comparative sequence analysis in molecular biology
DESCRIPTION
Comparative Sequence Analysis in Molecular Biology. Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A. Outline. What genome data is available? What is phylogenetic footprinting? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/1.jpg)
Comparative Sequence Analysisin Molecular Biology
Martin TompaComputer Science & Engineering
Genome SciencesUniversity of Washington
Seattle, Washington, U.S.A.
![Page 2: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/2.jpg)
2
![Page 3: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/3.jpg)
3
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
![Page 4: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/4.jpg)
4
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
![Page 5: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/5.jpg)
5
How Many Genomes Are Available?
• 46 vertebrate genomes sequenced (primates to rodents to marsupials to birds to fishes)
• 1766 bacterial genomes sequenced (as of 2/12/2012)
• Insects, fungi, worms, plants, …
• Many more will be finished very soon
• Fertile ground for comparative genomics
![Page 6: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/6.jpg)
61982-2003: number of nucleotides in GenBank doubled every 18 months
Since 2003: doubled every 3 years
![Page 7: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/7.jpg)
7
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
![Page 8: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/8.jpg)
8
Phylogenetic Footprinting(Tagle et al. 1988)
Functional regions of DNA (regions under “purifying constraint”) evolve slower than nonfunctional ones.
1. Consider a set of corresponding DNA sequences from related species.
2. Identify unusually well conserved subsequences (i.e., ones that have not mutated much over the course of evolution): “motifs”
![Page 9: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/9.jpg)
9
![Page 10: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/10.jpg)
10
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
![Page 11: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/11.jpg)
11
How to Find Conserved Motifs
ACTAACCGGGAGATTTCAGA human
AAGTTCCGGGAGATTTCCA chimp
TAGTTATCCGGGAGATTAGA mouse
AAAACCGGTAGATTTCAGG rat
![Page 12: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/12.jpg)
12
Multiple Sequence Alignment
AC--TAACCGGGAGATTTCAGA human
AAGTT--CCGGGAGATTTCC-A chimp
TAGTTATCCGGGAGATT--AGA mouse
AA---AACCGGTAGATTTCAGG rat
(Finding the optimal alignment is NP-complete.)
![Page 13: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/13.jpg)
13
Phylogenetic Footprinting
1. Use whole-genome multiple alignment such as provided
by UCSC Genome Browser.
2. Search for regions of well conserved alignment.
– Regulatory elements [Cliften; Kellis; Kolbe; Prakash; Woolfe;
Xie (2)]
– RNA elements [Pedersen; Washietl]
– General conservation & constraint [Bejerano; Boffelli;
Cooper; Margulies (4); Pollard; Prabhakar; Siepel]
![Page 14: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/14.jpg)
14
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
![Page 15: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/15.jpg)
15
Why Doubt Alignments?
• Multiple sequence alignment of short sequences (proteins, promoters) is difficult (NP-complete)
• Aligning whole genomes adds the complications of huge sequences and genomic rearrangements
• Vertebrate alignment has 3.8 billion columns
• Automatically generated
![Page 16: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/16.jpg)
16
Assessing 4 Genome-Size Alignments (with Xiaoyu Chen)
• Alignments: MLAGAN [Brudno 2003], MAVID [Bray 2003],
TBA [Blanchette 2003], Pecan [Paten 2008]
• Target ENCODE regions: 30 Mbp covering 1% of the human
genome (ENCODE targets)
• Total input: 554 Mbp over 28 vertebrates
• Rich resource for comparing and assessing genome-size
alignments
Margulies et al. 2007, Genome Research
![Page 17: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/17.jpg)
17
Coverage of each alignmentCoding Kbp
0
200
400
600UTR Kbp
0
200
400
600
Intronic Kbp
0
2000
4000
6000
8000
10000
12000Intergenic Kbp
02000400060008000
1000012000
TBA
MAVID
MLAGAN
Pecan
Alignment coverage: number of human bases aligned to a given species
![Page 18: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/18.jpg)
18
Coverage of each alignmentCoding Kbp
0
200
400
600UTR Kbp
0
200
400
600
Intronic Kbp
0
2000
4000
6000
8000
10000
12000Intergenic Kbp
02000400060008000
1000012000
TBA
MAVID
MLAGAN
Pecan
In noncoding regions, as species distance from human↑, coverage↓
![Page 19: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/19.jpg)
19
Coverage of each alignmentCoding Kbp
0
200
400
600UTR Kbp
0
200
400
600
Intronic Kbp
0
2000
4000
6000
8000
10000
12000Intergenic Kbp
02000400060008000
1000012000
TBA
MAVID
MLAGAN
Pecan
MAVID has lowest coverage
![Page 20: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/20.jpg)
20
Coverage of each alignmentCoding Kbp
0
200
400
600UTR Kbp
0
200
400
600
Intronic Kbp
0
2000
4000
6000
8000
10000
12000Intergenic Kbp
02000400060008000
1000012000
TBA
MAVID
MLAGAN
Pecan
Other 3 have comparable coverage in placental mammals
![Page 21: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/21.jpg)
21
Coverage of each alignmentCoding Kbp
0
200
400
600UTR Kbp
0
200
400
600
Intronic Kbp
0
2000
4000
6000
8000
10000
12000Intergenic Kbp
02000400060008000
1000012000
TBA
MAVID
MLAGAN
Pecan
MLAGAN has highest coverage in distant species, intronic and intergenic
![Page 22: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/22.jpg)
22
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
![Page 23: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/23.jpg)
23
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
Agree%: Coding > UTR > Int.
![Page 24: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/24.jpg)
24
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
Unique%: Coding < UTR < Int.
![Page 25: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/25.jpg)
25
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
As species distance from human↑, Agree%↓Unique%↑
![Page 26: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/26.jpg)
26
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
Primates: high Agree%
![Page 27: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/27.jpg)
27
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
Placental nonprimates: Agree% > 0.5
![Page 28: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/28.jpg)
28
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
Distant species, Int: low Agree%, high Unique%
![Page 29: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/29.jpg)
29
Alignment agreement for mouse
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
Intronic bases Intergenic bases
• Intronic & intergenic account for 95% of mouse bases aligned to human
• Agree% in those categories: 44% to 62% • Much worse for more distant species• Building reliable MSA remains challenging
![Page 30: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/30.jpg)
30
Which Alignment Columns to Trust? (with Amol Prakash, generalizing Karlin and Altschul 1990)
Goal: label each alignment column with confidence measure of alignment correctness
– Identify sequences that do not belong
• Users forewarned about regions of interest
• Genome browser designers consider realigning
• Alignment tool designers get feedback for possible improvements
![Page 31: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/31.jpg)
31
Sample Suspicious AlignmentHuman -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Chimp -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Rhesus -----------GTTGCCATGC-AAAAATATTATGTCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Mouse -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC
Rat -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC
Dog -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Cow -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Elephant -----------GTTGCTATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Tenrec -----------GTTGCCATAC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Opossum -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATATCAA----------TTAACAC
Chicken -----------GTTGCCATGCAAAAAATAATATGGCTTTACTAAAATTTACACAAC---CCTGACAA----------TTAACAC
Zebrafish GAACATATCCGAGTGCTGTAA-AATACTACTGGGA----ACCAGAAATG—-ACAAGTTCCATGACAGCTTTGCCTTTTTGGCTC
![Page 32: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/32.jpg)
32
Scoring Function
Pairwise: score(1, 2) = log ( )
Multiple:
HumanChimpMouseRatChicken
1
2
3
4
5
Pr(1, 2)
Pr(1) Pr(2)
sc(12345 | ) = log( )Pr(12345 | )
Pr(125 | ) Pr(34 | )
![Page 33: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/33.jpg)
33
Outline of Computation
Input Multiple sequence alignment A
Output Discordance : maxk pk
For each branch k of the tree {
Compute scoring function sck (Felsenstein)
Find all maximally scoring segments of A using sck (Ruzzo & Tompa)
Compute K, using sck (Karlin & Altschul)
Compute p-value pk of each segment score using K, (Karlin & Altschul)
}
![Page 34: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/34.jpg)
34
Suspicious Alignment Regions
Back to four ENCODE alignments spanning 30 Mbp of human aligned to 27 other vertebrates (with Xiaoyu Chen)• Identify suspicious alignment regions:
– Length 50 bp– Discordance 0.1 at each position, all with respect to the
same worst species– Fewer than 50% gapped sites
• Suspicious%
– Percentage of aligned bases in suspicious regions
![Page 35: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/35.jpg)
35
Alignment accuracy
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
TBA MAVID MLAGAN Pecan
Coding bases
UTR bases
Intronic bases
Intergenic bases
![Page 36: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/36.jpg)
36
Alignment accuracy
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
TBA MAVID MLAGAN Pecan
Coding bases
UTR bases
Intronic bases
Intergenic bases
![Page 37: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/37.jpg)
37
Alignment accuracy
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
TBA MAVID MLAGAN Pecan
Coding bases
UTR bases
Intronic bases
Intergenic bases
![Page 38: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/38.jpg)
38
Alignment accuracy
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
TBA MAVID MLAGAN Pecan
Coding bases
UTR bases
Intronic bases
Intergenic bases
![Page 39: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/39.jpg)
39
Can suspicious alignments be improved?
Baboon and MLAGAN (for example): all points (x,y), where•x = human-baboon alignment score of MLAGAN region suspicious for baboon
•y = human-baboon alignment score of alternative alignment for same human region but not suspicious for baboon
y = x
y - x = μ, where μ = average y-x over all points
y - x = μ ± σ, where σ = standard deviation of y-x over all points
![Page 40: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/40.jpg)
40
Can suspicious alignments be improved?
![Page 41: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/41.jpg)
41
Summary of comparisons (all categories)
0%
5%
10%
15%
20%
25%
30%
12 13 14 15 16 17 18
log(coverage)
susp
icio
us%
primates other placental mammals distant species
TBA MAVID MLAGAN Pecan
High is better
Low
is b
ette
r
![Page 42: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/42.jpg)
42
Conclusions1. Disturbing lack of agreement among alignments:
alignment still a hard problem
2. Performance of the aligners varies significantly by species group and region type, particularly distant species and noncoding regions
![Page 43: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/43.jpg)
43
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
![Page 44: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/44.jpg)
44
DNA, Genes, and Proteins
DNA: program for cell processes
Proteins: execute cell processes
TCCAA
CGGTGC
TGAGGT
GCAC
GeneProtein
DNA
![Page 45: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/45.jpg)
45
Regulation of Genes
• What turns genes on and off?
• When is a gene turned on or off?
• Where (in which cells) is a gene turned on?
• How many copies of the gene product are produced?
![Page 46: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/46.jpg)
46
Regulation of Genes
GeneRegulatory Element
RNA polymerase
Transcription Factor
DNA
![Page 47: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/47.jpg)
47
RNA polymerase
Transcription Factor
DNA
Regulatory Element Gene
Regulation of Genes
![Page 48: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/48.jpg)
48
GoalIdentify regulatory elements in DNA sequences. These are:
• Binding sites for proteins
• Short subsequences (5-25 nucleotides)
• Up to 1000 nucleotides (or farther) from gene
• Inexactly repeating patterns (“motifs”)
![Page 49: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/49.jpg)
49
CLUSTALW multiple sequence alignment (rbcS gene)
Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATTPea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACATobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACCIce-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACCTurnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGCWheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAADuckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAALarch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC
Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----APea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------ATobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGAIce-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAATurnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------AWheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC--------Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATTLarch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA
Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTAPea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTATobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATGIce-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGGTurnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATAWheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTGDuckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATCLarch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA
Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTACPea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAACTobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAAIce-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTACLarch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCATurnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAGWheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCCDuckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG
![Page 50: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/50.jpg)
50
Finding Short Motifs
AGTCGTACGTGAC... (Human)
AGTAGACGTGCCG... (Chimp)
ACGTGAGATACGT... (Rabbit)
GAACGGAGTACGT... (Mouse)
TCGTGACGGTGAT... (Rat)
Size of motif sought: k = 4
![Page 51: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/51.jpg)
51
Most Parsimonious Solution
“Parsimony score”: 1 mutation
AGTCGTACGTGAC...
AGTAGACGTGCCG...
ACGTGAGATACGT...
GAACGGAGTACGT...
TCGTGACGGTGAT...ACGGACGT
ACGT
ACGT
![Page 52: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/52.jpg)
52
Substring Parsimony ProblemGiven:
• phylogenetic tree T,• set of orthologous sequences at leaves of T,• length k of motif• threshold d
Problem:
• Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d.
This problem is NP-complete.
![Page 53: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/53.jpg)
53
FootPrinter’s Exact Algorithm(with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975)
Wu [s] = best parsimony score for subtree rooted at node u,
if u is labeled with string s.
AGTCGTACGTG
ACGGGACGTGC
ACGTGAGATAC
GAACGGAGTAC
TCGTGACGGTG
… ACGG: 2 ACGT: 1 ...
… ACGG: 0 ACGT: 2...
… ACGG: 1 ACGT: 1 ...
…
ACGG: + ACGT: 0
...
… ACGG: 1 ACGT: 0 ...
4k entries
… ACGG: 0 ACGT: + ...
… ACGG: ACGT :0 ...
… ACGG: ACGT :0 ...
… ACGG: ACGT :0 ...
![Page 54: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/54.jpg)
54
Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u
Running Time
Number of species
Average sequence
length
Motif length
Total time O(n k (42k + l ))
![Page 55: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/55.jpg)
55
Improvements• Better algorithm reduces time from
O(n k (42k + l )) to O(n k (4k + l ))
• By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k)
• Amenable to many useful extensions (e.g., allow insertions and deletions)
![Page 56: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/56.jpg)
56
Application to -actin Gene
Gilthead sea bream (678 bp)
Medaka fish (1016 bp)
Common carp (696 bp)
Grass carp (917 bp)
Chicken (871 bp)
Human (646 bp)
Rabbit (636 bp)
Rat (966 bp)
Mouse (684 bp)
Hamster (1107 bp)
![Page 57: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/57.jpg)
57
Common carpACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTG
AGGACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCA
GACATTTGGTGGGGCCAACCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC
ChickenACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGGCTTTATTTG
TTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGA
GCGAACGCCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGA
TAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGAGGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT
HumanGCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTT
TTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTT
GGAGCGAGCATCCCCCAAAGTTCACAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTC
CCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG
Parsimony score over 10 vertebrates: 0 1 2
![Page 58: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/58.jpg)
58
Motifs Absent from Some Species
• Find motifs – with small parsimony score
– that span a large part of the tree
• Example: in tree of 10 species spanning 760 Myrs, find all motifs with– score 0 spanning at least 250 Myrs– score 1 spanning at least 350 Myrs– score 2 spanning at least 450 Myrs– score 3 spanning at least 550 Myrs
![Page 59: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/59.jpg)
59
Application to c-fos Gene
Asked for motifs of length 10, with 0 mutations over tree of
size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26
Puffer fish
Chicken
Pig
Mouse
Hamster
Human
10
2
7
2
2
21
0
1
1
Found: 0 mutations over tree of size 81 mutation over tree of size 163 mutations over tree of size 214 mutations over tree of size 28
![Page 60: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/60.jpg)
60
Application to c-fos GeneMotif Score Conserved in Known?
CAGGTGCGAATGTTC 0 4 mammals
TTCCCGCCTCCCCTCCCC 0 4 mammals yes
GAGTTGGCTGcagcc 3 puffer + 4 mammals
GTTCCCGTCAATCcct 1 chicken + 4 mammals yes
CACAGGATGTcc 4 all 6 yes
AGGACATCTG 1 chicken + 4 mammals yes
GTCAGCAGGTTTCCACG 0 4 mammals yes
TACTCCAACCGC 0 4 mammals
metK in B. subtilis
![Page 61: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/61.jpg)
61
Microbial Footprinting• 1889 prokaryotes with genomes completely
sequenced (as of 2/12/2012)– For any prokaryotic gene of interest, plenty of close genes
in other species available– Relatively simple genomes
• MicroFootPrinter (with Shane Neph)– Designed specifically for phylogenetic footprinting in
microbial genomes– undergraduate Computational Biology Capstone project– User specifies species and gene of interest– Automates collection of orthologous genes, cis-regulatory
sequences, gene tree, parameters
![Page 62: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/62.jpg)
62
Demo
• MicroFootPrinter home• Examples: Agrobacterium tumefaciens genes
regulated by ChvI (with Eugene Nester)
– chvI (two component response regulator)– ropB (outer membrane protein )
![Page 63: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/63.jpg)
63
Sample chvI motifParsimony score: 2Span: 41.10Significance score: 4.22
B. henselae -151 GCTACAATTTR. etli -90 GCCACAATTTR. leguminosarum -106 GCCACAATTTS. meliloti -119 GCCACAATTTS. medicae -118 GCCACAATTTA. tumefaciens -105 GCCACAATTTM. loti -80 GCCACATTTTM. sp. -87 GCCACATTTTO. anthropi -158 GCCACATTTTB. suis -38 GCCACATTTTB. melitensis -156 GCCACATTTTB. abortus -156 GCCACATTTTB. ovis -156 GCCACATTTTB. canis -38 GCCACATTTT
![Page 64: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/64.jpg)
64
Sample ropB motifParsimony score: 1Span: 20.70Significance score: 1.34
Jannaschia sp. -151 CACATTTTGGR. etli -134 CACAATTTGGR. leguminosarum -135 CACAATTTGGA. tumefaciens -131 CACATTTTGGS. meliloti -128 CACATTTTGGS. medicae -128 CACATTTTGG
![Page 65: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/65.jpg)
65
Combined ChvI Motif
ropB: CACATTTTGGchvI: GCCACAATTTAtu1221: TTGTCACAAT
ultimate: GYCACAWTTTGGY={C,T}
W={A,T}
![Page 66: Comparative Sequence Analysis in Molecular Biology](https://reader036.vdocument.in/reader036/viewer/2022062301/56814b2e550346895db8393f/html5/thumbnails/66.jpg)
66
References and Acknowledgments• Amol Prakash & Martin Tompa, Measuring the Accuracy of
Genome-Size Multiple Alignments. Genome Biology, June 2007, R124.
• Xiaoyu Chen & Martin Tompa, Comparative Assessment of Methods for Aligning Multiple Genome Sequences. Nature Biotechnology, June 2010, 567-572.
• Mathieu Blanchette & Martin Tompa, Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, May 2002, 739-748.
• Shane Neph & Martin Tompa, MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, July 2006, W366-W368.
• All software available at bio.cs.washington.edu/software