Download - Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009
![Page 1: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/1.jpg)
Whole Genome Phylogenetic Analysis
Yifeng Liu and Reihaneh Rabbanyk Khorasgani
April 8th, 2009
![Page 2: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/2.jpg)
2
Agenda
• Introduction
• Our method proposals
• Datasets and experiments
• Results
• Discussion
• Future work
• Conclusion
![Page 3: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/3.jpg)
3
Whole Genome Phylogeny: Motivations
• Currently the dominant method for phylogenetic analysis is based on a single gene or protein.
• However different gene tells a different story• Recently more genomic sequences became
available• We hope to resolve the above inconsistency
by using the entire genome (or proteome) to reconstruct phylogenetic tree.
![Page 4: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/4.jpg)
4
Whole Genome Phylogeny: Methods
• Major categories of methods are based on:– Shared gene (ortholog) content– Nucleotide and amino acid (string) composition– Genome Compression – Gene order
• In our study, we focus on string composition and compression methods
![Page 5: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/5.jpg)
5
Complete Composition Vector (CCV)
• The observed occurrence probability for a k-string:
• The estimated background occurrence probability based on the Markov assumption is:
![Page 6: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/6.jpg)
6
Complete Composition Vector (CCV)
• The occurrence probability due to selective pressure:
The k-th composition vector:
The Complete Composition Vector (CCV):
![Page 7: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/7.jpg)
7
Compression Methods
• Kolmogorov Complexity
• Lempel-Ziv complexity
![Page 8: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/8.jpg)
8
Agenda
• Introduction
• Our method proposals
• Datasets and experiments
• Results
• Discussion
• Future work
• Conclusion
![Page 9: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/9.jpg)
9
A new term weighting scheme
• CCV uses S(•) to weight each k-string, which– Utilizes only local information available
within a single sequence– Estimates random background based on
Markov model • Can we have a measure that use both local
and global information without making the Markov assumption?
![Page 10: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/10.jpg)
10
Term and Document Frequency
• Genomes are documents written in a language of four alphabets {A,T,C,G}; similarly, proteomes are documents written in a language of twenty alphabets.
• Each k-string can be viewed as a word within a gnome (or proteome) document.
• The collection of all genomes in the dataset is therefore a corpus.
![Page 11: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/11.jpg)
11
Term and Document Frequency
• In statistical Natural Language Processing, a well-known term weighting scheme TF-IDF combines both term frequency and document frequency into a single weight.
![Page 12: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/12.jpg)
12
CCV meets Document Frequency
• We can also combine the occurrence probability due to selection S(•) with the inverse document frequency
into a single weight called CCV-IDF.
• S(•) provides local information and dfi provides global information.
![Page 13: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/13.jpg)
13
Ensemble Measures
Normalizing distances to same range
Combining distance matrixes
These parameters should be adjusted
TFIDFdnCompressiodCCVdd ___
![Page 14: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/14.jpg)
14
Tree Evaluation
• We propose a new evaluation method for evaluating phylogenetic trees
• A numeric measure
• Shows how compatible the tree is with the given taxonomy
![Page 15: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/15.jpg)
Tree Evaluation (Cont.)
• Labeling the inner nodes in the tree
• For each species – A path in the tree
sequence of inner node labels
– A taxonomy description taxonomy sequence
– There should be a many to many alignment between these two sequences
15
![Page 16: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/16.jpg)
Tree Evaluation (Cont.)
• Finding alignment between these sequences for all the species– Using Bayesian Network
• Finding the most probable alignments
• Measuring the Log likelihood of these alignment – How probable is this tree given this
taxonomy
16
![Page 17: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/17.jpg)
Tree Evaluation (Example)
• Phylogenetic tree
• Taxonomy– T1;T2; A– T1;T3; B– T1;T3; C– T1;T3; D
17
A B
D
C
1 2 3
1
1;2
1;2;3
1;2;3
<T1;T2,1>
<T1,1> <T3,2>
<T1,1><T3, 2;3>
<T1,1><T3, 2;3>
P1
P2
P3
P4
![Page 18: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/18.jpg)
18
Agenda
• Introduction
• Our method proposals
• Datasets and experiments
• Results
• Discussion
• Future work
• Conclusion
![Page 19: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/19.jpg)
19
Dataset: influenza virus
• Influenza virus genomes (flu)– 44 influenza A genomes
(3 for H1-H13, 2 for H16)– 3 influenza B genomes– 1 influenza C genome (out group)– Coding gene sequences only – Collected and joined from individual gene
sequences according to the following order: HA, NA, NP, M, NS, PA, PB1, PB2
![Page 20: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/20.jpg)
20
Dataset: Prokaryotes
• Prokaryote genomes (bac)– 88 bacterial genomes– 11 archaean genomes– Uses Nanoarchaeum equitans as the out group.– Collected from NCBI according to the accession
number provided in the CCV paper.– Genomeic DNA sequence including intergenic
regions.
![Page 21: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/21.jpg)
21
Dataset: Mammal mitochondria
• Mammal mitochondria (mito)– 425 mammal mitochondria– 1 Arabidopsis mitochondrion (out group)– Collected from the Organelle Genome
Megasequencing Program website.– converted from NCBI format to fasta format.– Contains many duplicated entries for:
• Bos taurus (cattle)• Sus scrofa (wild Boar)• Mus musculus (mouse)• Rattus norvegicus (rat)
![Page 22: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/22.jpg)
22
Experiments
• We built a multiple sequence alignment tree for flu
• We ran CCV, TF-IDF and CCV-IDF on all three datasets with the following k-string length: (we fixed K1 = 1 and only vary K2, L = K2 - K1 + 1 = K2 )– Flu: L = 7, L = 15– Bac and mito: L = 7 and L = 9
• Each run generates a pairwise distance matrix.
![Page 23: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/23.jpg)
23
Experiments
• We ran GenCompress and LZ compression programs on flu and mito and calculate pairwise distance
• We tried ensembling different measures [Reihaneh]
![Page 24: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/24.jpg)
24
Experiments
• We converted pairwise distance matrices into phylogenetic trees using the Neighbor-Joining program in PHYLIP
• We visualized resulting trees using DRAWGRAM and TreeView.
![Page 25: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/25.jpg)
25
Agenda
• Introduction
• Our method proposals
• Datasets and experiments
• Results
• Discussion
• Future work
• Conclusion
![Page 26: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/26.jpg)
26
MSA trees versus HA tree
HA tree by Suzuki et.al.
MSA tree
H1, 2, 3
H5, 6, 9
H4, 15, 16, 13
B
H7, 10, 12, 8
![Page 27: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/27.jpg)
27
MSA versus Compression
MSA tree
1, 2, 3
5, 6, 9
4, 15, 16, 13
B
7, 10, 12, 8
GenCompress
1, 2, 3
15, 4, 5, 6, 9
10, 12, 8
13, 16
B
7
![Page 28: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/28.jpg)
28
MSA versus CCV
MSA tree
H1, 2, 3
H5, 6, 9
H4, 15, 16, 13
B
7, 8, 10, 12
CCV L15 cos
1, 2, 3
7, 8, 10, 12
B
4, 5, 6, 9
15
13, 16
![Page 29: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/29.jpg)
29
MSA vs TF-IDF
MSA tree
H1, 2, 3
H5, 6, 9
H4, 15, 16, 13
B
H7, 10, 12, 8
TF-IDF L15 cos
H1, 2, 3
B
13, 16
4, 5, 6, 9
15
7, 8, 10, 12, 11
![Page 30: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/30.jpg)
30
MSA vs CCV-IDF
MSA tree
H1, 2, 3
H5, 6, 9
H4, 15, 16, 13
B
7, 8, 10, 12
CCV-IDF L15 cos
H1, 2, 3
13, 16
8, 10, 12
B
7
4, 5, 6, 9
15
![Page 31: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/31.jpg)
31
CCV vs TF-IDF
TF-IDF L15 cos
H1, 2, 3
B
13, 16
4, 5, 6, 9
15
7, 8, 10, 12, 11
CCV L15 cos
1, 2, 3
7, 8, 10, 12
B
4, 5, 6, 9
15
13, 16
![Page 32: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/32.jpg)
32
CCV vs CCV-IDF
CCV-IDF L15 cos
1, 2, 3
13, 16
8, 10, 12
B
7
4, 5, 6, 9
15
CCV L15 cos
1, 2, 3
7, 8, 10, 12
B
4, 5, 6, 9
15
13, 16
![Page 33: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/33.jpg)
33
Observations
• All methods (MSA, CCV, GenCompress, TF-IDF, CCV-IDF) generate similar results.
• Our results are significantly different from previous studies.
• Most clades are intact while some are scattered around.
• Most clades are pure while some are mixed with species from nearby clades.
• CCV and CCV-IDF results are highly similar.
![Page 34: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/34.jpg)
34
AA versus DNA
CCV k1=3, k2=7 protein CCV k1=1, k2=7 DNA
![Page 35: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/35.jpg)
35
CCV L=7 and L=9
CCV k1=1, k2=7 DNA CCV k1=1, k2=9 DNA
![Page 36: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/36.jpg)
36
Observations
• Most clades are intact. • For similar CCV length, the DNA tree is worse
than the protein tree and unable to recognize Archaea as a distinctive clade.
• CCV trees are similar for length 7 and length 9.– Similarly the L7, L15 and L21 tree for flu are
almost identical
![Page 37: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/37.jpg)
37
Mito results
• For the mito dataset, we have similar observations.
• All methods failed to resolve fine branches of the tree by mixing in distant species.
![Page 38: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/38.jpg)
38
Mito: primatesTF-IDF L9 cos
CCV L9 cos
![Page 39: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/39.jpg)
39
Agenda
• Introduction
• Our method proposals
• Datasets and experiments
• Results
• Discussion
• Future work
• Conclusion
![Page 40: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/40.jpg)
40
DNA versus AA Sequence
• There are more k-strings for protein sequence than DNA sequence for the same length.– We need longer k-strings for DNA to achieve the same
resolution as amino acid (AA) sequence.
• Due to the redundant nature of the genetic code, different DNA k-strings may correspond to the same AA k-string.– AA k-strings can share information even though their DNA
sequence might be different
• DNA sequence may contain intergenic regions which do not response to selection pressure– Intergenic region may not contribute much to the resolution of
the tree; they might even reduce such resolution.
![Page 41: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/41.jpg)
41
Thoughts on Document Frequency
• We did not observe significant performance difference by adding in document frequency information.
• For longer genome (e.g. bac), we need longer k-strings to see the effect of DF.– All bac genomes share 87.9% 9-strings
and only 0.8% 11-strings
![Page 42: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/42.jpg)
42
Compression programs
• Current compression programs are problematic – LZ could not handle large datasets– Kolmogorov is not applicable for large
sequences
• These method should be reimplemented
![Page 43: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/43.jpg)
43
Agenda
• Introduction
• Our method proposals
• Datasets and experiments
• Results
• Discussion
• Future work
• Conclusion
![Page 44: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/44.jpg)
44
Future works
• Run the same experiments on protein sequence– To investigate the effect of using AA
versus DNA sequences.– We expect to see better results with protein
sequences– New result may reveal subtle difference
between different methods.
![Page 45: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/45.jpg)
45
Future works
• Speed up the implementation for TF-IDF and run them on longer k-strings– Computational complexity is the bottle neck for
achieving high resolution in a reasonable amount of time.
– Initially the calculations for TF and IDF are separated: slow
– We achieved significant speedup by integrating the calculation of TF and IDF into a two-pass algorithm
– We may drop k-string with low TF-IDF values to further speed up the program.
![Page 46: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/46.jpg)
46
Future works
• Perform bootstrapping analysis– We are unable to perform bootstrapping
analysis due to time and computational resource constraints
![Page 47: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/47.jpg)
47
Future works
• In our proposed evaluation method, we need a Many to many alignment which is not a trivial task– It is well studied in Machine translation and
Natural Language Processing and those techniques could help here
• This measure could also be used as a measure of similarity between trees
![Page 48: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/48.jpg)
48
Agenda
• Introduction
• Our method proposals
• Datasets and experiments
• Results
• Discussion
• Future work
• Conclusion
![Page 49: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/49.jpg)
49
Conclusion
• All string composition methods (CCV, TF-IDF, CCV-IDF) somewhat group most similar species together and produce consistent results.– However they all failed to resolve big branches as
well as fine branches.
• We did not observe significant improvement by adding document frequency.– But we will need further experiments (with longer
k-strings on AA sequences) to fully understand the effect.
![Page 50: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/50.jpg)
50
Major Contributions
• We proposed a novel term weighting scheme which achieves similar performance as CCV in our experiments
• We proposed the notion of adding in global information in the form of document frequency
• We discovered that using protein sequence may significantly improve performance for all methods
• We proposed a novel evaluation method for phylogenetic trees
![Page 51: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/51.jpg)
51
Author Contributions
• Yifeng – Collected all three data sets– Performed CCV experiments– Implemented TF-IDF and CCV-IDF
• Reihaneh– Built MSA tree for flu– Performed Compression experiments– Implemented ensemble and evaluation
methods.
![Page 52: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/52.jpg)
52
Special thanks to …
• Professor Guohui Lin
• Dr. Zhipeng Cai
• Proteome Analyst Research Group
![Page 53: Whole Genome Phylogenetic Analysis Yifeng Liu and Reihaneh Rabbanyk Khorasgani April 8th, 2009](https://reader030.vdocument.in/reader030/viewer/2022032805/56649ee95503460f94bfac52/html5/thumbnails/53.jpg)
53
References• [1] Xin Chen, Sam Kwong, and Ming Li. A compression algorithm for dna sequences and its applications in
genome comparison. In in Genome Informatics, pages 52-61, 1999. • [2] Joseph Felsenstein. Phylip - phylogeny inference package (version 3.2). Cladistics, 5:164-166, 1989. • [3] M. Li, J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. An information-based sequence
distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17(2):149-154, 2001.
• [4] Yifeng Liu and Reihaneh Rabbanyk Khorasgani. A survey on whole genome phylogenetic analysis. CM- PUT 606 course survey, Feburary 2009.
• [5] Christopher D. Manning and Hinrich Schtze. Foundations of Statistical Natural Language Processing. The MIT Press, June 1999.
• [6] Hasan H. Otu and Khalid Sayood. A new sequence distance measure for phylogenetic tree construction. Bioinformatics, 19(16):2122-2130, 2003.
• [7] N. Saitou and M. Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 4(4):406-425, July 1987.
• [8] Yoshiyuki Suzuki and Masatoshi Nei. Origin and evolution of influenza virus hemagglutinin genes. Mol Biol Evol, 19(4):501-509, April 2002.
• [9] Xiaomeng Wu, Xiufeng Wan, Gang Wu, Dong Xu, and Guohui Lin. Phylogenetic analysis using complete signature information of whole genomes and clustered neighbor-joining method. International Journal on Bioinformatics Research and Application, 2(3):219-248, 2006.