correlogram method for comparing bio-sequences - gandhali samant, m.s. computer science committee...

55
Correlogram Method for comparing Bio- Sequences - Gandhali Samant , M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan Leonard, PhD

Upload: marshall-hunt

Post on 05-Jan-2016

240 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Correlogram Method for comparing Bio-Sequences

- Gandhali Samant , M.S. Computer Science

Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan Leonard, PhD

Page 2: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

What is Sequence Comparison

Sequence Comparison – One of the most important primitive operations in computational biology.

Finding resemblance or similarity between sequences

Basis for many other more complex manipulations.

Used for database search, phylogeny development, clustering etc.

Page 3: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

What is Sequence Comparison …Contd.

Two important notions are -Similarity – How similar are the two sequences? This gives a numeric score of similarity between two sequences

A G T C T CA T T G T C

--------------------------1 -1 1 -1 1 1 = 2

Alignment – Way of placing one sequence above other to make clear the correspondence between them.

A G T C G T CA _ T C _ T C

--------------------------1 -2 1 1 -2 1 1 = 1

Page 4: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

What is Sequence Comparison …Contd.

Many methods have been proposed for sequence comparison.

Some Important ones include –Dynamic programming algorithms for sequence alignment - Global, Local or Semi-Global Alignment

Heuristic and Database Search Algorithms - BLAST,

FASTA.

Page 5: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

What is Sequence Comparison …Contd.

Multiple sequence alignment AlgorithmsMultiple sequence alignment methods are mainly used when there is a need to extract information from a group of sequences.

Examples of situations in which these techniques are used include the determination of secondary or tertiary structures, characterization of protein families, identification of similar regions etc.

Page 6: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

What is Sequence Comparison …Contd.

Also many miscellaneous techniques have been proposed for sequence comparison

Contact based sequence alignment

Using Correlation Images

Some methods have been proposed without using the fundamental tool of Sequence Alignment

Shortest unique substring

Page 7: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Background Study

Basic Concepts of Molecular Biology

BLAST

Clustering

Phylogeny Trees / Phylip

Page 8: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Basic Concepts of Molecular Biology

Proteins –Most substances in our body are proteins

Some of these are structural proteins and some are enzymes.

Proteins are responsible for what an organism is and what it does in physical sense.

Amino Acids –A protein is a chain of simple molecules called Amino Acids. There are total 20 amino acids

Page 9: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Basic Concepts of Molecular Biology

Nucleic Acids –Nucleic Acids encode information necessary to produce proteins They are responsible for passing recipe to subsequent generations. 2 types of nucleic acids present in living organisms,

RNA (ribonucleic acid) DNA (deoxyribonucleic acid).

Page 10: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

BLAST

BLAST (Basic Local Alignment Search Tool)

BLAST algorithms are heuristic search methods

This method seeks words of length W (default=3 in blastP) that score at least T when aligned with the query and scored with the substitution matrix (e.g. PAM)

Page 11: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Clustering

Clustering It can be defined as “The process of organizing objects into groups whose members are similar in some way”

Page 12: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Phylogeny Trees / Phylip

Phylogeny -The context of evolutionary biology

Phylogeny TreesRelationships between different species and their common ancestors shown by constructing a tree.

PHYLIP, the Phylogeny Inference Package, is a package of programs for inferring phylogenies (evolutionary trees) from University of Washington .

What Phylip can do??

Data used by phylip.

Page 13: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Phylip…Contd.

Following are the programs used from Phylip package in this research.

FITCH - Estimates phylogenies from distance matrix data.

KITCH - Estimates phylogenies from distance matrix data.

NEIGHBOR - Produces an un-rooted tree

DRAWGRAM - Plots rooted phylogenies, cladograms, circular trees and phenograms in a wide variety of user-controllable formats. The program is interactive.

DRAWTREE - Similar to DRAWGRAM but plots unrooted phylogenies.

Page 14: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Our Approach …Correlogram

What is a Correlogram??

Representation of sequence in mathematical space.

3-D matrix of which 2 dimensions are the set of entities (e.g.. Amino Acids, Nucleic Acids) and third dimension is distance.

A T G C

D

3210

C

AT

G

Page 15: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Correlogram for Image Comparison

Correlogram method has already been used for Image comparison.“Image indexing using color correlograms” By Jing Huang,S Ravi Kumar, Mandar Mitra, Wei-Jing Zhu, Ramin Zabih

A color correlogram expresses how the spatial correlation of pairs of colors changes with the distance

Color correlogram has also been used recently for object tracking

Page 16: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Correlogram Usage in the field of Bioinformatics

Correlograms were used to analyze autocorrelation characteristics of active polypeptides.

MF Macchiato, V Cuomo and A Tramontano (1985), “Determination of the autocorrelation orders of proteins”

For analyzing spatial patterns in various experiments.– Giorgio Bertorelle and Guido Barbujanit (1995), “Analysis

of DNA Diversity by Spatial Auto Correlation”

In studies regarding patterns of transitional mutation biases within and among mammalian genomes

– Michael S. Rosenberg, Sankar Subramanian, and Sudhir Kumar (2003), “Patterns of Transitional Mutation Biases Within and Among Mammalian Genomes”

Page 17: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Constructing a Correlogram plane

Example Sequence ….. agcttactgt

If we calculate the appearance of every pair of

characters at distance 1 ..

The Correlogram Plane for distance 1 will be ->

Correlogram can be constructed as a set of frequencies for different distances.

  A T G C

A     1   1

T 1 1   1

G   1   1

C   2    

d = 1

Page 18: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Constructing a Correlogram plane…Contd.

Example Sequence ….. agcttactgt

Correlogram plane for d=0

  A T G C

A 2

T 4

G 2

C 2

d = 0

Page 19: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Constructing a Correlogram plane…Contd.

Example Sequence ….. agcttactgt

Correlogram plane for d=2

  A T G C

A 1 1

T 1 1 1

G 1

C 1 1

d = 2

Page 20: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Graphical Representation of Correlogram

Correlogram plane shown here is of a protein sequence for distance 0.

At distance 0 each character is compared with itself so we can see all the peaks at diagonal.

This is a Histogram.

0

0.05

0.1

A C D E F G H I K L M N P Q R S T V W YACDEFGHIKLM

NPQRSTVWY

Page 21: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Graphical Representation …Contd.

Similarly Correlogram frequencies for distance 1 and distance 2 can be represented as…

Page 22: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Normalization of Correlogram

Need for normalization – Finding similarity between sequences of different length.

For every correlogram plane, each value is divided by the total volume of that plane.

Page 23: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Extension - Gapped Correlogram

Gapped Correlogram - Consideration the gapped alignment of sequences

The reason is if a pair of character is at distance d, there is probability that in other sequence it might appear at distance d-1 or d+1.

Adding a ‘delta’ to Correlogram.

1

0.5 0.5

0.25 0.25

d -> 2 3 4 5 6

For every pair at distance n, frequency f and with delta = d, a fraction of frequency f/(2|n-distance|) is added at distances n-1,n-2… n-d and distances n+1,n+2… n+d.

Page 24: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Extension - Gapped Correlogram…Contd.

+  A T G C

A     1

0.5

T 0.5 0.5

0.5

G   0.5   0.5

C   1    

  A T G C

A     2 1

T 1 1

1

G   1   1

C   2    

 + A T G C

A     1 0.5

T 0.5 0.5

0.5

G   0.5   0.5

C   1    

D=3

D=4

D=2

Adding values to previous plane

Adding values to next plane

Delta = 1

Page 25: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Correlogram for Sequence Comparison

We are using these Correlograms for comparison of 2 sequences.

Correlograms were constructed using same set of distances for both the sequences being compared.

Then distance between each cell of two Correlograms (i.e. Two 3-D Matrices) is calculated as

dijk = (Sijk – S’ijk )2 / (1+ Sijk + S’ijk )where i, j and k are 3 dimensions.

These distances were then added to get a final distance between two sequences.

d = √ ∑ dijk

One major difference !!

Page 26: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Synthetic Data Experiments using Correlogram

Purpose To discriminate and compare the capability of correlogram-method with one of the "traditional" comparison techniques i.e. Smith-Waterman Dynamic Programming algorithms.

The reason for using DP algorithms for comparison was that they are the most standard method for sequence comparison.

The sequences used in these experiments were amino acid sequences

Page 27: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Synthetic Data Experiments…Contd.

In all the experiments, the pair of sequences was compared using both Correlogram method and DP Method.

Page 28: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Synthetic Data Experiments…Contd.

The experiments were designed as followsComparing a base sequence with its reverse sequenceWrap around the target sequence at different character length and measure the difference with respect to the reference sequence each time Delete an amino acid from target sequence and measure the difference with respect to the reference sequence each time Replace an amino acid at different location and measure the difference with respect to the reference sequence each time Add an amino acid from target sequence and measure the difference with respect to the reference sequence each time

Page 29: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Synthetic Data Experiments…Contd.

Comparing a base sequence with its reverse sequence.

-2

-1

0

1

2

3

4

5

1 2 3 4Iterations

Sco

res

Correlogram Score DP Score

Page 30: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Synthetic Data Experiments…Contd.

Wrap around the target sequence at different character length and measure the difference with respect to the reference sequence each time.

-1

0

1

2

3

4

5

0 2 4 6 8 10 12Iterations

Sco

re

Correlogram Score DP Score

Page 31: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Synthetic Data Experiments…Contd.

Delete an amino acid from target sequence and measure the difference with respect to the reference sequence each time.

-1

0

1

2

3

4

5

1 2 3 4 5 6 7 8 9 10Iterations

Sco

res

Correlogram Score DP Score

Page 32: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Synthetic Data Experiments…Contd.

Replace an amino acid at different location and measure the difference with respect to the reference sequence each time.

-1

0

1

2

3

4

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14Iterations

Sco

res

Correlogram Score DP Score

Page 33: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Synthetic Data Experiments…Contd.

Add an amino acid at different location and measure the difference with respect to the reference sequence each time.

-1

0

1

2

3

4

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14Iterations

Sco

res

Correlogram Score DP Score

Page 34: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Finding Test data..

“Alternate circulation of recent equine-2 influenza viruses (H3N8) from two distinct lineages in the United States” By Alexander C.K. Lai, Kristin M. Rogers, Amy Glaser, Lynn Tudor, Thomas Chambers

hemagglutinin (HA) gene from Different strains of equine-2 influenza viruses.

GeneTool version 1.1. – Compilation and analysis

Phylogenetic analysis was performed by using the deduced HA1 amino acid sequence and the PHYLIP software package

Distance matrix was calculated by using the PROTDIST program, and an unrooted tree generated by using the FITCH program.

Page 35: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Test Data

Page 36: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Phylogeny Tree

Page 37: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Experiment 1 : Using same Test data

We have done an experiment with the same test data.

All the protein sequences were searched. http://www.ebi.ac.uk/cgi-bin/expasyfetch

A distance matrix was created using correlogram distances for every pair among these sequences.

From this distance matrix, a tree is created using PHYLIP software package.

The program ‘FITCH’ is used for creating tree whereas the program ‘DRAWTREE’ is used for visualizing the tree.

Page 38: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Graphical Representation of Correlogram for SA90

0

0.05

0.1

ACDE FGH I KLM NPQRS TVWYA

F

K

P

T

Distance = 0

0

0.01

0.02

ACDE FGH I KLM NPQRS TVWYA

F

K

P

T

Distance = 1

0

0.01

0.02

ACDE FGH I KLM NPQRS TVWYA

F

K

P

T

Distance = 2

Page 39: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Graphical Representation of Correlogram for SA90

0

0.005

0.01

0.015

ACDE FGH I KLM NPQRS TVWYA

F

K

P

T

Distance = 4

0

0.005

0.01

0.015

ACDE FGH I KLM NPQRS TVWYA

F

K

P

T

Distance = 3

Page 40: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Distance Matrix

  SA90 SU89 LM92 HK92 KY92 KY91 KY94

SA90 0 0.084172 0.035637 0.087184 0.085504 0.085942 0.086551

SU90 0.084172 0 0.082183 0.014866 0.020469 0.021679 0.020881

LM92 0.035637 0.082183 0 0.081841 0.085076 0.085575 0.085744

HK92 0.087184 0.014866 0.081841 0 0.024637 0.026439 0.025417

KY92 0.085504 0.020469 0.085076 0.024637 0 0.018841 0.017493

KY91 0.085942 0.021679 0.085575 0.026439 0.018841 0 0.016823

KY94 0.086551 0.020881 0.085744 0.025417 0.017493 0.016823 0

Page 41: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Phylogeny Tree found with Correlogram Distances

Page 42: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Comparison of two trees.

Page 43: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Experiment 2 : Finding Test Data

Parvovirus causes stomach diseases in children.

Coat protein – Some coat proteins are important as they are responsible for the resistance.

Different strains of parvoviri were studied for their VP1 Protein.

Reference for the test data – Dr. Mavis McKenna and Dr. Rob McKenna from University of Florida, Gainesville.

From these distance matrices, trees were created using PHYLIP software package.

The programs ‘NEIGHBOR’ and ‘DRAWTREE’ were used.

Page 44: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Comparison of two trees.

Page 45: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Experiment 3 -Correlogram for Sequence Scanning

The next experiment was to use correlogram for scanning Sequences i.e. Pattern Finding.The algorithm Scan Correlogram was developed for finding the occurrences of a given pattern over a long sequence.

2nd Comparison

A T C G T

A T C G A T C G T T A G C T C C

Pattern

Target1st Comparison

Last Comparison

Page 46: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Experiment 3 -Correlogram for Sequence Scanning…Contd.

Following Viruses were used in this experimentPorcine-parvovirusBovine ParvovirusCPV Packaged StrandH1 ComplementaryMVM Packaged StrandPhiX-GenomeAAV NC001401AAV ComplementaryADV ComplementaryAstell and Tattersall MVMi Packaged Sequence

Page 47: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Experiment 3 -Correlogram for Sequence Scanning…Contd.

The patterns searched were as followsACACCAAAAATACCTCTTGCATCCTCTATCAC

Page 48: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Results for Bovine Parvovirus

Following are the results shown for Bovine Parvovirus.The length of sequence was 5517 and cut-off score used was 2.48 for all three patterns.

Pattern 1 - ACACCAAAA

0

0.5

1

1.5

2

2.5

3

0 1000 2000 3000 4000 5000 6000

Location of Substring

Dif

fere

nce S

co

re

Page 49: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Results for Bovine Parvovirus

Following are the results shown for Bovine Parvovirus for pattern ACACCAAAA.

Location Score Distance Substring

129 2.28 ACAACTAAA 2167 1.99 ACCCAAATA3543 2.39 AACTCCAAA4149 1.83 TACCACCAA 4150 1.83 ACCACCAAA 4151 2.09 CCACCAAAT 4152 1.81 CACCAAATC 4798 2.48 ACCCCCAAT

Page 50: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Conclusions??

This research developed the correlogram comparison method for comparing sequences. Experiments were performed on real sequences and on synthetic sequences to answer the research questions of whether the correlogram biological sequences.

It was observed that the Dynamic Programming method was more sensitive to the positioning of characters (i.e. amino acids or nucleic acids) in the sequence (sequence alignment), whereas the Correlogram method was found to be more sensitive to the character itself (contents of the sequence)

Page 51: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Conclusions??

The real data experiment was conducted on different strains of the horse influenza virus and the parvovirus. It was observed that the phylogeny was retained in most cases, however there were certain remarkable differences between the two.

The scan correlogram algorithm was developed and used in this research to find motifs or patterns. The results of this experiment showed that the sub-sequences obtained were very similar to the given pattern.

Page 52: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Future Work

The further study can be done to see how the array of distances used for correlogram computations can impact the results.

It will be interesting to study various delta values for Gapped correlograms and how they affect the scores. This gapped correlogram method can be further researched to see if the delta values are useful in determining global versus local alignments.

Page 53: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Future Work…Contd.

Enhancements can be made to the scan correlogram method to use the gapped correlogram method for finding patterns and also to find the sub-sequences of more or less length than that of the pattern sequence.

Page 54: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

Acknowledgement

Dr. Kuntal Sengupta suggested that correlogram method can be used for comparison of bio-sequences.

Dr. Mavis McKenna and Dr. Rob McKenna, University of Florida, Gainesville.

Mridula Anand, Florida Institute of Technology.

Page 55: Correlogram Method for comparing Bio-Sequences - Gandhali Samant, M.S. Computer Science Committee Dr. Debasis Mitra, PhD Dr. William Shoaff, PhD Dr. Alan

References

http://www.ncbi.nlm.nih.gov/BLAST/http://highwire.stanford.edu/http://au.expasy.org/http://evolution.gs.washington.edu/phylip.html“Alternate circulation of recent equine-2 influenza viruses (H3N8) from two distinct lineages in the United States” By Alexander C.K. Lai, Kristin M. Rogers, Amy Glaser, Lynn Tudor, Thomas Chambers.“Image indexing using color correlograms” By Jing Huang,S Ravi

Kumar, Mandar Mitra, Wei-Jing Zhu, Ramin Zabih“Phylogeny of the genus Haemophilus as determined by comparison of partial infB sequences” By Jakob Hedegaard, Henrik Okkels, Brita Bruun, Mogens Kilian, Kim K. Mortensen1 and Niels Nørskov-Lauritsen

Thanks!!