the bioquest curriculum consortium at clark atlanta university atlanta, georgia feb. 14-16, 2003...

The BioQUEST Curriculum ConsortiumThe BioQUEST Curriculum Consortium

at at Clark Atlanta UniversityClark Atlanta University

Atlanta, GeorgiaAtlanta, Georgia

Feb. 14-16, 2003Feb. 14-16, 2003

Evolutionary Bioinformatics Evolutionary Bioinformatics Education: a National Science Education: a National Science

Foundation Chautauqua CourseFoundation Chautauqua Course

More data yields stronger analyses — if done carefully!More data yields stronger analyses — if done carefully!

Mosaic ideas and evolutionary ‘importance.’Mosaic ideas and evolutionary ‘importance.’

Multiple Sequence Multiple Sequence Alignment & AnalysisAlignment & Analysis

Steven M. ThompsonSteven M. Thompson

Florida State University School of Florida State University School of Computational Science and Computational Science and

Information Technology (Information Technology (CSITCSIT))

So what; why even bother? So what; why even bother?

Applications:Applications:

Probe, primer, and motif design;Probe, primer, and motif design;

Graphical illustrations;Graphical illustrations;

Comparative ‘homology’ inference;Comparative ‘homology’ inference;

Molecular evolutionary analysis.Molecular evolutionary analysis.

OK — well, how do you do it?OK — well, how do you do it?

Applicability?Applicability?

Dynamic programming’s complexity Dynamic programming’s complexity increases exponentially with the number of increases exponentially with the number of sequences being compared:sequences being compared:

N-dimensional matrix . . . .N-dimensional matrix . . . .complexity=[sequence length]complexity=[sequence length]number of sequencesnumber of sequences

See —See —

MSA (‘global’ within ‘bounding box’) andMSA (‘global’ within ‘bounding box’) and

PIMA (‘local’ portions only) on the multiple PIMA (‘local’ portions only) on the multiple alignment page at thealignment page at the

Baylor College of Medicine’s Search Baylor College of Medicine’s Search Launcher —Launcher —

http://searchlauncher.bcm.tmc.edu/ — but, — but,

severely limiting restrictions!severely limiting restrictions!

‘‘Global’ heuristic solutionsGlobal’ heuristic solutions

Therefore — Therefore — pairwise, pairwise, progressive dynamic progressive dynamic programming restricts the programming restricts the solution to the neighbor-solution to the neighbor-hood of only two hood of only two sequences at a time.sequences at a time.

All sequences are All sequences are compared, pairwise, and compared, pairwise, and then each is aligned to its then each is aligned to its most similar partner or most similar partner or group of partners. Each group of partners. Each group of partners is then group of partners is then aligned to finish the aligned to finish the complete multiple complete multiple sequence alignment.sequence alignment.

Multiple Sequence Dynamic ProgrammingMultiple Sequence Dynamic Programming

Web resources for pairwise, Web resources for pairwise, progressive multiple alignment —progressive multiple alignment —http://www.techfak.uni-bielefeld.de/bcd/Curric/

MulAli/welcome.html..

http://pbil.univ-lyon1.fr/alignment.html

http://www.ebi.ac.uk/clustalw/

http://searchlauncher.bcm.tmc.edu/

However, problems with very large datasets and However, problems with very large datasets and huge multiple alignments make doing multiple huge multiple alignments make doing multiple sequence alignment on the Web impractical sequence alignment on the Web impractical after your dataset has reached a certain size. after your dataset has reached a certain size. You’ll know it when you’re there!You’ll know it when you’re there!

Reliability and the Reliability and the Comparative Approach —Comparative Approach —

explicit homologous correspondence;explicit homologous correspondence;

manual adjustments based on manual adjustments based on knowledge,knowledge,

especially structural, regulatory, and especially structural, regulatory, and functional sites.functional sites.

Therefore, editors like SeqLab andTherefore, editors like SeqLab and

the Ribosomal Database Project:the Ribosomal Database Project:

http://rdp.cme.msu.edu/html/.http://rdp.cme.msu.edu/html/.

Structural & Functional correspondence in Structural & Functional correspondence in the Wisconsin Package’s SeqLab —the Wisconsin Package’s SeqLab —

Work with proteins!Work with proteins!If at all possible —If at all possible —

Twenty match symbols versus four, plus Twenty match symbols versus four, plus similarity! Way better signal to noise.similarity! Way better signal to noise.

Also guarantees no indels are placed Also guarantees no indels are placed within codons. So translate, then align.within codons. So translate, then align.

Nucleotide sequences will only reliably Nucleotide sequences will only reliably align if they are align if they are veryvery similarsimilar to each to each other. And they will require extensive other. And they will require extensive hand editing and careful consideration.hand editing and careful consideration.

Beware of aligning apples and Beware of aligning apples and oranges oranges [[and grapefruitand grapefruit]]!!

Parologous Parologous versus versus orthologous;orthologous;

genomic versus genomic versus cDNA;cDNA;

mature versus mature versus precursor.precursor.

Mask out uncertain areas —Mask out uncertain areas —

Complications —Complications —Order dependence.Order dependence.

Not that big of a deal.Not that big of a deal.

Substitution matrices and gap penalties.Substitution matrices and gap penalties.

A very big deal!A very big deal!

Regional ‘realignment’ becomes incredibly Regional ‘realignment’ becomes incredibly

important, especially with sequences that important, especially with sequences that

have areas of high and low similarity have areas of high and low similarity

(GCG’ PileUp -InSitu option).(GCG’ PileUp -InSitu option).

Complications cont. —Complications cont. —

Format hassles!Format hassles!

Specialized format conversion Specialized format conversion tools such as GCG’s From’ tools such as GCG’s From’ and To’ programs and and To’ programs and PAUPSearch.PAUPSearch.

Don Gilbert’s public domain Don Gilbert’s public domain ReadSeq program.ReadSeq program.

Still more complications —Still more complications —

Indels and missing Indels and missing

data symbols (i.e. data symbols (i.e.

gaps) designation gaps) designation

discrepancy discrepancy

headaches —headaches —

., -, ~, ?, N, or X., -, ~, ?, N, or X

. . . . . Help!. . . . . Help!

The consensus and motifs —The consensus and motifs —Conserved Conserved regions can be regions can be visualized with a visualized with a sliding window sliding window approach and approach and appear as appear as peaks. peaks.

P-Loop

Let’s Let’s concentrate on concentrate on the first peak the first peak seen here to seen here to simplify matters.simplify matters.

The first GTP binding domain The first GTP binding domain of EF 1 of EF 1 /Tu —/Tu —

A consensus A consensus isn’t necessarily isn’t necessarily the biologically the biologically “correct” “correct” combination.combination.

A simple A simple consensus consensus throws much throws much information information away!away!

Therefore, motif Therefore, motif definition.definition.

The EF 1 The EF 1 /Tu P-Loop —/Tu P-Loop —Defined as:Defined as:

(A,G)x4GK(S,T).(A,G)x4GK(S,T).

A one-dimensional A one-dimensional ‘regular-expression’ ‘regular-expression’ of a conserved site.of a conserved site.

Not necessarily Not necessarily biologically biologically meaningful.meaningful.

Motifs are limited in Motifs are limited in their ability to their ability to discriminate a discriminate a residue’s residue’s ‘importance.’‘importance.’

FOR MORE INFO...FOR MORE INFO...

Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html andExplore my Web Home: http://bio.fsu.edu/~stevet/cv.html and

http://bio.http://bio.fsufsu..eduedu/~/~stevetstevet/workshop.html/workshop.html and and

http://bio.fsu.edu/~stevet/BSC5936.htmlhttp://bio.fsu.edu/~stevet/BSC5936.html

Contact me (Contact me (stevetstevet@[email protected]) for specific long-distance ) for specific long-distance

bioinformatics assistance and collaboration.bioinformatics assistance and collaboration.

So how do we include ‘all’ the information of a So how do we include ‘all’ the information of a multiple sequence alignment, or of a region within multiple sequence alignment, or of a region within an alignment, in a description that doesn’t throw an alignment, in a description that doesn’t throw anything away?anything away?

Enter —Enter —

for remote homology searching, the ‘profile’ . . . for remote homology searching, the ‘profile’ . . .

profile algorithms, incl. ‘traditional’ Gribskov profiles, profile algorithms, incl. ‘traditional’ Gribskov profiles, Expectation Maximization (MEME’s), and Hidden Expectation Maximization (MEME’s), and Hidden Markov Models (HMMer’s).Markov Models (HMMer’s).

Conclusions —Conclusions —

Many fine texts are Many fine texts are

starting to become starting to become

available in the field.available in the field.

To ‘honk-my-own-horn’ a bit, check To ‘honk-my-own-horn’ a bit, check

out the new —out the new —

Current Protocols in BioinformaticsCurrent Protocols in Bioinformatics

from John Wiley & Sons, Inc:from John Wiley & Sons, Inc:

http://www.does.org/cp/bioinfo.html.http://www.does.org/cp/bioinfo.html.

They asked me to contribute a They asked me to contribute a

chapter on multiple sequence chapter on multiple sequence

analysis using GCG software.analysis using GCG software.

Humana Press, Inc. also Humana Press, Inc. also

asked me to contribute. I’ve asked me to contribute. I’ve

got two chapters in their — got two chapters in their —

Introduction to Introduction to

Bioinformatics:Bioinformatics:

A Theoretical And A Theoretical And

Practical ApproachPractical Approach

http://http://

www.humanapress.com/www.humanapress.com/

Product.pasp?Product.pasp?

txtCatalog=HumanaBooks&ttxtCatalog=HumanaBooks&t

xtCategory=&txtProductID=1xtCategory=&txtProductID=1

-58829-241-X&isVariant=0-58829-241-X&isVariant=0..

Both volumes are now Both volumes are now

available.available.

AND FOR EVEN MORE INFO...

References —References —Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in

biopolymers, in biopolymers, in Proceedings of the Second International Conference on Intelligent Systems for Molecular Proceedings of the Second International Conference on Intelligent Systems for Molecular BiologyBiology, AAAI Press, Menlo Park, California, U.S.A. pp. 28–36., AAAI Press, Menlo Park, California, U.S.A. pp. 28–36.

Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids ResearchNucleic Acids Research 2020, 2013-, 2013-2018.2018.

Eddy, S.R. (1996) Hidden Markov models. Eddy, S.R. (1996) Hidden Markov models. Current Opinion in Structural BiologyCurrent Opinion in Structural Biology 66, 361–365., 361–365.

Eddy, S.R. (1998) Profile hidden Markov models. Eddy, S.R. (1998) Profile hidden Markov models. BioinformaticsBioinformatics 1414, 755--763, 755--763

Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A.Genetics, University of Washington, Seattle, Washington, U.S.A.

Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. trees. Journal of Molecular EvolutionJournal of Molecular Evolution 2525, 351–360 ., 351–360 .

Genetics Computer Group (Copyright 1982-2002) Genetics Computer Group (Copyright 1982-2002) Program Manual for the Wisconsin PackageProgram Manual for the Wisconsin Package , Version 10.3, , Version 10.3, Accelrys, subsidiary of Pharmocopeia Inc.Accelrys, subsidiary of Pharmocopeia Inc.

Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author. http://iubio.bio.indiana.edu/soft/molbio/readseq/ Bioinformatics Group, Biology Department, Indiana author. http://iubio.bio.indiana.edu/soft/molbio/readseq/ Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A.University, Bloomington, Indiana,U.S.A.

Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Proc. Natl. Acad. Sci. U.S.A.Acad. Sci. U.S.A. 8484, 4355-4358., 4355-4358.

Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. shortest-paths approach to sum-of-pairs multiple sequence alignment. Journal of Computational BiologyJournal of Computational Biology 22, , 459–472.459–472.

Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparative protein modelling. secondary structure-dependent gap penalties for comparative protein modelling. Protein EngineeringProtein Engineering 55, 35–, 35–41.41.

Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A.personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A.

Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins,D.G. (1997) The ClustalX windows Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins,D.G. (1997) The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Nucleic Acids ResearchResearch 2424, 4876–4882., 4876–4882.

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. choice. Nucleic Acids ResearchNucleic Acids Research, , 2222, 4673-4680., 4673-4680.

the bioquest curriculum consortium at clark atlanta university atlanta, georgia feb. 14-16, 2003...

Documents

eduhtml slide

multiple alignment page

huge multiple alignments

nucleotide sequences

thompson steven

evolutionary importance

group of partners

web resources