tutorial 2: some problems in bioinformatics 1. alignment pairs of sequences
DESCRIPTION
Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification 2. Phylogeny prediction (tree construction). Sources: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/1.jpg)
Tutorial 2: Some problems in bioinformatics
1. Alignment pairs of sequencesDatabase searching for sequencesMultiple sequence alignmentProtein classification
2. Phylogeny prediction (tree construction)
Sources:1) "Bioinformatics: Sequence and Genome Analysis" by David W. Mount. 2001. Cold Spring Harbor Press2) NCBI tutorial http://www.ncbi.nlm.nih.gov/Education/ andhttp://www.ncbi.nih.gov/BLAST/tutorial/Altschul-1.html3) Brian Fristensky. Univ. of Manitobahttp://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics
![Page 2: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/2.jpg)
Alignment: pairs of sequences
DNA: A, G, C, T
protein: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
KQTGKG| |||KSAGKG
TCGCA|| ||TC-CA
![Page 3: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/3.jpg)
![Page 4: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/4.jpg)
DNA to RNA to protein to phenotype
![Page 5: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/5.jpg)
DNA to RNA to protein to phenotype
![Page 6: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/6.jpg)
Alignment: pairs of sequences
Concepts:
SimilarityIdentityHomologyOrthologyParalog KQTGKGV
| |||:KSAGKGL
4/7 identical5/7 similar
![Page 7: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/7.jpg)
Homology is based on evolutionary history
![Page 8: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/8.jpg)
![Page 9: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/9.jpg)
Figure 45 Lineage-specific expansions of domains and architectures of transcription factors. Top, specific families of transcription factors that have been expanded in each of the proteomes. Approximate numbers of domains identified in each of the (nearly) complete proteomes representing the lineages are shown next to the domains, and some of the most common architectures are shown. Some are shared by different animal lineages; others are lineage-specific.
![Page 10: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/10.jpg)
![Page 11: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/11.jpg)
A partial alignment of globin sequences.
Proteins with very little identity (10% or less) can be recognized as sharing a common domain if they match a pattern.
![Page 12: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/12.jpg)
![Page 13: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/13.jpg)
![Page 14: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/14.jpg)
- Fitch, W.M. 2001. Homology: A personal view of some of the problems.Trends Genet. 16: 227-231.
Homology, orthology and paralogy
orthologs diverged at a speciation event
paralogs diverged at a gene duplication event
![Page 15: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/15.jpg)
Alignment: pairs of sequences
Scoring schemes
Score = matches - mismatches - gaps
GKG-RRWDAKR||| ||GKGAKRWESAP
What is the best way to evaluate the contribution of each?
![Page 16: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/16.jpg)
A partial alignment of globin sequences from Pfam.
Proteins with very little identity (10% or less) can be recognized as sharing a common domain if they match a pattern.
![Page 17: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/17.jpg)
Alignment: pairs of sequences
Global vs. local alignment.(end gaps are ignored in local alignment)
![Page 18: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/18.jpg)
Brian Fristensky. Univ. of Manitobahttp://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html
Dynamic programming
TCGCA|| ||TC-CA
![Page 19: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/19.jpg)
Dynamic programming
Brian Fristensky. Univ. of Manitobahttp://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html
![Page 20: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/20.jpg)
Dynamic programming
Brian Fristensky. Univ. of Manitobahttp://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html
![Page 21: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/21.jpg)
Dynamic programming
Brian Fristensky. Univ. of Manitobahttp://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html
![Page 22: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/22.jpg)
Alignment: pairs of sequences
Scoring schemes
Score = matches - mismatches - gaps
GKG-RRWDAKR||| ||GKGAKRWESAP
"The dynamic programming algorithm was improved in performance by Gotoh (1982) by
using the linear relationship for a gap weight wx = g + rx, where the weight for a gap of length x is the sum of a gap opening penalty (g) and a gap extension penalty (r) times the gap length (x), and by simplifying the dynamic programming algorithm."
D. W. Mount
KQTGKG-RRWDAKR| ||| |||KSAGKG-----AKR
VS.
![Page 23: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/23.jpg)
Alignment: amino acid substitution matrices
Scoring schemes
"Any [scoring] matrix has an implicit amino acid pair frequency distribution that characterizes the alignments it is optimized for finding. More precisely, let p i be the frequency with which amino acid i occurs in protein sequences and let q ij be the freqeuncy with which amino acids i and j are aligned within the class of alignments sought. Then, the scores that best distinguish these alignments from chance are given by the formula:
Sij = log (qij / pipj)
The base of the logarithm is arbitrary, affecting only the scale of the scores. Any set of scores useful for local alignment can be written in this form, so a choice of substitution matrices can be viewed as an implicit choice of 'target frequencies'"
- Altschul et al. 1994 (Nature Genetics 6:119)
Those frequencies are characteristic of the sequences being aligned, and are primarily a function of their degree of divergence.
![Page 24: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/24.jpg)
Alignment: amino acid substitution matrices
Substitution matrices -- BLOSUM 62
Henikoff and Henikoff. 1992.Amino acid substitution matrices from protein blocks.PNAS 89: 10915-10919.
![Page 25: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/25.jpg)
Alignment: amino acid substitution matrices
Substitution matrices -- BLOSUM 62
![Page 26: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/26.jpg)
Alignment: implementations
FastaIntroduces the concept of k-tuple perfects alignment
to seed longer global alignments.
BLAST -- Basic Local Alignment Search ToolInitiates an alignment locally and then extends that
alignment.
GKG|||GKG
GKG-RRW||| ||GKGAKRW
![Page 27: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/27.jpg)
Alignment: Searching databases for sequences
![Page 28: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/28.jpg)
There are many modifications of BLAST for specific purposes.
![Page 29: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/29.jpg)
The NCBI BLAST interface
![Page 30: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/30.jpg)
The NCBI BLAST interface
![Page 31: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/31.jpg)
![Page 32: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/32.jpg)
Extreme value distributionthe expected distribution of the maximum of many independent random variables, generally Y = exp [-x -e-x ]
K and lambda are statistical parameters dependent upon the scoring system and the background amino acid frequencies of the sequences being compared. While FASTA estimates these parameters from the scores generated by actual database searches, BLAST estimates them beforehand for specific scoring schemes by comparing many random sequences generated using a standard protein amino acid composition [12].
![Page 33: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/33.jpg)
![Page 34: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/34.jpg)
Fasta can be run at EMBL.The software is also available for download.
![Page 35: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/35.jpg)
![Page 36: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/36.jpg)
![Page 37: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/37.jpg)
Alignment: Multiple sequence alignment
![Page 38: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/38.jpg)
![Page 39: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/39.jpg)
![Page 40: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/40.jpg)
Alignment: Protein classification
![Page 41: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/41.jpg)
![Page 42: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/42.jpg)
![Page 43: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/43.jpg)
![Page 44: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/44.jpg)
Phylogeny prediction (tree construction)
![Page 45: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/45.jpg)
root
![Page 46: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/46.jpg)
![Page 47: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/47.jpg)
Phylogeny prediction (tree construction)Character-based Methods
Parsimony
Maximum Likelihoodtree that maximizes the likelihood of seeing the data
Bayesian Analysistrees with greatest likelihoods given the data
Distance Methods
Unweighted Gap-pair method with Arithmetic Means
Neighbor joining
![Page 48: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/48.jpg)
a,The interspecies relationships of five chromosome regions to corresponding DNA sequences in a chimpanzee and a gorilla. Most regions show humans to be most closely related to chimpanzees (red) whereas a few regions show other relationships (green and blue). b, The among-human relationships of the same regions are illustrated schematically for five individual chromosomes.
Within- and between-species variation along a single chromosome.
![Page 49: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/49.jpg)
Tutorial III: Open problems in bioinformaticsTentatively:
Detection of subtle signalspromoter elementsexon splicing enhancersnoncoding RNAsweak protein similarities
MicroarraysProtein folding and homology modeling
Thursday, June 10, 2:00 - 3:45
![Page 50: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/50.jpg)
Microarray expression data
Statistical analysis -- what has changed
Clustering -- which genes change together
Clustering -- promoter recognition
Clustering -- database integration
Phenotype determination (e.g. cancer prognosis)
![Page 51: Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences](https://reader035.vdocument.in/reader035/viewer/2022062407/56812ddc550346895d932d21/html5/thumbnails/51.jpg)
Tutorial 2: Some problems in bioinformatics
1. Alignment pairs of sequencesMultiple sequence alignmentDatabase searching for sequencesProtein classification
2. Phylogeny prediction (tree construction)
3. microarray expression data
4. Protein structureProtein foldingStructure predictionHomology modeling
Sources:1) "Bioinformatics: Sequence and Genome Analysis" by David W. Mount. 2001. Cold Spring Harbor Press2) NCBI tutorial http://www.ncbi.nlm.nih.gov/Education/3) Cold Spring Harbor course in Computational Genomics (1999) Pearson