genomics and personalized care in health systems lecture 3 sequence alignment leming zhou school...
DESCRIPTION
Department of Health Information Management Similarity Search Find statistically significant matches to a protein or DNA sequence of interest. Obtain information on inferred function of the gene Sequence identity/similarity is a quantitative measurement of the number of nucleotides / amino acids which are identical /similar in two aligned sequences –Calculated from a sequence alignment –Can be expressed as a percentage –In proteins, some residues are chemically similar but not identicalTRANSCRIPT
![Page 1: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/1.jpg)
Genomics and Personalized Care in Health Systems
Lecture 3 Sequence Alignment
Leming ZhouSchool of Health and Rehabilitation Sciences
Department of Health Information Management
![Page 2: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/2.jpg)
Department of Health Information Management
Outline• Pairwise sequence alignment• Multiple sequence alignment• Phylogenetic tree
![Page 3: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/3.jpg)
Department of Health Information Management
Similarity Search• Find statistically significant matches to a protein
or DNA sequence of interest. • Obtain information on inferred function of the
gene• Sequence identity/similarity is a quantitative
measurement of the number of nucleotides / amino acids which are identical /similar in two aligned sequences– Calculated from a sequence alignment– Can be expressed as a percentage– In proteins, some residues are chemically similar but not
identical
![Page 4: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/4.jpg)
Department of Health Information Management
Sequence Alignment• A linear, one-to-one correspondence between some
of the symbols in one sequence with some of the symbols in another sequence– Four possible outcomes in aligning two sequences
• Identity; mismatch; gap in one sequence; gap in the other sequence
• May be DNA or protein sequences.
![Page 5: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/5.jpg)
Department of Health Information Management
Evolutionary Basis of Alignment• The simplest molecular mechanisms of evolution
are substitution, insertion, and deletion
• If a sequence alignment represents the evolutionary relationship of two sequences, residues that are aligned but do not match represent substitutions
• Residues that are aligned with a gap in the sequence represent insertions or deletions
![Page 6: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/6.jpg)
Department of Health Information Management
Alignment Algorithms• Sequences often contain highly conserved regions
• These regions can be used for an initial alignment
![Page 7: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/7.jpg)
Department of Health Information Management
Alignments• Two sequencesSeq 1: ACGGACTSeq 2: ATCGGATCT• There may be multiple ways of creating the
alignment. Which alignment is the best?A – C – G G – A C T| | | | |A T C G G A T - C T
A T C G G A T C T| | | | | |A – C G G – A C T
![Page 8: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/8.jpg)
Department of Health Information Management
Optimal vs. Correct Alignment• For a given group of sequences, there is no single
“correct” alignment, only an alignment that is “optimal” according to some set of calculations
• This is partly due to:– the complexity of the problem,– limitations of the scoring systems used,– our limited understanding of life and evolution
• Success of the alignment will depend on the similarity of the sequences. If sequence variation is great it will be very difficult to find an optimal alignment
![Page 9: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/9.jpg)
Department of Health Information Management
Optimal Alignment• Every alignment has a score
• Chose alignment with highest score
• Must choose appropriate scoring function
• Scoring function based on evolutionary model with insertions, deletions, and substitutions
• Use substitution score matrix – contains an entry for every amino acid pair
![Page 10: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/10.jpg)
Department of Health Information Management
Gaps• Positions at which a letter is paired with a null are
called gaps. Gap scores are typically negative.
• Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap.
• Biologically, it should in general be easier for a sequence to accept a different residue in a position, rather than having parts of the sequence deleted or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions)
![Page 11: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/11.jpg)
Department of Health Information Management
Gaps in Sequence Alignment• Gap can occur
– Before the first character of a string– Inside a string– After the last character of a string
CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA-
![Page 12: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/12.jpg)
Department of Health Information Management
Gap penalties• There is no suitable theory for gap penalties.• The simplest gap penalty is a constant penalty for
each gap• The most common type of gap penalty is the affine
gap penalty: g = a + bx – a is the gap opening penalty – b is the gap extension penalty – x is the number of gapped-out residues.
• More likely contiguous block of residues inserted or deleted
• Scoring scheme should penalize new gaps more• Typical values, e.g. a = 10 and b = 1 for BLAST.
![Page 13: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/13.jpg)
Pairwise Sequence Alignment
![Page 14: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/14.jpg)
Department of Health Information Management
Pairwise Alignment• The process of lining up two sequences to achieve
maximal levels of identity or conservation for the purpose of assessing the degree of similarity and the possibility of homology
• It is used to – Decide if two genes are related structurally or functionally
• Find the similarities between two sequences with same evolutionary background
– Identify domains or motifs that are shared between proteins
– Analyze genomes• Identify genes, search large databases, determine overlaps of
sequences (DNA assembly)
![Page 15: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/15.jpg)
Department of Health Information Management
DNA and Protein Sequences• DNA alphabet: {A, C, G, T}+
– Four discrete possibilities – it’s either a match or a mismatch
• Protein alphabet: {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}+
– 20 possibilities which fall into several categories– Residues can be similar without being identical
• In some cases, protein sequence is more informative– Codons are degenerate: changes in the third position often do not
alter the amino acid that is specified
• In some cases, DNA alignments are appropriate– To confirm the identity of a cDNA; to study noncoding regions of
DNA; to study DNA polymorphisms, …
![Page 16: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/16.jpg)
Department of Health Information Management
Translating a DNA Sequence into Proteins• DNA sequences can be translated into protein, and then
used in pairwise alignments• One DNA sequence can be translated into six potential
proteins
5’ CAT CAA 5’ ATC AAC 5’ TCA ACT
5’ GTG GGT 5’ TGG GTA 5’ GGG TAG
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
![Page 17: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/17.jpg)
Department of Health Information Management
DNA Alignment Score CGAAGACTTGAGCTGAT || |||| ||| |||| CGCAGACATGA-CTGAC
)(
)penalties gap,mismatches identity,(
SMaxScore
S
MismatchGapMatch
![Page 18: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/18.jpg)
Department of Health Information Management
Alignment Scoring Scheme• Possible scoring scheme:
– match: +5– mismatch: -3– indel: –4
• Example:G A A T T C A G T T A| | | | | |G G A – T C – G - — A
+ - + - + + - + - - +5 3 5 4 5 5 4 5 4 4 5
S = 5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11
A C G TA 5
C -3 5
G -3 -3 5
T -3 -3 -3 5
![Page 19: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/19.jpg)
Department of Health Information Management
Amino Acid Sequence Alignment• No exact match/mismatch scores
• Match state score calculated by table lookup
• Lookup table is substitution matrix (or scoring matrix)
![Page 20: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/20.jpg)
Department of Health Information Management
Substitution Matrix• A substitution matrix contains values proportional to
the probability that amino acid i mutates into amino acid j for all pairs of amino acids.
• Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids.
• Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution.
• The two major types of substitution matrices are Point-Accepted Mutations (PAM) and BLOcks Substituion Matrix (BLOSUM).
![Page 21: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/21.jpg)
Department of Health Information Management
Sequence Alignment Algorithms• Dynamic Programming:
– Needleman-Wunsch Global Alignment (1970)• Smith-Waterman Local Alignment (1981)• Guaranteed to find the best scoring • Slow, especially used to compare with a large database
• Heuristics– FASTA, BLAST : heuristic approximations to Smith-
waterman• Fast and results comparable to the Smith-Waterman
algorithm
![Page 22: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/22.jpg)
Department of Health Information Management
Dynamic Programming• Solve optimization problems by dividing the problem
into independent subproblems• Sequence alignment has optimal substructure property
– Subproblem: alignment of prefixes of two sequences– Each subproblem is computed once and stored in a matrix
• Optimal score: built upon optimal alignment computed to that point
• Aligns two sequences beginning at ends, attempting to align all possible pairs of characters– Alignment contains matches, mismatches and gaps– Scoring scheme for matches, mismatches, gaps– Highest set of scores defines optimal alignment between
sequences
![Page 23: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/23.jpg)
Department of Health Information Management
The Big O Notation• Computational complexity of an algorithm is how its execution
time increases as the problem is made larger (e.g. more sequences to align)
• The big-O notation– If we have a problem size n, then an algorithm takes O(n) time if
the time increases linearly with n. If the algorithm needs time proportional to the square of n, then it is O(n2)
– More example, here c is a constant:• O(c) utopian
• O(log n) excellent
• O(n) very good
• O(n2) not so good
• O(n3) pretty bad
• O(cn) disaster
![Page 24: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/24.jpg)
Department of Health Information Management
Drawbacks to DP Approaches• Compute intensive• Memory intensive• Complexity of DP Algorithm
– Time O(nm); space O(nm) • where n, m are the lengths of the two sequences.
– Space complexity can be reduced to O(n) by not storing the entries of dynamic programming table that are no longer needed for the computation (keep current row and the previous row only)
• A fast heuristic (BLAST) will be discussed next week
![Page 25: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/25.jpg)
Department of Health Information Management
Two Sequences>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNAACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNATGCTTCTGACATAGTTGTGTTGACTCACAAACTCAGAAACAGACACCATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCTGTGGGGAAAGGTGAACCCTGATGATGTTGGTGGCGAGGCCCTGGGCAGGCTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTGATAGCTTTGGGGACCTGTCCTCTGCCTCTGCTATCATGGGTAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGATAAACGCCTTCAATGATGGCCTGAAACACTTGGACAACCTCAAGGGCACCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAGGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTCACCCCCTGTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAAACCTCTTTTCCTGCTCTTGTCTTTGTGCAATGGTCAATTGTTCCCAAGAGAGCATCTGTCAGTTGTTGTCAAAATGACAAAGACCTTTGAAAATCTGTCCTACTAATAAAAGGCATTTACTTTCACTGC
![Page 26: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/26.jpg)
Department of Health Information Management
Pairwise Sequence Alignment• FASTA:
http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi• DNA vs. DNA comparison• Default parameters:
– Match: +5– Mismatch: -4– Gap open penalty: -12– Gap extension penalty: -4
• BLAST search will be covered next week
![Page 27: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/27.jpg)
Multiple Sequence Alignment
![Page 28: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/28.jpg)
Department of Health Information Management
Multiple Sequence Alignment• Multiple sequence alignment (MSA) is a
generalization of Pairwise Sequence Alignment: instead of aligning two sequences, n (>2) sequences are aligned simultaneously
• A multiple sequence alignment is obtained by inserting gaps (“-”) into sequences such that the resulting sequences have all length L and can be arranged in a matrix of n rows and L columns where each column represents a homologous position
• MSA applies both to DNA and protein sequences
![Page 29: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/29.jpg)
Department of Health Information Management
Why Do We Need MSA?• MSA can help to develop a sequence “finger print” which
allows the identification of members of distantly related protein family (motifs)
• Formulate & test hypotheses about protein 3-D structure• MSA can help us to reveal biological facts about proteins,
e.g.: how protein function has changed or evolutionary pressure acting on a gene
• Crucial for genome sequencing:– Random fragments of a large molecule are sequenced and those
that overlap are found by a multiple sequence alignment program.
• To establish homology for phylogenetic analyses• Identify homologous sequences in other organisms
![Page 30: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/30.jpg)
Department of Health Information Management
Multiple Sequence Alignment• Difficulty: introduction of multiple sequences
increases combination of matches, mismatches, gaps
• In pairwise alignments, one has a 2D matrix with the sequences on each axis. The number of operations required to locate the best “path” through the matrix is approximately proportional to the product of the lengths of the two sequences
• A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a DP algorithm in N dimensions. Algorithmically, this is not difficult to do
![Page 31: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/31.jpg)
Department of Health Information Management
Examplefly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA
fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST
fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA
![Page 32: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/32.jpg)
Department of Health Information Management
MSA• How do we generate a multiple alignment? Given
a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned. Does it work?
• It is not self-evident how these sequences are to be aligned together.
• It depends not only on the various alignment parameters but also on the order in which sequences are added to the multiple alignment
![Page 33: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/33.jpg)
Department of Health Information Management
Dynamic Programming for MSA• Dynamic programming with two sequences
– Relatively easy to code– Guaranteed to obtain optimal alignment
• An extension of the pairwise sequence alignment– Alignment of K sequences
• K(K-1)/2 possible sequence comparisons• Alignment algorithms operate in a similar manner
as pairwise alignment but now the distance matrix is K dimensional and the weight function compares K letters
![Page 34: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/34.jpg)
Department of Health Information Management
Time Complexity of Optimal MSA• Space complexity (hyperlattice size): O(nk) for k
sequences each n long.• Computing a hyperlattice node: O(2k).• Time complexity: O(2knk).• Find the optimal solution is exponential in k (non-
polynomial, NP-hard).
![Page 35: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/35.jpg)
Department of Health Information Management
Heuristics for Optimal MSA• Reduction of space and time• Heuristic alignment – not guaranteed to be
optimal• Alignment provides a limit to the volume within
which optimal alignments are likely to be found• Heuristics:
– Progressive alignments (ClustalW)
![Page 36: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/36.jpg)
Department of Health Information Management
Progressive Alignment• Works by progressive alignment: it aligns a pair of
sequences then aligns the next one onto the first pair• Most closely related sequences are aligned first, and
then additional sequences and groups of sequences are added, guided by the initial alignments
• Uses alignment scores to produce a guide tree• Aligns the sequences sequentially, guided by the
relationships indicated by the tree– If the order is wrong and merge distantly related sequences
too soon , errors in the alignment may occur and propagate• Gap penalties can be adjusted based on specific
sequence
![Page 37: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/37.jpg)
Department of Health Information Management
CLUSTALW• http://www.ebi.ac.uk/clustalw/• Perform pairwise alignments of all sequences• Use alignment scores to produce a guide tree• Align sequences sequentially, guided by the tree• Enhanced Dynamic Programming used to align
sequences• Genetic distance determined by number of
mismatches divided by number of matches• Gaps are added to an existing profile in progressive
methods• CLUSTALW incorporates a statistical model in order
to place gaps where they are most likely to occur
![Page 38: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/38.jpg)
S1S2S3S4
4
3
2
1
4321
S4S74S494S
SSSS
All PairwiseAlignments
S1 S3
S2 S4 Distance
Cluster Analysis
Similarity Matrix Dendrogram
Multiple Alignment Step:1. Aligning S1 and S32. Aligning S2 and S43. Aligning (S1,S3) with (S2,S4).
ClustalW MSA Procedure
From Higgins(1991) and Thompson(1994).
![Page 39: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/39.jpg)
Department of Health Information Management
Three Protein Sequences>sp|P25454|RAD51_YEAST DNA repair protein RAD51 OS=Saccharomyces cerevisiae GN=RAD51 PE=1
SV=1MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNGSGDGGGLQEQAEAQGEMEDEAYD
EAALGSFVPIEKLQVNGITMADVKKLRESGLHTAEAVAYAPRKDLLEIKGISEAKADKLLNEAARLVPMGFVTAADFHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLCHTLAVTCQIPLDIGGGEGKCLYIDTEGTFRPVRLVSIAQRFGLDPDDALNNVAYARAYNADHQLRLLDAAAQMMSESRFSLIVVDSVMALYRTDFSGRGELSARQMHLAKFMRALQRLADQFGVAVVVTNQVVAQVDGGMAFNPDPKKPIGGNIMAHSSTTRLGFKKGKGCQRLCKVVDSPCLPEAECVFAIYEDGVGDPREEDE
>sp|P25453|DMC1_YEAST Meiotic recombination protein DMC1 OS=Saccharomyces cerevisiae GN=DMC1 PE=1 SV=1
MSVTGTEIDSDTAKNILSVDELQNYGINASDLQKLKSGGIYTVNTVLSTTRRHLCKIKGLSEVKVEKIKEAAGKIIQVGFIPATVQLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMSHTLCVTTQLPREMGGGEGKVAYIDTEGTFRPERIKQIAEGYELDPESCLANVSYARALNSEHQMELVEQLGEELSSGDYRLIVVDSIMANFRVDYCGRGELSERQQKLNQHLFKLNRLAEEFNVAVFLTNQVQSDPGASALFASADGRKPIGGHVLAHASATRILLRKGRGDERVAKLQDSPDMPEKECVYVIGEKGITDSSD
>sp|P48295|RECA_STRVL Protein recA OS=Streptomyces violaceus GN=recA PE=3 SV=1MAGTDREKALDAALAQIERQFGKGAVMRMGDRTQEPIEVISTGSTALDIALGVGGLPRGRVVEIYGPESSGKTTL
TLHAVANAQKAGGQVAFVDAEHALDPEYAKKLGVDIDNLILSQPDNGEQALEIVDMLVRSGALDLIVIDSVAALVPRAEIEGEMGDSHVGLQARLMSQALRKITSALNQSKTTAIFINQLREKIGVMFGSPETTTGGRALKFYASVRLDIRRIETLKDGTDAVGNRTRVKVVKNKVAPPFKQAEFDILYGQGISREGGLIDMGVEHGFVRKAGAWYTYEGDQLGQGKENARNFLKDNPDLADEIERKIKEKLGVGVRPDAAKAEAATDAAAADTAGTDDAAKSVPAPASKTAKATKATAVKS
![Page 40: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/40.jpg)
Department of Health Information Management
An Alignment from ClustalWsp|P25454|RAD51_YEAST MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNG 50
sp|P25453|DMC1_YEAST ---------------------MSVTGTEIDSDTAKN-------------- 15sp|P48295|RECA_STRVL ------------MAGTDREKALDAALAQIERQFGKG-------------- 24 :... :::. . ..
sp|P25454|RAD51_YEAST SGDGGGLQEQAEAQGEMEDEAYDEAALGSFVPIEKLQVNGITMADVKKLR 100sp|P25453|DMC1_YEAST -----------------------------ILSVDELQNYGINASDLQKLK 36sp|P48295|RECA_STRVL -------------------------------AVMRMGDRTQEPIEVISTG 43 .: .: :: .
sp|P25454|RAD51_YEAST ESGLHTAEAVAYAPRKDLLEIKG-ISEAKADKLLNEAARLVPMG----FV 145sp|P25453|DMC1_YEAST SGGIYTVNTVLSTTRRHLCKIKG-LSEVKVEKIKEAAGKIIQVG----FI 81sp|P48295|RECA_STRVL STALDIALGVGGLPRGRVVEIYGPESSGKTTLTLHAVANAQKAGGQVAFV 93
. .: . * .* : :* * *. *. . ... * *:
sp|P25454|RAD51_YEAST TAADFHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLC 195
sp|P25453|DMC1_YEAST PATVQLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMS 131
sp|P48295|RECA_STRVL DAEHALDPEYAKKLGVDIDNLILSQPDNGEQALEIVDML--VRSGALDLI 141 * . .: : : ..* : .*.::: .* * ::
![Page 41: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/41.jpg)
Phylogenetic Analysis
![Page 42: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/42.jpg)
Department of Health Information Management
Page 358
Evolution• At the molecular level, evolution is a process of
mutation with selection. • Molecular evolution is the study of changes in
genes and proteins throughout different branches of the tree of life.
• Phylogeny is the inference of evolutionary relationships.
• Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are also used for phylogenetic analyses.
![Page 43: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/43.jpg)
Department of Health Information Management
Phylogenetic Trees• Phylogenetic trees are trees that describe the
“relations” among species (genes, sequences)– Evolutionary relationships are shown as branches
• Sequences most closely related drawn as neighboring branches
– Length and nesting reflects degree of similarity between any two items (sequences, species, etc.)
• Objective of Phylogenetic Analysis: determine branch length and figure out how the tree should be drawn– Dependent upon good multiple sequence alignment
programs– Group sequences with similar patterns of substitutions
![Page 44: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/44.jpg)
Department of Health Information Management
Uses of Phylogenetic Analysis• Phylogeny can answer questions such as:
– How many genes are related to the gene I am working on?– Are humans really closest to chimps and gorillas?– How related are chicken, dog, mouse to zebrafish?– Where and when did HIV originate?– What is the history of life on earth?
• Given a set of genes, determine genes likely to have equivalent functions
• Follow changes occurring in a rapidly changing species
• Example: influenza – Study rapidly changing genes in influenza genome, predict
next year’s strain and develop flu vaccination accordingly
![Page 45: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/45.jpg)
Department of Health Information Management
Difficulties With Phylogenetic Analysis• Horizontal or lateral transfer of genetic material (for
instance through viruses) makes it difficult to determine phylogenetic origin of some evolutionary events
• Genes selective pressure can be rapidly evolving, masking earlier changes that had occurred phylogenetically
• Two sites within comparative sequences may be evolving at different rates
• Rearrangements of genetic material can lead to false conclusions
• Duplicated genes can evolve along separate pathways, leading to different functions
![Page 46: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/46.jpg)
Department of Health Information Management
Rooted Trees• One sequence (root) defined to be common ancestor of all
other sequences• Root chosen as a sequence thought to have branched off
earliest• A rooted tree specifies evolutionary path for each sequence• A tree can be rooted using an outgroup (that is, a sequence
known to be distantly related from all other sequences).
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
past
present
1
2 3 4
5
6
7 8
9
![Page 47: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/47.jpg)
Department of Health Information Management
Unrooted Tree• Indicates evolutionary relationship without
revealing location of oldest ancestry
4
5
87
1
2
36
![Page 48: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/48.jpg)
Department of Health Information Management
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
![Page 49: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/49.jpg)
Department of Health Information Management
4 Steps of Phylogenetic Analysis• Molecular phylogenetic analysis may be described
in four steps:– Selection of sequences for analysis– Multiple sequence alignment– Tree building– Tree evaluation
![Page 50: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/50.jpg)
Department of Health Information Management
Page 371
Selection of Sequences (1/2)• For phylogeny, DNA can be more informative.
– Protein-coding sequences has synonymous and nonsynonymous substitutions. Thus, some DNA changes do not have corresponding protein changes.
– Some substitutions in a DNA sequence alignment can be directly observed: single nucleotide substitutions, sequential substitutions, coincidental substitutions.
– Additional mutational events can be inferred by analysis of ancestral sequences. These changes include parallel substitutions, convergent substitutions, and back substitutions.
– Pseudogenes and noncoding regions may be analyzed using DNA
![Page 51: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/51.jpg)
Department of Health Information Management
Selection of Sequences (2/2)• For phylogeny, protein sequences are also often
used.– Proteins have 20 states (amino acids) instead of only four
for DNA, so there is a stronger phylogenetic signal.– Nucleotides are unordered characters: any one
nucleotide can change to any other in one step.– An ordered character must pass through one or more
intermediate states before reaching the final state.– Amino acid sequences are partially ordered character
states: there is a variable number of states between the starting value and the final value.
![Page 52: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/52.jpg)
Department of Health Information Management
Multiple Sequence Alignment• The fundamental basis of a phylogenetic tree is a
multiple sequence alignment. – Confirm that all sequences are homologous– Adjust gap creation and extension penalties as needed to
optimize the alignment– Restrict phylogenetic analysis to regions of the multiple
sequence alignment for which data are available for all sequences (species)
![Page 53: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/53.jpg)
Department of Health Information Management
Building Tree• Two tree-building methods: distance-based and
character-based– Distance-based methods involve a distance metric, such
as the number of amino acid changes between the sequences, or a distance score
– Character-based methods include maximum parsimony and maximum likelihood
• In both distance- or character-based methods for building a tree, the starting point is a multiple sequence alignment
![Page 54: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/54.jpg)
Department of Health Information Management
Maximum Parsimony• Predicts evolutionary tree by minimizing number
of steps required to generate observed variation• For each position, phylogenetic trees requiring
smallest number of evolutionary changes to produce observed sequence changes are identified
• Columns representing greater variation dominate the analysis
• Trees producing smallest number of changes for all sequence positions are identified
• Time consuming algorithm• Only works well if the sequences have a strong
sequence similarity
![Page 55: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/55.jpg)
Department of Health Information Management
Maximum Parsimony Example1 A A G A G T G C A2 A G C C G T G C G3 A G A T A T C C A4 A G A G A T C C G
• Four sequences, three possible unrooted trees
1
2 4
3 1
3 4
2 1
4 2
3
![Page 56: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/56.jpg)
Department of Health Information Management
Maximum Parsimony Example• Some sites are informative, others are not• Site is informative if there are at least two
different kinds of letters at the site, each of which is represented in at least two of the sequences
• Only informative sites are considered1 A A G A G T G C A2 A G C C G T G C G3 A G A T A T C C A4 A G A G A T C C G
Three informative columns
![Page 57: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/57.jpg)
Department of Health Information Management
Maximum Parsimony Example1 G G A2 G G G3 A C A4 A C G
Is a substitution
Col 1: 1
2 4
3 1
3 4
2 1
4 2
3
Col 2: 1
2 4
3 1
3 4
2 1
4 2
3
Col 3: 1
2 4
3 1
3 4
2 1
4 2
3
# of Changes:Tree 1: 4Tree 2: 5Tree 3: 6
![Page 58: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/58.jpg)
Department of Health Information Management
Distance Methods• Looks at number of changes between each pair in
a group of sequences• Identify tree positioning neighbors correctly that
has branch lengths reproducing original data as closely as possible
• Distance score counted as: – # of mismatched positions in alignment– # of sequence positions changed to generate the second
sequence
• Success depends on degree the distances are additive on a predicted evolutionary tree
![Page 59: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/59.jpg)
Department of Health Information Management
Example of Distance Analysis• Consider the alignment:
A ACGCGTTGGGCGATGGCAACB ACGCGTTGGGCGACGGTAATC ACGCATTGAATGATGATAATD ACACATTGAGTGATAATAAT• Calculate distances (# of differences)
• Using this information, a tree can be drawn:C
D
A
B
![Page 60: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/60.jpg)
Department of Health Information Management
Maximum Likelihood (ML)• A likelihood is calculated for the probability of each residue in
an alignment, based upon some model of the substitution process.
• A maximum likelihood method constructs a phylogenetic tree from DNA sequences whose likelihood is a maximum. This corresponds to the tree that makes the data the most probable evolutionary outcome– Calculates likelihood of a tree given an alignment
– Probability of each tree is product of mutation rates in each branch
– Likelihoods given by each column multiplied to give the likelihood of the tree
• This approach requires a explicit model of evolution which is both a strength and weakness because the results depend on the model used
• This methods can also be very computationally expensive• Can only be done for a handful of sequences
![Page 61: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/61.jpg)
Department of Health Information Management
Which Method to Choose?• Depends upon the sequences that are being
compared– Strong sequence similarity:
• Maximum parsimony
– Clearly recognizable sequence similarity• Distance methods
– All others: • Maximum likelihood
• Best to choose at least two approaches• Compare the results – if they are similar, you can
have more confidence
![Page 62: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/62.jpg)
Department of Health Information Management
Evaluating Trees • The main criteria by which the accuracy of a
phylogentic tree is assessed are consistency, efficiency, and robustness.
• Bootstrapping is a commonly used approach to measuring the robustness of a tree topology – Given a branching order, how consistently does an algorithm
find that branching order in a randomly permuted version of the original data set?
– To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment.
– Make the dataset the same size as the original. – Do 100 bootstrap replicates. – Observe the percent of cases in which the assignment of
clades in the original tree is supported by the bootstrap replicates
– >70% is considered significant
![Page 63: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/63.jpg)
Department of Health Information Management
MEGA 5: Molecular Evolutionary Genetics Analysis• http://www.megasoftware.net/
• Human, mouse, rat, and zebrafish CFTR gene• Multiple sequence alignment by ClustalW• Build a tree using Maximum Parsimony• The obtained phylogenetic tree
NM 021050.2| Mouse CFTR
XM 001062374.1| Rat CFTR
NM 001044883.1| Zebrafish CFTR
NM 000492.3| Human CFTR
![Page 64: Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment Leming Zhou School of…](https://reader036.vdocument.in/reader036/viewer/2022081521/5a4d1c137f8b9ab0599f80a9/html5/thumbnails/64.jpg)
Department of Health Information Management
Homework 2• Retrieve BRCA1 gene in human (Homo sapiens), mouse
(Mus musculus), cow (Bos taurus), and dog (canis lupus familiaris)
• Use FASTA program to perform all-against-all pairwise sequence alignments
• Create multiple sequence alignment with ClustalW using the web server
• Build phylogenetic trees using different methods (such as Neighbor Joining, minimum evolution, UPGMA, and maximum parsimony implemented in MEGA)