algorithmic approaches for biological data, …algorithmic approaches for biological data, lecture...
TRANSCRIPT
Algorithmic Approaches for Biological Data, Lecture #20
Katherine St. John
City University of New YorkAmerican Museum of Natural History
20 April 2016
Outline
Aligning with Gaps and Substitution Matrices
Global versus Local Alignment
Searching Graphs: Breadth First & Depth First
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 2 / 16
Outline
Aligning with Gaps and Substitution Matrices
Global versus Local Alignment
Searching Graphs: Breadth First & Depth First
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 2 / 16
Outline
Aligning with Gaps and Substitution Matrices
Global versus Local Alignment
Searching Graphs: Breadth First & Depth First
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 2 / 16
Pairwise Sequence Alignment
A G A G
0 -1 -2 -3 -4A -1 1G -2G -3
Pictorially:
As equations:
where:
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 3 / 16
Pairwise Sequence Alignment
A G A G
0 -1 -2 -3 -4A -1 1G -2G -3
Pictorially:
As equations:
where:
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 3 / 16
Pairwise Sequence Alignment
A G A G
0 -1 -2 -3 -4A -1 1G -2G -3
Pictorially:
As equations:
where:
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 3 / 16
Aligning with Gaps and Substitution Matrices
where:
The basic dynamic programming formatcan be adjusted for different gaps andsubstitutions models.
δ: the gap penalty
σ: scores matches/mismatches.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 4 / 16
Aligning with Gaps and Substitution Matrices
where:
The basic dynamic programming formatcan be adjusted for different gaps andsubstitutions models.
δ: the gap penalty
σ: scores matches/mismatches.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 4 / 16
Aligning with Gaps and Substitution Matrices
where:
The basic dynamic programming formatcan be adjusted for different gaps andsubstitutions models.
δ: the gap penalty
σ: scores matches/mismatches.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 4 / 16
Gaps Are Treated Equally
A G A G
0 -1 -2 -3 -4
A -1 1
G -2
G -3
Commonly use affine gap penalty
function:
I h: penalty associated withopening a gap
I g : (smaller) penalty associatedwith extending the gap.
To implement this efficiently, use 2additional matrices that keeps track ofthe gaps (one for each sequence).
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 5 / 16
Gaps Are Treated Equally
A G A G
0 -1 -2 -3 -4
A -1 1
G -2
G -3
Commonly use affine gap penalty
function:
I h: penalty associated withopening a gap
I g : (smaller) penalty associatedwith extending the gap.
To implement this efficiently, use 2additional matrices that keeps track ofthe gaps (one for each sequence).
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 5 / 16
Gaps Are Treated Equally
A G A G
0 -1 -2 -3 -4
A -1 1
G -2
G -3
Commonly use affine gap penalty
function:
I h: penalty associated withopening a gap
I g : (smaller) penalty associatedwith extending the gap.
To implement this efficiently, use 2additional matrices that keeps track ofthe gaps (one for each sequence).
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 5 / 16
Affine Gap
Burr Settles, U Wisconsin, 2008
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 6 / 16
Using Substitution Matrices
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
Can view σ(i , j) as a substitution matrix.
Substitution matrices commonly used for proteinseqeunces.
PAM = Percent Accepted Mutation
I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment
BLOSUM = Blocks Substitution Matrix
I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16
Using Substitution Matrices
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
Can view σ(i , j) as a substitution matrix.
Substitution matrices commonly used for proteinseqeunces.
PAM = Percent Accepted Mutation
I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment
BLOSUM = Blocks Substitution Matrix
I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16
Using Substitution Matrices
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
Can view σ(i , j) as a substitution matrix.
Substitution matrices commonly used for proteinseqeunces.
PAM = Percent Accepted Mutation
I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment
BLOSUM = Blocks Substitution Matrix
I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16
Using Substitution Matrices
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
Can view σ(i , j) as a substitution matrix.
Substitution matrices commonly used for proteinseqeunces.
PAM = Percent Accepted Mutation
I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment
BLOSUM = Blocks Substitution Matrix
I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16
Using Substitution Matrices
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
Can view σ(i , j) as a substitution matrix.
Substitution matrices commonly used for proteinseqeunces.
PAM = Percent Accepted Mutation
I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment
BLOSUM = Blocks Substitution Matrix
I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16
Using Substitution Matrices
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
Can view σ(i , j) as a substitution matrix.
Substitution matrices commonly used for proteinseqeunces.
PAM = Percent Accepted Mutation
I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment
BLOSUM = Blocks Substitution Matrix
I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16
Global versus Local Alignment
Paul Reiners, IBM, 2008
Global: Needleman & Wunsch, 1970.
Local: Smith & Waterman, 1981.
Instead of looking for the global bestscore, look for the best score forsubsequences of the initial sequences.
Examples:
I finding motifs (conservedpatterns) across sequences,
I comparing sequences againstlonger sequences (e.g. blastsearch).
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 8 / 16
Global versus Local Alignment
Paul Reiners, IBM, 2008
Global: Needleman & Wunsch, 1970.
Local: Smith & Waterman, 1981.
Instead of looking for the global bestscore, look for the best score forsubsequences of the initial sequences.
Examples:
I finding motifs (conservedpatterns) across sequences,
I comparing sequences againstlonger sequences (e.g. blastsearch).
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 8 / 16
Smith-Waterman Algorithm
Paul Reiners, IBM, 2008
The equation is slightly different:
s(i , j) = max
σ(i , j) + s(i − 1, j − 1)−δ + s(i , j − 1)−δ + s(i − 1, j)0
Initialize: first row and first column set to 0’s
Traceback: find maximum value of s(i , j) anywhere inthe the matrix, stop when we get to a cell with 0.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 9 / 16
Smith-Waterman Algorithm
Paul Reiners, IBM, 2008
The equation is slightly different:
s(i , j) = max
σ(i , j) + s(i − 1, j − 1)−δ + s(i , j − 1)−δ + s(i − 1, j)0
Initialize: first row and first column set to 0’s
Traceback: find maximum value of s(i , j) anywhere inthe the matrix, stop when we get to a cell with 0.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 9 / 16
Smith-Waterman Algorithm
Paul Reiners, IBM, 2008
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 10 / 16
In Pairs: Local Alignment
A A G A
T
T
A
A
G
Use σ from Monday, but δ = 2.
What are the best local alignments?
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 11 / 16
In Pairs: Local Alignment
A A G A
T
T
A
A
G
Use σ from Monday, but δ = 2.
What are the best local alignments?
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 11 / 16
In Pairs: Local Alignment
A A G A
0 0 0 0 0
T 0
T 0
A 0
A 0
G 0
Use σ from Monday, but δ = 2.
What are the best local alignments?
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 12 / 16
In Pairs: Local Alignment
A A G A
0 0 0 0 0
T 0 0 0 0 0
T 0 0 0 0 0
A 0 1 1 0 1
A 0 1 2 0 1
G 0 0 0 3 1
Use σ from Monday, but δ = 2.
What are the best local alignments?
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 13 / 16
In Pairs: Searching Graphs
Bastert et al., 2002
Develop a strategy tovisit every node of thegraph(i.e. what datastructures areneeded?)
The bookkeeping isimportant.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 14 / 16
In Pairs: Searching Graphs
Bastert et al., 2002
Develop a strategy tovisit every node of thegraph(i.e. what datastructures areneeded?)
The bookkeeping isimportant.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 14 / 16
In Pairs: Searching Graphs
Bastert et al., 2002
Develop a strategy tovisit every node of thegraph(i.e. what datastructures areneeded?)
The bookkeeping isimportant.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 14 / 16
In Pairs: Searching Graphs
Bastert et al., 2002
Two common strategies:
I Breadth First Search (BFS): visit all theneighbors, then visit all the neighbors’neighbors, etc.
I Depth First Search (DFS): for eachneighbor, visit its’ neighbors, andcontinue as far down as possible.
Bookkeeping is important:
I Keep a “To Do” list (priority queue) ofnodes still to visit.
I Mark nodes as you visit them, so, youknow not to visit again.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 15 / 16
In Pairs: Searching Graphs
Bastert et al., 2002
Two common strategies:
I Breadth First Search (BFS): visit all theneighbors, then visit all the neighbors’neighbors, etc.
I Depth First Search (DFS): for eachneighbor, visit its’ neighbors, andcontinue as far down as possible.
Bookkeeping is important:
I Keep a “To Do” list (priority queue) ofnodes still to visit.
I Mark nodes as you visit them, so, youknow not to visit again.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 15 / 16
Recap
Dynamic Programming: will do local &global alignments in lab today.
More on searching graphs on Monday.
Email lab reports to [email protected].
Challenges available at rosalind.info.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16
Recap
Dynamic Programming: will do local &global alignments in lab today.
More on searching graphs on Monday.
Email lab reports to [email protected].
Challenges available at rosalind.info.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16
Recap
Dynamic Programming: will do local &global alignments in lab today.
More on searching graphs on Monday.
Email lab reports to [email protected].
Challenges available at rosalind.info.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16
Recap
Dynamic Programming: will do local &global alignments in lab today.
More on searching graphs on Monday.
Email lab reports to [email protected].
Challenges available at rosalind.info.
K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16