1 lesson 3 aligning sequences and searching databases

65
1 Lesson 3 Lesson 3 Aligning sequences and Aligning sequences and searching databases searching databases

Post on 19-Dec-2015

234 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Lesson 3 Aligning sequences and searching databases

11

Lesson 3Lesson 3

Aligning sequences and Aligning sequences and searching databases searching databases

Page 2: 1 Lesson 3 Aligning sequences and searching databases

22

HomologyHomology

Similarity between objects due to a Similarity between objects due to a common ancestrycommon ancestry

Page 3: 1 Lesson 3 Aligning sequences and searching databases

33

Sequence homologySequence homology

Similarity between sequences that Similarity between sequences that results from a common ancestorresults from a common ancestor

VLSPAVKWAKVGAHAAGHG

VLSEAVLWAKVEADVAGHGBasic assumptionBasic assumption: :

Sequence homology → Sequence homology → similar structure/functionsimilar structure/function

Page 4: 1 Lesson 3 Aligning sequences and searching databases

44

Sequence alignmentSequence alignment

Alignment: Alignment: Comparing two (pairwise) Comparing two (pairwise) or more (multiple) sequences. or more (multiple) sequences. Searching for a series of identical or Searching for a series of identical or similar characters in the sequences.similar characters in the sequences.

Page 5: 1 Lesson 3 Aligning sequences and searching databases

55

HomologyHomology

Ortholog – homolog with similar Ortholog – homolog with similar function (via speciation)function (via speciation)

Paralog – homolog which arose from Paralog – homolog which arose from gene duplicationgene duplication

Orthologs – 2 homologs

from different species

Paralogs – 2 homologs

within the same species

G G G

G

G

G G G1,G2

G

G

Page 6: 1 Lesson 3 Aligning sequences and searching databases

66

How closeHow close??

Rule of thumb:Rule of thumb: Proteins are homologous if over 25% Proteins are homologous if over 25%

identical (identical (length >100length >100)) DNA sequences are homologous if DNA sequences are homologous if

over 70% identicalover 70% identical

Page 7: 1 Lesson 3 Aligning sequences and searching databases

77

Twilight zoneTwilight zone

< 20% identity in proteins – may be < 20% identity in proteins – may be homologous and may not be….homologous and may not be….

(Note that 5% identity will be (Note that 5% identity will be obtained completely by chance!)obtained completely by chance!)

Page 8: 1 Lesson 3 Aligning sequences and searching databases

88

Why sequence alignment?Why sequence alignment?

Predict characteristics of a Predict characteristics of a protein – protein –

use the structure/function of known use the structure/function of known proteins for predicting the proteins for predicting the structure/function of an unknown structure/function of an unknown proteinsproteins

Page 9: 1 Lesson 3 Aligning sequences and searching databases

99

Sequence modificationsSequence modifications

Sequences change in the course of evolution Sequences change in the course of evolution due to random mutationsdue to random mutations

Three types of mutations:Three types of mutations:1.1. InsertionInsertion - an insertion of a nucleotide or several - an insertion of a nucleotide or several

nucleotides to the sequence. AAGAnucleotides to the sequence. AAGA AAG AAGTTAA2.2. DeletionDeletion – a deletion of a nucleotide (or more) from the – a deletion of a nucleotide (or more) from the

sequence. sequence. AAAAGAGA AGA AGA

3.3. SubstitutionSubstitution – a replacement of a nucleotide by another. – a replacement of a nucleotide by another. AAAAGGAA AA AACCAA

Insertion or Deletion ?Insertion or Deletion ? -> -> Indel Indel

Page 10: 1 Lesson 3 Aligning sequences and searching databases

1010

Local vs. GlobalLocal vs. Global

Global alignmentGlobal alignment – finds the best – finds the best alignment across the alignment across the entireentire two two sequences.sequences.

Local alignmentLocal alignment – finds regions of – finds regions of similarity in similarity in partsparts of the sequences. of the sequences.

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLG CDRYFQ|||| |||| |ADLG CDRYYQ

Global alignment:

forces alignment in

regions which differ

Local alignment will

return only regions of

good alignment

Page 11: 1 Lesson 3 Aligning sequences and searching databases

1111

When global and when localWhen global and when local??

Page 12: 1 Lesson 3 Aligning sequences and searching databases

1212

Global alignmentGlobal alignment PTK2 protein tyrosine kinase 2PTK2 protein tyrosine kinase 2 of of

human and rhesus monkeyhuman and rhesus monkey

Page 13: 1 Lesson 3 Aligning sequences and searching databases

1313

Protein tyrosine kinase domainProtein tyrosine kinase domain

Page 14: 1 Lesson 3 Aligning sequences and searching databases

1414

Protein tyrosine kinase domainProtein tyrosine kinase domain

Human PTK2 and leukocyte tyrosine Human PTK2 and leukocyte tyrosine kinase kinase

Both function as tyrosine kinases, in Both function as tyrosine kinases, in completely different contextscompletely different contexts

Ancient duplicationAncient duplication

Page 15: 1 Lesson 3 Aligning sequences and searching databases

1515

Global alignment of PTK and LTKGlobal alignment of PTK and LTK

Page 16: 1 Lesson 3 Aligning sequences and searching databases

1616

Local alignment of PTK and LTKLocal alignment of PTK and LTK

Page 17: 1 Lesson 3 Aligning sequences and searching databases

1717

Pairwise alignmentPairwise alignment

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

One possible alignment:

Page 18: 1 Lesson 3 Aligning sequences and searching databases

1818

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

This alignment includes:

2 mismatches 4 indels (gap)

10 perfect matches

Page 19: 1 Lesson 3 Aligning sequences and searching databases

1919

Choosing an alignment: Choosing an alignment:

Many different alignments are possible:Many different alignments are possible:

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Which alignment is better?

Page 20: 1 Lesson 3 Aligning sequences and searching databases

2020

Alignment scoring - scoring of Alignment scoring - scoring of sequence similarity: sequence similarity:

Assumes independence between positionsAssumes independence between positions– Each position is considered separatelyEach position is considered separately

Scores each positionScores each position– Positive if identical (match)Positive if identical (match)– Negative if different (mismatch) or gap (indel)Negative if different (mismatch) or gap (indel)

Total score = sum of position scoresTotal score = sum of position scores– Can be positive or negativeCan be positive or negative

Page 21: 1 Lesson 3 Aligning sequences and searching databases

2121

Example - naïve scoring Example - naïve scoring system:system:

Perfect match: +1Perfect match: +1 Mismatch: Mismatch: -2-2 Indel (gap): Indel (gap): -1-1

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Higher score Better alignment

Page 22: 1 Lesson 3 Aligning sequences and searching databases

2222

Scoring systemScoring system::

The choice of +1,-2, and -1 scores is The choice of +1,-2, and -1 scores is quite arbitraryquite arbitrary

Different scoring systems Different scoring systems different different alignmentsalignments

Scoring systems implicitly represent a Scoring systems implicitly represent a particular theory of evolution particular theory of evolution – Some mismatches are more plausibleSome mismatches are more plausible

Transition vs. Transversion Transition vs. Transversion LysLysArgArg ≠≠ LysLysCysCys

– Gap extension Gap extension ≠≠ Gap opening Gap opening

Page 23: 1 Lesson 3 Aligning sequences and searching databases

2323

Scoring matrixScoring matrix Representing the Representing the

scoring system as a scoring system as a table or matrix table or matrix nn n n ((n n is the number of is the number of letters the alphabet letters the alphabet contains. n=4 for contains. n=4 for nucleotides, n=20 for nucleotides, n=20 for amino acids)amino acids)

symmetricsymmetric

AAGGCCTT

AA22

GG--6622

CC--66--6622

TT--66--66--6622

Page 24: 1 Lesson 3 Aligning sequences and searching databases

2424

DNA scoring matricesDNA scoring matrices

Uniform substitutions between all nucleotides:Uniform substitutions between all nucleotides:

FromFrom

ToToAAGGCCTT

AA22

GG--6622

CC--66--6622

TT--66--66--6622

MatchMismatch

Page 25: 1 Lesson 3 Aligning sequences and searching databases

2525

DNA scoring matricesDNA scoring matrices

Can take into account biological Can take into account biological phenomena such as:phenomena such as:

Transition-transversionTransition-transversion

Page 26: 1 Lesson 3 Aligning sequences and searching databases

2626

Amino-acid scoring matricesAmino-acid scoring matrices Take into account physico-chemical propertiesTake into account physico-chemical properties

Page 27: 1 Lesson 3 Aligning sequences and searching databases

2727

Amino-acid substitutions matricesAmino-acid substitutions matrices

Actual substitutions:Actual substitutions:– Based on empirical dataBased on empirical data– Commonly used by many bioinformatics Commonly used by many bioinformatics

programsprograms– PAM & BLOSUMPAM & BLOSUM

Page 28: 1 Lesson 3 Aligning sequences and searching databases

2828

Protein matrices – actual Protein matrices – actual substitutionssubstitutions

The ideaThe idea: Given an alignment of a large number : Given an alignment of a large number of closely related sequences we can score the of closely related sequences we can score the relation between amino acids based on how relation between amino acids based on how

frequently they substitute each otherfrequently they substitute each other M G Y D EM G Y D EM G Y E EM G Y D EM G Y Q EM G Y D EM G Y E EM G Y E E

In the fourth columnE and D are found in 7 / 8

Page 29: 1 Lesson 3 Aligning sequences and searching databases

2929

PAM Matrix - PAM Matrix - PPoint oint AAccepted ccepted MMutationsutations

Based on a database of 1,572 changes in Based on a database of 1,572 changes in 71 groups of closely related proteins (85% 71 groups of closely related proteins (85% identity)identity)– Alignment was easyAlignment was easy

Counted the number of the substitutions Counted the number of the substitutions per amino-acid pair (20 x 20) per amino-acid pair (20 x 20)

Found that common substitutions occurred Found that common substitutions occurred between chemically similar amino acidsbetween chemically similar amino acids

Page 30: 1 Lesson 3 Aligning sequences and searching databases

3030

PAM MatricesPAM Matrices Family of matrices PAM 80, PAM 120, Family of matrices PAM 80, PAM 120,

PAM 250PAM 250

The number on the PAM matrix The number on the PAM matrix represents evolutionary distance represents evolutionary distance

Larger numbers are for larger distancesLarger numbers are for larger distances

Page 31: 1 Lesson 3 Aligning sequences and searching databases

3131

Example: PAM 250Example: PAM 250

Similar amino acids have greater score

Page 32: 1 Lesson 3 Aligning sequences and searching databases

3232

PAM - limitationsPAM - limitations

Based only on a single, and limited Based only on a single, and limited datasetdataset

Examines proteins with few Examines proteins with few differences (85% identity)differences (85% identity)

Based mainly on small globular Based mainly on small globular proteins so the matrix is biased proteins so the matrix is biased

Page 33: 1 Lesson 3 Aligning sequences and searching databases

3333

BLOSUMBLOSUM

Henikoff and Henikoff (1992) derived Henikoff and Henikoff (1992) derived a set of matrices based on a much a set of matrices based on a much larger dataset larger dataset

BLOSUM observes significantly more BLOSUM observes significantly more replacements than PAM, even for replacements than PAM, even for infrequent pairsinfrequent pairs

Page 34: 1 Lesson 3 Aligning sequences and searching databases

3434

BLOSUM:BLOSUM: BloBlockscks SuSubstitutionbstitution MMatrixatrix

Based on BLOCKS database Based on BLOCKS database – ~2000 blocks from 500 families of ~2000 blocks from 500 families of

related proteinsrelated proteins– Families of proteins with identical Families of proteins with identical

function function Blocks are short Blocks are short

conserved patterns of conserved patterns of 3-60 aa 3-60 aa without gapswithout gaps

AABCDA----BBCDADABCDA----BBCBBBBBCDA-AA-BCCAAAAACDA-A--CBCDBCCBADA---DBBDCCAAACAA----BBCCC

Page 35: 1 Lesson 3 Aligning sequences and searching databases

3535

BLOSUMBLOSUM

Each block represents a sequence Each block represents a sequence alignment with different identity alignment with different identity percentagepercentage

For each block the amino-acid For each block the amino-acid substitution rates were calculated to substitution rates were calculated to create the BLOSUM matrixcreate the BLOSUM matrix

Page 36: 1 Lesson 3 Aligning sequences and searching databases

3636

BLOSUM MatricesBLOSUM Matrices

BLOSUMBLOSUMnn is based on sequences that is based on sequences that share at least share at least nn percent identity percent identity

BLOSUMBLOSUM6262 represents closer represents closer sequences than BLOSUMsequences than BLOSUM4545

Page 37: 1 Lesson 3 Aligning sequences and searching databases

3737

Example : Blosum62Example : Blosum62

derived from block where the sequencesshare at least 62% identity

Page 38: 1 Lesson 3 Aligning sequences and searching databases

3838

PAM vs. BLOSUMPAM vs. BLOSUM

PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45

More distant sequences

Page 39: 1 Lesson 3 Aligning sequences and searching databases

3939

Scoring system = Scoring system =

substitution matrix + substitution matrix +

gap penaltygap penalty

Page 40: 1 Lesson 3 Aligning sequences and searching databases

4040

Gap penaltyGap penalty

We penalize gaps We penalize gaps Scoring for gap opening & gap extension:Scoring for gap opening & gap extension:

– Gap-extension penalty < gap-open penaltyGap-extension penalty < gap-open penalty

Page 41: 1 Lesson 3 Aligning sequences and searching databases

4141

Optimal alignment algorithmsOptimal alignment algorithms

Needleman-WunschNeedleman-Wunsch (global) (global) Smith-Waterman Smith-Waterman (local)(local)

Page 42: 1 Lesson 3 Aligning sequences and searching databases

4242

Alignment Search SpaceAlignment Search Space The “The “search spacesearch space” (number of possible gapped ” (number of possible gapped

alignments) for optimally aligning two sequences alignments) for optimally aligning two sequences is is exponentialexponential in the length of the sequences in the length of the sequences (n)(n)..

If If nn=100=100, there are , there are 100100100100 = 10 = 10200200 = = 100,000,000,000,000,000,000,000,000,000,000,000,000,000,0100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,00000,000,000,000,000,000,000,000,000,000

different alignments!different alignments! Average protein length is about Average protein length is about nn=250=250!!

Page 43: 1 Lesson 3 Aligning sequences and searching databases

4343

Searching databasesSearching databases

Page 44: 1 Lesson 3 Aligning sequences and searching databases

4444

Searching a sequence databaseSearching a sequence database

Using a sequence as a query to find Using a sequence as a query to find homologoushomologous sequences in a sequences in a sequence databasesequence database

Page 45: 1 Lesson 3 Aligning sequences and searching databases

4545

Query sequence: DNA or proteinQuery sequence: DNA or protein??

For coding sequences, we can use For coding sequences, we can use the DNA sequence or the protein the DNA sequence or the protein sequence to search for similar sequence to search for similar sequences.sequences.

Which is preferable?Which is preferable?

Page 46: 1 Lesson 3 Aligning sequences and searching databases

4646

Protein is betterProtein is better!!

Selection (and hence conservation) Selection (and hence conservation) works (mostly) on the protein level:works (mostly) on the protein level:

CCTTTTTTCCAA = = LeuLeu--SerSerTTTTGGAAGGTT == LeuLeu--SerSer

Page 47: 1 Lesson 3 Aligning sequences and searching databases

4747

Query typeQuery type

Nucleotides: a four letter alphabetNucleotides: a four letter alphabet Amino acids: a twenty letter alphabet Amino acids: a twenty letter alphabet

• Two random DNA sequences will, on average, have 25% identity

• Two random protein sequences will, on average, have 5% identity

Page 48: 1 Lesson 3 Aligning sequences and searching databases

4848

ConclusionsConclusions Using the amino-acid sequence is Using the amino-acid sequence is

preferable for homology searchpreferable for homology search

Why use a nucleotide sequence after all?Why use a nucleotide sequence after all? No ORF found, e.g. newly sequenced No ORF found, e.g. newly sequenced

genomegenome No similar protein sequences were foundNo similar protein sequences were found Specific DNA databases are available Specific DNA databases are available

(EST)(EST)

Page 49: 1 Lesson 3 Aligning sequences and searching databases

4949

Some terminologySome terminology

Query sequenceQuery sequence - the sequence with - the sequence with which we are searchingwhich we are searching

HitHit – a sequence found in the – a sequence found in the database, suspected as homologousdatabase, suspected as homologous

Page 50: 1 Lesson 3 Aligning sequences and searching databases

5050

How do we search a databaseHow do we search a database??

Assume we perform pairwise Assume we perform pairwise alignment of the query against all alignment of the query against all the sequences in the databasethe sequences in the database

Exact pairwise alignment is O(mn) ≈ Exact pairwise alignment is O(mn) ≈ O(nO(n22)) (m – length of sequence 1, (m – length of sequence 1, n – length of sequence 2)n – length of sequence 2)

Page 51: 1 Lesson 3 Aligning sequences and searching databases

5151

How much time will it takeHow much time will it take?? O(nO(n22) computations per search.) computations per search. Assume n=200, so we have Assume n=200, so we have 40,00040,000

computations per searchcomputations per search Size of database - Size of database - ~60 million entries~60 million entries 2.4 x 102.4 x 101212 computations for each sequence computations for each sequence

search we perform!search we perform! Assume each computation takes 10Assume each computation takes 10-6-6

seconds seconds 24,000 seconds ≈ 24,000 seconds ≈ 6.66 hours 6.66 hours for each sequence searchfor each sequence search

150,000150,000 searches (at least!!) are searches (at least!!) are performed per dayperformed per day

Page 52: 1 Lesson 3 Aligning sequences and searching databases

5252

ConclusionConclusion

Using the exact comparison pairwise Using the exact comparison pairwise alignment algorithm between query alignment algorithm between query and all DB entries – too slowand all DB entries – too slow

Page 53: 1 Lesson 3 Aligning sequences and searching databases

5353

HeuristicHeuristic

Definition:Definition: a heuristic is a design a heuristic is a design to solve a problem that does not to solve a problem that does not provide an exact solution (but is provide an exact solution (but is not too bad) but reduces the not too bad) but reduces the time complexity of the exact time complexity of the exact solutionsolution

Page 54: 1 Lesson 3 Aligning sequences and searching databases

5454

BLASTBLAST

BLAST - Basic Local Alignment and BLAST - Basic Local Alignment and Search ToolSearch Tool

A heuristic for searching a database A heuristic for searching a database for similar sequencesfor similar sequences

Page 55: 1 Lesson 3 Aligning sequences and searching databases

5555

DNA or ProteinDNA or Protein All types of searches are possibleAll types of searches are possible

Query: DNA Protein

Database: DNA Protein

blastn – nuc vs. nucblastp – prot vs. protblastx – translated query vs. protein databasetblastn – protein vs. translated nuc. DBtblastx – translated query vs. translated database

Translated databases:

trEMBLgenPept

Page 56: 1 Lesson 3 Aligning sequences and searching databases

5656

BLAST - underlying hypothesisBLAST - underlying hypothesis

The underlying hypothesisThe underlying hypothesis: when : when two sequences are similar there are two sequences are similar there are short ungapped regions of high short ungapped regions of high similaritysimilarity between them between them

The heuristic:The heuristic:

1.1. Discard irrelevant sequencesDiscard irrelevant sequences

2.2. Perform exact Perform exact locallocal alignment with alignment with remaining sequences remaining sequences

Page 57: 1 Lesson 3 Aligning sequences and searching databases

5757

How do we discard irrelevant How do we discard irrelevant sequences quicklysequences quickly??

Divide the Divide the databasedatabase into into wordswords of of length w (default: w = 3 for protein length w (default: w = 3 for protein and w = 7 for DNA)and w = 7 for DNA)

Save the words in a look-up table Save the words in a look-up table that can be searched quicklythat can be searched quickly

WTDFGYPAILKGGTAC

WTDTDFDFGFGYGYP …

Page 58: 1 Lesson 3 Aligning sequences and searching databases

5858

BLASTBLAST:: discarding sequences discarding sequences

When the user gives a query When the user gives a query sequence, divide it also into wordssequence, divide it also into words

Search the Search the databasedatabase for consecutive for consecutive neighbor wordsneighbor words

Page 59: 1 Lesson 3 Aligning sequences and searching databases

5959

Neighbour wordsNeighbour words

neighbor wordsneighbor words are defined are defined according to a scoring matrix (e.g., according to a scoring matrix (e.g., BLOSUM62 for proteins) with a BLOSUM62 for proteins) with a certain cutoff levelcertain cutoff level

GFB

GFC (20)

GPC (11)WAC (5)

Page 60: 1 Lesson 3 Aligning sequences and searching databases

6060

Search for consecutive wordsSearch for consecutive words

Query

Dat

abas

e re

cord

Neighbor word Look for a seed: hits on the same diagonal

which can be connected

At least 2 hits on the same diagonal with distance which is

smaller than a predetermined cutoff

This is the filtering stage – many unrelated hits are filtered, saving lots

of time!

A

Page 61: 1 Lesson 3 Aligning sequences and searching databases

6262

Try to extend the alignmentTry to extend the alignment

Stop extending when the score of the Stop extending when the score of the alignment drops alignment drops XX beneath the beneath the maximal score obtained so farmaximal score obtained so far

Discard segments with score < Discard segments with score < SS

ASKIOPLLWLAASFLHNEQAPALSDAN

JWQEOPLWPLAASOIHLFACNSIFYASScore=15 Score=17 Score=14

X=4

Page 62: 1 Lesson 3 Aligning sequences and searching databases

6363

The result – local alignmentThe result – local alignment

The result of BLAST will be a series of The result of BLAST will be a series of local alignmentslocal alignments between the query between the query and the different hits foundand the different hits found

Page 63: 1 Lesson 3 Aligning sequences and searching databases

6464

E-valueE-value The number of times we will

theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database

Theoretically, we could trust

any result with an

E-value ≤ 1

In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a

significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe

non-homologous).E-values between 10-2 and 1 do not

indicate a good homology

Page 64: 1 Lesson 3 Aligning sequences and searching databases

6565

Filtering low complexityFiltering low complexity

Low complexity regionsLow complexity regions : e.g., Proline : e.g., Proline rich areas (in proteins), Alu repeats rich areas (in proteins), Alu repeats (in DNA)(in DNA)

Regions of low complexity generate Regions of low complexity generate high score of alignment, BUT – this high score of alignment, BUT – this does not indicate homologydoes not indicate homology

Page 65: 1 Lesson 3 Aligning sequences and searching databases

6666

SolutionSolution

In BLAST there is an option to mask In BLAST there is an option to mask low-complexity regions in the query low-complexity regions in the query sequence (such regions are sequence (such regions are represented as XXXXX in query)represented as XXXXX in query)