1 lesson 3 aligning sequences and searching databases

Post on 19-Dec-2015

234 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

11

Lesson 3Lesson 3

Aligning sequences and Aligning sequences and searching databases searching databases

22

HomologyHomology

Similarity between objects due to a Similarity between objects due to a common ancestrycommon ancestry

33

Sequence homologySequence homology

Similarity between sequences that Similarity between sequences that results from a common ancestorresults from a common ancestor

VLSPAVKWAKVGAHAAGHG

VLSEAVLWAKVEADVAGHGBasic assumptionBasic assumption: :

Sequence homology → Sequence homology → similar structure/functionsimilar structure/function

44

Sequence alignmentSequence alignment

Alignment: Alignment: Comparing two (pairwise) Comparing two (pairwise) or more (multiple) sequences. or more (multiple) sequences. Searching for a series of identical or Searching for a series of identical or similar characters in the sequences.similar characters in the sequences.

55

HomologyHomology

Ortholog – homolog with similar Ortholog – homolog with similar function (via speciation)function (via speciation)

Paralog – homolog which arose from Paralog – homolog which arose from gene duplicationgene duplication

Orthologs – 2 homologs

from different species

Paralogs – 2 homologs

within the same species

G G G

G

G

G G G1,G2

G

G

66

How closeHow close??

Rule of thumb:Rule of thumb: Proteins are homologous if over 25% Proteins are homologous if over 25%

identical (identical (length >100length >100)) DNA sequences are homologous if DNA sequences are homologous if

over 70% identicalover 70% identical

77

Twilight zoneTwilight zone

< 20% identity in proteins – may be < 20% identity in proteins – may be homologous and may not be….homologous and may not be….

(Note that 5% identity will be (Note that 5% identity will be obtained completely by chance!)obtained completely by chance!)

88

Why sequence alignment?Why sequence alignment?

Predict characteristics of a Predict characteristics of a protein – protein –

use the structure/function of known use the structure/function of known proteins for predicting the proteins for predicting the structure/function of an unknown structure/function of an unknown proteinsproteins

99

Sequence modificationsSequence modifications

Sequences change in the course of evolution Sequences change in the course of evolution due to random mutationsdue to random mutations

Three types of mutations:Three types of mutations:1.1. InsertionInsertion - an insertion of a nucleotide or several - an insertion of a nucleotide or several

nucleotides to the sequence. AAGAnucleotides to the sequence. AAGA AAG AAGTTAA2.2. DeletionDeletion – a deletion of a nucleotide (or more) from the – a deletion of a nucleotide (or more) from the

sequence. sequence. AAAAGAGA AGA AGA

3.3. SubstitutionSubstitution – a replacement of a nucleotide by another. – a replacement of a nucleotide by another. AAAAGGAA AA AACCAA

Insertion or Deletion ?Insertion or Deletion ? -> -> Indel Indel

1010

Local vs. GlobalLocal vs. Global

Global alignmentGlobal alignment – finds the best – finds the best alignment across the alignment across the entireentire two two sequences.sequences.

Local alignmentLocal alignment – finds regions of – finds regions of similarity in similarity in partsparts of the sequences. of the sequences.

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLG CDRYFQ|||| |||| |ADLG CDRYYQ

Global alignment:

forces alignment in

regions which differ

Local alignment will

return only regions of

good alignment

1111

When global and when localWhen global and when local??

1212

Global alignmentGlobal alignment PTK2 protein tyrosine kinase 2PTK2 protein tyrosine kinase 2 of of

human and rhesus monkeyhuman and rhesus monkey

1313

Protein tyrosine kinase domainProtein tyrosine kinase domain

1414

Protein tyrosine kinase domainProtein tyrosine kinase domain

Human PTK2 and leukocyte tyrosine Human PTK2 and leukocyte tyrosine kinase kinase

Both function as tyrosine kinases, in Both function as tyrosine kinases, in completely different contextscompletely different contexts

Ancient duplicationAncient duplication

1515

Global alignment of PTK and LTKGlobal alignment of PTK and LTK

1616

Local alignment of PTK and LTKLocal alignment of PTK and LTK

1717

Pairwise alignmentPairwise alignment

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

One possible alignment:

1818

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

This alignment includes:

2 mismatches 4 indels (gap)

10 perfect matches

1919

Choosing an alignment: Choosing an alignment:

Many different alignments are possible:Many different alignments are possible:

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Which alignment is better?

2020

Alignment scoring - scoring of Alignment scoring - scoring of sequence similarity: sequence similarity:

Assumes independence between positionsAssumes independence between positions– Each position is considered separatelyEach position is considered separately

Scores each positionScores each position– Positive if identical (match)Positive if identical (match)– Negative if different (mismatch) or gap (indel)Negative if different (mismatch) or gap (indel)

Total score = sum of position scoresTotal score = sum of position scores– Can be positive or negativeCan be positive or negative

2121

Example - naïve scoring Example - naïve scoring system:system:

Perfect match: +1Perfect match: +1 Mismatch: Mismatch: -2-2 Indel (gap): Indel (gap): -1-1

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Higher score Better alignment

2222

Scoring systemScoring system::

The choice of +1,-2, and -1 scores is The choice of +1,-2, and -1 scores is quite arbitraryquite arbitrary

Different scoring systems Different scoring systems different different alignmentsalignments

Scoring systems implicitly represent a Scoring systems implicitly represent a particular theory of evolution particular theory of evolution – Some mismatches are more plausibleSome mismatches are more plausible

Transition vs. Transversion Transition vs. Transversion LysLysArgArg ≠≠ LysLysCysCys

– Gap extension Gap extension ≠≠ Gap opening Gap opening

2323

Scoring matrixScoring matrix Representing the Representing the

scoring system as a scoring system as a table or matrix table or matrix nn n n ((n n is the number of is the number of letters the alphabet letters the alphabet contains. n=4 for contains. n=4 for nucleotides, n=20 for nucleotides, n=20 for amino acids)amino acids)

symmetricsymmetric

AAGGCCTT

AA22

GG--6622

CC--66--6622

TT--66--66--6622

2424

DNA scoring matricesDNA scoring matrices

Uniform substitutions between all nucleotides:Uniform substitutions between all nucleotides:

FromFrom

ToToAAGGCCTT

AA22

GG--6622

CC--66--6622

TT--66--66--6622

MatchMismatch

2525

DNA scoring matricesDNA scoring matrices

Can take into account biological Can take into account biological phenomena such as:phenomena such as:

Transition-transversionTransition-transversion

2626

Amino-acid scoring matricesAmino-acid scoring matrices Take into account physico-chemical propertiesTake into account physico-chemical properties

2727

Amino-acid substitutions matricesAmino-acid substitutions matrices

Actual substitutions:Actual substitutions:– Based on empirical dataBased on empirical data– Commonly used by many bioinformatics Commonly used by many bioinformatics

programsprograms– PAM & BLOSUMPAM & BLOSUM

2828

Protein matrices – actual Protein matrices – actual substitutionssubstitutions

The ideaThe idea: Given an alignment of a large number : Given an alignment of a large number of closely related sequences we can score the of closely related sequences we can score the relation between amino acids based on how relation between amino acids based on how

frequently they substitute each otherfrequently they substitute each other M G Y D EM G Y D EM G Y E EM G Y D EM G Y Q EM G Y D EM G Y E EM G Y E E

In the fourth columnE and D are found in 7 / 8

2929

PAM Matrix - PAM Matrix - PPoint oint AAccepted ccepted MMutationsutations

Based on a database of 1,572 changes in Based on a database of 1,572 changes in 71 groups of closely related proteins (85% 71 groups of closely related proteins (85% identity)identity)– Alignment was easyAlignment was easy

Counted the number of the substitutions Counted the number of the substitutions per amino-acid pair (20 x 20) per amino-acid pair (20 x 20)

Found that common substitutions occurred Found that common substitutions occurred between chemically similar amino acidsbetween chemically similar amino acids

3030

PAM MatricesPAM Matrices Family of matrices PAM 80, PAM 120, Family of matrices PAM 80, PAM 120,

PAM 250PAM 250

The number on the PAM matrix The number on the PAM matrix represents evolutionary distance represents evolutionary distance

Larger numbers are for larger distancesLarger numbers are for larger distances

3131

Example: PAM 250Example: PAM 250

Similar amino acids have greater score

3232

PAM - limitationsPAM - limitations

Based only on a single, and limited Based only on a single, and limited datasetdataset

Examines proteins with few Examines proteins with few differences (85% identity)differences (85% identity)

Based mainly on small globular Based mainly on small globular proteins so the matrix is biased proteins so the matrix is biased

3333

BLOSUMBLOSUM

Henikoff and Henikoff (1992) derived Henikoff and Henikoff (1992) derived a set of matrices based on a much a set of matrices based on a much larger dataset larger dataset

BLOSUM observes significantly more BLOSUM observes significantly more replacements than PAM, even for replacements than PAM, even for infrequent pairsinfrequent pairs

3434

BLOSUM:BLOSUM: BloBlockscks SuSubstitutionbstitution MMatrixatrix

Based on BLOCKS database Based on BLOCKS database – ~2000 blocks from 500 families of ~2000 blocks from 500 families of

related proteinsrelated proteins– Families of proteins with identical Families of proteins with identical

function function Blocks are short Blocks are short

conserved patterns of conserved patterns of 3-60 aa 3-60 aa without gapswithout gaps

AABCDA----BBCDADABCDA----BBCBBBBBCDA-AA-BCCAAAAACDA-A--CBCDBCCBADA---DBBDCCAAACAA----BBCCC

3535

BLOSUMBLOSUM

Each block represents a sequence Each block represents a sequence alignment with different identity alignment with different identity percentagepercentage

For each block the amino-acid For each block the amino-acid substitution rates were calculated to substitution rates were calculated to create the BLOSUM matrixcreate the BLOSUM matrix

3636

BLOSUM MatricesBLOSUM Matrices

BLOSUMBLOSUMnn is based on sequences that is based on sequences that share at least share at least nn percent identity percent identity

BLOSUMBLOSUM6262 represents closer represents closer sequences than BLOSUMsequences than BLOSUM4545

3737

Example : Blosum62Example : Blosum62

derived from block where the sequencesshare at least 62% identity

3838

PAM vs. BLOSUMPAM vs. BLOSUM

PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45

More distant sequences

3939

Scoring system = Scoring system =

substitution matrix + substitution matrix +

gap penaltygap penalty

4040

Gap penaltyGap penalty

We penalize gaps We penalize gaps Scoring for gap opening & gap extension:Scoring for gap opening & gap extension:

– Gap-extension penalty < gap-open penaltyGap-extension penalty < gap-open penalty

4141

Optimal alignment algorithmsOptimal alignment algorithms

Needleman-WunschNeedleman-Wunsch (global) (global) Smith-Waterman Smith-Waterman (local)(local)

4242

Alignment Search SpaceAlignment Search Space The “The “search spacesearch space” (number of possible gapped ” (number of possible gapped

alignments) for optimally aligning two sequences alignments) for optimally aligning two sequences is is exponentialexponential in the length of the sequences in the length of the sequences (n)(n)..

If If nn=100=100, there are , there are 100100100100 = 10 = 10200200 = = 100,000,000,000,000,000,000,000,000,000,000,000,000,000,0100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,0000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,00000,000,000,000,000,000,000,000,000,000

different alignments!different alignments! Average protein length is about Average protein length is about nn=250=250!!

4343

Searching databasesSearching databases

4444

Searching a sequence databaseSearching a sequence database

Using a sequence as a query to find Using a sequence as a query to find homologoushomologous sequences in a sequences in a sequence databasesequence database

4545

Query sequence: DNA or proteinQuery sequence: DNA or protein??

For coding sequences, we can use For coding sequences, we can use the DNA sequence or the protein the DNA sequence or the protein sequence to search for similar sequence to search for similar sequences.sequences.

Which is preferable?Which is preferable?

4646

Protein is betterProtein is better!!

Selection (and hence conservation) Selection (and hence conservation) works (mostly) on the protein level:works (mostly) on the protein level:

CCTTTTTTCCAA = = LeuLeu--SerSerTTTTGGAAGGTT == LeuLeu--SerSer

4747

Query typeQuery type

Nucleotides: a four letter alphabetNucleotides: a four letter alphabet Amino acids: a twenty letter alphabet Amino acids: a twenty letter alphabet

• Two random DNA sequences will, on average, have 25% identity

• Two random protein sequences will, on average, have 5% identity

4848

ConclusionsConclusions Using the amino-acid sequence is Using the amino-acid sequence is

preferable for homology searchpreferable for homology search

Why use a nucleotide sequence after all?Why use a nucleotide sequence after all? No ORF found, e.g. newly sequenced No ORF found, e.g. newly sequenced

genomegenome No similar protein sequences were foundNo similar protein sequences were found Specific DNA databases are available Specific DNA databases are available

(EST)(EST)

4949

Some terminologySome terminology

Query sequenceQuery sequence - the sequence with - the sequence with which we are searchingwhich we are searching

HitHit – a sequence found in the – a sequence found in the database, suspected as homologousdatabase, suspected as homologous

5050

How do we search a databaseHow do we search a database??

Assume we perform pairwise Assume we perform pairwise alignment of the query against all alignment of the query against all the sequences in the databasethe sequences in the database

Exact pairwise alignment is O(mn) ≈ Exact pairwise alignment is O(mn) ≈ O(nO(n22)) (m – length of sequence 1, (m – length of sequence 1, n – length of sequence 2)n – length of sequence 2)

5151

How much time will it takeHow much time will it take?? O(nO(n22) computations per search.) computations per search. Assume n=200, so we have Assume n=200, so we have 40,00040,000

computations per searchcomputations per search Size of database - Size of database - ~60 million entries~60 million entries 2.4 x 102.4 x 101212 computations for each sequence computations for each sequence

search we perform!search we perform! Assume each computation takes 10Assume each computation takes 10-6-6

seconds seconds 24,000 seconds ≈ 24,000 seconds ≈ 6.66 hours 6.66 hours for each sequence searchfor each sequence search

150,000150,000 searches (at least!!) are searches (at least!!) are performed per dayperformed per day

5252

ConclusionConclusion

Using the exact comparison pairwise Using the exact comparison pairwise alignment algorithm between query alignment algorithm between query and all DB entries – too slowand all DB entries – too slow

5353

HeuristicHeuristic

Definition:Definition: a heuristic is a design a heuristic is a design to solve a problem that does not to solve a problem that does not provide an exact solution (but is provide an exact solution (but is not too bad) but reduces the not too bad) but reduces the time complexity of the exact time complexity of the exact solutionsolution

5454

BLASTBLAST

BLAST - Basic Local Alignment and BLAST - Basic Local Alignment and Search ToolSearch Tool

A heuristic for searching a database A heuristic for searching a database for similar sequencesfor similar sequences

5555

DNA or ProteinDNA or Protein All types of searches are possibleAll types of searches are possible

Query: DNA Protein

Database: DNA Protein

blastn – nuc vs. nucblastp – prot vs. protblastx – translated query vs. protein databasetblastn – protein vs. translated nuc. DBtblastx – translated query vs. translated database

Translated databases:

trEMBLgenPept

5656

BLAST - underlying hypothesisBLAST - underlying hypothesis

The underlying hypothesisThe underlying hypothesis: when : when two sequences are similar there are two sequences are similar there are short ungapped regions of high short ungapped regions of high similaritysimilarity between them between them

The heuristic:The heuristic:

1.1. Discard irrelevant sequencesDiscard irrelevant sequences

2.2. Perform exact Perform exact locallocal alignment with alignment with remaining sequences remaining sequences

5757

How do we discard irrelevant How do we discard irrelevant sequences quicklysequences quickly??

Divide the Divide the databasedatabase into into wordswords of of length w (default: w = 3 for protein length w (default: w = 3 for protein and w = 7 for DNA)and w = 7 for DNA)

Save the words in a look-up table Save the words in a look-up table that can be searched quicklythat can be searched quickly

WTDFGYPAILKGGTAC

WTDTDFDFGFGYGYP …

5858

BLASTBLAST:: discarding sequences discarding sequences

When the user gives a query When the user gives a query sequence, divide it also into wordssequence, divide it also into words

Search the Search the databasedatabase for consecutive for consecutive neighbor wordsneighbor words

5959

Neighbour wordsNeighbour words

neighbor wordsneighbor words are defined are defined according to a scoring matrix (e.g., according to a scoring matrix (e.g., BLOSUM62 for proteins) with a BLOSUM62 for proteins) with a certain cutoff levelcertain cutoff level

GFB

GFC (20)

GPC (11)WAC (5)

6060

Search for consecutive wordsSearch for consecutive words

Query

Dat

abas

e re

cord

Neighbor word Look for a seed: hits on the same diagonal

which can be connected

At least 2 hits on the same diagonal with distance which is

smaller than a predetermined cutoff

This is the filtering stage – many unrelated hits are filtered, saving lots

of time!

A

6262

Try to extend the alignmentTry to extend the alignment

Stop extending when the score of the Stop extending when the score of the alignment drops alignment drops XX beneath the beneath the maximal score obtained so farmaximal score obtained so far

Discard segments with score < Discard segments with score < SS

ASKIOPLLWLAASFLHNEQAPALSDAN

JWQEOPLWPLAASOIHLFACNSIFYASScore=15 Score=17 Score=14

X=4

6363

The result – local alignmentThe result – local alignment

The result of BLAST will be a series of The result of BLAST will be a series of local alignmentslocal alignments between the query between the query and the different hits foundand the different hits found

6464

E-valueE-value The number of times we will

theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database

Theoretically, we could trust

any result with an

E-value ≤ 1

In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a

significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe

non-homologous).E-values between 10-2 and 1 do not

indicate a good homology

6565

Filtering low complexityFiltering low complexity

Low complexity regionsLow complexity regions : e.g., Proline : e.g., Proline rich areas (in proteins), Alu repeats rich areas (in proteins), Alu repeats (in DNA)(in DNA)

Regions of low complexity generate Regions of low complexity generate high score of alignment, BUT – this high score of alignment, BUT – this does not indicate homologydoes not indicate homology

6666

SolutionSolution

In BLAST there is an option to mask In BLAST there is an option to mask low-complexity regions in the query low-complexity regions in the query sequence (such regions are sequence (such regions are represented as XXXXX in query)represented as XXXXX in query)

top related