1 lesson 2 aligning sequences and searching databases

Post on 01-Apr-2015

226 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Lesson 2

Aligning sequences and searching databases

2

Homology and sequence alignment.

HomologyHomology = Similarity between objects due to a common ancestry

Hund = Dog,Schwein = Pig

4

Sequence homology

VLSPAVKWAKVGAHAAGHG||| || |||| | ||||VLSEAVLWAKVEADVAGHG

Similarity between sequences as a result of common ancestry.

5

Sequence alignment

Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

6

Why align?VLSPAVKWAKV||| || |||| VLSEAVLWAKV

1. To detect if two sequences are homologous. If so, homology may indicate similarity in function (and structure).

2. Required for evolutionary studies (e.g., tree reconstruction).

3. To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site).

4. Given a sequenced DNA, from an unknown region, align it to the genome.

7

Insertions, deletions, and substitutions

8

Sequence alignment

If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV VLSEAVLWAKV

9

Perfect match

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV VLSEAVLWAKV

A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case).

10

A substitution

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV VLSEAVLWAKV

A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred).

11

Indel

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV

VLSEAVLWAKV

Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion.

VLSEAVLWAKV

12

Indel

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV

VLSEAVWAKV

Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion.

VLSEAVLWAKV

L

13

Indel

VLSPAV-WAKV

Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel.

VLSEAVLWAKV

Deletion? Insertion?

14

Indels in protein coding genes

Indels in protein coding genes are often of 3bp, 6bp, 9bp, etc...

Gene Search

In fact, searching for indels of length 3K (K=1,2,3,…) can help algorithms that search a genome for coding regions

15

Global and Local pairwise alignments

16

Global vs. Local

• Global alignment – finds the best alignment across the entire two sequences.

• Local alignment – finds regions of similarity in parts of the sequences.

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLG CDRYFQ|||| |||| |ADLG CDRYYQ

Global alignment:

forces alignment in

regions which differ

Local alignment will

return only regions of

good alignment

17

Global alignment

PTK2 protein tyrosine kinase 2 of human and rhesus monkey

18

Proteins are comprised of domains

Domain B

Protein tyrosine kinase domain

Domain A

Human PTK2 :

19

Protein tyrosine kinase domain

In leukocytes, a different gene for tyrosine kinase is expressed.

Domain X

Protein tyrosine kinase domain

Domain A

20

Domain X

Protein tyrosine kinase domain

Domain BProtein tyrosine kinase domain

Domain A

Leukocyte TK

PTK2 The sequence similarity is restricted to a single domain

21

Global alignment of PTK and LTK

22

Local alignment of PTK and LTK

23

Conclusions

Use global alignment when the two sequences share the same overall sequence arrangement.

Use local alignment to detect regions of similarity.

24

How alignments are computed

25

Pairwise alignment

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

One possible alignment:

26

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

This alignment includes:2 mismatches 4 indels (gap)

10 perfect matches

27

Choosing an alignment for a pair of sequences

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Which alignment is better?

Many different alignments are

possible for 2 sequences:

28

Scoring system (naïve)

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Higher score Better alignment

Perfect match: +1

Mismatch: -2

Indel (gap): -1

29

Alignment scoring - scoring of sequence similarity:

Assumes independence between positions:each position is considered separately

Scores each position:• Positive if identical (match)• Negative if different (mismatch or gap)

Total score = sum of position scoresCan be positive or negative

30

Scoring systems

31

Scoring system

•In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary

•Different scoring systems different alignments

•We want a good scoring system…

32

Scoring matrix

A G C T

A 2

G -6 2

C -6 -6 2

T -6 -6 -6 2

•Representing the scoring system as a table or matrix n X n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids)

•symmetric

33

DNA scoring matrices

• Uniform substitutions between all nucleotides:

From

To

A G C T

A 2

G -6 2

C -6 -6 2

T -6 -6 -6 2

MatchMismatch

34

DNA scoring matrices

Can take into account biological phenomena such as:

• Transition-transversion

35

Amino-acid scoring matrices• Take into account physico-chemical properties

36

Scoring gaps (I)

In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened.

Gap extension penalty < Gap opening penalty

37

Scoring gaps (II)

The dependency between the penalty and the length of the gap need not to be linear.

AGGGTTC—GAAGGGTTCTGA Score = -2

AGGGTT-—GAAGGGTTCTGA Score = -4

AGGGT--—GAAGGGTTCTGA Score = -6

AGGG---—GAAGGGTTCTGA Score = -8

Linear penalty

38

Scoring gaps (II)

The dependency between the penalty and the length of the gap need not to be linear.

AGGGTTC—GAAGGGTTCTGA Score = -4

AGGGTT-—GAAGGGTTCTGA Score = -6

AGGGT--—GAAGGGTTCTGA Score = -7

AGGG---—GAAGGGTTCTGA Score = -8

Non-linear penalty

39

PAM AND BLOSUM

40

Amino-acid substitution matrices

• Actual substitutions:– Based on empirical data– Commonly used by many bioinformatics

programs– PAM & BLOSUM

41

Protein matrices – actual substitutions

The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how

frequently they substitute each other M G Y D EM G Y D EM G Y E EM G Y D EM G Y Q EM G Y D EM G Y E EM G Y E E

In the fourth columnE and D are found in 7 / 8

42

PAM Matrix - Point Accepted Mutations

• The Dayhoff PAM matrix is based on a database of 1,572 changes in 71 groups of closely related proteins (85% identity => Alignment was easy and reliable).

• Counted the number of substitutions per amino-acid pair (20 x 20)

• Found that common substitutions occurred between chemically similar amino acids

43

PAM Matrices

• Family of matrices PAM 80, PAM 120, PAM 250

• The number on the PAM matrix represents evolutionary distance

• Larger numbers are for larger distances

44

Example: PAM 250

Similar amino acids have greater score

45

PAM - limitations

• Based only on a single, and limited dataset

• Examines proteins with few differences (85% identity)

• Based mainly on small globular proteins so the matrix is biased

46

BLOSUM

• Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset

• BLOSUM observes significantly more replacements than PAM, even for infrequent pairs

47

BLOSUM: Blocks Substitution Matrix

• Based on BLOCKS database – ~2000 blocks from 500 families of related

proteins– Families of proteins with identical function

• Blocks are short conserved patterns of 3-60 amino acids without gaps

AABCDA----BBCDADABCDA----BBCBBBBBCDA-AA-BCCAAAAACDA-A--CBCDBCCBADA---DBBDCCAAACAA----BBCCC

48

BLOSUM

• Each block represents a sequence alignment with different identity percentage

• For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix

49

BLOSUM Matrices

• BLOSUMn is based on sequences that share at least n percent identity

• BLOSUM62 represents closer sequences than BLOSUM45

50

Example : Blosum62

Derived from blocks where the sequencesshare at least 62% identity

51

PAM vs. BLOSUM

More distant sequences

PAM100 = BLOSUM90

PAM120 = BLOSUM80

PAM160 = BLOSUM60

PAM200 = BLOSUM52

PAM250 = BLOSUM45

52

Intermediate summary

1. Scoring system = substitution matrix + gap penalty.

2. Used for both global and local alignment

3. For amino acids, there are two types of substitution matrices: PAM and Blosum

top related