introduction to bioinformatics

Introduction to Bioinformatics

Burkhard Morgenstern

Institute of Microbiology and Genetics

Department of Bioinformatics

Goldschmidtstr. 1

Göttingen, March 2004


Bioinformatics in Göttingen:

Dep. of Bioinformatics (UKG),

Edgar Wingender Dep. of Bioinformatics (IMG), BM Inst. Num. and Applied Mathematics,

Stephan Waack Dep. of Genetics (Hans Fritz, IMG),

Rainer Merkl


Definition:

Bioinformatics

= development and application of software

tools for Molecular Biology

Bioinformatics:

Topics:

(a) Sequence Analysis (Gene finding …)

(b) Structure Analysis (RNA, Protein)

(c) Gene Expression Analysis

(d) Metabolic Pathways, Virtual Cell

Bioinformatics:

Areas of work:

(a) Application of software tools for data analysis in (Molecular) Biology

(b) Computing infrastructure, database development, support

(c) Development of algorithms and software tools

Information flow in the cell


Idea:

Sequence -> Structure -> Function


Lots of data available at the sequence level

Fewer data at the structure and function level

Topics of lecture:

Data bases SwissProt, GenBank Pair-wise sequence comparison Data base searching Multiple sequence alignment Gene prediction

Protein data bases

Sanger and Tuppy: protein-sequencing methods (1951)

Margaret Dayhoff: Atlas of Protein Sequence and Structure (1972); later: Protein Identification Resource (PIR) as international collaboration

(a) Organize proteins into families;

(b) Amino acid substitution frequencies Amos Bairoch: SwissProt (1986)

Exponential growth of data bases

DNA data bases

Maxam and Gilbert; Sanger: DNA sequencing methods (1977)

GenBank DNA data base (1979), now run by NCBI.

Collaboration with EMBL (1982), DDBJ (1984)

Translated DNA sequences stored in protein data bases (PIR, trEMBL)

Most important tool for sequence analysis:

Sequence comparison

The dot plot

Y Q E W T Y I V A R E A Q Y E

C I V M R E Q Y

The dot plot

Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y

The dot plot

Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X

The dot plot

Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X

The dot plot

Advantages:

1. Various types of similarity detectable (repeats, inversions)

2. Useful for large-scale analysis

The dot plot

Pair-wise sequence alignment

Evolutionary or structurally related sequences:

alignment possible

Sequence homologies represented by inserting gaps


T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X


T Y I V A R E A Q Y E

C I V M R E Q Y



- C I V M R E - Q Y –



- C I V M R E - Q Y –

Global alignment: sequences aligned over the entire length



- C I V M R E - Q Y –

Basic task:

Find best alignment of two sequences



- C I V M R E - Q Y –

Basic task:

Find best alignment of two sequences

= alignment that reflects structural and evolutionary relations



- C I V M R E - Q Y –

Questions:

1. What is a good alignment?

2. How to find the best alignment?



- C I V M R E - Q Y –

Problem: Astronomical number of possible

alignments



C I - V M R E - Q Y –


alignments



- C I V M R E - Q Y –


alignments

Stupid computer has to find out: which alignment is best ??



- C I V M R E - Q Y –

First (simplified) rules:

1. Minimize number of mismatches

2. Maximize number of matches



C I - V M R E - Q Y –






- C I V M R E - Q Y –






C I - V M R E - Q Y –

Second (simplified) rule:

Minimize number of gaps


T Y I V - A R E A Q Y E

C I - V M - R E - Q Y –

Second (simplified) rule:

Minimize number of gaps


For protein sequences: Different degrees of similarity among amino

acids. Counting matches/mismatches

oversimplistic


T Y I V

T L V


T Y I V

T L - V


T Y I V

T - L V


T Y I V

T - L V

Use similarity scores for amino acids


T Y I V

T - L V

Use similarity scores for amino acids:

Define score s(a,b) for amino acids a and b


T Y I V

T - L V

Given a similarity score for pairs of amino acids

Define score of alignment as

sum of similarity values s(a,b) of aligned

residues minus gap penalty g for each

residue aligned with a gap


T Y I V

T - L V

Example:

Score = s(T,T) + s(I,L) + s (V,V) - g


T Y I V

T - L V

Dynamic-programming algorithm finds

alignment with best score.

(Needleman and Wunsch, 1970)



- C I V M R E - Q Y –

Alignment corresponds to path through comparison matrix


T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X


T Y I V A R E A Q Y E X X C X I X V X M X R X E X X Q X Y X X



- C I V M R E - Q Y –

Alignment corresponds to path through comparison matrix


T W L V - R E A Q I - C I V M R E - H Y


Score of alignment: Sum of similarity values of aligned residues minus gap penatly



Example: S = - g + s(W,C) + s(L,L) + s(V,V) - g + s(R,R) …



T W L V R E A Q Y I X X C X Alignment corresponds I X to path through V X comparison matrix M X R X E X X H X Y X X



i T W L V R E A Q Y I X X Dynamic programming: C X Calculate scores S(i,j) I X of optimal alignment of V X prefixes up to positions M X i and j. j R X E H Y

T W L V - R - C I V M R


i T W L V R E A Q Y I X X C X S(i,j) can be calculated from I X possible predecessors V X S(i-1,j-1), S(i,j-1), S(i-1,j). M X j R X E H Y



i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from top left = V X M X S(i-1,j-1) + s(R,R) j R X E H Y



i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from above = V X j-1M X S(i,j-1) – g j R X E H Y

T W L V R - - C I V M R


i-1 i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from left = V X M X S(i-1,j) – g j R X X E H Y

T W L - - V R - C I V M R -


i-1 i T W L V R E A Q Y I X X C X Score of optimal path = I X V X Maximum of these three M X values j R X X E H Y

T W L - - V R - C I V M R -


Recursion formula:

S(i,j) = max { S(i-1,j-i)+s(ai,bj) , S(i-1,j) – g , S(i,j-i) – g }


T W L V R C I V M R E H Y


T W L V R x x x C x x x I x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:


T W L V R x x x C x x x I x x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:


T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Fill matrix from top left to bottom right:


T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Find optimal alignment by trace-back procedure


T W L V R x x x x x x C x I x V x M x R x E x H x Y x Initial matrix entries?


i

T W L V R

X X

C X Entries S(i,j) scores

I X of optimal alignment of

j V X prefixes up to positions

M i and j.

R

E

H

Y

T W L V

- C I V


i T W L V R j X X X X X C Entries S(i,0) scores I of optimal alignment of V prefix up to positions M i and empty prefix. R E Score = - i* g H Y T W L V - - - -


T W L V R C I V M R E H Y Initial matrix entries: Example, g = 2


T W L V R 0 -2 -4 -6 -8 -10 C -2 I -4 V -6 M -8 R -10 E -12 H -14 Y -16 Initial matrix entries: Example, g = 2

Pair-wise global alignment

T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X

T W L V - R E A Q I - C I V M R E - F Y

Pair-wise global alignment

Complexity:

l1 and l2 length of sequences:

Computing time and memory proportional to

l1 * l2

Time and space complexity = O(l1 * l2)

Pair-wise local alignment

Sequences often share only

local sequence similarity

(conserved genes or domains)

Important for database searching


T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X



Problem:

Find pair of segments with maximal

Alignment score

(not necessarily part of optimal global alignment!)


Recursion formula for global alignment:

S(i,j) = max { S(i-1,j-i)+s(ai,bj) , S(i-1,j) – g , S(i,j-i) – g }


Recursion formula for local alignment:

S(i,j) = max { 0 , S(i-1,j-i)+s(ai,bj) , S(i-1,j) – g , S(i,j-i) – g }


T W L V R 0 0 0 0 0 0 C 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 Initial matrix entries = 0


T W L V R 0 0 0 0 0 0 C 0 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 s(C,T) = -2


Recursion formula for local alignment:

S(i,j) = max { 0 , S(i-1,j-i)+s(ai,bj) , S(i-1,j) – g , S(i,j-i) – g }

Store position with maximal value S(i,j) in matrix


Algorithm by

Smith and Waterman (1983)

Implementation: e.g. BestFit in GCG package

introduction to bioinformatics

Documents

r y x x x q x x x e

r x e x x x q x x y

q y e c i x v x

q y e c i v

q y e c i v

dot plot y q e w t y

sequence structure

sequence level fewer