bioinformatics for mnw 2 nd year

179
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU)

Upload: deon

Post on 15-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Bioinformatics For MNW 2 nd Year. Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU) [email protected]. Current Bioinformatics Unit. Jens Kleinjung (1/11/02) Victor Simosis – PhD (1/12/02) Radek Szklarczyk - PhD (1/01/03) John Romein (1/12/02, Henri Bal). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatics  For MNW 2 nd  Year

Bioinformatics For MNW 2nd Year

Jaap Heringa

FEW/FALW

Integrative Bioinformatics Institute VU (IBIVU)

[email protected]

Page 2: Bioinformatics  For MNW 2 nd  Year

Current Bioinformatics Unit

• Jens Kleinjung (1/11/02)

• Victor Simosis – PhD (1/12/02)

• Radek Szklarczyk - PhD (1/01/03)

• John Romein (1/12/02, Henri Bal)

Page 3: Bioinformatics  For MNW 2 nd  Year

Bioinformatics course 2nd year MNW spring 2003

• Pattern recognition– Supervised/unsupervised learning– Types of data, data normalisation, lacking data– Search image– Similarity tables– Clustering– Principal component analysis– Discriminant analysis

Page 4: Bioinformatics  For MNW 2 nd  Year

Bioinformatics course 2nd year MNW spring 2003

• Protein– Folding– Structure and function– Protein structure prediction– Secondary structure– Tertiary structure– Function– Post-translational modification – Prot.-Prot. Interaction -- Docking algorithm– Molecular dynamics/Monte Carlo

Page 5: Bioinformatics  For MNW 2 nd  Year

Bioinformatics course 2nd year MNW spring 2003

• Sequence analysis– Pairwise alignment– Dynamic programming (NW, SW, shortcuts)– Multiple alignment– Combining information– Database/homology searching (Fasta, Blast,

Statistical issues-E/P values)

Page 6: Bioinformatics  For MNW 2 nd  Year

Bioinformatics course 2nd year MNW spring 2003

• Gene structure and gene finding algorithm• Omics

– DNA makes RNA makes protein– Expression data, Nucleus to ribosome, translation, etc.– Metabolomics– Physiomics– Databases

• DNA, EST• Protein sequence• Protein structure

Page 7: Bioinformatics  For MNW 2 nd  Year

Bioinformatics course 2nd year MNW spring 2003

o Microarray data

o Protein structure (PDB)

o Proteomics

o Mass spectrometry/NMR/X-ray?

Page 8: Bioinformatics  For MNW 2 nd  Year

Bioinformatics course 2nd year MNW spring 2003

• Bioinformatics method development• IPR issues• Programming and scripting languages• Web solutions• Computational issues

– NP-complete problems– CPU, memory, storage problems– Parallel computing

• Bioinformatics method usage/application• Molecular viewers (RasMol, MolMol, etc.)

Page 9: Bioinformatics  For MNW 2 nd  Year

Gathering knowledge

• Anatomy, architecture

• Dynamics, mechanics

• Informatics(Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals, and hence it applies to technological, animal and environmental systems)

• Genomics, bioinformatics

Rembrandt, 1632

Newton, 1726

Page 10: Bioinformatics  For MNW 2 nd  Year

MathematicsStatistics

Computer ScienceInformatics

BiologyMolecular biology

Medicine

Chemistry

Physics

Bioinformatics

Bioinformatics

Page 11: Bioinformatics  For MNW 2 nd  Year

Bioinformatics

“Studying informational processes in biological systems” (Hogeweg, early

1970s)• No computers necessary• Back of envelope OK

Applying algorithms with mathematical formalisms in biology (genomics) -- USA

“Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith)

Page 12: Bioinformatics  For MNW 2 nd  Year

Bioinformatics in the olden days• Close to Molecular Biology:

– (Statistical) analysis of protein and nucleotide structure

– Protein folding problem– Protein-protein and protein-nucleotide

interaction

• Many essential methods were created early on (BG era)– Protein sequence analysis (pairwise and

multiple alignment)– Protein structure prediction (secondary, tertiary

structure)

Page 13: Bioinformatics  For MNW 2 nd  Year

Bioinformatics in the olden days (Cont.)

• Evolution was studied and methods created– Phylogenetic reconstruction (clustering – NJ

method

Page 14: Bioinformatics  For MNW 2 nd  Year

The Human Genome -- 26 June 2000

Page 15: Bioinformatics  For MNW 2 nd  Year

The Human Genome -- 26 June 2000

Dr. Craig Venter

Celera Genomics

-- Shotgun method

Sir John Sulston

Human Genome Project

Page 16: Bioinformatics  For MNW 2 nd  Year

Human DNA

• There are about 3bn (3 109) nucleotides in the nucleus of almost all of the trillions (3.5 1012 ) of cells of a human body (an exception is, for example, red blood cells which have no nucleus and therefore no DNA) – a total of ~1022 nucleotides!

• Many DNA regions code for proteins, and are called genes (1 gene codes for 1 protein in principle)

• Human DNA contains ~30,000 expressed genes • Deoxyribonucleic acid (DNA) comprises 4 different

types of nucleotides: adenine (A), thiamine (T), cytosine (C) and guanine (G). These nucleotides are sometimes also called bases

Page 17: Bioinformatics  For MNW 2 nd  Year

Human DNA (Cont.)

• All people are different, but the DNA of different people only varies for 0.2% or less. So, only 2 letters in 1000 are expected to be different. Over the whole genome, this means that about 3 million letters would differ between individuals.

• The structure of DNA is the so-called double helix, discovered by Watson and Crick in 1953, where the two helices are cross-linked by A-T and C-G base-pairs (nucleotide pairs – so-called Watson-Crick base pairing).

Page 18: Bioinformatics  For MNW 2 nd  Year

Tot hier 3/2 – 10.45-12.30

Page 19: Bioinformatics  For MNW 2 nd  Year

DNA compositional biases

• Base composition of genomes: • E. coli: 25% A, 25% C, 25% G, 25% T• P. falciparum (Malaria parasite): 82%A+T

• Translation initiation: • ATG is the near universal motif indicating the

start of translation in DNA coding sequence.

Page 20: Bioinformatics  For MNW 2 nd  Year

Some facts about human genes

• Comprise about 3% of the genome

• Average gene length: ~ 8,000 bp

• Average of 5-6 exons/gene

• Average exon length: ~200 bp

• Average intron length: ~2,000 bp

• ~8% genes have a single exon

• Some exons can be as small as 1 or 3 bp.

• HUMFMR1S is not atypical: 17 exons 40-60 bp long, comprising 3% of a 67,000 bp gene

Page 21: Bioinformatics  For MNW 2 nd  Year

Genetic diseases

• Many diseases run in families and are a result of genes which predispose such family members to these illnesses

• Examples are Alzheimer’s disease, cystic fibrosis (CF), breast or colon cancer, or heart diseases.

• Some of these diseases can be caused by a problem within a single gene, such as with CF.

Page 22: Bioinformatics  For MNW 2 nd  Year

Genetic diseases (Cont.)• For other illnesses, like heart disease, at least 20-30

genes are thought to play a part, and it is still unknown which combination of problems within which genes are responsible.

• With a “problem” within a gene is meant that a single nucleotide or a combination of those within the gene are causing the disease (or make that the body is not sufficiently fighting the disease).

• Persons with different combinations of these nucleotides could then be unaffected by these diseases.

Page 23: Bioinformatics  For MNW 2 nd  Year

Genetic diseases (Cont.)Cystic Fibrosis

• Known since very early on (“Celtic gene”)• Inherited autosomal recessive condition (Chr. 7)• Symptoms:

– Clogging and infection of lungs (early death)

– Intestinal obstruction

– Reduced fertility and (male) anatomical anomalies

• CF gene CFTR has 3-bp deletion leading to Del508 (Phe) in 1480 aa protein (epithelial Cl- channel) – protein degraded in ER instead of inserted into cell membrane

Page 24: Bioinformatics  For MNW 2 nd  Year

Genomic Data Sources

• DNA/protein sequence

• Expression (microarray)

• Proteome (xray, NMR,

mass spectrometry)

• Metabolome

• Physiome (spatial,

temporal)

Integrative bioinformatics

Page 25: Bioinformatics  For MNW 2 nd  Year

Dinner discussion: Integrative Bioinformatics & Genomics VUDinner discussion: Integrative Bioinformatics & Genomics VU

metabolomemetabolome

proteomeproteome

genomegenome

transcriptometranscriptome

physiomephysiome

Genomic Data SourcesVertical Genomics

Page 26: Bioinformatics  For MNW 2 nd  Year

A gene codes for a protein

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Page 27: Bioinformatics  For MNW 2 nd  Year

Humans havespliced genes…

Page 28: Bioinformatics  For MNW 2 nd  Year

DNA makes RNA makes Protein

Page 29: Bioinformatics  For MNW 2 nd  Year

Remark• The problem of identifying (annotating) human genes is

considerably harder than the early success story for ß-globin might suggest.

• The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its 25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron.

• The biggest human gene yet is for dystrophin. It has > 30

exons and is spread over 2.4 million bp.

Page 30: Bioinformatics  For MNW 2 nd  Year

DNA makes RNA makes Protein:Expression data

• More copies of mRNA for a gene leads to more protein

• mRNA can now be measured for all the genes in a cell at ones through microarray technology

• Can have 60,000 spots (genes) on a single gene chip

• Colour change gives intensity of gene expression (over- or under-expression)

Page 31: Bioinformatics  For MNW 2 nd  Year
Page 32: Bioinformatics  For MNW 2 nd  Year

Metabolic networks

Glycolysis and

Gluconeogenesis

Kegg database (Japan)

Page 33: Bioinformatics  For MNW 2 nd  Year

High-throughput Biological Data

• Enormous amounts of biological data are being generated by high-throughput capabilities; even more are coming– genomic sequences

– gene expression data

– mass spec. data

– protein-protein interaction

– protein structures

– ......

Page 34: Bioinformatics  For MNW 2 nd  Year

Protein structural data explosion

Protein Data Bank (PDB): 14500 Structures (6 March 2001)10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...

Page 35: Bioinformatics  For MNW 2 nd  Year

Dickerson’s formula: equivalent to Moore’s law

On 27 March 2001 there were 12,123 3D protein structures in the PDB: Dickerson’s formula predicts 12,066 (within 0.5%)!

n = e0.19(y-1960) with y the year.

Page 36: Bioinformatics  For MNW 2 nd  Year

Sequence versus structural data

• Despite structural genomics efforts, growth of PDB slowed down in 2001-2002 (i.e did not keep up with Dickerson’s formula)

• More than 100 completely sequenced genomes

Increasing gap between structural and sequence data

Page 37: Bioinformatics  For MNW 2 nd  Year

BioinformaticsLarge - external(integrative) Science Human

Planetary Science Cultural Anthropology

Population Biology Sociology Sociobiology Psychology Systems Biology Biology Medicine

Molecular Biology Chemistry Physics

Small – internal (individual)

Bioinformatics

Page 38: Bioinformatics  For MNW 2 nd  Year

Bioinformatics• Offers an ever more essential input to

– Molecular Biology– Pharmacology (drug design)– Agriculture– Biotechnology– Clinical medicine– Anthropology– Forensic science– Chemical industries (detergent industries, etc.)

Page 39: Bioinformatics  For MNW 2 nd  Year

High-throughput Biological DataThe data deluge

• Hidden in these data is information that reflects

– existence, organization, activity, functionality …… of biological machineries at different levels in living organisms

Most effectively utilising this information will prove to be essential for Integrative Bioinformatics

Page 40: Bioinformatics  For MNW 2 nd  Year

Data Issues ……• Data collection: getting the data

• Data representation: data standards, data normalisation …..

• Data organisation and storage: database issues …..

• Data analysis and data mining: discovering “knowledge”, patterns/signals, from data, establishing associations among data patterns

• Data utilisation and application: from data patterns/signals to models for bio-machineries

• Data visualization: viewing complex data ……

• Data transmission: data collection, retrieval, …..

• ……

Page 41: Bioinformatics  For MNW 2 nd  Year

Tot hier 5/2

Page 42: Bioinformatics  For MNW 2 nd  Year

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))

“Nothing in bioinformatics makes sense except in the light of Biology”

Bioinformatics

Page 43: Bioinformatics  For MNW 2 nd  Year

Pair-wise alignment

Combinatorial explosion- 1 gap in 1 sequence: n+1 possibilities- 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc.

2n (2n)! 22n

= ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!

T D W V T A L KT D W L - - I K

Page 44: Bioinformatics  For MNW 2 nd  Year

Dynamic programmingScoring alignments

Sa,b = +

gp(k) = pi + kpe affine gap penalties

pi and pe are the penalties for gap initialisation and extension, respectively

li jbas ),( )(kgpN

kk

Page 45: Bioinformatics  For MNW 2 nd  Year

Dynamic programmingScoring alignments

10 1

Amino Acid Exchange Matrix Gap penalties (open, extension)

2020

Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)+Po+2Px +

+s(L,I)+s(K,K)

T D W V T A L KT D W L - - I K

Page 46: Bioinformatics  For MNW 2 nd  Year

Pairwise sequence alignmentGlobal dynamic programming

MDAGSTVILCFVGMDAASTILCGS

Amino Acid Exchange

Matrix

Gap penalties (open,extension)

Search matrix

MDAGSTVILCFVG-MDAAST-ILC--GS

Evolution

Page 47: Bioinformatics  For MNW 2 nd  Year

Global dynamic programming

i-1

j-1

Si,j = si,j + Max Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px}

Si-1,j-1

Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px}

Page 48: Bioinformatics  For MNW 2 nd  Year

Global dynamic programming

Page 49: Bioinformatics  For MNW 2 nd  Year

Global dynamic programming

Page 50: Bioinformatics  For MNW 2 nd  Year

Tot hier 17/02/03

Page 51: Bioinformatics  For MNW 2 nd  Year

Local dynamic programming (Smith & Waterman, 1981)

LCFVMLAGSTVIVGTREDASTILCGS

Amino AcidExchange Matrix

Gap penalties (open, extension)

Search matrix

Negativenumbers

AGSTVIVG

A-STILCG

Page 52: Bioinformatics  For MNW 2 nd  Year

Local dynamic programming (Smith & Waterman, 1981)

i-1

j-1

Si,j = Max

Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}

Si,j + Si-1,j-1

Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}

0

Page 53: Bioinformatics  For MNW 2 nd  Year

Local dynamic programming

Page 54: Bioinformatics  For MNW 2 nd  Year

Sequence database searching – Homology searching

DP too slow for repeated database searches

• FASTA

• BLAST and PSI-BLAST

• QUEST

• HMMER

• SAM-T98

Fast heuristics

Hidden Markov modelling

Page 55: Bioinformatics  For MNW 2 nd  Year

FASTA

• Compares a given query sequence with a library of sequences and calculates for each pair the highest scoring local alignment

• Speed is obtained by delaying application of the dynamic programming technique to the moment where the most similar segments are already identified by faster and less sensitive techniques

• FASTA routine operates in four steps:

Page 56: Bioinformatics  For MNW 2 nd  Year

FASTAOperates in four steps: 1. Rapid searches for identical words of a user specified length

occurring in query and database sequence(s) (Wilbur and Lipman, 1983, 1984). For each target sequence the 10 regions with the highest density of ungapped common words are determined.

2. These 10 regions are rescored using Dayhoff PAM-250 residue exchange matrix (Dayhoff et al., 1983) and the best scoring region of the 10 is reported under init1 in the FASTA output.

3. Regions scoring higher than a threshold value and being sufficiently near each other in the sequence are joined, now allowing gaps. The highest score of these new fragments can be found under initn in the FASTA output.

4. full dynamic programming alignment (Chao et al., 1992) over the final region which is widened by 32 residues at either side, of which the score is written under opt in the FASTA output.

Page 57: Bioinformatics  For MNW 2 nd  Year

FASTA output example

DE METAL RESISTANCE PROTEIN YCF1 (YEAST CADMIUM FACTOR 1). . . .

SCORES Init1: 161 Initn: 161 Opt: 162 z-score: 229.5 E(): 3.4e-06

Smith-Waterman score: 162; 35.1% identity in 57 aa overlap

10 20 30 test.seq MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLE :| :|::| |:::||:|||::|: | YCFI_YEAST CASILLLEALPKKPLMPHQHIHQTLTRRKPNPYDSANIFSRITFSWMSGLMKTGYEKYLV

180 190 200 210 220 230

40 50 60 test.seq LSDIYQIPSVDSADNLSEKLEREWDRE :|:|::| |:::||:|||::|: | YCFI_YEAST EADLYKLPRNFSSEELSQKLEKNWENELKQKSNPSLSWAICRTFGSKMLLAAFFKAIHDV 240 250 260 270 280 290

Page 58: Bioinformatics  For MNW 2 nd  Year

FASTA

(1) Rapid identical word searches:• Searching for k-tuples of a certain size within a

specified bandwidth along search matrix diagonals. • For not-too-distant sequences (> 35% residue

identity), little sensitivity is lost while speed is greatly increased.

• Technique employed is known as hash coding or hashing: a lookup table is constructed for all words in the query sequence, which is then used to compare all encountered words in each database sequence.

Page 59: Bioinformatics  For MNW 2 nd  Year

FASTA• The k-tuple length is user-defined and is usually 1 or

2 for protein sequences (i.e. either the positions of each of the individual 20 amino acids or the positions of each of the 400 possible dipeptides are located).

• For nucleic acid sequences, the k-tuple is 5-20, and should be longer because short k-tuples are much more common due to the 4 letter alphabet of nucleic acids. The larger the k-tuple chosen, the more rapid but less thorough, a database search.

Page 60: Bioinformatics  For MNW 2 nd  Year

BLAST• blastp compares an amino acid query sequence

against a protein sequence database• blastn compares a nucleotide query sequence

against a nucleotide sequence database• blastx compares the six-frame conceptual protein

translation products of a nucleotide query sequence against a protein sequence database

• tblastn compares a protein query sequence against a nucleotide sequence database translated in six reading frames

• tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

Page 61: Bioinformatics  For MNW 2 nd  Year

BLAST• Generates all tripeptides from a query sequence and

for each of those the derivation of a table of similar tripeptides: number is only fraction of total number possible.

• Quickly scans a database of protein sequences for ungapped regions showing high similarity, which are called high-scoring segment pairs (HSP), using the tables of similar peptides. The initial search is done for a word of length W that scores at least the threshold value T when compared to the query using a substitution matrix.

• Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of S, and as far as the cumulative alignment score can be increased.

Page 62: Bioinformatics  For MNW 2 nd  Year

BLASTExtension of the word hits in each direction are halted• when the cumulative alignment score falls off by the

quantity X from its maximum achieved value• the cumulative score goes to zero or below due to the

accumulation of one or more negative-scoring residue alignments

• upon reaching the end of either sequence

• The T parameter is the most important for the speed and sensitivity of the search resulting in the high-scoring segment pairs

• A Maximal-scoring Segment Pair (MSP) is defined as the highest scoring of all possible segment pairs produced from two sequences.

Page 63: Bioinformatics  For MNW 2 nd  Year

PSI-BLAST• Query sequences are first scanned for the presence of

so-called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to spurious hits; are excluded from alignment.

• The program then initially operates on a single query sequence by performing a gapped BLAST search

• Then, the program takes significant local alignments found, constructs a multiple alignment and abstracts a position specific scoring matrix (PSSM) from this alignment.

• Rescan the database in a subsequent round to find more homologous sequences Iteration continues until user decides to stop or search has converged

Page 64: Bioinformatics  For MNW 2 nd  Year

PSI-BLAST iteration

Q

ACD..Y

PiPx

Query sequence

PSSM

Q Query sequence

Gapped BLAST search

Database hits

Gapped BLAST searchACD..Y

PiPx

PSSM

Database hits

xxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxx

Page 66: Bioinformatics  For MNW 2 nd  Year

Multiple alignment profilesGribskov et al. 1987

ACDWY

Gappenalties

i0.30.100.30.3

0.51.0

Position dependent gap penalties

Page 67: Bioinformatics  For MNW 2 nd  Year

Normalised sequence similarityThe p-value is defined as the probability of seeing at least one unrelated score S greater than or equal to a given score x in a database search over n sequences.

This probability follows the Poisson distribution (Waterman and Vingron, 1994):

P(x, n) = 1 – e-nP(S x),

where n is the number of sequences in the database

Depending on x and n (fixed)

Page 68: Bioinformatics  For MNW 2 nd  Year

Normalised sequence similarityStatistical significance

The E-value is defined as the expected number of non-homologous sequences with score greater than or equal to a score x in a database of n sequences:

E(x, n) = nP(S x)

if E-value = 0.01, then the expected number of random hits with score S x is 0.01, which means that this E-value is expected by chance only once in 100 independent searches over the database.if the E-value of a hit is 5, then five fortuitous hits with S x are expected within a single database search, which renders the hit not significant.

Page 69: Bioinformatics  For MNW 2 nd  Year

Normalised sequence similarityStatistical significance

• Database searching is commonly performed using an E-value in between 0.1 and 0.001.

• Low E-values decrease the number of false positives in a database search, but increase the number of false negatives, thereby lowering the sensitivity of the search.

Page 70: Bioinformatics  For MNW 2 nd  Year

HMM-based homology searching

• Most widely used HMM-based profile searching tools currently are SAM-T98 (Karplus et al., 1998) and HMMER2 (Eddy, 1998)

• formal probabilistic basis and consistent theory behind gap and insertion scores

• HMMs good for profile searches, bad for alignment

• HMMs are slow

Page 71: Bioinformatics  For MNW 2 nd  Year

The HMM algorithms

Questions:1. What is the most likely die (predicted) sequence? Viterbi

2. What is the probability of the observed sequence? Forward

3. What is the probability that the 3rd state is B, given the observed sequence? Backward

Forward: (i) = P(observed sequence, ending in state i at base t)

Backward:ß (i) = P(obs. after t | ending in state i at base t)

Viterbi: (i) = max P(obs. , ending in state i at base t)

t

t

t

Page 72: Bioinformatics  For MNW 2 nd  Year

HMM-based homology searching

Transition probabilities and Emission probabilities

Gapped HMMs also have insertion and deletion states

Page 73: Bioinformatics  For MNW 2 nd  Year

Profile HMM: m=match state, I-insert state, d=delete state; go from left to right. I and m states output amino acids; d states are ‘silent”.

d1 d2 d3 d4

I0 I2 I3 I4I1

m0 m1 m2 m3 m4 m5

Start End

Page 74: Bioinformatics  For MNW 2 nd  Year

Homology-derived Secondary Structure of Proteins

(HSSP) Sander & Schneider, 1991

Page 75: Bioinformatics  For MNW 2 nd  Year

Tot hier 17/02/03

Page 76: Bioinformatics  For MNW 2 nd  Year

Bio-Data Analysis and Data Mining• Existing/emerging bio-data analysis and mining tools for

– DNA sequence assembly

– Genetic map construction

– Sequence comparison and database searching

– Gene finding

– ….

– Gene expression data analysis

– Phylogenetic tree analysis to infer horizontally-transferred genes

– Mass spec. data analysis for protein complex characterization

– ……

• Current mode of work:

Often enough: developing ad hoc tools for each individual application

Page 77: Bioinformatics  For MNW 2 nd  Year

Bio-Data Analysis and Data Mining• As the amount and types of data and their

cross connections increase rapidly

• the number of analysis tools needed will go up “exponentially”– blast, blastp, blastx, blastn, … from BLAST family

of tools– gene finding tools for human, mouse, fly, rice,

cyanobacteria, …..– tools for finding various signals in genomic

sequences, protein-binding sites, splice junction sites, translation start sites, …..

Page 78: Bioinformatics  For MNW 2 nd  Year

Bio-Data Analysis and Data Mining

Many of these data analysis problems are fundamentally the same problem(s) and can

be solved using the same set of tools: e.g. clustering or optimal segmentation by

Dynamic Programming

Developing ad hoc tools for each application (by each group of individual researchers) may soon become inadequate as bio-data production capabilities further ramp up

Page 79: Bioinformatics  For MNW 2 nd  Year

Bio-data Analysis, Data Mining and Integrative

Bioinformatics

To have analysis capabilities covering wide range of problems, we need to discover the common

fundamental structures of these problems;

HOWEVER in biology one size does NOT fit all…

Goal is development of a data analysis infrastructure in support of Genomics and

beyond

Page 80: Bioinformatics  For MNW 2 nd  Year

Algorithms in bioinformatics• string algorithms• dynamic programming• machine learning (NN, k-NN, SVM, GA, ..)• Markov chain models• hidden Markov models• Markov Chain Monte Carlo (MCMC) algorithms• stochastic context free grammars• EM algorithms• Gibbs sampling• clustering• tree algorithms• text analysis• hybrid/combinatorial techniques and more…

Page 81: Bioinformatics  For MNW 2 nd  Year

Sequence analysis and homology searching

Page 82: Bioinformatics  For MNW 2 nd  Year

Finding genes and regulatory elements

Page 83: Bioinformatics  For MNW 2 nd  Year

Expression data

Page 84: Bioinformatics  For MNW 2 nd  Year

Functional genomics

• Monte Carlo

Page 85: Bioinformatics  For MNW 2 nd  Year

Protein translation

Page 86: Bioinformatics  For MNW 2 nd  Year

Example of algorithm reuse: Data clustering

• Many biological data analysis problems can be formulated as clustering problems– microarray gene expression data analysis

– identification of regulatory binding sites (similarly, splice junction sites, translation start sites, ......)

– (yeast) two-hybrid data analysis (for inference of protein complexes)

– phylogenetic tree clustering (for inference of horizontally transferred genes)

– protein domain identification

– identification of structural motifs

– prediction reliability assessment of protein structures

– NMR peak assignments

– ......

Page 87: Bioinformatics  For MNW 2 nd  Year

Data Clustering Problems

• Clustering: partition a data set into clusters so that data points of the same cluster are “similar” and points of different clusters are “dissimilar”

• cluster identification -- identifying clusters with significantly different features than the background

Page 88: Bioinformatics  For MNW 2 nd  Year

Application Examples

• Regulatory binding site identification: CRP (CAP) binding site

• Two hybrid data analysis Gene expression data analysis

Are all solvable by the same algorithm!

Page 89: Bioinformatics  For MNW 2 nd  Year

Other Application Examples

• Phylogenetic tree clustering analysis

• Protein sidechain packing prediction

• Assessment of prediction reliability of protein structures

• Protein secondary structures

• Protein domain prediction

• NMR peak assignments

• ……

Page 90: Bioinformatics  For MNW 2 nd  Year

Integrative bioinformatics @ VU

Studying informational processes at biological system level

• From gene sequence to intercellular processes

• Computers necessary

• We have biology, statistics, computational intelligence (AI), HTC, ..

• VUMC: microarray facility

• Enabling technology: new glue to integrate

• New integrative algorithms

• Goals: understanding cells in terms of genomes, fighting disease (VUMC)

Page 91: Bioinformatics  For MNW 2 nd  Year

Bioinformatics @ VU

Progression:

• DNA: gene prediction, predicting regulatory elements

• mRNA expression

• Proteins: docking, domain prediction

• Metabolic pathways: metabolic control

• Cell-cell communication

Page 92: Bioinformatics  For MNW 2 nd  Year
Page 93: Bioinformatics  For MNW 2 nd  Year

Pyruvate kinasePhosphotransferase

barrel regulatory domain

barrel catalytic substrate binding domain

nucleotide binding domain

1 continuous + 2 discontinuous domains

Protein structure and function can be complex…

Page 94: Bioinformatics  For MNW 2 nd  Year

Bioinformatics @ VU

Qualitative challenges:• High quality alignments (alternative splicing)• In-silico structural genomics• In-silico functional genomics: reliable annotation• Protein-protein interactions.• Metabolic pathways: assign the edges in the

networks• Cell-cell communication: find membrane

associated components• New algorithms

Page 95: Bioinformatics  For MNW 2 nd  Year

Bioinformatics @ VUQuantitative challenges:• Understanding mRNA expression levels• Understanding resulting protein activity• Time dependencies• Spatial constraints, compartmentalisation• Are classical differential equation models adequate or do

we need more individual modeling (e.g macromolecular crowding and activity at oligomolecular level)?

• Metabolic pathways: calculate fluxes through time • Cell-cell communication: tissues, hormones, innervations

Need ‘complete’ experimental data for good biological model system to learn to integrate

Page 96: Bioinformatics  For MNW 2 nd  Year

Bioinformatics @ VUVUMC

• Neuropeptide – addiction

• Oncogenes – disease patterns

• Reumatic disease

CNCR

• From synapses to higher order behaviour

• Addiction

FPP

• Genetic psychology – twin data bank

Page 97: Bioinformatics  For MNW 2 nd  Year

• Integrate data sources

• Integrate methods

• Integrate data through method integration (biological model)

Integrative bioinformatics

Page 98: Bioinformatics  For MNW 2 nd  Year

Bioinformatics tool

Data

Algorithm

BiologicalInterpretation

(model)

tool

Page 99: Bioinformatics  For MNW 2 nd  Year

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))

“Nothing in Bioinformatics makes sense except in the light of Biology”

Bioinformatics

Page 100: Bioinformatics  For MNW 2 nd  Year

Pair-wise sequence alignment(more than just string matching)

MDAGSTVILCFVGMDAASTILCGS

Amino Acid Exchange

Matrix

Gap penalties (open,extension)

Search matrix

MDAGSTVILCFVG-MDAAST-ILC--GS

EvolutionGlobal dynamic programming

Page 101: Bioinformatics  For MNW 2 nd  Year

Pair-wise alignment search explosions

Combinatorial explosion- 1 gap in 1 sequence: n+1 possibilities- 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc.

2n (2n)! 22n

= ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!

T D W V T A L KT D W L - - I K

Page 102: Bioinformatics  For MNW 2 nd  Year

Global dynamic programming

Page 103: Bioinformatics  For MNW 2 nd  Year

Three integrative methods to predict protein structural aspects:

• Iterative multiple alignment + protein secondary structure (Praline)

Intermezzo: 2½-D structure prediction of flavodoxin fold by hand

• Protein domain delineation based on consistency of multiple ab initio model tertiary structures (SnapDRAGON)

• Protein domain delineation based on combining homology searching with domain prediction (Domaination)

This talk – own kitchen

Page 104: Bioinformatics  For MNW 2 nd  Year

Comparing sequences - Similarity Score -

Many properties can be used:

• Nucleotide or amino acid composition

• Isoelectric point

• Molecular weight

• Morphological characters

Page 105: Bioinformatics  For MNW 2 nd  Year

Multivariate statistics – Cluster analysis

Phylogenetic tree

Scores

Similaritymatrix

5×5

12345

C1 C2 C3 C4 C5 C6 ..

Raw table

Similarity criterion

Cluster criterion

Page 106: Bioinformatics  For MNW 2 nd  Year

Human Evolution

Page 107: Bioinformatics  For MNW 2 nd  Year

Comparing sequences - Similarity Score -

Many properties can be used:

• Nucleotide or amino acid composition

• Isoelectric point

• Molecular weight

• Morphological characters

• But: molecular evolution through sequence alignment

Page 108: Bioinformatics  For MNW 2 nd  Year

Multivariate statistics – Cluster analysis

Phylogenetic tree

Scores

Similaritymatrix

5×5

Multiple alignment

12345

Similarity criterion

Page 109: Bioinformatics  For MNW 2 nd  Year

Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQLamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ

Lactate dehydrogenase multiple alignment

Distance Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Human 0.000 0.112 0.128 0.202 0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635 0.637 2 Chicken 0.112 0.000 0.155 0.214 0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631 0.651 3 Dogfish 0.128 0.155 0.000 0.196 0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600 0.655 4 Lamprey 0.202 0.214 0.196 0.000 0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616 0.669 5 Barley 0.378 0.382 0.389 0.426 0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629 0.575 6 Maizey 0.346 0.348 0.337 0.356 0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643 0.587 7 Lacto_casei 0.530 0.538 0.522 0.553 0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526 0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589 0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598 0.495 9 Lacto_plant 0.512 0.516 0.516 0.544 0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563 0.485 10 Therma_mari 0.524 0.524 0.512 0.503 0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405 0.598 11 Bifido 0.528 0.524 0.524 0.544 0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604 0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616 0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000 0.641 13 Mycoplasma 0.637 0.651 0.655 0.669 0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641 0.000

Page 110: Bioinformatics  For MNW 2 nd  Year
Page 111: Bioinformatics  For MNW 2 nd  Year

Multiple sequence alignmentWhy?

• It is the most important means to assess relatedness of a set of sequences

• Gain information about the structure/function of a query sequence (conservation patterns)

• Construct a phylogenetic tree• Putting together a set of sequenced fragments

(Fragment assembly)• Comparing a segment sequenced by two different

labs• Many bioinformatics methods depend on it (e.g.

secondary/tertiary structure prediction)

Page 112: Bioinformatics  For MNW 2 nd  Year

Flavodoxin fold: aligning 13 Flavodoxins + cheY

5() fold

Page 113: Bioinformatics  For MNW 2 nd  Year

Flavodoxin-cheY multiple alignmentPraline with pre-processing

1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF

FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf

FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf

FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf

FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf

2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF

FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf

FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf

FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-SELDDVDFNGKLVAYf

FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-PTLEEIDFNGKLVALf

4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-EEIS-TKISGKKVALF

FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf

FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-DESSEFNLEGKLGAAf

3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-KTIRADGAMSALPVLM

T1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI--------

FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL--------

FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI--------

FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD---------------------SLKIDGD--PE--RDEIVSwGSGIADKI--------

FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS---------------------SLKIDGE--PD--SAEVLDwAREVLARV--------

2fcr GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------

FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L--

FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------

FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------

FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA

4fxn G-----SY-GWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI---------

FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNA-PECKElGEAAAKA---------

FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF-----------

3chy VTAEAKK--ENIIAA---------AQAGAS-------------------------GYVV-----KPFTAATLEEKLNKIFEKLGM------

GIteration 0 SP= 136944.00 AvSP= 10.675 SId= 4009 AvSId= 0.313

Page 114: Bioinformatics  For MNW 2 nd  Year

Flavodoxin-cheY NJ tree

Page 115: Bioinformatics  For MNW 2 nd  Year

Integrating secondary structure prediction in multiple alignment

Victor Simossis

Praline multiple alignment method(Heringa, Comp. Chem. 23, 341-364;1999, Comp. Chem., 26, 459-477;2002;Kleinjung, Douglas & Heringa, Bioinformatics, in press;2002)

• Combining sequence data and secondary structure prediction (Heringa, Curr. Prot. Pept. Sci., 1 (3), 273-301;2000)

• Secondary structure methods: PhD, Predator, PSIPred, Jpred, SSPRED,...

Page 116: Bioinformatics  For MNW 2 nd  Year

Using secondary structure in multiple alignment

“Structure more conserved than sequence”

Page 117: Bioinformatics  For MNW 2 nd  Year

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE (oligomers)

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 118: Bioinformatics  For MNW 2 nd  Year

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE (oligomers)

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 119: Bioinformatics  For MNW 2 nd  Year

Secondary structure-induced alignment

Page 120: Bioinformatics  For MNW 2 nd  Year

Using secondary structure in multiple alignment

Dynamic programmingsearch matrix

Amino acid exchangeweights matrices

MDAGSTVILCFVHHHCCCEEEEEE

MDAASTILCGS

HHHHCCEEECC

C

H

E

H C

E Default

Page 121: Bioinformatics  For MNW 2 nd  Year

Flavodoxin-cheY predicted secondary structure(PREDATOR)

1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF e eeee b ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b ee sss ee ttthhhhtt ttss tt eeeeeFLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf e eeeeee hhhhhhhhhhhhhhh eeeeee eeeeee hhhhhh eeeeeFLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf e eeeeee hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee hhhhhh eeeeeeFLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf eeeeee hhhhhhhhhhhhhh eeeee eeeee hhhhhhh h eeeeeFLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf eeee hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee hhhhhhh hh eeeee2fcr --K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF eeeee ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee stt s s s sthhhhhhhtggg tt eeeeeFLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLYSE-LDDVDFNGKLVAYf eeeee hhhhhhhhhhhh eee hhh hhhhhhheeeeee hhhhhhhhh eeeeeeFLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QCDWDDFFPT-LEEIDFNGKLVALf eee hhhhhhhhhhhh eee hhh hhhhhhheeeee hhhhh eeeeeeFLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf eee hhhhhhhhhhhhh hhh hhhhhhheeeee hhhhhhhhh eeeeeeFLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf eeee hhhhhhhhhhhh hhh hhhhhhheeeee hhhhh eeeee4fxn ----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF eeeee ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee btttb ttthhhhhhh hst t tt eeeeeFLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee eeeeeFLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWEMKKWIDE-SSEFNLEGKLGAAf eee hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee hhhhhhhhh eeeee3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSALPVLMV tt eeee s hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s sss hhhhhhhhhh ttttt eeee

1fx1 GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------- eee s ss sstthhhhhhhhhhhttt ee s eeees gggghhhhhhhhhhhhhhFLAV_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------- eee hhhhhhhhhhhh eeeee eeeee hhhhhhhhhhhhhhFLAV_DESGI GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS---------------------SLKIDGE--P--DSAEVLDwAREVLARV-------- eee hhhhhhhhhhhh eeeee hhhhhhhhhhhFLAV_DESSA GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD---------------------SLKIDGD--P--ERDEIVSwGSGIADKI-------- hhhhhhhhhhhh eeeee e eeeFLAV_DESDE ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------- e hhhhhhhhhhhhhh eeeee ee hhhhhhhhhhh2fcr GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhhtFLAV_ANASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------ hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhFLAV_ECOLI GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhhhFLAV_AZOVI GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- e hhhhhhhhhhhhhh eeeee hhhhhhhhhhhFLAV_ENTAG GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------ hhhhhhhhhhhhhhh eeee hhhhhhh hhhhhhhhhhhh4fxn G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------- e eesss shhhhhhhhhhhhtt ee s eeees ggghhhhhhhhhhhhtFLAV_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNAPE-CKElGEAAAKA--------- hhhhhhhhhhh eeeee eeee h hhhhhhhhFLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-- hhhhhhhhhhhhhh eeeee hhhh hhh hhhhhhhhhhhh h3chy -----------TAEAKKENIIAAAQAGASGY-------------------------VVK----P-FTAATLEEKLNKIFEKLGM------ ess hhhhhhhhhtt see ees s hhhhhhhhhhhhhhht

G

Enough to predict 5() topology

Page 122: Bioinformatics  For MNW 2 nd  Year

Secondary structure-induced alignment

Page 123: Bioinformatics  For MNW 2 nd  Year

3chy-AA SEQUENCE|| AA |ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP|

3chy-ITERATION-0|| PHD | EEEEEEE HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE |

3chy-ITERATION-1|| PHD | EEEEEEEE HHHHHHHHHHHHHHH HHHHHHHH EEEEEE |

3chy-ITERATION-2|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHH EEEEEE |

3chy-ITERATION-3|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |

3chy-ITERATION-4|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEE |

3chy-ITERATION-5|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |

3chy-ITERATION-6|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHH EEEEEE |

3chy-ITERATION-7|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |

3chy-ITERATION-8|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEEE |

3chy-ITERATION-9|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHHH EEEEE |

3chy-AA SEQUENCE|| AA |NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKLGM|

3chy-ITERATION-0|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH |

3chy-ITERATION-1|| PHD | HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-2|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-3|| PHD | HHHHHHHHHHHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-4|| PHD | HHHHH EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-5|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-6|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH |

3chy-ITERATION-7|| PHD | HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-8|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-9|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH |

Flavodoxin-cheY multiple alignment/ secondary structure iteration

cheY SSEs

Page 124: Bioinformatics  For MNW 2 nd  Year

4fxn-AA SEQUENCE|| AA |MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEV|4fxn-ITERATION-0|| PHD | EEEEE HHHHHHHHHHHHHHH EEE EEEEE |4fxn-ITERATION-1|| PHD | EEEEE HHHHHHHHHHHHHHH EEEE EEEEE |4fxn-ITERATION-2|| PHD | EEEEE HHHHHHHHHHHHHHH EEEE EEEEE |4fxn-ITERATION-3|| PHD | EEEEE HHHHHHHHHHHHHHH E EEEEE |4fxn-ITERATION-4|| PHD | EEEEEE HHHHHHHHHHHHHHH EEEE EEEEE |4fxn-ITERATION-5|| PHD | EEEEEE HHHHHHHHHHHHHHH EE EEEEE |4fxn-ITERATION-6|| PHD | EEEEEE HHHHHHHHHHHHHHH EEEE EEEEE |4fxn-ITERATION-7|| PHD | EEEEEE HHHHHHHHHHHHHHH EE EEEEE |4fxn-ITERATION-8|| PHD | EEEEEE HHHHHHHHHHHHHHH EEE EEEEE |4fxn-ITERATION-9|| PHD | EEEEE HHHHHHHHHHHHHHH EEE EEEEE |

4fxn-AA SEQUENCE|| AA |LEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE|4fxn-ITERATION-0|| PHD | EEEEE HHHHHHHHHHHHHHHHH EEE EEE |4fxn-ITERATION-1|| PHD | HHHH EEEEE HHHHHHHHHHHHHHH EEE EE |4fxn-ITERATION-2|| PHD | HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHH EEE EE |4fxn-ITERATION-3|| PHD | HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHH EEE EE |4fxn-ITERATION-4|| PHD | HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE E |4fxn-ITERATION-5|| PHD | HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE E |4fxn-ITERATION-6|| PHD | HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE E |4fxn-ITERATION-7|| PHD | HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE E |4fxn-ITERATION-8|| PHD | HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE E |4fxn-ITERATION-9|| PHD | HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE E |

4fxn-AA SEQUENCE|| AA |PDEAEQDCIEFGKKIANI|4fxn-ITERATION-0|| PHD | HHHHHHHHHHHHH |4fxn-ITERATION-1|| PHD | HHHHHHHHHHHHH |4fxn-ITERATION-2|| PHD | HHHHHHHHHHHHH |4fxn-ITERATION-3|| PHD | HHHHHHHHHHHHH |4fxn-ITERATION-4|| PHD | HHHHHHHHHHHH |4fxn-ITERATION-5|| PHD | HHHHHHHHHHHHH |4fxn-ITERATION-6|| PHD | HHHHHHHHHHHH |4fxn-ITERATION-7|| PHD | HHHHHHHHHHHHH |4fxn-ITERATION-8|| PHD | HHHHHHHHHHHHH |4fxn-ITERATION-9|| PHD | HHHHHHHHHHHH |

Page 125: Bioinformatics  For MNW 2 nd  Year

Optimal segmentation of predicted secondary structures by Dynamic Programming

sequence position

window size

Max scoreOffsetLabel

H scoreE scoreC score

The recorded values are used in a weighted function according to their secondary structure type, that gives each position a window-specific score. The more probable the secondary structure element, the higher the score.

Restrictions:H only if ws>=4E only if ws>=2

5H

2 6

Segmentation score (Total score of each path)

? scoreRegion

Page 126: Bioinformatics  For MNW 2 nd  Year

Example of an optimally segmented secondary structure prediction library for sequence 3chy 3chy ---------------GYVV-----KPFTAATLEEKLNKIFEKLGM------

3chy <- 1fx1 ??????????????? ee ?? hhhhhhhhhhhhhh ????????

3chy <- FLAV_DESDE ??????????????? ee ?? hhhhhhhhhhhhhhh ????????

3chy <- FLAV_DESVH ??????????????? ee ?? hhhhhhhhhhhhhh ????????

3chy <- FLAV_DESGI ??????????????? eee ?? ??hhhhhhhhhhhhh ????????

3chy <- FLAV_DESSA ??????????????? eee ?? ??hhhhhhhhhhhhh ????????

3chy <- 4fxn ??????????????? eee ?? hhhhhhhhhhhhh ?????????

3chy <- FLAV_MEGEL ????????????????eee ?? hh?hhhhhhhhhhh ?????????

3chy <- 2fcr e ? eeeeeee hhhhhhhhhhhhhhh ??????

3chy <- FLAV_ANASP ? eeeeeee hhhhhhhhhhhhhhh ??????

3chy <- FLAV_ECOLI eeeeeee hhhhhhhhhhhhhhh hhhhh

3chy <- FLAV_AZOVI ? eeeeeee hhhhhhhhhhhhhhh ????

3chy <- FLAV_ENTAG e eeeeeeee hhhhhhhhhhhhhhhh? ??????

3chy <- FLAV_CLOAB eeeeeee hhhhhhhhhh ???????????

3chy <- 3chy --------------- ----- hhhhhhhhhhhhhh ------

Consensus ---------------EEEE----- HHHHHHHHHHHHH ------

Consensus-DSSP ...............****.....****xx***************......

PHD --------------- ----- HHHHHHHHHHHHHH ------

PHD-DSSP ...............xxxx.....******************x**......

DSSP ...............EEEE.....SS HHHHHHHHHHHHHHHT ......

LumpDSSP ...............EEEE..... HHHHHHHHHHHHHHH ......

Page 127: Bioinformatics  For MNW 2 nd  Year

What to do with a multiple alignment?

• Use it to eyeball and detect structural/functional features

• Use it to make a profile and search a database for homologs

• Give it to other bioinformatics methods and predict secondary structure, functional residues, correlated mutations, phylogenetic trees, etc.

Page 128: Bioinformatics  For MNW 2 nd  Year

Rules of thumb when looking at a multiple alignment (MA)

• Hydrophobic residues are internal

• Gly (Thr, Ser) in loops

• MA: hydrophobic block -> internal -strand

• MA: alternating (1-1) hydrophobic/hydrophilic => edge -strand

• MA: alternating 2-2 (or 3-1) periodicity => -helix

• MA: gaps in loops

• MA: Conserved column => functional? => active site

Page 129: Bioinformatics  For MNW 2 nd  Year

Rules of thumb when looking at a multiple alignment (MA)

• Active site residues are together in 3D structure

• Helices often cover up core of strands

• Helices less extended than strands => more residues to cross protein

-- motif is right-handed in >95% of cases (with parallel strands)

• MA: ‘inconsistent’ alignment columns and match errors!

• Secondary structures have local anomalies, e.g. -bulges

Page 130: Bioinformatics  For MNW 2 nd  Year

Rules of thumb when looking at a multiple alignment (MA)

• Active site residues are together in 3D structure

• Helices often cover up core of strands

• Helices less extended than strands => more residues to cross protein

-- motif is right-handed in >95% of cases (with parallel strands)

• MA: ‘inconsistent’ alignment columns and match errors!

• Secondary structures have local anomalies, e.g. -bulges

Page 131: Bioinformatics  For MNW 2 nd  Year

Periodicity patterns

Burried -strand

Edge -strand

-helix

Page 132: Bioinformatics  For MNW 2 nd  Year

Burried and Edge strands

Parallel -sheet

Anti-parallel -sheet

Page 133: Bioinformatics  For MNW 2 nd  Year

-- motif is right-handed in >95% of cases

RH LH

Page 134: Bioinformatics  For MNW 2 nd  Year

Flavodoxin-cheY example: 5()

1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF

FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf

FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf

FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf

FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf

2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF

FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf

FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf

FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-SELDDVDFNGKLVAYf

FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-PTLEEIDFNGKLVALf

4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-EEIS-TKISGKKVALF

FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf

FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-DESSEFNLEGKLGAAf

3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-KTIRADGAMSALPVLM

T1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI--------

FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL--------

FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI--------

FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD---------------------SLKIDGD--PE--RDEIVSwGSGIADKI--------

FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS---------------------SLKIDGE--PD--SAEVLDwAREVLARV--------

2fcr GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------

FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L--

FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------

FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------

FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA

4fxn G-----SY-GWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI---------

FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNA-PECKElGEAAAKA---------

FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF-----------

3chy VTAEAKK--ENIIAA---------AQAGAS-------------------------GYVV-----KPFTAATLEEKLNKIFEKLGM------

GIteration 0 SP= 136944.00 AvSP= 10.675 SId= 4009 AvSId= 0.313

Page 135: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

21 3 4 5

Page 136: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

21 3 4 5

Page 137: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

21 3 4 5

Page 138: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

21 3 4 5

Page 139: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

21 3 4 5

Page 140: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

21 3 4 5

Page 141: Bioinformatics  For MNW 2 nd  Year

Building flavodoxintry again

RH

12 3 4 5

Page 142: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

12 3 4 5

Page 143: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

12 3 4 5

Page 144: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

12 3 4 5

Page 145: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

12 3 4 5

Page 146: Bioinformatics  For MNW 2 nd  Year

Building flavodoxin

RH

12 3 4 5

Page 147: Bioinformatics  For MNW 2 nd  Year

Flavodoxin family - TOPS diagrams (Flores et al., 1994)

1 2345

1

234

5

Page 148: Bioinformatics  For MNW 2 nd  Year

Protein structure evolutionInsertion/deletion of secondary structural

elements can ‘easily’ be done at loop sites

Page 149: Bioinformatics  For MNW 2 nd  Year

Protein structure evolutionInsertion/deletion of structural domains can

‘easily’ be done at loop sites

N

C

Page 150: Bioinformatics  For MNW 2 nd  Year

SnapDRAGON

Richard A. George

George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.

 

Integrating protein multiple alignment, secondary and tertiary structure

prediction to predict structural domains in sequence data

Page 151: Bioinformatics  For MNW 2 nd  Year

A domain is a:

• Compact, semi-independent unit (Richardson, 1981).

• Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973).

• Recurring functional and evolutionary module (Bork, 1992).

“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).

Page 152: Bioinformatics  For MNW 2 nd  Year

The DEATH Domain• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.

http

://w

ww

.msh

ri.o

n.ca

/paw

son

Page 153: Bioinformatics  For MNW 2 nd  Year

Delineating domains is essential for:

• Obtaining high resolution structures (x-ray, NMR)• Sequence analysis • Multiple sequence alignment methods• Prediction algorithms (SS, Class, secondary/tertiary

structure)• Fold recognition and threading• Elucidating the evolution, structure and function of

a protein family (e.g. ‘Rosetta Stone’ method)• Structural/functional genomics• Cross genome comparative analysis

Page 154: Bioinformatics  For MNW 2 nd  Year

Pyruvate kinasePhosphotransferase

barrel regulatory domain

barrel catalytic substrate binding domain

nucleotide binding domain

1 continuous + 2 discontinuous domains

Structural domain organisation can be nasty…

Page 155: Bioinformatics  For MNW 2 nd  Year

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 156: Bioinformatics  For MNW 2 nd  Year

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 157: Bioinformatics  For MNW 2 nd  Year

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 158: Bioinformatics  For MNW 2 nd  Year

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Page 159: Bioinformatics  For MNW 2 nd  Year

Distance Regularisation Algorithm for Geometry OptimisatioN

(Aszodi & Taylor, 1994)

Domain prediction using DRAGON

•Folds proteins based on the requirement that (conserved) hydrophobic residues cluster together.

•First constructs a random high dimensional C distance matrix.

•Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.

Page 160: Bioinformatics  For MNW 2 nd  Year

The DRAGON target matrix is inferred from:

• A multiple sequence alignment of a protein (old)– Conserved hydrophobicity

• Secondary structure information (SnapDRAGON)– predicted by PREDATOR (Frishman & Argos, 1996).– strands are entered as distance constraints from the N-

terminal Cto the C-terminal C

Page 161: Bioinformatics  For MNW 2 nd  Year

•The C distance matrix is divided into smaller clusters.

•Seperately, each cluster is embedded into a local centroid.

•The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.

3NN

NN

C distancematrix

Targetmatrix

N

CCHHHCCEEE

Multiple alignment

Predicted secondary structure100 randomised

initial matrices

100 predictions Input data

Page 162: Bioinformatics  For MNW 2 nd  Year

SnapDragon

Generated folds by Dragon

Boundary recognition

Summed and Smoothed Boundaries

CCHHHCCEEE

Multiple alignment

Predicted secondary structure

Page 163: Bioinformatics  For MNW 2 nd  Year

Domains in structures assigned using method by Taylor (1997)

Domain boundary positions of each model against sequence

Summed and Smoothed Boundaries (Biased window protocol)

SnapDRAGON

1

2

3

Page 164: Bioinformatics  For MNW 2 nd  Year

SnapDRAGON

• Is very slow (can be hours for proteins>400 aa) – cluster computing implementation

• Uses consistency in the absence of standard of truth

• Goes from primary+secondary to tertiary structure to ‘just’ chop protein sequences

• SnapDRAGON webserver is underway

Page 165: Bioinformatics  For MNW 2 nd  Year

DOMAINATIONRichard A. George

Protein domain identification and improved sequence searching using PSI-BLAST

(George & Heringa, Prot. Struct. Func. Genet., in press; 2002)

Integrating protein sequence database searching and on-the-fly domain recognition

Page 166: Bioinformatics  For MNW 2 nd  Year

Domaination

• Current iterative homology search methods do not take into account that:– Domains may have different ‘rates of

evolution’.– Common conserved domains, such as the

tyrosine kinase domain, can obscure weak but relevant matches to other domain types

– Premature convergence (false negatives)– Matrix migration / Profile wander (false

positives).

Page 167: Bioinformatics  For MNW 2 nd  Year

PSI-BLAST• Query sequence is first scanned for the presence of so-

called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition (e.g. TM regions or coiled coils) likely to lead to spurious hits, which are excluded from alignment.

• Initially operates on a single query sequence by performing a gapped BLAST search

• Then takes significant local alignments found, constructs a ‘multiple alignment’ and abstracts a position specific scoring matrix (PSSM) from this alignment.

• Rescans the database in a subsequent round to find more homologous sequences -- Iteration continues until user decides to stop or search converges

Page 168: Bioinformatics  For MNW 2 nd  Year

PSI-BLAST iteration

Q

ACD..Y

PiPx

Query sequence

PSSM

Q Query sequence

Gapped BLAST search

Database hits

Gapped BLAST searchACD..Y

PiPx

PSSM

Database hits

xxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxx

Page 169: Bioinformatics  For MNW 2 nd  Year

DO

MA

INA

TIO

N

Chop and JoinDomains

Page 170: Bioinformatics  For MNW 2 nd  Year

Identifying domain boundaries

Sum N- and C-termini ofgapped local alignments

True N- and C- termini are counted twice (within 10 residues)

Boundaries are smoothed using twowindows (15 residues long)

Combine scores using biased protocol:

if Ni x Ci = 0then Si = Ni+Cielse Si = Ni+Ci +(NixCi)/(Ni+Ci)

Page 171: Bioinformatics  For MNW 2 nd  Year

Identifying domain deletions

• Deletions in the query (or insertion in the DB sequences) are identified by– two adjacent segments in the query align to the

same DB sequences (>70% overlap), which have a region of >35 residues not aligned to the query. (remove N- and C- termini)

DBQuery

Page 172: Bioinformatics  For MNW 2 nd  Year

Identifying domain permutations

• A domain shuffling event is declared – when two local alignments (>35 residues)

within a single DB sequence match two separate segments in the query (>70% overlap), but have a different sequential order.

DB

Query

b a

a b

Page 173: Bioinformatics  For MNW 2 nd  Year

Identifying continuous and discontinuous domains

•Each segment is assigned an independence score (In). If In>10% the segment is assigned as a continuous domain.•An association score is calculated between non-adjacent fragments by assessing the shared sequence hits to the segments. If score > 50% then segments are considered asdiscontinuous domains and joined.

Page 174: Bioinformatics  For MNW 2 nd  Year

Create domain profiles

• A representative set of the database sequence fragments that overlap a putative domain are selected for alignment using OBSTRUCT (Heringa et al. 1992). > 20% and < 60% sequence identity (including the query seq).

• A multiple sequence alignment is generated using PRALINE (Heringa 1999, 2002; Kleinjung et al., 2002).

• Each domain multiple alignment is used as a profile in further database searches using PSI-BLAST (Altschul et al 1997).

• The whole process is iterated until no new domains are identified.

Page 175: Bioinformatics  For MNW 2 nd  Year

Significant sequences found in database searches

At an E-value cut-off of 0.1 the performance of DOMAINATION

searches with the full-length proteins is 15% better than PSI-BLAST

Page 176: Bioinformatics  For MNW 2 nd  Year

Summary

Algorithmic integration issues:• Integrating data categories• Integrating alternative methods (consensus)• Making an web-integrated genomics pipeline that combines it all

Page 177: Bioinformatics  For MNW 2 nd  Year

Big task ahead @ VU

Needs:

• People

• Teams with an interest in Integrative Bioinformatics

• HTC/Dedicated cluster computing

Page 178: Bioinformatics  For MNW 2 nd  Year

AcknowledgementsVU CvB

FEW

FALW

Victor Simossis – NIMR to VU (1 November 2002)

Jens Kleinjung – NIMR to VU (1 December 2002)

Hans Westerhoff – FALW, VU

Henri Bal – CS, FEW, VU

Hans van Beek – VUMC/FALW, VU

Page 179: Bioinformatics  For MNW 2 nd  Year

View at NIMR (Mill Hill)