bioinformatics master course: dna/protein structure-function analysis and prediction

FORINTEGRATIVE

BIOINFORMATICSVU

EBioinformatics Master Course:DNA/Protein Structure-Function Analysis and Prediction

Lecture 13: Protein Function

[2] [2] [2]

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

Sequence

Structure

Function

Threading

Homology searching (BLAST)

Ab initio prediction and folding

Function prediction from structure

Sequence-Structure-Function

impossible but for the smallest structures

very difficult

[3] [3] [3]

TERTIARY STRUCTURE (fold)TERTIARY STRUCTURE (fold)

Genome

Expressome

Proteome

Metabolome

Functional Genomics – Systems Functional Genomics – Systems BiologyBiology

Metabolomics

fluxomics

[4] [4] [4]

Systems Biology

is the study of the interactions between the components of a biological system, and how these interactions give rise to the function and behaviour of that system (for example, the enzymes and metabolites in a metabolic pathway). The aim is to quantitatively understand the system and to be able to predict the system’s time processes

• the interactions are nonlinear• the interactions give rise to emergent properties, i.e.

properties that cannot be explained by the components in the system

• Biological processes include many time-scales, many compartments and many interconnected network levels (e.g. regulation, signalling, expression,..)

[5] [5] [5]

Systems Biology

understanding is often achieved through modeling and simulation of the system’s components and interactions.

Many times, the ‘four Ms’ cycle is adopted:

Measuring

Mining

Modeling

Manipulating

[6] [6] [6]

‘The silicon cell’

(some people think ‘silly-con’ cell)

[7] [7] [7]

[8] [8] [8]

A system response

Apoptosis: programmed cell death

Necrosis: accidental cell death

[9] [9] [9]

This pathway diagram shows a comparison of pathways in (left) Homo sapiens (human) and (right) Saccharomyces cerevisiae (baker’s yeast). Changes in controlling enzymes (square boxes in red) and the pathway itself have occurred (yeast has one altered (‘overtaking’) path in the graph)

We need to be able to do automatic pathway comparison (pathway alignment)

Human Yeast

‘Comparative metabolomics’

[10] [10] [10]

The citric-acid cycle

http://en.wikipedia.org/wiki/Krebs_cycle

[11] [11] [11]

The citric-acid cycleFig. 1. (a) A graphical representation of the reactions of the citric-acid cycle (CAC), including the connections with pyruvate and phosphoenolpyruvate, and the glyoxylate shunt. When there are two enzymes that are not homologous to each other but that catalyse the same reaction (non-homologous gene displacement), one is marked with a solid line and the other with a dashed line. The oxidative direction is clockwise. The enzymes with their EC numbers are as follows: 1, citrate synthase (4.1.3.7); 2, aconitase (4.2.1.3); 3, isocitrate dehydrogenase (1.1.1.42); 4, 2-ketoglutarate dehydrogenase (solid line; 1.2.4.2 and 2.3.1.61) and 2-ketoglutarate ferredoxin oxidoreductase (dashed line; 1.2.7.3); 5, succinyl- CoA synthetase (solid line; 6.2.1.5) or succinyl-CoA–acetoacetate-CoA transferase (dashed line; 2.8.3.5); 6, succinate dehydrogenase or fumarate reductase (1.3.99.1); 7, fumarase (4.2.1.2) class I (dashed line) and class II (solid line); 8, bacterial-type malate dehydrogenase (solid line) or archaeal-type malate dehydrogenase (dashed line) (1.1.1.37); 9, isocitrate lyase (4.1.3.1); 10, malate synthase (4.1.3.2); 11, phosphoenolpyruvate carboxykinase (4.1.1.49) or phosphoenolpyruvate carboxylase (4.1.1.32); 12, malic enzyme (1.1.1.40 or 1.1.1.38); 13, pyruvate carboxylase or oxaloacetate decarboxylase (6.4.1.1); 14, pyruvate dehydrogenase (solid line; 1.2.4.1 and 2.3.1.12) and pyruvate ferredoxin oxidoreductase (dashed line; 1.2.7.1).

M. A. Huynen, T. Dandekar and P. Bork ``Variation and evolution of the citric acid cycle: a genomic approach'' Trends Microbiol, 7, 281-29 (1999)

[12] [12] [12]

The citric-acid cycle

M. A. Huynen, T. Dandekar and P. Bork ``Variation and evolution of the citric acid cycle: a genomic approach'' Trends Microbiol, 7, 281-29 (1999)

b) Individual species might not have a complete CAC. This diagram shows the genes for the CAC for each unicellular species for which a genome sequence has been published, together with the phylogeny of the species. The distance-based phylogeny was constructed using the fraction of genes shared between genomes as a similarity criterion29. The major kingdoms of life are indicated in red (Archaea), blue (Bacteria) and yellow (Eukarya). Question marks represent reactions for which there is biochemical evidence in the species itself or in a related species but for which no genes could be found. Genes that lie in a single operon are shown in the same color. Genes were assumed to be located in a single operon when they were transcribed in the same direction and the stretches of non-coding DNA separating them were less than 50 nucleotides in length.

[13] [13] [13]

Experimental• Structural genomics

• Functional genomics

• Protein-protein interaction

• Metabolic pathways

• Expression data

[14] [14] [14]

Communicability: Functional Genomics• Interpretation of genome-scale gene expression data

External ProgramDNA-chip data

Cluster of coregulated genes gene 1 gene 2 ... gene n

PFMP query

Pathways affected pathway 1 pathway 2

[15] [15] [15]

Communicability: Functional Genomics• Interpretation of genome-scale gene expression data

External ProgramsDNA-chip data

Cluster of coregulated genes gene 1 gene 2 ... gene n

PFMP query

Similarities with known regulatory sites site 1 Factor 1 site 2 Factor 2 ...

Pattern discovery gene 1 gene 2 ...(putative regulatory sites)

[16] [16] [16]

Other Issues• Partial information (indirect interactions) and

subsequent filling of the missing steps

• Negative results (elements that have been shown not to interact, enzymes missing in an organism)

• Putative interactions resulting from computational analyses

[17] [17] [17]

Protein function categories• Catalysis (enzymes)

• Binding – transport (active/passive)

• Protein-DNA/RNA binding (e.g. histones, transcription factors)

• Protein-protein interactions (e.g. antibody-lysozyme) (experimentally determined by yeast two-hybrid (Y2H) or bacterial two-hybrid (B2H) screening )

• Protein-fatty acid binding (e.g. apolipoproteins)

• Protein – small molecules (drug interaction, structure decoding)

• Structural component (e.g. -crystallin)

• Regulation

• Signalling

• Transcription regulation

• Immune system

• Motor proteins (actin/myosin)

[18] [18] [18]

Catalytic properties of enzymes

Vmax/2

Michaelis-Menten equation:

Km kcat

E + S ES E + P• E = enzyme• S = substrate• ES = enzyme-substrate complex (transition state)• P = product

• Km = Michaelis constant

• Kcat = catalytic rate constant (turnover number)

• Kcat/Km = specificity constant (useful for comparison)

Vmax × [S]V = ------------------- Km + [S]

[19] [19] [19]

Protein interaction domains

http://pawsonlab.mshri.on.ca/html/domains.html

[20] [20] [20]

Energy difference upon binding

Examples of protein interactions (and functional importance) include:

• Protein – protein (pathway analysis);

• Protein – small molecules (drug interaction, structure decoding);

• Protein – peptides, DNA/RNA (function analysis)

The change in Gibb’s Free Energy of the protein-ligand binding interaction can be monitored and expressed by the following;

G = H – T S (H=Enthalpy, S=Entropy and T=Temperature)

[21] [21] [21]

Experimentally measuring PPIsYeast two-hybrid

Bait – TF binding domain

Prey – Activation domain

TF: DNA binding and activation domain together set transcription in motion

Yeast strains of opposite mating types

Make yeast strains mate and have an easily observable reporter gene (e.g. luciferase) with appropriate TFBS

Bait and Prey have to interact to activate reporter gene

[22] [22] [22]

Experimentally measuring PPIs

Tandem affinity purification (TAP)

• Add TAP tag at end of target gene containing an IgG domain

• Separate protein-TAP-IgG complexes using affinity column containing IgG beads

• Wash off the column, target-IgG complex stays behind

• If target protein interacts with others, these are also retained on the column

• Separate proteins using SDS-PAGE and identify using mass-spec

• Can also use other protein in complex as target protein to verify complex formation

[23] [23] [23]

Protein function • Many proteins combine functions

• For example, some immunoglobulin structures are thought to have more than 100 different functions (and active/binding sites)

• Alternative splicing can generate (partially) alternative structures

[24] [24] [24]

Protein function & Interaction

Active site / binding cleft

Shape complementarity

[25] [25] [25]

Protein function evolution

Chymotrypsin

[26] [26] [26]

How to infer function• Experiment

• Deduction from sequence• Multiple sequence alignment – conservation patterns• Homology searching

• Deduction from structure• Threading• Structure-structure comparison• Homology modelling

[27] [27] [27]

Cholesterol Biosynthesis:

Cholesterol biosynthesis primarily occurs in eukaryotic cells. It is necessary for membrane synthesis, and is a precursor for steroid hormone production as well as for vitamin D. While the pathway had previously been assumed to be localized in the cytosol and ER, more recent evidence suggests that a good deal of the enzymes in the pathway exist largely, if not exclusively, in the peroxisome (the enzymes listed in blue in the pathway to the left are thought to be at least partly peroxisomal). Patients with peroxisome biogenesis disorders (PBDs) have a variable deficiency in cholesterol biosynthesis

[28] [28] [28]

EMevalonate plays a role in epithelial cancers: it can inhibit EGFR

Cholesterol Biosynthesis: from acetyl-Coa to mevalonate

[29] [29] [29]

Epidermal Growth Factor as a Clinical Target in CancerA malignant tumour is the product of uncontrolled cell proliferation. Cell growth is controlled by a delicate balance between growth-promoting and growth-inhibiting factors. In normal tissue the production and activity of these factors results in differentiated cells growing in a controlled and regulated manner that maintains the normal integrity and functioning of the organ. The malignant cell has evaded this control; the natural balance is disturbed (via a variety of mechanisms) and unregulated, aberrant cell growth occurs. A key driver for growth is the epidermal growth factor (EGF) and the receptor for EGF (the EGFR) has been implicated in the development and progression of a number of human solid tumours including those of the lung, breast, prostate, colon, ovary, head and neck.

[30] [30] [30]

Energy housekeeping:Adenosine diphosphate (ADP) – Adenosine triphosphate (ATP)

[31] [31] [31]

Chemical Reaction

[32] [32] [32]

Enzymatic Catalysis

[33] [33] [33]

Gene Expression

[34] [34] [34]

Inhibition

[35] [35] [35]

Metabolic Pathway: Proline Biosynthesis

[36] [36] [36]

Transcriptional Regulation

[37] [37] [37]

Methionine Biosynthesis in E. coli

[38] [38] [38]

Shortcut Representation

[39] [39] [39]

High-level Interaction

[40] [40] [40]

Levels of Resolution

[41] [41] [41]

Cholesterol Biosynthesis

[42] [42] [42]

SREBP Pathway

[43] [43] [43]

Signal Transduction

Important signalling pathways: Map-kinase (MapK) signalling pathway, or TGF- pathway

[44] [44] [44]

Transport

[45] [45] [45]

Phosphate Utilization in Yeast

[46] [46] [46]

Multiple Levels of Regulation• Gene expression

• Protein activity

• Protein intracellular location

• Protein degradation

• Substrate transport

[47] [47] [47]

Graphical Representation – Gene Expression

[48] [48] [48]

Experimental Data – Gene Expression

[49] [49] [49]

Experimental Data – Transcriptional Regulation

[50] [50] [50]

Experimental Data – Transcriptional Regulation

[51] [51] [51]

Transcriptional RegulationIntegrated View

[52] [52] [52]

Pathways and Pathway Diagrams• Pathways

• Set of nodes (entities) and edges (associations)

• Pathway Diagrams

• XY coordinates

• Node splitting allowed

• Multiple views of the same pathway

• Different abstraction levels

[53] [53] [53]

Kegg database (Japan)

Metabolic Metabolic networksnetworks

Glycolysis Glycolysis and and

GluconeogenesisGluconeogenesis

[54] [54] [54]

Gene Ontology (GO)• Not a genome sequence database

• Developing three structured, controlled vocabularies (ontologies) to describe gene products in terms of:

• biological process

• cellular component

• molecular function

in a species-independent manner

[55] [55] [55]

The GO ontology

[56] [56] [56]

Gene Ontology Members• FlyBase - database for the fruitfly Drosophila melanogaster • Berkeley Drosophila Genome Project (BDGP) - Drosophila informatics; GO database & software, Sequence Ontology development

• Saccharomyces Genome Database (SGD) - database for the budding yeast Saccharomyces cerevisiae • Mouse Genome Database (MGD) & Gene Expression Database (GXD) - databases for the mouse Mus musculus

• The Arabidopsis Information Resource (TAIR) - database for the brassica family plant Arabidopsis thaliana

• WormBase - database for the nematode Caenorhabditis elegans • EBI GOA project : annotation of UniProt (Swiss-Prot/TrEMBL/PIR) and InterPro databases • Rat Genome Database (RGD) - database for the rat Rattus norvegicus • DictyBase - informatics resource for the slime mold Dictyostelium discoideum • GeneDB S. pombe - database for the fission yeast Schizosaccharomyces pombe (part of the Pathogen Sequencing Unit at the Wellcome Trust Sanger Institute)

• GeneDB for protozoa - databases for Plasmodium falciparum, Leishmania major, Trypanosoma brucei, and several other protozoan parasites (part of the Pathogen Sequencing Unit at the Wellcome Trust Sanger Institute)

• Genome Knowledge Base (GK) - a collaboration between Cold Spring Harbor Laboratory and EBI) • TIGR - The Institute for Genomic Research • Gramene - A Comparative Mapping Resource for Monocots • Compugen (with its Internet Research Engine) • The Zebrafish Information Network (ZFIN) - reference datasets and information on Danio rerio

bioinformatics master course: dna/protein structure-function analysis and prediction

Documents

lecture 15 secondary structure prediction bioinformatics...

dna sequencing, bioinformatics and precision medicine...

an introduction to bioinformatics algorithms gene...

transcription factor dna binding prediction

bioinformatics prediction and evolution analysis of...

bioinformatics splicing and gene prediction in eukaryotes...

promoter prediction in bacterial dna sequences using ... ·...

secondary structure prediction victor a. simossis...

simlin: a bioinformatics tool for prediction of s

bioinformatics for dna - seq and rna- seq experiments

high throughput genomic dna sequencing and bioinformatics

bioinformatics master course ii: dna/protein...

introduction to bioinformatics - tutorial no. 9 rna...

bioinformatics lecture 1. dna - the basics drew berry –...

bioinformatics for high-throughput dna sequencing

bioinformatics the prediction of life

computational prediction of rna and dna secondary structure...

dna microarray bioinformatics - #27612 normalization getting...

structure databases dna/protein structure-function analysis...

10 dna sequencing -...