basic proteomics goes functional biomedicumhelsinki...

Basic Proteomics goes Functional

Biomedicum HelsinkiNovember 29-30, 2007

Sophia Kossida

Protein Chemistry/Proteomics

Protein Chemistry

• Individual proteins

• Complete sequence analysis

• Emphasis on structure and function

• Structural biology

Proteomics

• Complex mixtures

• Partial sequence analysis

• Emphasis on identification by database matching

• System biology

Mining proteomes

to identify as many components of the proteome as possible

�Mapping of proteomes of various organisms and tissues

� Comparison of protein expression levels for the detection of disease biomarkers

Main objectives in Proteomics

i. to identify all proteins from a proteome (map)

ii. to analyze differential protein expression associated to a disease, sample treatments, drug targets

iii. to characterize proteins by discovering their functions, cellular localization, PTMs

iv. to describe & understand protein interaction networks

Palagi et al., Proteomics 2006, 6. 5435-5444

A pH gradient is generated by a limited number of well defined chemicals (immobilines) which are co-polymerized with the acrylamide matrix.

Migration of proteins in a pH gradient: protein stops at pH=pI

Immobilized pH gradients (IPGs)

1st dimensionIsoElectric Focusing, IEF

2nd dimension

pH 3pH 10

The strip is loaded onto a SDS gel

Mw

pI

Staining !

Proteins that were separated on IEF gel are next separated in the second dimension based on their

molecular weights.

Limitations/difficulties with the 2D gels

ReproducibilitySamples must be run at least in triplicate to rule out effects from gel-to-gel variation (statistics)

Small dynamic range of protein staining as a detection technique- visualization of abundant proteins while less abundant might be missed

Co-migrating spots forming a complex region

Streaking and smearing

Weak spots and background

DIGE

Proteins are labeled prior to running the first dimension with up to three different fluorescent cyanide dyes

Allows use of an internal standard in each gel-to-gel variation, reduces the number of gels to be run

Adds 500 Da to the protein labeled

Quantification of Spot Relative Levels

2D Fluorescence Difference Gel Electrophoresis

2DE Image Analysis Software


Image analysis

Overlapped Spots - Streaks

Part of gel with some overlapped spots and streaks

Same region of gel visualized in 3D

Closely overlapping spots

Traditional 2DE software packages

1. Pre-processing of the gel images

2. Normalization, cropping and background subtraction

3. Spot detection (segmentation) and expression quantification

4. An initial user guided pairing of a few spots between the reference and sample gels (landmarking). The sample gel is then warped to align the landmarks.

5. An automatic pairing of the rest of the spots

6. Identification of differential expression

7. Data presentation and interpretation

8. Creation of 2-D gel databases

Spot identification

D Iakovidies et al. A Genetic approach Approach to Spot Detection in two-Dimensional Gel Electrophoresis images, ITAB Oct. 2006

Visualization Volume - Intensity

2DE Image Analysis Software


PDQuest

Progenesis

Delta 2D

ImageMaster

Organizing experiments

Organizing the experiment:

creating projects, folders and subfolders.

importing gel images

Melanie/ImageMaster 2D Platinum 6.0

Import gels

Tool box to easy manipulate gels


Viewing and manipulating images

Adjusting contrast

Intensity variations in x- and y- axes

3D-view


Automatically subtracted background

Spot detection

separation between spots

split overlap

elimination of noise

stain saturation

incomplete resolution


Spots report

A spot report summarizes the information about the selected spots


Modified from: mouse cardiac; 250 µg loading; pH 3-10 IEF strips; 12.5% SDS-PAGE; file ID: sc5bcon vs. sc15iso

PTM?

Downregulation?

Detection/matching

Spot detection Spot matching Normalization of spot intensities

Master gels

Combine several images, creating the master image

•all the spots on a single image

–even those that will never be expressed at the same time

•a summary of groups of replicate gels (average gel)

Delta 2D

Any point on a gel can be labeled, and automatically transferred from one gel to another.

Gel image warping

Compensates for running differences between gels

After warping, corresponding spots will have the same position on every image.

Variations in migration, protein separation, stain artifacts andstain saturation complicate gel matching and quantitation.

Expression

Comparison of individual experimental gels to master gels.

Identification of variant spots

2D Gel Databases

Swiss-2DPAGE www.expasy.ch

GelBankhttp://www.gelscape.ualberta.ca:8080/htm/gdbIndex.html

Cornea 2D-PAGE http://www.cornea-proteomics.com/

World 2DPAGE, Index of 2D gel databaseshttp://ca.expasy.org/ch2d/2d-index.html

SWISS 2D PAGE

http://au.expasy.org/ch2d/

Swiss 2D PAGE viewer

which gel we want to look at

Swiss 2D PAGE

Swiss-2D PAGE

Estimated position

Estimated position in human liver sample

Vimentin_human(P08670)

The sample has to be introduced into the ionization source of the instrument. Once inside

the ionization source the sample molecules are ionized.

These ions are extracted into the analyzer region of the mass spectrometer where they are

separated according to their mass (m)-to-charge (z) ratios (m/z).

The separated ions are detected and this signal is sent to a data system where the m/z

ratios are stored together with their relative abundance for presentation in the format of a

m/z spectrum.

Modified from www.csupomona.edu/~drlivesay/ Chm561/winter04_561_lect1.ppt

A Mass Spectrometer

source analyzer detector

..consists of..

Detector –detection of mass separated ions

source analyzer detector

MALDI, Matrix-Assisted Laser Desorptionand Ionisation

ESI, ElectroSpray Ionisation

Source -produces the ions from the sample

(vaporization /ionization)

Mass Anlyzer - resolves ions based on their mass/charge (m/z) ratio

Generate different, but

complementary information

MALDI

Matrix Assisted Laser Desorption and Ionisation

Peptides co-crystallised with matrix

Produces singly charged protonatedmolecular ions

High throughput

Single proteins

Rapid procedure, high rate of sample throughput

large scale identification (“first look at a sample”)

+

+-+-+

laser

ions+

-

-

+

TOF

Separate ions o f different m/z based on flight time

Time of flight

Measures the time it takes for the ions to fly form one end to other and strike the detector.

The speed with which the ions fly down the analyzer tube is proportional to their m/z values.

The greater the m/z the faster they fly

MALDI-TOF data

Peak List = List of masses

112.1234.4890.51296.91876.41987.5…….

=Modified from http://plantsci.arabidopsis.info/pg/day3practical1.ppt

Every peak corresponds to the exact mass (m/z) of a peptide ion

fingerprint

R = m/R = m/∆∆m = mm = m/(m2m2--m1)m1)

Mass Mass Resolution

intensity

The ability of the instrument to resolve two closely placed peaks.

Mass accuracy

The lower the number the better the mass accuracy

Mass accuracy the measured values for the peptide ions must be as close as possible to their real values. The relative percent difference between the measured mass and the true mass, usually representedin ppm.

Peptide Mass Fingerprinting

A protein identification technique, that correlates experimental data with theoretical data.

Theoretical MS

Computer search

Protein sequence from database

In silico digestion

“Experimental” MSProteolytic digestionProtein


• Protein digestion with protease (trypsin)

• Determination of the mass by MS -Calibration

• Database searching -Generation of the peptide map

• Comparison with theoretical peptide maps of known proteins -In silicodigestion

• Identification of the protein based on a probabilistic basis -percent coverage, similarity etc

Determination of mass

Every peak corresponds to the exact mass (m/z) of a peptide ion

MALDI - MS is used to measure the masses of the proteolyticpeptide fragments.

Select: Monoisotopic peaks [M+H]+ i.e. singly charged

Peak list

1051.54

1086.52

1094.56

1111.59

1244.64

1421.7

1476.67

1542.841613.881664.971763.791777.82

Effect of Mass Accuracy and Mass Tolerance

# hitsmass tolerance (Da)search m/z

40,0011529,734

20,00011529,7348

250,011529,73

1640,11529,7

47811529

Tryptic digestion of human hemoglobin alpha chain yields 14 tryptic peptides, of which the peptide VGAHAGEYHAEALER has an exact monoisotopic mass of 1528,7348 Da.

The singly charged ion of this peptide has an m/z value of 1529,7348. The result of searching SWISS PROT database against all human and mouse proteins.

Lieber, Introduction to Proteomics

Database search

Peptide mass fingerprinting provides evidence for the most probable identity of a protein.

The quality of the Protein identification will depend upon:

i. the quality of the mass spectrometry data,

ii. the accuracy of the database,

iii. the power of the search algorithms and software used

The significance of the result depends on the size of the database being searched

Mascot PMF score

Mascot PMF results

Entry name Coverage similarity

Probability to be

random

coverage

% of protein length covered by the

experimental peptides

Mascot protein view

Sequence Database Search

Theoretical spectra

AVAGCAGARCVAAGAAGRVGGACAAAR..

Compare virtual spectra to real spectrum

i. Compute correlation scores

ii. Rank hits

iii. Peptide/protein validation

Experimental fragmentation spectrum

Select peptides that equal the input mass, from database, - get sequences that match.

Precursor mass, charge state [M+H]=775,8

Sequence Database Search

Modified from:Jimmy Eng, MS/MS Database Searching

http://tools.proteomecenter.org/course/lectures/0610Day1.Eng.pdf


PMF vs PFF

PFF approach very similar to the PMF

approach, but applied to MS/MS, hence

correlating peptide spectra with theoretical

peptides from a database

Sequest

Commercially available, distributed by Finnigan Corp.Developed by Jimmy Eng and John Yates

Correlates uninterpreted tandem mass spectra of peptides with amino acid sequences from protein and nucleotide databases.Determine the amino acid sequence and thus the protein (s) and organism (s) that correspond to the mass spectrum being analyzed.

http://fields.scripps.edu/sequest/

Missed cleavage site

Parameters of MS/MS id searchModifications

Cystein almost always modified

Variable modifications increase search time

exponentially

Basic residues (K, R) at C-terminal attract

ionizing charge, leading to strong y-ions

Large sequence databases contain many

irrelevant peptide candidates

Digestion Enzyme

Trypsin (specific)

Non-tryptic search increase

time by two orders of

magnitude

http://www.matrixscience.com/

Mascot MS/MS ion search

MS/MS ion search result

It is the ions scores for individual

peptide matches that are

statistically significant

The proteins are listed, by descending score, each with a table summarising the matched peptides

Protein view

Peptide summary

Experimental m/z value

Expectation value for the peptide match, (the number of times we would expect to obtain an equal or higher score, purely by chance.

The lower this value, the more significant the result.)

(relative molecular mass)Calculated rel mass

Peptide view

Ions score

Difference between

the experimental and

calculated masses.Hit: Plus sign

indicates that

multiple proteins

contain a match

to this peptide

The Brukin2d software, developed with Matlab 7.4, uses the compound data that are exported from Bruker 'DataAnalysis' program, and depicts the mean mass spectra of all the chromatograph compounds from one LC-MS run, in one 2D contour plot.

D Tsagkrasoulis, Brukin2d (in preparation) www.bioacademy.gr/bioinformatics

Brukin2d

Databases and tools

Melanie

ProteinScape

• Hierarchy:

Project

Sample

Gel

Spots

MS Data

Search Events

Platform for storing, organizing, analyzing data generated during the proteomics workflow.

Identification of interactions

ComputationalComputational

Genomic data

• Phylogenetic profiling

• Gene context

• Gene fusion

• Symmetric evolution

Structural data

• Sequence profile

• 3D structural distance matrix

• Surface patches

• Binding interactions

ExperimentalExperimental

•x-ray crystallography

• NMR spectroscopy

• Mass spectrometry

(Tandem affinity

purification)

• Immunoprecipitation

•Yeast two-hybrid

• Microarrays

KEGG

http://www.genome.jp/kegg/kegg2.html

KEGG: Kyoto Encyclopedia of Genes and Genomes

•Organism specific entry points:

-KEGG Organisms

•Subject specific entry points:

-DRUG, GLYCAN, REACTION, KAAS

KEGG

Manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for metabolism, other cellular processes, and human diseases.

Functional hierarchies and binary relations of KEGG objects, including genes and proteins, compounds and reactions, drugs and diseases, and cells and organisms.

Gene catalogs of all complete genomes and some partial genomes with ortholog annotation (KO assignment), enabling KEGG PATHWAY mapping and BRITE mapping.

A composite database of chemical substances and reactions representing our knowledge on the chemical repertoire of biological systems and environments.

KEGG is a “biological systems” database integrating both molecular building block information and higher-level systematic information.

Search Pathway

Carbon fixation

Search “Pathway”

“Pathways” _motifs

Reactome

Browse interactions

Interactions databases

STRING (EMBL)

BOND (Unleashed Informatics)

Cytoscape (viewer)

DIP (UCLA)

iHOP

SPIN-PP (protein-protein interfaces in the PDB)

MIPS (Mammalian Protein-Protein Interaction Database)

InterAct (protein interactions from literature curation)

STRING search results

Cytoscape

Cytoscape is an open source bioinformatics software

platform for visualizing molecular interactions with gene

expression profiles and other state data.

Protein Protein Interactions

From single proteins to systems biology

Protein-Protein Interactions

Proteins “work together” forming multi

complexes to carry out the specific

functions

• Available datasets of pairwise protein-protein interactions

• Protein interactions are modeled as a graph G=(V,E), called protein interaction graphs, where V is the set of vertices (proteins) and E the set of edges (protein-protein interactions)

• Goal:

– Using protein interaction graphs, find protein complexes (clustering)

– Annotation of unknown proteins

• Creation of a new hierarchical algorithm and application on a well studied organism in order to evaluate it.

•Whole Genome Analyses•Clustering in protein-protein interaction networks

Theodosiou A, Moschopoulos C, Baumann M, Kossida S, Protein interactions and disease, (2007) Idea Group Inc., Handbook of Research in Systems Biology Applications in Medicine

C Moschopoulos, PhD student

1101Org 4

0101Org 3

1010Org 2

1111Org 1

Protein D

Protein C

Protein B

Protein A

Phylogenetic Profile

In silico Prediction of PPI

Phylogenetic profile (against N genomes):For each gene X in a target genome: if gene X has a homolog in genome #i, the ith bit of X’s phylogeneticprofile is “1” otherwise it is “0”

The phylogenetic profile of a protein is a string that encodes the presence or absence of the protein in every sequenced genome

Conserved presence or absence of a protein pair suggests functional coupling.

A C

In silico Prediction of PPI

Protein A

Protein C

Protein B

Org 1

Org 2

Org 3

Org 4

A B

Gene Fusion (Rosetta stone)

Seemly unrelated proteins are sometimes found fused in another organism

Org 1

Org 2

Gene Context

Conserved gene neighbourhood suggests position- function coupling

Though gene-fusion has low prediction coverage, its false-positive rate is low

• Whole Genome Analyses• Domain Fusion Analysis (DFA) for Complete Bacteria Genomes

• Proteins in a given species are found to consist of a fusion between two separate full-length proteins in another species.

• DFA predicts protein pairs that have related biological functions (participation in a common structural complex, metabolic pathway or biological process).

• We can predict potential physical protein-protein interactions.

OUR METHOD:

Eubacteria Crenarchaeota(Aquifex) ( Thermofilum)

Complete Genome Complete Genome(Query 1) (Query 2)

ProtistsProtists((EntamoebaEntamoeba))

Complete Genome (Reference one)

VS

•Whole Genome Analyses• Domain Fusion Analysis (DFA) for Complete Bacteria Genomes

Danos V

Gene Ontology Home

What is an Ontology?

• A formal, explicit specification of a shared conceptualization.

• A shared vocabulary that can be used to model a domain, i.e., the objects and/or concepts that exist, their properties and relations.

• An ontology includes:

– Classes : describe concepts in the domain

– Properties : describe features and attributes of each class

– Restrictions : allowed values for the classes’ properties

– Instances : real values for properties

Why use Ontologies?

• To share common understanding of the structure of information among

– people

– software

• To enable reuse of domain knowledge

• To make domain assumptions explicit (easy to find, understand and change)

• To be used as a generalized database

• To create models that make easier the software connectivity

• To retrieve information easily and quickly (Information Retrieval Applications)

• More… (Natural Language Processing, Artificial Intelligence, Semantics in Web… )

Molecular Function— describes activities, or tasks, performed by individual or by assembled complexes of gene products.DNA binding, transcription factor

Biological Process— a series of events accomplished by one or more ordered assemblies of molecular functions.NOT a “pathway”!mitosis, signal transduction, metabolism

Cellular Component— location or complex , a component of a cell, that also is part of some larger object nucleus, ribosome, origin recognition complex

The Three Ontologies

www.bioacademy.gr/bioinforamtics

Cytoplasm

52%

Mitochondria

27%

Extracellular

3%

Endoplasmatic

Reticulum

4%

Synapses

2%

Misc

1%

Membranes

2%

Nucleus

9%

Subcellular Location of Identified Human Brain Proteins

• In silico Analysis of Specific Genes / Gene Families• Regulaotry analysis of WISP1 & CTGF co-regulated genes

• If there is common Framework of TFs, it could denote common regulation.

• Confirmation of Experimental Results

• Evolutionary Conserved Region � Significant Regulatory Function. Ifdistant from TSS � Putative Enhancer

• Combinatorial use of bioinformatics analysis and biological knowledge could provide novel hypotheses concerning the regulatory mechanisms governing gene expression that could guide further experimental analysis.

• In silico Analysis of Specific Genes / Gene Families• Regulaotry analysis of WISP1 & CTGF co-regulated genes

Kapasa M, Serafimidis I, Gavalas A, Kossida S, Phylogenetic and promoter analysis of the WISP1 and CTG orthologs: functional implications (submitted 2007)

• In silico Analysis of Specific Genes / Gene Families• Notch3 and CADASIL

Theodosiou A, Baumann M, Kossida S, Evolutionary & promoter analysis of Notch3 (in preparation)

Annotating & reconstructing hypothetical Annotating & reconstructing hypothetical zebrafishzebrafishproteomics derived sequences to their whole lengthproteomics derived sequences to their whole length

Danio rerio Hypothetical

Alpha II Spectrin (433 aa)IPI00490244

CB 363887.1 CR930945.1

CK692840.1

CK024418.1AL914103.1

CA473456.1

BI981643.1

CK709210.1

CN178436.1

CK399253.1

Human Alpha II Spectrin (2472 aa) Q13813 / SPTA2_HUMAN

1 200

Initial zebrafish hypothetical Alpha II Spectrin sequence (from MALDI-TOF): 433aa

Final zebrafish Alpha II Spectrin sequence (with gaps) (database analysis) : 2472 aa

We managed to build the remaining sequence of the zebrafish orthologue protein by adding 1580 aminoacids, using EST database search

How individual cells assemble into

specialised tissues and organs?muscles

epithelium

blood

How different organs interact

to form the organism ?

+

Organogenesis: the formation of the living organism

From the organogenesis to disease

In order to understand at the molecular level the cause of the disease we need first

to dissect the molecular mechanisms responsible for the formation of various tissues

Disruption in the epithelia formation

result in cancer

Metastasis

weak muscle connections

result in a myopathy disease

Some examples of human diseases as result of abnormal gene function

that also cause abnormal tissue morphogenesis

mutation in a cytoskeletal protein

oncogenic mutation

cell migration

tissue A

tissue B

Integrins: an ancient family of proteins that play important role in health and disease because they function as linkers of the outside micro-environment with

the interior of the cell

integrins

cytoskeleton

extracellular

matrix

? Identify novel genes involved in the

integrin-mediated functions using as

a model system the genetic tractable

organism Drosophila

More we learn about the functions of

novel genes, more we gain insight in the

fundamendal molecular principles that

are affected in human diseases

Our goal:

?

� Why integrins?

� one of the main families of adhesion molecules.

� important role in normal development (organogenesis)/ human pathology: cancer metastasis, thrombosis, angiogenesis, inflammation

� Why first muscle?

� more available information for the function of integrins-associated proteins in muscle

� Which are the goals?

1. Search the network of protein interactions around integrins & associated proteins, discover the main hubs.

2. Search the genetic interactions of integrins’ complex in different organisms to fulfill the previous network

3. Search for expression patterns that correlates with integrins’during development and towards subcellular distribution.

• what kind of data we need to collect?

• where can we find the data?

Gene name (fly & Gene name (fly & homologshomologs).).

Domains.Domains.

Available mutants (PAvailable mutants (P--elements).elements).

Interactions (fly & Interactions (fly & homologshomologs).).

Expression pattern.Expression pattern.

SubcellularSubcellular localisationlocalisation..

Worm Worm RNAiRNAi results.results.

Internet Data Bases.Internet Data Bases.

Interactions graphical representation tools (e.g. Osprey).Interactions graphical representation tools (e.g. Osprey).

Literature.Literature.

Constructing a meta-base

Phases

• Identification of required data.

• DataBase (DB) development.

• Data collection & import into the DB.

• Data Analysis via appropriate queries.

Isolation of genes suitable for further genetic (RNAi) analysis.

b e

j

k

r

c d

f g i

m o

p s

a

h

l n

q t

already known genes in the organism X

a

h

l n

q t

b e

j

k

r

c d

f g i

m o

p s

genes expressed differently in various tissues(e.g. gut / nervous system / muscles)

network of interactions (protein-protein or genetic interactions)

genes that cluster together (likely to participate in similar mechanisms)

a’

i’

t’r’p’

b’

f’

o’

j’h’

q’

n’

e’

l’

d’

m’

Organism X (e.g. mouse) Organism Y (e.g. fly)

conserved interactions

novel interactions

non-conserved interactions

gene homologs in the organism Y

expression pattern in similar tissues between the 2 examined organisms

Our approach: data mining and collection of available data in a way thatdata comparisons from various experiments will generate a novel information

b e

j

k

r

c d

f g

m

p s

a

l

q t

i’

o’

h’

n’

Organism X (e.g. mouse) Organism Y (e.g. fly)

a’

t’r’p’

b’

f’ j’

q’

e’

l’

d’

m’

h

n

i

o

Data comparison reveals some interesting links between specific genes

the identification of suitable target genes that can direct further

experimental approaches aiming to address a specific biological question

regarding the functions of the above genes in a model organism

The outcome of the proposed work will be:

Westerhoff HV, Palsson BO: The evolution of molecular biology into systems biologyNat Biotech 2004, 22: 1249-1252

Systems biology is the integration of experimental and computational approaches to achieve the overall goal of explaining and predicting complex cellular behaviors of biological systems.

Acknowledgements

BRFAA Bioinformatics & Medical Informatics Group

BRFAA Proteomics Unit

Proteomics Unit, Biomedicum, University of Helsinki headed by Marc Baumann• Dimosthenis• Panos• Athina• Manousos

Sigrid Juselius Foundation for funding: S Kossida & A Theodosiou

basic proteomics goes functional biomedicumhelsinki...

Documents