basic proteomics goes functional biomedicumhelsinki...
TRANSCRIPT
Basic Proteomics goes Functional
Biomedicum HelsinkiNovember 29-30, 2007
Sophia Kossida
Protein Chemistry/Proteomics
Protein Chemistry
• Individual proteins
• Complete sequence analysis
• Emphasis on structure and function
• Structural biology
Proteomics
• Complex mixtures
• Partial sequence analysis
• Emphasis on identification by database matching
• System biology
Mining proteomes
to identify as many components of the proteome as possible
�Mapping of proteomes of various organisms and tissues
� Comparison of protein expression levels for the detection of disease biomarkers
Main objectives in Proteomics
i. to identify all proteins from a proteome (map)
ii. to analyze differential protein expression associated to a disease, sample treatments, drug targets
iii. to characterize proteins by discovering their functions, cellular localization, PTMs
iv. to describe & understand protein interaction networks
Palagi et al., Proteomics 2006, 6. 5435-5444
A pH gradient is generated by a limited number of well defined chemicals (immobilines) which are co-polymerized with the acrylamide matrix.
Migration of proteins in a pH gradient: protein stops at pH=pI
Immobilized pH gradients (IPGs)
1st dimensionIsoElectric Focusing, IEF
2nd dimension
pH 3pH 10
The strip is loaded onto a SDS gel
Mw
pI
Staining !
Proteins that were separated on IEF gel are next separated in the second dimension based on their
molecular weights.
Limitations/difficulties with the 2D gels
ReproducibilitySamples must be run at least in triplicate to rule out effects from gel-to-gel variation (statistics)
Small dynamic range of protein staining as a detection technique- visualization of abundant proteins while less abundant might be missed
Co-migrating spots forming a complex region
Streaking and smearing
Weak spots and background
DIGE
Proteins are labeled prior to running the first dimension with up to three different fluorescent cyanide dyes
Allows use of an internal standard in each gel-to-gel variation, reduces the number of gels to be run
Adds 500 Da to the protein labeled
Quantification of Spot Relative Levels
2D Fluorescence Difference Gel Electrophoresis
2DE Image Analysis Software
Palagi et al., Proteomics 2006, 6. 5435-5444
Image analysis
Overlapped Spots - Streaks
Part of gel with some overlapped spots and streaks
Same region of gel visualized in 3D
Closely overlapping spots
Traditional 2DE software packages
1. Pre-processing of the gel images
2. Normalization, cropping and background subtraction
3. Spot detection (segmentation) and expression quantification
4. An initial user guided pairing of a few spots between the reference and sample gels (landmarking). The sample gel is then warped to align the landmarks.
5. An automatic pairing of the rest of the spots
6. Identification of differential expression
7. Data presentation and interpretation
8. Creation of 2-D gel databases
Spot identification
D Iakovidies et al. A Genetic approach Approach to Spot Detection in two-Dimensional Gel Electrophoresis images, ITAB Oct. 2006
Visualization Volume - Intensity
2DE Image Analysis Software
Palagi et al., Proteomics 2006, 6. 5435-5444
PDQuest
Progenesis
Delta 2D
ImageMaster
Organizing experiments
Organizing the experiment:
creating projects, folders and subfolders.
importing gel images
Melanie/ImageMaster 2D Platinum 6.0
Import gels
Tool box to easy manipulate gels
Melanie/ImageMaster 2D Platinum 6.0
Viewing and manipulating images
Adjusting contrast
Intensity variations in x- and y- axes
3D-view
Melanie/ImageMaster 2D Platinum 6.0
Automatically subtracted background
Spot detection
separation between spots
split overlap
elimination of noise
stain saturation
incomplete resolution
Melanie/ImageMaster 2D Platinum 6.0
Spots report
A spot report summarizes the information about the selected spots
Melanie/ImageMaster 2D Platinum 6.0
Modified from: mouse cardiac; 250 µg loading; pH 3-10 IEF strips; 12.5% SDS-PAGE; file ID: sc5bcon vs. sc15iso
PTM?
Downregulation?
Detection/matching
Spot detection Spot matching Normalization of spot intensities
Master gels
Combine several images, creating the master image
•all the spots on a single image
–even those that will never be expressed at the same time
•a summary of groups of replicate gels (average gel)
Delta 2D
Any point on a gel can be labeled, and automatically transferred from one gel to another.
Gel image warping
Compensates for running differences between gels
After warping, corresponding spots will have the same position on every image.
Variations in migration, protein separation, stain artifacts andstain saturation complicate gel matching and quantitation.
Expression
Comparison of individual experimental gels to master gels.
Identification of variant spots
2D Gel Databases
Swiss-2DPAGE www.expasy.ch
GelBankhttp://www.gelscape.ualberta.ca:8080/htm/gdbIndex.html
Cornea 2D-PAGE http://www.cornea-proteomics.com/
World 2DPAGE, Index of 2D gel databaseshttp://ca.expasy.org/ch2d/2d-index.html
SWISS 2D PAGE
http://au.expasy.org/ch2d/
Swiss 2D PAGE viewer
which gel we want to look at
Swiss 2D PAGE
Swiss-2D PAGE
Estimated position
Estimated position in human liver sample
Vimentin_human(P08670)
The sample has to be introduced into the ionization source of the instrument. Once inside
the ionization source the sample molecules are ionized.
These ions are extracted into the analyzer region of the mass spectrometer where they are
separated according to their mass (m)-to-charge (z) ratios (m/z).
The separated ions are detected and this signal is sent to a data system where the m/z
ratios are stored together with their relative abundance for presentation in the format of a
m/z spectrum.
Modified from www.csupomona.edu/~drlivesay/ Chm561/winter04_561_lect1.ppt
A Mass Spectrometer
source analyzer detector
..consists of..
Detector –detection of mass separated ions
source analyzer detector
MALDI, Matrix-Assisted Laser Desorptionand Ionisation
ESI, ElectroSpray Ionisation
Source -produces the ions from the sample
(vaporization /ionization)
Mass Anlyzer - resolves ions based on their mass/charge (m/z) ratio
Generate different, but
complementary information
MALDI
Matrix Assisted Laser Desorption and Ionisation
Peptides co-crystallised with matrix
Produces singly charged protonatedmolecular ions
High throughput
Single proteins
Rapid procedure, high rate of sample throughput
large scale identification (“first look at a sample”)
+
+-+-+
laser
ions+
-
-
+
TOF
Separate ions o f different m/z based on flight time
Time of flight
Measures the time it takes for the ions to fly form one end to other and strike the detector.
The speed with which the ions fly down the analyzer tube is proportional to their m/z values.
The greater the m/z the faster they fly
MALDI-TOF data
Peak List = List of masses
112.1234.4890.51296.91876.41987.5…….
=Modified from http://plantsci.arabidopsis.info/pg/day3practical1.ppt
Every peak corresponds to the exact mass (m/z) of a peptide ion
fingerprint
R = m/R = m/∆∆m = mm = m/(m2m2--m1)m1)
Mass Mass Resolution
intensity
The ability of the instrument to resolve two closely placed peaks.
Mass accuracy
The lower the number the better the mass accuracy
Mass accuracy the measured values for the peptide ions must be as close as possible to their real values. The relative percent difference between the measured mass and the true mass, usually representedin ppm.
Peptide Mass Fingerprinting
A protein identification technique, that correlates experimental data with theoretical data.
Theoretical MS
Computer search
Protein sequence from database
In silico digestion
“Experimental” MSProteolytic digestionProtein
Peptide Mass Fingerprinting
• Protein digestion with protease (trypsin)
• Determination of the mass by MS -Calibration
• Database searching -Generation of the peptide map
• Comparison with theoretical peptide maps of known proteins -In silicodigestion
• Identification of the protein based on a probabilistic basis -percent coverage, similarity etc
Determination of mass
Every peak corresponds to the exact mass (m/z) of a peptide ion
MALDI - MS is used to measure the masses of the proteolyticpeptide fragments.
Select: Monoisotopic peaks [M+H]+ i.e. singly charged
Peak list
1051.54
1086.52
1094.56
1111.59
1244.64
1421.7
1476.67
1542.841613.881664.971763.791777.82
Effect of Mass Accuracy and Mass Tolerance
# hitsmass tolerance (Da)search m/z
40,0011529,734
20,00011529,7348
250,011529,73
1640,11529,7
47811529
Tryptic digestion of human hemoglobin alpha chain yields 14 tryptic peptides, of which the peptide VGAHAGEYHAEALER has an exact monoisotopic mass of 1528,7348 Da.
The singly charged ion of this peptide has an m/z value of 1529,7348. The result of searching SWISS PROT database against all human and mouse proteins.
Lieber, Introduction to Proteomics
Database search
Peptide mass fingerprinting provides evidence for the most probable identity of a protein.
The quality of the Protein identification will depend upon:
i. the quality of the mass spectrometry data,
ii. the accuracy of the database,
iii. the power of the search algorithms and software used
Peptide Mass Fingerprinting
Palagi et al., Proteomics 2006, 6. 5435-5444
The significance of the result depends on the size of the database being searched
Mascot PMF score
Mascot PMF results
Entry name Coverage similarity
Probability to be
random
coverage
% of protein length covered by the
experimental peptides
Mascot protein view
Sequence Database Search
Theoretical spectra
AVAGCAGARCVAAGAAGRVGGACAAAR..
Compare virtual spectra to real spectrum
i. Compute correlation scores
ii. Rank hits
iii. Peptide/protein validation
Experimental fragmentation spectrum
Select peptides that equal the input mass, from database, - get sequences that match.
Precursor mass, charge state [M+H]=775,8
Sequence Database Search
Modified from:Jimmy Eng, MS/MS Database Searching
http://tools.proteomecenter.org/course/lectures/0610Day1.Eng.pdf
Palagi et al., Proteomics 2006, 6. 5435-5444
PMF vs PFF
PFF approach very similar to the PMF
approach, but applied to MS/MS, hence
correlating peptide spectra with theoretical
peptides from a database
Sequest
Commercially available, distributed by Finnigan Corp.Developed by Jimmy Eng and John Yates
Correlates uninterpreted tandem mass spectra of peptides with amino acid sequences from protein and nucleotide databases.Determine the amino acid sequence and thus the protein (s) and organism (s) that correspond to the mass spectrum being analyzed.
http://fields.scripps.edu/sequest/
Missed cleavage site
Parameters of MS/MS id searchModifications
Cystein almost always modified
Variable modifications increase search time
exponentially
Basic residues (K, R) at C-terminal attract
ionizing charge, leading to strong y-ions
Large sequence databases contain many
irrelevant peptide candidates
Digestion Enzyme
Trypsin (specific)
Non-tryptic search increase
time by two orders of
magnitude
http://www.matrixscience.com/
Mascot MS/MS ion search
MS/MS ion search result
It is the ions scores for individual
peptide matches that are
statistically significant
The proteins are listed, by descending score, each with a table summarising the matched peptides
Protein view
Peptide summary
Experimental m/z value
Expectation value for the peptide match, (the number of times we would expect to obtain an equal or higher score, purely by chance.
The lower this value, the more significant the result.)
(relative molecular mass)Calculated rel mass
Peptide view
Ions score
Difference between
the experimental and
calculated masses.Hit: Plus sign
indicates that
multiple proteins
contain a match
to this peptide
The Brukin2d software, developed with Matlab 7.4, uses the compound data that are exported from Bruker 'DataAnalysis' program, and depicts the mean mass spectra of all the chromatograph compounds from one LC-MS run, in one 2D contour plot.
D Tsagkrasoulis, Brukin2d (in preparation) www.bioacademy.gr/bioinformatics
Brukin2d
Databases and tools
Melanie
ProteinScape
• Hierarchy:
Project
Sample
Gel
Spots
MS Data
Search Events
Platform for storing, organizing, analyzing data generated during the proteomics workflow.
Identification of interactions
ComputationalComputational
Genomic data
• Phylogenetic profiling
• Gene context
• Gene fusion
• Symmetric evolution
Structural data
• Sequence profile
• 3D structural distance matrix
• Surface patches
• Binding interactions
ExperimentalExperimental
•x-ray crystallography
• NMR spectroscopy
• Mass spectrometry
(Tandem affinity
purification)
• Immunoprecipitation
•Yeast two-hybrid
• Microarrays
KEGG
http://www.genome.jp/kegg/kegg2.html
KEGG: Kyoto Encyclopedia of Genes and Genomes
•Organism specific entry points:
-KEGG Organisms
•Subject specific entry points:
-DRUG, GLYCAN, REACTION, KAAS
KEGG
Manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for metabolism, other cellular processes, and human diseases.
Functional hierarchies and binary relations of KEGG objects, including genes and proteins, compounds and reactions, drugs and diseases, and cells and organisms.
Gene catalogs of all complete genomes and some partial genomes with ortholog annotation (KO assignment), enabling KEGG PATHWAY mapping and BRITE mapping.
A composite database of chemical substances and reactions representing our knowledge on the chemical repertoire of biological systems and environments.
KEGG is a “biological systems” database integrating both molecular building block information and higher-level systematic information.
Search Pathway
Carbon fixation
Search “Pathway”
“Pathways” _motifs
Reactome
Reactome
Browse interactions
Interactions databases
STRING (EMBL)
BOND (Unleashed Informatics)
Cytoscape (viewer)
DIP (UCLA)
iHOP
SPIN-PP (protein-protein interfaces in the PDB)
MIPS (Mammalian Protein-Protein Interaction Database)
InterAct (protein interactions from literature curation)
STRING search results
Cytoscape
Cytoscape is an open source bioinformatics software
platform for visualizing molecular interactions with gene
expression profiles and other state data.
Protein Protein Interactions
From single proteins to systems biology
Protein-Protein Interactions
Proteins “work together” forming multi
complexes to carry out the specific
functions
• Available datasets of pairwise protein-protein interactions
• Protein interactions are modeled as a graph G=(V,E), called protein interaction graphs, where V is the set of vertices (proteins) and E the set of edges (protein-protein interactions)
• Goal:
– Using protein interaction graphs, find protein complexes (clustering)
– Annotation of unknown proteins
• Creation of a new hierarchical algorithm and application on a well studied organism in order to evaluate it.
•Whole Genome Analyses•Clustering in protein-protein interaction networks
Theodosiou A, Moschopoulos C, Baumann M, Kossida S, Protein interactions and disease, (2007) Idea Group Inc., Handbook of Research in Systems Biology Applications in Medicine
C Moschopoulos, PhD student
1101Org 4
0101Org 3
1010Org 2
1111Org 1
Protein D
Protein C
Protein B
Protein A
Phylogenetic Profile
In silico Prediction of PPI
Phylogenetic profile (against N genomes):For each gene X in a target genome: if gene X has a homolog in genome #i, the ith bit of X’s phylogeneticprofile is “1” otherwise it is “0”
The phylogenetic profile of a protein is a string that encodes the presence or absence of the protein in every sequenced genome
Conserved presence or absence of a protein pair suggests functional coupling.
A C
In silico Prediction of PPI
Protein A
Protein C
Protein B
Org 1
Org 2
Org 3
Org 4
A B
Gene Fusion (Rosetta stone)
Seemly unrelated proteins are sometimes found fused in another organism
Org 1
Org 2
Gene Context
Conserved gene neighbourhood suggests position- function coupling
Though gene-fusion has low prediction coverage, its false-positive rate is low
• Whole Genome Analyses• Domain Fusion Analysis (DFA) for Complete Bacteria Genomes
• Proteins in a given species are found to consist of a fusion between two separate full-length proteins in another species.
• DFA predicts protein pairs that have related biological functions (participation in a common structural complex, metabolic pathway or biological process).
• We can predict potential physical protein-protein interactions.
OUR METHOD:
Eubacteria Crenarchaeota(Aquifex) ( Thermofilum)
Complete Genome Complete Genome(Query 1) (Query 2)
ProtistsProtists((EntamoebaEntamoeba))
Complete Genome (Reference one)
VS
•Whole Genome Analyses• Domain Fusion Analysis (DFA) for Complete Bacteria Genomes
Danos V
Gene Ontology Home
What is an Ontology?
• A formal, explicit specification of a shared conceptualization.
• A shared vocabulary that can be used to model a domain, i.e., the objects and/or concepts that exist, their properties and relations.
• An ontology includes:
– Classes : describe concepts in the domain
– Properties : describe features and attributes of each class
– Restrictions : allowed values for the classes’ properties
– Instances : real values for properties
Why use Ontologies?
• To share common understanding of the structure of information among
– people
– software
• To enable reuse of domain knowledge
• To make domain assumptions explicit (easy to find, understand and change)
• To be used as a generalized database
• To create models that make easier the software connectivity
• To retrieve information easily and quickly (Information Retrieval Applications)
• More… (Natural Language Processing, Artificial Intelligence, Semantics in Web… )
Molecular Function— describes activities, or tasks, performed by individual or by assembled complexes of gene products.DNA binding, transcription factor
Biological Process— a series of events accomplished by one or more ordered assemblies of molecular functions.NOT a “pathway”!mitosis, signal transduction, metabolism
Cellular Component— location or complex , a component of a cell, that also is part of some larger object nucleus, ribosome, origin recognition complex
The Three Ontologies
www.bioacademy.gr/bioinforamtics
Cytoplasm
52%
Mitochondria
27%
Extracellular
3%
Endoplasmatic
Reticulum
4%
Synapses
2%
Misc
1%
Membranes
2%
Nucleus
9%
Subcellular Location of Identified Human Brain Proteins
• In silico Analysis of Specific Genes / Gene Families• Regulaotry analysis of WISP1 & CTGF co-regulated genes
• If there is common Framework of TFs, it could denote common regulation.
• Confirmation of Experimental Results
• Evolutionary Conserved Region � Significant Regulatory Function. Ifdistant from TSS � Putative Enhancer
• Combinatorial use of bioinformatics analysis and biological knowledge could provide novel hypotheses concerning the regulatory mechanisms governing gene expression that could guide further experimental analysis.
• In silico Analysis of Specific Genes / Gene Families• Regulaotry analysis of WISP1 & CTGF co-regulated genes
Kapasa M, Serafimidis I, Gavalas A, Kossida S, Phylogenetic and promoter analysis of the WISP1 and CTG orthologs: functional implications (submitted 2007)
• In silico Analysis of Specific Genes / Gene Families• Notch3 and CADASIL
Theodosiou A, Baumann M, Kossida S, Evolutionary & promoter analysis of Notch3 (in preparation)
Annotating & reconstructing hypothetical Annotating & reconstructing hypothetical zebrafishzebrafishproteomics derived sequences to their whole lengthproteomics derived sequences to their whole length
Danio rerio Hypothetical
Alpha II Spectrin (433 aa)IPI00490244
CB 363887.1 CR930945.1
CK692840.1
CK024418.1AL914103.1
CA473456.1
BI981643.1
CK709210.1
CN178436.1
CK399253.1
Human Alpha II Spectrin (2472 aa) Q13813 / SPTA2_HUMAN
1 200
Initial zebrafish hypothetical Alpha II Spectrin sequence (from MALDI-TOF): 433aa
Final zebrafish Alpha II Spectrin sequence (with gaps) (database analysis) : 2472 aa
We managed to build the remaining sequence of the zebrafish orthologue protein by adding 1580 aminoacids, using EST database search
How individual cells assemble into
specialised tissues and organs?muscles
epithelium
blood
How different organs interact
to form the organism ?
+
Organogenesis: the formation of the living organism
From the organogenesis to disease
In order to understand at the molecular level the cause of the disease we need first
to dissect the molecular mechanisms responsible for the formation of various tissues
Disruption in the epithelia formation
result in cancer
Metastasis
weak muscle connections
result in a myopathy disease
Some examples of human diseases as result of abnormal gene function
that also cause abnormal tissue morphogenesis
mutation in a cytoskeletal protein
oncogenic mutation
cell migration
tissue A
tissue B
Integrins: an ancient family of proteins that play important role in health and disease because they function as linkers of the outside micro-environment with
the interior of the cell
integrins
cytoskeleton
extracellular
matrix
? Identify novel genes involved in the
integrin-mediated functions using as
a model system the genetic tractable
organism Drosophila
More we learn about the functions of
novel genes, more we gain insight in the
fundamendal molecular principles that
are affected in human diseases
Our goal:
?
� Why integrins?
� one of the main families of adhesion molecules.
� important role in normal development (organogenesis)/ human pathology: cancer metastasis, thrombosis, angiogenesis, inflammation
� Why first muscle?
� more available information for the function of integrins-associated proteins in muscle
� Which are the goals?
1. Search the network of protein interactions around integrins & associated proteins, discover the main hubs.
2. Search the genetic interactions of integrins’ complex in different organisms to fulfill the previous network
3. Search for expression patterns that correlates with integrins’during development and towards subcellular distribution.
• what kind of data we need to collect?
• where can we find the data?
Gene name (fly & Gene name (fly & homologshomologs).).
Domains.Domains.
Available mutants (PAvailable mutants (P--elements).elements).
Interactions (fly & Interactions (fly & homologshomologs).).
Expression pattern.Expression pattern.
SubcellularSubcellular localisationlocalisation..
Worm Worm RNAiRNAi results.results.
Internet Data Bases.Internet Data Bases.
Interactions graphical representation tools (e.g. Osprey).Interactions graphical representation tools (e.g. Osprey).
Literature.Literature.
Constructing a meta-base
Phases
• Identification of required data.
• DataBase (DB) development.
• Data collection & import into the DB.
• Data Analysis via appropriate queries.
Isolation of genes suitable for further genetic (RNAi) analysis.
b e
j
k
r
c d
f g i
m o
p s
a
h
l n
q t
already known genes in the organism X
a
h
l n
q t
b e
j
k
r
c d
f g i
m o
p s
genes expressed differently in various tissues(e.g. gut / nervous system / muscles)
network of interactions (protein-protein or genetic interactions)
genes that cluster together (likely to participate in similar mechanisms)
a’
i’
t’r’p’
b’
f’
o’
j’h’
q’
n’
e’
l’
d’
m’
Organism X (e.g. mouse) Organism Y (e.g. fly)
conserved interactions
novel interactions
non-conserved interactions
gene homologs in the organism Y
expression pattern in similar tissues between the 2 examined organisms
Our approach: data mining and collection of available data in a way thatdata comparisons from various experiments will generate a novel information
b e
j
k
r
c d
f g
m
p s
a
l
q t
i’
o’
h’
n’
Organism X (e.g. mouse) Organism Y (e.g. fly)
a’
t’r’p’
b’
f’ j’
q’
e’
l’
d’
m’
h
n
i
o
Data comparison reveals some interesting links between specific genes
the identification of suitable target genes that can direct further
experimental approaches aiming to address a specific biological question
regarding the functions of the above genes in a model organism
The outcome of the proposed work will be:
Westerhoff HV, Palsson BO: The evolution of molecular biology into systems biologyNat Biotech 2004, 22: 1249-1252
Systems biology is the integration of experimental and computational approaches to achieve the overall goal of explaining and predicting complex cellular behaviors of biological systems.
Acknowledgements
BRFAA Bioinformatics & Medical Informatics Group
BRFAA Proteomics Unit
Proteomics Unit, Biomedicum, University of Helsinki headed by Marc Baumann• Dimosthenis• Panos• Athina• Manousos
Sigrid Juselius Foundation for funding: S Kossida & A Theodosiou