computational functional genomics
DESCRIPTION
Computational functional genomics. Lital Haham Sivan Pearl. Introduction. Piles of information but only flakes of knowledge. The existing information:. Collections of genomic sequences. Expression profiles Protein-protein interactions And many more…. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
1
Computational functional genomics
Lital HahamSivan Pearl
2
Introduction
• Piles of information but only flakes of knowledge.
• The existing information:Collections of genomic sequences. Expression profilesProtein-protein interactions And many more…
3
Introduction
• Computational biology strives to extract the maximal possible information from known sequences, by classifying them according to their homologous relationships, predicting their biochemical activity, cellular function, 3-dimensional structures and evolutionary origin.
4
The COG-Clusters of Orthologous Groups of proteins
• Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes.
• Reflects one-to-one, one-to-many and many-to-many relationships.
• The purpose of COG is to serve as a platform for functional annotation of newly sequenced genomes and for study of genome evolution.
5
The COG-statistics
• In 2003, there are 3307 COGs including 74059 proteins from 43 genomes.
• Genomes from- Bacteria, Archaea and Eukaryota.
• The database includes 17 functional groups.
6
The COG- make on your own
• COG construction procedure is based on the notion that any group of at least 3 proteins from distant genomes that are more similar to each other than to any other protein from the same genomes, are most likely to belong to an orthologous family.
7
The COG- make on your own
All-against-all protein sequence comparison
Detect and collapse paralogs
Detect triangles of mutually genome specific best hits
Merge triangles with a common side, to form COG
8
The COG- make on your own
9
The COG- adding new genomes
• The COGNITOR program adds new proteins to pre-existing COGs on the basis of multiple Best Hits.
• 60-80% of the proteins of prokaryotes could be included.
10
The COG- more applications:
• Detecting missed genes.
• Convenient for variety of evolutionary-oriented analyses of protein families.
11
Methods
• Experimental method:Biochemical and genetic experiments
• Computational methods:Homology method (BLAST), mRNA expression
Phylogenetic profile
Fusion method (Rosetta stone analysis)
Gene neighbour method
12
Homology method
• Homology method: searches proteins whose AA sequences are similar.
• 40-70% of new genome can be assigned to some function.
• Involve identification of some molecular function.
13
mRNA expression
• Analysis of correlated mRNA expression levels enables to establish functional linkages, by detecting changes in mRNA expression in different cell types, or different environments.
14
Phylogenetic profile
• Describes the pattern of presence or absence of a particular protein, across a set of organisms.
• Number of possible profiles:
910~2n
• This number far exceeds the protein families.
15
Phylogenetic profile
• Why would two proteins always both be inherited into new species or neither inherited, unless the two function together?
• If two proteins have the same phylogenetic profile, it is inferred that they have a functional link: engaged in a common pathway or complex.
16
Phylogenetic profile
1 1 1
17
Phylogenetic profile- example • Analysis of three proteins: RL7, FlgL and His5,
according to their phylogenetic profiles.
• RL7: more than half have function associated with the ribosome.
• FlgL: more than half include various flagellar
proteins and cell-wall maintenance proteins.
• His5: more than half involved in amino acid metabolism.
18
Phylogenetic profile- example
RL7 ribosome L7RL15 ribosome L15RL17 ribosome L17PTH peptidyl-tRNA hydrolaseRNC ribonuclease III
PgsA phospholipid synthesis
YGGH hypotheticalYBEX hypotheticalRL34 ribosome L34RL36 ribosome L36RL27 ribosome L27RL25 ribosome L25
YQCB hypotheticalYABO hypotheticalYCEC hypotheticalRFH peptide release factorClpB geat shock proteinYJFH hypothethocal
RS14 ribosome S14
G3P3 dehydrogenase
RL4 ribosome L4
NONE hypothtical
GrpE co-chaperone
GidB glucose inhib. DivisionRL24 ribosome L24DEF polypeptide deformylaseRL20 ribosome L20MesJ cell cycle proteinRL19 ribosome L19RL21 ribosome L21RL9 ribosome L9SmpB small protein B
19
Phylogenetic profile
Keyword No. proteins
No. neighbors
in keyword group
No. neighbors in random
groupRibosome 60 197 27
Transcription 36 17 10
tRNA synthase and ligase 26 11 5
Membrane proteins 25 89 5
Flagellar 21 89 3
Iron, ferric, and ferritin 19 31 2
Galactose metabolism 18 31 2
Molybdoterin and Molybdenum, and molybdoterin 12 6 1
Hypothetical 1084 108226 8440
Phylogenetic profiles link protein with similar keywords
20
Fusion method or the Rosetta stone analysis
• Some pairs of interacting proteins have homologs in another organism, fused into a single protein chain.
• When two separate proteins in one organism, A and B, are expressed as a fused protein in some other species, there is a high probability that A and B are linked in function.
21
Fusion method
22
The Rosetta Stone model
23
Fusion method –what is it good for?
• Predicts protein pairs that have related biological functions.
• Predicts potential protein-protein interactions.
• Can turn up complexes of proteins, or protein pathways.
24
Fusion method –what is it good for?
25
Fusion method
• The group searched the 4290 protein sequences of the E.coli genome.
• The proteins could form at most (4290)(4289)/2 pair interactions. But we expect much less…
• There were found 6809 candidate for pair interactions.
26
Fusion method –validation
• Looking for a similar function in existing annotations that would imply at least functional interaction.
• Of the E.coli pairs that were found in the Rosetta Stone analysis, 68% share at least one keyword in their annotations, whereas from E.coli proteins that were selected randomly, only 15% share a keyword.
27
Fusion method –validation
• From a database containing protein pairs that have been found to interact (experimentally) – 6.4% are linked by Rosetta Stone sequences.
• The phylogenetic profile method was applied to the interactions predicted by the fusion method. It found more than 8 times as many interactions suggested by the phylogenetic profile method, as for randomly chosen sets of interactions.
28
Fusion method –missing pairs
• False negatives:
There was no fusion of the interacting proteins.
The fused protein disappeared during the course of evolution.
29
Fusion method –False alarms
• False positives:
False prediction of physical interactions when the proteins are fused, but are co-regulated and don’t interact.
Cannot distinguish between homologs that bind and those that do not.
30
Fusion method –False alarms
• The false positive rate in E.coli due to the inability to distinguish homologs is about 82%.
• To reduce these errors: the “promiscuous” domains were found and removed during the analysis.
• By filtering of only 5% of all domains, we can remove the majority of falsely predicted interactions.
31
Fusion method –False alarms
32
Neighbour method
• Functional links between genes can be identified by examining whether the proximity of the genes is conserved across multiple genomes.
• Powerful in uncovering functional linkages in prokaryotes where operons are common.
33
Neighbour method
34
Neighbour method- definitions
• ‘close’: proximate genes are on the same strand within 300 bp, and transcribed in the same direction.
• Direct link: two proximate genes that are also proximate in at least two other genomes of different phylogenetic groups.
• Inferred link: two genes that are not close but with orthologs that are close in at least three other genomes of different phylogenetic groups.
35
Neighbour method- defenitions
36
Neighbour method
• Proximity between genes is maintained mostly because it facilitates their co-transfer to another organism.
• Example: restriction-modification systems.
37
Neighbour method- validation
• Identification of links that are annotated in KEGG or COG – and calculate the fraction of those in the same functional pathway / category.
• The functional correspondence is correlated to the minimal number of phylogenetic groups, in which the proximity is detected.
38
Neighbour method- validation
N tradeoff
39
Neighbour method- example
40
Happy end???
• The group analyzed the 6,217 proteins of the yeast Saccharomyces combining several methods.
• one can expect each protein to be functionally linked to perhaps 5–50 other proteins, giving 30,000–300,000 biologically meaningful links.
41
Happy end???
42
Networks
• When methods of detecting functional linkages are applied to all the proteins of an organism, network of interacting, functionally linked proteins can be traced.
• As methods improve for detecting protein linkages, it seems likely that most of the proteins will be included in the network.
43
Networks
44
פורים שמח