1 embl outstation — the european bioinformatics institute automatic and reliable functional...
TRANSCRIPT
1EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Automatic and Reliable Functional
Annotation of Proteins
2EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
The Target Database
Your dataYour data UncharacterizedUncharacterized Any kind of dataAny kind of data
– Protein sequencesProtein sequences
– Gene sequencesGene sequences
– etc.etc.
Our target: TrEMBLOur target: TrEMBLTargetTarget
3EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
The External Database
Collection of conditionsCollection of conditions– Sequence patternsSequence patterns
– ProfilesProfiles
– HMMsHMMs
– E.C. numbersE.C. numbers
– Protein clustersProtein clusters
Example:Example:– PROSITEPROSITE
– PfamPfam
TargetTarget
XDBXDB
4EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Direct Transfer
Search targetSearch target Transfer annotation to Transfer annotation to
target databasetarget database
Example:Example:Look up E.C. number and Look up E.C. number and add recommended add recommended enzyme nameenzyme name
TargetTarget
XDBXDB
5EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Multiple Sources
Usually more than one Usually more than one external database is usedexternal database is used
Combine the different Combine the different resultsresults
TargetTarget
XDBXDB
6EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Conflicts
ContradictionContradiction InconsistenciesInconsistencies SynonymsSynonyms RedundancyRedundancy
7EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Translation
Use a translator to map Use a translator to map XDB language to target XDB language to target languagelanguage
TargetTarget
XDBXDB
8EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Translation Examples ENZYME ENZYME TrEMBLTrEMBL CA L-ALANINE=D-ALANINECA L-ALANINE=D-ALANINECC -!- CATALYTIC ACTIVITY: L-ALANINE=CC -!- CATALYTIC ACTIVITY: L-ALANINE=CC D-ALANINE.CC D-ALANINE.
PROSITE PROSITE TrEMBLTrEMBL/SITE=3,heme_iron/SITE=3,heme_ironFT METAL IRONFT METAL IRON
Pfam Pfam TrEMBL TrEMBL FT DOMAIN zf_C3HC4FT DOMAIN zf_C3HC4FT ZN_FING C3HC4-TYPEFT ZN_FING C3HC4-TYPE
9EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Automatic Translation
Introduction a Introduction a standard/reference standard/reference databasedatabase
Must be:Must be:– highly reliablehighly reliable
– well-curatedwell-curated
Example:Example:SWISS-PROTSWISS-PROT
TargetTargetStandardStandard
XDBXDB
10EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Extract Reference Entries
Use XDB to extract entries Use XDB to extract entries from standard databasefrom standard database
Example:Example:Pfam:PF00509 HemagglutininPfam:PF00509 HemagglutininHEMA_IAVI7/P03435HEMA_IAVI7/P03435HEMA_IANT6/P03436HEMA_IANT6/P03436HEMA_IAAIC/P03437HEMA_IAAIC/P03437HEMA_IAX31/P03438HEMA_IAX31/P03438HEMA_IAME2/P03439HEMA_IAME2/P03439HEMA_IAEN7/P03440HEMA_IAEN7/P03440HEMA_IABAN/P03441HEMA_IABAN/P03441HEMA_IADU3/P03442HEMA_IADU3/P03442HEMA_IADA1/P03443HEMA_IADA1/P03443HEMA_IADMA/P03444HEMA_IADMA/P03444HEMA_IADM1/P03445HEMA_IADM1/P03445HEMA_IADA2/P03446HEMA_IADA2/P03446HEMA_IASH5/P03447HEMA_IASH5/P03447
TrEMBLTrEMBLSWISS-PROTSWISS-PROT
PfamPfam
11EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Extract Common Annotation
132 entries read132 entries read131 ID HEMA_XXXXX131 ID HEMA_XXXXX125 DE HEMAGGLUTININ PRECURSOR.125 DE HEMAGGLUTININ PRECURSOR. 6 DE HEMAGGLUTININ. 6 DE HEMAGGLUTININ.131 GN HA131 GN HA130 CC -!- FUNCTION: HEMAGGLUTININ IS RESPONSIBLE FOR ATTACHING THE130 CC -!- FUNCTION: HEMAGGLUTININ IS RESPONSIBLE FOR ATTACHING THE130 CC VIRUS TO CELL RECEPTORS AND FOR INITIATING INFECTION.130 CC VIRUS TO CELL RECEPTORS AND FOR INITIATING INFECTION.125 CC -!- SUBUNIT: HOMOTRIMER. EACH OF THE MONOMER IS FORMED BY TWO125 CC -!- SUBUNIT: HOMOTRIMER. EACH OF THE MONOMER IS FORMED BY TWO125 CC CHAINS (HA1 AND HA2) LINKED BY A DISULFIDE BOND.125 CC CHAINS (HA1 AND HA2) LINKED BY A DISULFIDE BOND. 75 DR HSSP; P03437; 1HGD. 75 DR HSSP; P03437; 1HGD. 31 DR HSSP; P03437; 1DLH. 31 DR HSSP; P03437; 1DLH.131 KW HEMAGGLUTININ; GLYCOPROTEIN; ENVELOPE PROTEIN131 KW HEMAGGLUTININ; GLYCOPROTEIN; ENVELOPE PROTEIN102 KW SIGNAL102 KW SIGNAL 1 KW COAT PROTEIN; POLYPROTEIN; 3D-STRUCTURE 1 KW COAT PROTEIN; POLYPROTEIN; 3D-STRUCTURE130 FT CHAIN HA1 CHAIN.130 FT CHAIN HA1 CHAIN.107 FT CHAIN HA2 CHAIN.107 FT CHAIN HA2 CHAIN.102 FT SIGNAL102 FT SIGNAL
12EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Store Common Annotation
Store the used pattern and Store the used pattern and the extracted common the extracted common annotation in a separate annotation in a separate databasedatabase
TargetTargetStandardStandard
XDBXDB
CommonCommon
13EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Add Annotation to Target
Extract entries from targetExtract entries from target Add common annotation Add common annotation
to the entriesto the entries
TargetTargetStandardStandard
XDBXDB
CommonCommon
14EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Modelling of the Rules
Definition of condition typesDefinition of condition types Definition of action typesDefinition of action types Encoding the logicEncoding the logic Storage and retrieval of the rulesStorage and retrieval of the rules
Version controlVersion control Monitoring the resultsMonitoring the results
15EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Formal Language for the Rules
##CommentComment#RULE RU000001#RULE RU000001#DATE 1997-04-23#DATE 1997-04-23
??ConditionCondition?PSAC PS00057?PSAC PS00057?SPOC PLANTA?SPOC PLANTA
!!ActionAction!SPDE L-LACTATE DEHYDROGENASE!SPDE L-LACTATE DEHYDROGENASE!ECNO 1.1.1.27!ECNO 1.1.1.27
16EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Implementation of Condition Types
Every condition type must be implementedEvery condition type must be implemented Example: Perl routine for ‘?PSAC’: has the Example: Perl routine for ‘?PSAC’: has the
protein a link to a given prosite entry?protein a link to a given prosite entry?
sub condition_PSAC {sub condition_PSAC { my $ac = shift; my $ac = shift; return /^DR PROSITE; $ac/m; return /^DR PROSITE; $ac/m;} }
17EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Implementation of Action Types
Every action type must be implementedEvery action type must be implemented Example: Add enzyme code to the entry.Example: Add enzyme code to the entry.
sub action_ECNO {sub action_ECNO { my $ecno = shift; my $ecno = shift; s/^DE.*$/$& (EC $ecno)/m; s/^DE.*$/$& (EC $ecno)/m;}}
ororinsert into Trembl2Enzyme values (acc,ecno);insert into Trembl2Enzyme values (acc,ecno);
18EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Encoding the Logic
Any logical expression likeAny logical expression like aa ANDAND ( (bb OROR cc) ) BUTBUT NOTNOT ddcan be written without brackets as can be written without brackets as aa ANDAND bb ANDAND NOTNOT dd OROR aa ANDAND cc ANDAND NOTNOT dd
Rules can be identifed by their conditionsRules can be identifed by their conditions ”a&b&-d|a&c&-d””a&b&-d|a&c&-d”
19EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Automatic Annotation of TrEMBL
Extract conditions from XDBExtract conditions from XDB Group SWISS-PROT by Group SWISS-PROT by
conditionsconditions Extract common annotationExtract common annotation Group TrEMBL by conditionsGroup TrEMBL by conditions Add common annotation to Add common annotation to
TrEMBLTrEMBLTrEMBLTrEMBLSWISS-PROTSWISS-PROT
PROSITEPROSITE
RuleBaseRuleBase
PfamPfam ENZYMEENZYME
20EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Results: RuleBase
Source: PROSITE patternsSource: PROSITE patterns 262 rules262 rules 597 conditions597 conditions 1099 actions1099 actions
Result:Result: 2951 of 29330 new TrEMBL 5 entries2951 of 29330 new TrEMBL 5 entries 1443 of 15078 new TrEMBL 6 entries1443 of 15078 new TrEMBL 6 entries 9658 of 106330 existing TrEMBL 5 entries9658 of 106330 existing TrEMBL 5 entries 3254 of 140635 existing TrEMBL 6 entries3254 of 140635 existing TrEMBL 6 entries
21EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Results: Keywords in TrEMBL
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
SWISS-PROT
TREMBLnew
TREMBL0
10000
20000
30000
40000
50000
60000
70000
80000
KW/Entry
135.555
243.12512.073
Entries with Keywords. TrEMBLnew 10.970 (32%)
TrEMBL 76.249 (51%)SWISS-PROT 70.973 (97%)
22EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Results: TrEMBL Annotation
0 510
1520
2530
3540
45
0
10000
20000
30000
40000
50000
60000
Lines/Entry
Entries with Keywords TrEMBLnew 10.970 (32%)
TrEMBL 76.249 (51%)SWISS-PROT 70.973 (97%)
23EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Discussion
Stable and reliable, successfully added 68000 lines to TrEMBLStable and reliable, successfully added 68000 lines to TrEMBL Carefully set thresholds, therefore low coverageCarefully set thresholds, therefore low coverage Restricted language better than free textRestricted language better than free text Feed-back loop SWISS-PROT Feed-back loop SWISS-PROT TrEMBL TrEMBL Rules may be implemented in set-oriented languageRules may be implemented in set-oriented language Position specific annotation may be improved by alignmentsPosition specific annotation may be improved by alignments Independent of hierarchyIndependent of hierarchy Based on multiple entriesBased on multiple entries
24EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Dynamic Updates
SWISS-PROT TREMBL
common
spacs
conditions
tracs
XDB
25EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
Where to get TrEMBL
S W IS S -P R O Tsprot.txl
T rE M B Ltrembl.txl
T rE M B L newtrembl_new.txl
sp tr
ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/
26EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute
CreditsSWISS-PROT at EBISWISS-PROT at EBI Rolf ApweilerRolf Apweiler Sergio ContrinoSergio Contrino Wolfgang FleischmannWolfgang Fleischmann Henning HermjakobHenning Hermjakob Viv JunkerViv Junker Fiona LangFiona Lang Claire O'DonovanClaire O'Donovan Michele MagraneMichele Magrane Maria Jesus MartinMaria Jesus Martin Nicoletta MitaritonnaNicoletta Mitaritonna Steffen MoellerSteffen Moeller Stephanie KappusStephanie Kappus
CollaboratorsCollaborators Amos BairochAmos Bairoch Alain GateauAlain Gateau Jean-Jacques CodaniJean-Jacques Codani Keith TiptonKeith Tipton MGDMGD FlybaseFlybase PfamPfam Network of > 200 external expertsNetwork of > 200 external experts