1 embl outstation — the european bioinformatics institute automatic and reliable functional...

26
EMBL Outstation — The European EMBL Outstation — The European Bioinformatics Institute Bioinformatics Institute Automatic and Reliable Functional Annotation of Proteins

Upload: benedict-stewart

Post on 04-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

1EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Automatic and Reliable Functional

Annotation of Proteins

2EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

The Target Database

Your dataYour data UncharacterizedUncharacterized Any kind of dataAny kind of data

– Protein sequencesProtein sequences

– Gene sequencesGene sequences

– etc.etc.

Our target: TrEMBLOur target: TrEMBLTargetTarget

3EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

The External Database

Collection of conditionsCollection of conditions– Sequence patternsSequence patterns

– ProfilesProfiles

– HMMsHMMs

– E.C. numbersE.C. numbers

– Protein clustersProtein clusters

Example:Example:– PROSITEPROSITE

– PfamPfam

TargetTarget

XDBXDB

4EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Direct Transfer

Search targetSearch target Transfer annotation to Transfer annotation to

target databasetarget database

Example:Example:Look up E.C. number and Look up E.C. number and add recommended add recommended enzyme nameenzyme name

TargetTarget

XDBXDB

5EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Multiple Sources

Usually more than one Usually more than one external database is usedexternal database is used

Combine the different Combine the different resultsresults

TargetTarget

XDBXDB

6EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Conflicts

ContradictionContradiction InconsistenciesInconsistencies SynonymsSynonyms RedundancyRedundancy

7EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Translation

Use a translator to map Use a translator to map XDB language to target XDB language to target languagelanguage

TargetTarget

XDBXDB

8EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Translation Examples ENZYME ENZYME TrEMBLTrEMBL CA L-ALANINE=D-ALANINECA L-ALANINE=D-ALANINECC -!- CATALYTIC ACTIVITY: L-ALANINE=CC -!- CATALYTIC ACTIVITY: L-ALANINE=CC D-ALANINE.CC D-ALANINE.

PROSITE PROSITE TrEMBLTrEMBL/SITE=3,heme_iron/SITE=3,heme_ironFT METAL IRONFT METAL IRON

Pfam Pfam TrEMBL TrEMBL FT DOMAIN zf_C3HC4FT DOMAIN zf_C3HC4FT ZN_FING C3HC4-TYPEFT ZN_FING C3HC4-TYPE

9EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Automatic Translation

Introduction a Introduction a standard/reference standard/reference databasedatabase

Must be:Must be:– highly reliablehighly reliable

– well-curatedwell-curated

Example:Example:SWISS-PROTSWISS-PROT

TargetTargetStandardStandard

XDBXDB

10EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Extract Reference Entries

Use XDB to extract entries Use XDB to extract entries from standard databasefrom standard database

Example:Example:Pfam:PF00509 HemagglutininPfam:PF00509 HemagglutininHEMA_IAVI7/P03435HEMA_IAVI7/P03435HEMA_IANT6/P03436HEMA_IANT6/P03436HEMA_IAAIC/P03437HEMA_IAAIC/P03437HEMA_IAX31/P03438HEMA_IAX31/P03438HEMA_IAME2/P03439HEMA_IAME2/P03439HEMA_IAEN7/P03440HEMA_IAEN7/P03440HEMA_IABAN/P03441HEMA_IABAN/P03441HEMA_IADU3/P03442HEMA_IADU3/P03442HEMA_IADA1/P03443HEMA_IADA1/P03443HEMA_IADMA/P03444HEMA_IADMA/P03444HEMA_IADM1/P03445HEMA_IADM1/P03445HEMA_IADA2/P03446HEMA_IADA2/P03446HEMA_IASH5/P03447HEMA_IASH5/P03447

TrEMBLTrEMBLSWISS-PROTSWISS-PROT

PfamPfam

11EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Extract Common Annotation

132 entries read132 entries read131 ID HEMA_XXXXX131 ID HEMA_XXXXX125 DE HEMAGGLUTININ PRECURSOR.125 DE HEMAGGLUTININ PRECURSOR. 6 DE HEMAGGLUTININ. 6 DE HEMAGGLUTININ.131 GN HA131 GN HA130 CC -!- FUNCTION: HEMAGGLUTININ IS RESPONSIBLE FOR ATTACHING THE130 CC -!- FUNCTION: HEMAGGLUTININ IS RESPONSIBLE FOR ATTACHING THE130 CC VIRUS TO CELL RECEPTORS AND FOR INITIATING INFECTION.130 CC VIRUS TO CELL RECEPTORS AND FOR INITIATING INFECTION.125 CC -!- SUBUNIT: HOMOTRIMER. EACH OF THE MONOMER IS FORMED BY TWO125 CC -!- SUBUNIT: HOMOTRIMER. EACH OF THE MONOMER IS FORMED BY TWO125 CC CHAINS (HA1 AND HA2) LINKED BY A DISULFIDE BOND.125 CC CHAINS (HA1 AND HA2) LINKED BY A DISULFIDE BOND. 75 DR HSSP; P03437; 1HGD. 75 DR HSSP; P03437; 1HGD. 31 DR HSSP; P03437; 1DLH. 31 DR HSSP; P03437; 1DLH.131 KW HEMAGGLUTININ; GLYCOPROTEIN; ENVELOPE PROTEIN131 KW HEMAGGLUTININ; GLYCOPROTEIN; ENVELOPE PROTEIN102 KW SIGNAL102 KW SIGNAL 1 KW COAT PROTEIN; POLYPROTEIN; 3D-STRUCTURE 1 KW COAT PROTEIN; POLYPROTEIN; 3D-STRUCTURE130 FT CHAIN HA1 CHAIN.130 FT CHAIN HA1 CHAIN.107 FT CHAIN HA2 CHAIN.107 FT CHAIN HA2 CHAIN.102 FT SIGNAL102 FT SIGNAL

12EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Store Common Annotation

Store the used pattern and Store the used pattern and the extracted common the extracted common annotation in a separate annotation in a separate databasedatabase

TargetTargetStandardStandard

XDBXDB

CommonCommon

13EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Add Annotation to Target

Extract entries from targetExtract entries from target Add common annotation Add common annotation

to the entriesto the entries

TargetTargetStandardStandard

XDBXDB

CommonCommon

14EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Modelling of the Rules

Definition of condition typesDefinition of condition types Definition of action typesDefinition of action types Encoding the logicEncoding the logic Storage and retrieval of the rulesStorage and retrieval of the rules

Version controlVersion control Monitoring the resultsMonitoring the results

15EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Formal Language for the Rules

##CommentComment#RULE RU000001#RULE RU000001#DATE 1997-04-23#DATE 1997-04-23

??ConditionCondition?PSAC PS00057?PSAC PS00057?SPOC PLANTA?SPOC PLANTA

!!ActionAction!SPDE L-LACTATE DEHYDROGENASE!SPDE L-LACTATE DEHYDROGENASE!ECNO 1.1.1.27!ECNO 1.1.1.27

16EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Implementation of Condition Types

Every condition type must be implementedEvery condition type must be implemented Example: Perl routine for ‘?PSAC’: has the Example: Perl routine for ‘?PSAC’: has the

protein a link to a given prosite entry?protein a link to a given prosite entry?

sub condition_PSAC {sub condition_PSAC { my $ac = shift; my $ac = shift; return /^DR PROSITE; $ac/m; return /^DR PROSITE; $ac/m;} }

17EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Implementation of Action Types

Every action type must be implementedEvery action type must be implemented Example: Add enzyme code to the entry.Example: Add enzyme code to the entry.

sub action_ECNO {sub action_ECNO { my $ecno = shift; my $ecno = shift; s/^DE.*$/$& (EC $ecno)/m; s/^DE.*$/$& (EC $ecno)/m;}}

ororinsert into Trembl2Enzyme values (acc,ecno);insert into Trembl2Enzyme values (acc,ecno);

18EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Encoding the Logic

Any logical expression likeAny logical expression like aa ANDAND ( (bb OROR cc) ) BUTBUT NOTNOT ddcan be written without brackets as can be written without brackets as aa ANDAND bb ANDAND NOTNOT dd OROR aa ANDAND cc ANDAND NOTNOT dd

Rules can be identifed by their conditionsRules can be identifed by their conditions ”a&b&-d|a&c&-d””a&b&-d|a&c&-d”

19EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Automatic Annotation of TrEMBL

Extract conditions from XDBExtract conditions from XDB Group SWISS-PROT by Group SWISS-PROT by

conditionsconditions Extract common annotationExtract common annotation Group TrEMBL by conditionsGroup TrEMBL by conditions Add common annotation to Add common annotation to

TrEMBLTrEMBLTrEMBLTrEMBLSWISS-PROTSWISS-PROT

PROSITEPROSITE

RuleBaseRuleBase

PfamPfam ENZYMEENZYME

20EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Results: RuleBase

Source: PROSITE patternsSource: PROSITE patterns 262 rules262 rules 597 conditions597 conditions 1099 actions1099 actions

Result:Result: 2951 of 29330 new TrEMBL 5 entries2951 of 29330 new TrEMBL 5 entries 1443 of 15078 new TrEMBL 6 entries1443 of 15078 new TrEMBL 6 entries 9658 of 106330 existing TrEMBL 5 entries9658 of 106330 existing TrEMBL 5 entries 3254 of 140635 existing TrEMBL 6 entries3254 of 140635 existing TrEMBL 6 entries

21EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Results: Keywords in TrEMBL

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

SWISS-PROT

TREMBLnew

TREMBL0

10000

20000

30000

40000

50000

60000

70000

80000

KW/Entry

135.555

243.12512.073

Entries with Keywords. TrEMBLnew 10.970 (32%)

TrEMBL 76.249 (51%)SWISS-PROT 70.973 (97%)

22EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Results: TrEMBL Annotation

0 510

1520

2530

3540

45

0

10000

20000

30000

40000

50000

60000

Lines/Entry

Entries with Keywords TrEMBLnew 10.970 (32%)

TrEMBL 76.249 (51%)SWISS-PROT 70.973 (97%)

23EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Discussion

Stable and reliable, successfully added 68000 lines to TrEMBLStable and reliable, successfully added 68000 lines to TrEMBL Carefully set thresholds, therefore low coverageCarefully set thresholds, therefore low coverage Restricted language better than free textRestricted language better than free text Feed-back loop SWISS-PROT Feed-back loop SWISS-PROT TrEMBL TrEMBL Rules may be implemented in set-oriented languageRules may be implemented in set-oriented language Position specific annotation may be improved by alignmentsPosition specific annotation may be improved by alignments Independent of hierarchyIndependent of hierarchy Based on multiple entriesBased on multiple entries

24EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Dynamic Updates

SWISS-PROT TREMBL

common

spacs

conditions

tracs

XDB

25EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

Where to get TrEMBL

S W IS S -P R O Tsprot.txl

T rE M B Ltrembl.txl

T rE M B L newtrembl_new.txl

sp tr

ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/

26EMBL Outstation — The European Bioinformatics InstituteEMBL Outstation — The European Bioinformatics Institute

CreditsSWISS-PROT at EBISWISS-PROT at EBI Rolf ApweilerRolf Apweiler Sergio ContrinoSergio Contrino Wolfgang FleischmannWolfgang Fleischmann Henning HermjakobHenning Hermjakob Viv JunkerViv Junker Fiona LangFiona Lang Claire O'DonovanClaire O'Donovan Michele MagraneMichele Magrane Maria Jesus MartinMaria Jesus Martin Nicoletta MitaritonnaNicoletta Mitaritonna Steffen MoellerSteffen Moeller Stephanie KappusStephanie Kappus

CollaboratorsCollaborators Amos BairochAmos Bairoch Alain GateauAlain Gateau Jean-Jacques CodaniJean-Jacques Codani Keith TiptonKeith Tipton MGDMGD FlybaseFlybase PfamPfam Network of > 200 external expertsNetwork of > 200 external experts