robots and automatic genome annotation

58
Robots and Automatic Genome Annotation Ross D. King Department of Computer Science University of Wales, Aberystwyth

Upload: kaethe

Post on 13-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Robots and Automatic Genome Annotation. Ross D. King Department of Computer Science University of Wales, Aberystwyth. Talk Plan. Data Mining based gene function prediction The Robot Scientist Automating annotation and experimentation. Data Mining Prediction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Robots and Automatic Genome Annotation

Robots and Automatic Genome Annotation

Ross D. King Department of Computer Science University of Wales, Aberystwyth

Page 2: Robots and Automatic Genome Annotation

Talk Plan

Data Mining based gene function prediction

The Robot Scientist

Automating annotation and experimentation

Page 3: Robots and Automatic Genome Annotation

Data Mining Prediction

We have developed a method for predicting the functional class of gene products based on data mining.

The idea is to learn a reliable predictive function on the examples of genes with products of known function.

Then apply this function to genes where the functional class is unknown.

Applied to: E. coli, M. tuberculosis, S. cerevisiae, A. thaliana.

We call this approach: Data Mining Prediction (DMP).

Page 4: Robots and Automatic Genome Annotation

Classification schemes (MIPS/GO)

1,0,0,0 "METABOLISM"1,1,0,0 "amino acid metabolism"1,1,1,0 "amino acid biosynthesis"1,1,4,0 "regulation of amino acid metabolism"1,1,7,0 "amino acid transport"1,1,10,0 "amino acid degradation (catabolism)"1,1,99,0 "other amino acid metabolism activities"

1,2,0,0 "nitrogen and sulfur metabolism"1,3,0,0 "nucleotide metabolism"1,4,0,0 "phosphate metabolism"1,5,0,0 "C-compound and carbohydrate metabolism"1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism"1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups"1,20,0,0 "secondary metabolism"

... and ORFs may have multiple functions too!

Hierarchy of classes

Page 5: Robots and Automatic Genome Annotation

Sequence Data

478 attributes in total

field description typeaa_rat_X % of amino acid X in the protein realseq_len length of the protein sequence intaa_rat_pair_X_Y % of the amino acids X and Y consecutively realmol_wt molecular weight of the protein inttheo_pI theoretical pI (isoelectric point) realatomic_comp_X atomic composition of X (C,H,N,O,S) realaliphatic_index aliphatic index realhydro grand average of hydropathy realstrand the DNA strand 'w' or 'c'position the number of exons (no. of start positions) intcai codon adaptation index realmotifs number of PROSITE motifs inttmSpans number of transmembrane spans

intchromosome chromosome number 1..16,mit

Page 6: Robots and Automatic Genome Annotation

Homology dataYAL001C: mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdk....

PSI-BLASTSequence databaseNRDB

sfc3:keyword(membrane)length(358)dbref(prosite)dbref(embl)

genetfcsfc3wsv442cg9463f1l3

organismbaker's yeastfission yeastwhite spot virusfruit flyArabidopsis

score0.01.0e-182.12.93.0

We look up the associated information from SwissProt

Page 7: Robots and Automatic Genome Annotation

Predicted Secondary Structure Data

mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk...cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb...

We record length and relative positions of the secondary structure elements.

This is relational data.

Page 8: Robots and Automatic Genome Annotation

Expression Data

Spellman et al (1998), Roth et al (1998)DeRisi et al (1997), Eisen et al (1998)Gasch et al (2000, 2001), Chu et al (1998)

• Microrarray experiments to measure expression changes in yeast under a variety of conditions, including cell cycle, heat shock, diauxic shift.

• Short time series data, numerical-valued

0 7 14 21YBR166C 0.33 -0.17 0.04 -0.07YOR357C -0.64 -0.38 -0.32 -0.29YLR292C -0.23 0.19 -0.36 0.14YGL112C -0.69 -0.89 -0.74 -0.56...

Page 9: Robots and Automatic Genome Annotation

Phenotype Data• Data from knockout gene growth experiments • Many missing data• Data taken from 3 sources (TRIPLES, MIPS, EUROFAN)

s = sensitive (less growth)w = wild-type (no observable effect) r = resistant (more growth)n = no data

ORF

YAL001CYAL019WYAL021CYAL029C

calcofluor white

w n n n

sorbitol

n s n w

benomyl

n w n w

...

deleted ORFgrowth medium

H2O2

w w n r

Page 10: Robots and Automatic Genome Annotation

What are the Machine Learning Issues?

• Large volume of data• Missing data• Accurate results required • Intelligible results required• Class hierarchy • Multiple labels • Relational data

Page 11: Robots and Automatic Genome Annotation

Data Mining Prediction (DMP)

Entire database

Data for rule creation

2/3 1/3

2/3 1/3

PolyFARM

C4.5Rule

gener-ation

Selectbestrules

Measurerule

accuracy

Validation data

Trainingdata

Allrules

Bestrules

Test data

Results

Page 12: Robots and Automatic Genome Annotation

Application to Bacterial Genomes Successful for both M. tuberculosis and E. coli.

Of the ORFs with no assigned function >40% were predicted to have a function at one or more levels of the class hierarchy.

It was found that many of the predictive rules were more general than possible using sequence homology.

References

King et al. (2000) KDD 2000

King et al. (2000) Yeast (Comparative and Functional Genomics)

King et al. (2001) Bioinformatics

Page 13: Robots and Automatic Genome Annotation

Summary Results (Bacteria)

Using voting (2 or more rules agree on a prediction)– Level 2 :128 ORFs predicted - 87.5% accuracy– Level 3 : 23 ORFs predicted - 91.3% accuracy

All predictions– Level 2 :335 ORFs predicted - 64.5% accuracy– Level 3: 204 ORFs predicted - 44.6% accuracy

Page 14: Robots and Automatic Genome Annotation

Example Rule (level 2 E. coli) If the ORF is not predicted to have a -strand of length 3 a homologous protein from class Chytridiomycetes was foundThen its functional class is “Cell processes, Transport/binding proteins”

12/13 (86%) correct on Test Set - probability of this result occurring by chance is estimated at 4x10-7. 24 ORFs of unknown function are predicted by the rule.

16 ORFs now with putative or confirmed function - 93.8%

accurate predictions

Page 15: Robots and Automatic Genome Annotation

Experimental Conformation The original bacterial ORF predictions were made

over three years ago.

In the intervening time many more ORFs have been sequenced, making traditional homologous prediction methods more accurate and sensitive, and the function of some ORFs have been determined by wet biology.

The E. coli genome has recently been re-annotated by Monica Riley’s group.

Page 16: Robots and Automatic Genome Annotation

“Wet” Biology conformation A number of predictions have been confirmed or

falsified by new “wet” experimental data.

This new data is biased towards hard classes. Despite this the results are still good:– Level 2: 23 predictions - 47.8% accuracy– Level 3: 23 predictions - 43.4% accuracy

This is very much better than random as there are many classes.

Page 17: Robots and Automatic Genome Annotation

Confirmation of “Wet” PredictionsORF Rule Predicted Class Confirmed Function Resultb0805 8 Cell envelop Outer membrane protein Cb1519 15 Degradation of small molecules Trans-aconitate methyltransferase Cb1533 43 Transport/binding proteins Cysteine pathway metabolite transport Cb1981 42 Transport/binding proteins Shikimate and dehydroshikimate transport

proteinC

b1981 56 Transport/binding proteins Shikimate and dehydroshikimate transportprotein

C

b2210 15 Degradation of small molecules Malate:quinone oxidoreductase Cb2392 43a Transport/binding proteins High-affinity manganese transporter Cb2392 43b Transport/binding proteins High-affinity manganese transporter Cb2392 54 Transport/binding proteins High-affinity manganese transporter Cb2924 45 Transport/binding proteins Component of the MscS mechanosensitive

channel – “new gene family”C

b3839 43 Transport/binding proteins Essential component of translocase Cb0103 42 Transport/binding proteins dephospho-CoA kinase Wb0103 41 Transport/binding proteins dephospho-CoA kinase Wb0103 43 Transport/binding proteins dephospho-CoA kinase Wb1822 15 Degradation of small molecules 23S rRNA m1G745 methyltransferase Wb2530 35 Global regulatory functions cysteine desulfurase Wb2392 14 Degradation of small molecules High-affinity manganese transporter Wb2889 50 Energy metabolism carbon Isopentenyl diphosphate isomerase Wb3222 54 Transport/binding proteins ManNAc kinase Wb3223 39 Ribosome constituents ManNAc epimerase Wb3337 28 Laterally acquired elements regulatory or redox component Wb3338 39 Ribosome constituents Periplasmic endochitinase Wb3569 32 Laterally acquired elements transcriptional regulator of xylose utilization Wb3955 8 Cell envelop Required for invasion of brain microvascular

endothelial cellsEF

b3955 18 Energy metabolism carbon Required for invasion of brain microvascularendothelial cells

EA

b3955 20 Energy metabolism carbon Required for invasion of brain microvascularendothelial cells

EA

Page 18: Robots and Automatic Genome Annotation

Results (Yeast) Many rules from each data type Rules at each level of hierarchy Some classes are much easier to predict than others

(for example "protein synthesis" at 71-93%, "energy" at 20-47%)

Good levels of accuracy on held out test data Many predictions for ORFs of unknown function

(some function at some level is predicted for 96% of the ORFs of unknown function)

Some rules explainable by biology -> scientific knowledge discovery

Clare & King (2003) Bioinformatics suppl. 2., 42-49

Page 19: Robots and Automatic Genome Annotation

Accuracy Table

Level

Datatype 1 2 3 4 all

Seq 55 55 33 0 71

Struc 49 43 0 0 58

Hom 65 38 69 20 55

Expr 42 37 35 0 75

Phen 75 40 7 0 68

Page 20: Robots and Automatic Genome Annotation

Extension to Arabidopsis Genome Collaborative project with the Institute of Grassland

and Environmental Research and the University of Nottingham.

Large increase in data: 6,000 -> 25,000 ORFs. Large amount of micro-array data from the Nottingham Arabidopsis stock centre.

250 million Prolog facts, 200,000 attributes, File sizes almost 2Gb

7,964 gene function predictions with an expected accuracy >70%, 2,974 with an expected accuracy >90%,

We are currently growing 14 knockout varieties of Arabidopsis to test a sample of these predictions

Page 21: Robots and Automatic Genome Annotation
Page 22: Robots and Automatic Genome Annotation

Availability

All rules and data available at http://www.aber.ac.uk/compsci/Research/bio/dss/

All predictions available at http://www.genepredictions.org

Page 23: Robots and Automatic Genome Annotation

The Robots Scientist

Page 24: Robots and Automatic Genome Annotation

The Robot Scientist Concept

Background Knowledge Machine Learning Analysis

Consistent

Hypothesis

Final Theory Experiment(s) selection Robot

Experiments(s)

Results

The robot scientist project aims to develop a computer system that is capable of originating its own experiments, physically doing them, interpreting the results, and then repeating the

cycle.

Page 25: Robots and Automatic Genome Annotation

Motivation: Technological

In many areas of science our ability to generate data is outstripping our ability to analyse the data.

One scientific area where this is true is functional genomics, where data is now being generated on an industrial scale.

The analysis of scientific data needs to become as industrialised as its generation.

Page 26: Robots and Automatic Genome Annotation

The Application Domain

Functional genomics

In yeast (S. cerivasae) ~30% of the 6,000 genes still have no known function.

EUROFAN 2 has knocked out each of the 6,000 genes in mutant strains.

Task to determine the “function” of the gene by auxotrophic growth experiments comparing mutants and wild type.

Page 27: Robots and Automatic Genome Annotation

Logical Cell Model

We have built a logical model of the known metabolic pathways (coded in Prolog) - taken from KEGG and other bioinformatic sources. This is essentially a directed graph: with metabolites as nodes and enzymes as arcs.

If a path can be found from cell inputs (metabolites in the growth medium) to all the cell outputs (essential compounds), then the cell can grow.

Page 28: Robots and Automatic Genome Annotation

AAA Model System

We started using the aromatic amino-acid (AAA) pathway in yeast as a model system to prove the principle of the Robot Scientist.

9 metabolities can be used of the shelf 15 knockout mutants from Eurofan

The mutant can grow iff all three aromatic amino-acids can be synthesised (tyrosine, phenyalalanine, tryptophan). Based on a pathway from glycerate-2-phophate.

Page 29: Robots and Automatic Genome Annotation

Glycerate-2-Phosphate

Phosphoenolpyruvate

D-Erythrose-4-Phosphate

3-deoxy-D-arabino-heptulosonate-7-

phosphate

3-Dehydroquinate

3-Dehydroshikimate

5-Dehydroshikimate Shikimate

Shikimate –3-phosphate

5-o-1-carboxyvinyl-3-phosphoshikimate

Chorismate

Prephenate

p-Hydroxyphenylpyruvate

TYROSINE

Phenylpyruvate

PHENYLALANINE

Anthranilate

TRYPTOPHAN

N-5’-Phospho--d-ribosylanthranilate

1-(2-Carboxylphenylamino)-1’-deoxy-D-ribulose-

5’-phosphate

(3-Indolyl)-glycerol

phosphateIndole

YBR249CYDR035WYBR249CYDR035W

YGR254WYHR174WYMR323W

YGR254WYHR174WYMR323W

YDR127WYDR127W

YDR127WYDR127W

YDR127WYDR127W

YDR127WYDR127W

YDR127WYDR127W

YDR127W

YDR127W

YPR060CYPR060C

YBR166CYBR166C

YHR137WYGL202WYHR137WYGL202W

YNL316CYNL316C

YGL148WYGL148W

YDR354WYDR354W

YDR007WYDR007W

YKL211CYKL211C

YGL026CYGL026C

YGL026CYGL026CYGL026CYGL026C

YER090W(YKL211C)YER090W(YKL211C)

C00631

C00074

C00279

C04961

C00944

C02637

C02652

C00493

C03175

C01269

C00251

C00254

C01179 C00166

C03506

C01302

C00108

C04302

C00463

C00078C00079C00082

YHR137WYGL202WYHR137WYGL202W

Phenyalanine, Tyrosine, and Tryptophan Pathways for S. cerivisae

Growth Medium

Metabolite import

Page 30: Robots and Automatic Genome Annotation

Experimental Methodology

Experiments consist of making particular growth media and testing if the mutants can grow (add metabolites to a basic defined medium).

A mutant is auxotrophic if cannot grow on a defined medium that the wild type can grow on.

By observing the pattern of chemicals that recover growth the function of the knocked out mutant can be inferred.

Page 31: Robots and Automatic Genome Annotation

Inferring Hypotheses

In the philosophy of science. It has often been argued that only humans can make the “leaps of imagination” necessary to form hypotheses.

We use Abductive Logic Programming to infer missing arcs/labels in our metabolic graph. With these missing nodes we can explain (deductively) all the experimental results.

Reiser et al., (2001) ETAI 5, 233-244;

Page 32: Robots and Automatic Genome Annotation

The Form of the Hypotheses

The form of the hypotheses we can infer is currently quite simple. Each hypothesis binds a particular gene to an enzyme that catalyses the reaction.– A correct hypothesis would be that: YDR060C

codes for the enzyme for the reaction chorismate prephenate.

– An incorrect hypothesis would be that: it coded for the reaction chorismate anthranilate.

We have also demonstrated how more complex abductive hypotheses could be formed.

Page 33: Robots and Automatic Genome Annotation

A Discriminating Experiment

Hypothesis 1: YDR060C codes for the enzyme the reaction: chorismate prephenate.

Hypothesis 2: YDR060C codes for the enzyme the reaction: chorismate anthranilate.

These can be distinguished by growing the knockout YDR060C on prephenate or anthranilate.

Note that these two experiments will have differing monetary cost.

Page 34: Robots and Automatic Genome Annotation

Glycerate-2-Phosphate

Phosphoenolpyruvate

D-Erythrose-4-Phosphate

3-deoxy-D-arabino-heptulosonate-7-

phosphate

3-Dehydroquinate

3-Dehydroshikimate

5-Dehydroshikimate Shikimate

Shikimate –3-phosphate

5-o-1-carboxyvinyl-3-phosphoshikimate

Chorismate

Prephenate

p-Hydroxyphenylpyruvate

TYROSINE

Phenylpyruvate

PHENYLALANINE

Anthranilate

TRYPTOPHAN

N-5’-Phospho--d-ribosylanthranilate

1-(2-Carboxylphenylamino)-1’-deoxy-D-ribulose-

5’-phosphate

(3-Indolyl)-glycerol

phosphateIndole

YBR249CYDR035WYBR249CYDR035W

YGR254WYHR174WYMR323W

YGR254WYHR174WYMR323W

YDR127WYDR127W

YDR127WYDR127W

YDR127WYDR127W

YDR127WYDR127W

YDR127WYDR127W

YDR127W

YDR127W

YPR060CYPR060C

YBR166CYBR166C

YHR137WYGL202WYHR137WYGL202W

YNL316CYNL316C

YGL148WYGL148W

YDR354WYDR354W

YDR007WYDR007W

YKL211CYKL211C

YGL026CYGL026C

YGL026CYGL026CYGL026CYGL026C

YER090W(YKL211C)YER090W(YKL211C)

C00631

C00074

C00279

C04961

C00944

C02637

C02652

C00493

C03175

C01269

C00251

C00254

C01179 C00166

C03506

C01302

C00108

C04302

C00463

C00078C00079C00082

YHR137WYGL202WYHR137WYGL202W

Phenyalanine, Tyrosine, and Tryptophan Pathways for S. cerivisae

Growth Medium

Metabolite import

Page 35: Robots and Automatic Genome Annotation

Inferring ExperimentsGiven a set of hypotheses we wish to infer an experiment

that will efficiently discriminate between them

Assume: Every experiment has an associated cost. Each hypothesis has a probability of being correct.

The task: To choose a series of experiments which minimise the

expected cost of eliminating all but one hypothesis.

Page 36: Robots and Automatic Genome Annotation

Comparison of different experimental strategies

ASE - Expected cost minimization.

Naïve - Choose cheapest experiment.

Random - Randomly choose experiments.

The cost of a series of experiment is a function of the time taken and money spent. “Time is Money”.

Page 37: Robots and Automatic Genome Annotation

The Robot

Biomek 200

Page 38: Robots and Automatic Genome Annotation

Closing the Loop

We have physically implemented all aspects of the Robot Scientist system.

To the best of our knowledge this is the first active learning system that both explicitly forms hypotheses and experiments, and physicals does real experiments.

Page 39: Robots and Automatic Genome Annotation

Accuracy v Time

50

55

60

65

70

75

80

85

90

95

100

0 1 2 3 4 5

Iterations

ase

random

naive

At the end of the 5th iteration: ASE 80.1%, Naïve 74.0%, Random 72.2%. ASE was significantly more accurate than either Naïve (p < 0.05) or Random (p < 0.07) using a paired t-test.

Page 40: Robots and Automatic Genome Annotation

Accuracy v Money

50

55

60

65

70

75

80

85

90

95

100

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Log 10 Cost (£)

Cla

ss

ific

atio

n A

ccu

racy

(%

)

ase

random

naive

Given a spend of ≤£102.26, ASE 79.5%, Naïve 73.9%, Random 57.4%. ASE was significantly more accurate than either Naïve (p < 0.05) or Random (p < 0.001).

Page 41: Robots and Automatic Genome Annotation

Time and Money “Cost” is a positive function of time & money.

ASE dominates for both, therefore ASE dominates for any reasonable cost function.

For example: to achieve an accuracy of ~70%, ASE requires fewer trial iterations, and a hundredth of the price, of Random; and almost half the number of iterations, and a third of the price, of Naïve.

King et al. (2004) Nature. 427, 247-252.

Page 42: Robots and Automatic Genome Annotation

Human Comparisons

We were interested to compare the performance of the Robot Scientist with that of humans.

We adopted the simulator to allow humans to chooses and interpret the results of cycles of experimentation.

Compared nine graduate computer scientists and biologists.

No significant difference between the best humans and the Robot

Page 43: Robots and Automatic Genome Annotation

Robotic Annotation

Page 44: Robots and Automatic Genome Annotation

New Biological Knowledge

So far with the Robot Scientist we have only shown that we can automatically rediscover known biological knowledge.

We wish to extend this result to the discovery of new biological knowledge.

To do this we need to combine the robot scientist with conventional genome annotation bioinformatics, and DMP.

Page 45: Robots and Automatic Genome Annotation

Robotic Annotation

One way of thinking about genome annotation is as a hypothesis formation process.

Hypothesis formation is perhaps the hardest part of automating science.

Our idea is to incorporate bioinformatic annotation methods with genome annotation.

The bioinformatic methods will generate the hypotheses which the robot scientist will experimentally test.

Page 46: Robots and Automatic Genome Annotation

Genome Scale Model of Yeast Metabolism

We have extended our model of aromatic amino acid metabolism to cover most of what is known about yeast metabolism.

Includes 1,166 ORFs (940 known, 226 inferred) Growth if path from growth medium to defined end-

points. 83% accuracy (based on 914 strain/medium

predictions)

Page 47: Robots and Automatic Genome Annotation

The Model is Incomplete

It is not possible to find a path from the inputs (growth medium) to all the end-point metabolites using only reactions encoded by known genes.

This suggests automated strategies for determining the identity of the missing genes - new biological knowledge.

One strategy is based on using EC enzyme class of missing reactions, identify genes that code for this EC class in other organism, then find homologous genes in yeast.

The predictions can be tested automatically by robot.

Page 48: Robots and Automatic Genome Annotation

Confirmation of DMPYeast Predictions

The yeast gene YBR147W, of currently “unknown” function.

It is predicted to have a function in “metabolism” by 2 DMP rules with expected accuracies of >80%.

It is predicted to have a function in “amino-acid metabolism” with two rules with expected accuracies of 50% and 60% respectively.

Using our robot scientist auxotrophic methodology we have recovered growth of the knockout with: aspartic acid, tyrosine, leucine, valine, phenylalanine, cystine, arginine.

Page 49: Robots and Automatic Genome Annotation

Conclusions

Machine learning can be used to accurately predict gene function.

Simple forms of scientific reasoning and experimentation can be fully automated.

To develop robotic systems capable of generating new biological knowledge will require a synthesis of traditional genome annotation techniques, machine learning, and a Robot Scientist like methodology.

Page 50: Robots and Automatic Genome Annotation

The Three Objects of the Intellect

• The True

• The Beautiful

• The Beneficial

Page 51: Robots and Automatic Genome Annotation

AcknowledgementsDMP Andreas Karwath Aberystwyth Amanda Clare Aberystwyth Paul Wise Aberystwyth Luc Dehaspe Leuven

Robot Scientist Ken Whelan Aberystwyth Philip Reiser Aberystwyth Ffion Jones Aberystwyth Ugis Sarkans Aberystwyth (EBI) Douglas Kell Manchester (Aberystwyth) Steve Oliver Manchester Stephen Muggleton Imperial College (York) Chris Bryant Robert Gordons (York)

David Page Wisconsin

BBSRC, EPSRC

PharmDM - Commercial Support

Page 52: Robots and Automatic Genome Annotation
Page 53: Robots and Automatic Genome Annotation

Relational vs Propositional

orf time0 time7 time14yal001c 0.34 0.52 0.48yal002w0.76 0.82 0.89yal003w0.77 0.46 0.78yal004c 0.38 0.50 0.49

orf SwissProtID e-valyal001c p03415 2e-4yal001c p08640 8e-58yal002wp32583 6e-52yal002wp08775 3e-42

SwissProtID keywordp03415 apoptosisp03415 repeatp03415 zincp08640 membrane

Propositional: single table, fixed number of columns/attributes

Relational: multiple tables, multiple values

Page 54: Robots and Automatic Genome Annotation

Expression Data Rule

If in the micro-array experiment (sorbitol incubation) the ORF expression is > -0.25 and in the micro-array experiment (nitrogen depletion) the ORF expression is <= -1.29 and in the micro-array experiment (YPD stationary phase) the ORF expression is > -1.06then the function of this ORF is ”pheromone response, mating type determination, sex-specific proteins"

Accuracy on training data: 11/12 (92%)Accuracy on the test data: 3/4 (75%)21 predictions made

Page 55: Robots and Automatic Genome Annotation

Structure Rule

• 80% accurate on test data• Most matching ORFs belong to the Mitochondrial Carrier Family• These have 6 long transmembrane alpha-helices of about 20-30

amino acids• Why do we notice alpha-helices of length 10-14?

If true: coil (of length 3) followed by alpha (10 <= length < 14)and true: coil (of length 1 or 2) followed by alpha (10 <= length < 14)and true: coil (of length 3) followed by alpha (3 <= length < 6)and false: coil followed by beta followed by coil (c-b-c)and false: coil (6 <= length < 10) followed by alpha (of length 1 or 2)then the function of this ORF is "mitochondrial transport"

Page 56: Robots and Automatic Genome Annotation

AlignmentYJL133W -------NEYNPLIHCLC----GSISGSTCAAITTPLDCIKTVLQIRG------------ 251YKR052C -------NSYNPLIHCLC----GGISGATCAALTTPLDCIKTVLQVRG------------ 241YIL006W ----NNTNSINLQRLIMA----SSVSKMIASAVTYPHEILRTRMQLKS------------ 310YBR104W ----LTRNEIPPWKLCLF----GAFSGTMLWLTVYPLDVVKSIIQNDD------------ 271YGR096W ----KTTAAHKKWELATLNHSAGTIGGVIAKIITFPLETIRRRMQFMNSKHLEK------ 250YJR095W -----QMDVLPSWETSCI----GLISGAIGPFSNAPLDTIKTRLQKDK------------ 246YKL120W -----LMKDGPALHLTAS-----TISGLGVAVVMNPWDVILTRIYNQK------------ 261YLR348C -----FDASKNYTHLTAS-----LLAGLVATTVCSPADVMKTRIMNGS------------ 239YMR166C ----DGRDGELSIPNEILT---GACAGGLAGIITTPMDVVKTRVQTQQPPSQSNKSYSVT 300YDL198C ------DYSQATWSQNFIS---SIVGACSSLIVSAPLDVIKTRIQNRN------------ 242YGR257C ----RFASKDANWVHFINSFASGCISGMIAAICTHPFDVGKTRWQISMMN---------- 302YDL119C FIHYNPEGGFTTYTSTTVNTTSAVLSASLATTVTAPFDTIKTRMQLEP------------ 255

YJL133W -SQTVSLEIMRKADTFSKAASAIYQVYGWKGFWRGWKPRIVANMPATAISWTAYECAKHF 310YKR052C -SETVSIEIMKDANTFGRASRAILEVHGWKGFWRGLKPRIVANIPATAISWTAYECAKHF 300YIL006W -DIPDSIQRR-----LFPLIKATYAQEGLKGFYSGFTTNLVRTIPASAITLVSFEYFRNR 364YBR104W -LRKPKYKNS-----ISYVAKTIYAKEGIRAFFKGFGPTMVRSAPVNGATFLTFELVMRF 325YGR096W FSRHSSVYGSYKGYGFARIGLQILKQEGVSSLYRGILVALSKTIPTTFVSFWGYETAIHY 310YJR095W ---SISLEKQSGMKKIITIGAQLLKEEGFRALYKGITPRVMRVAPGQAVTFTVYEYVREH 303YKL120W ----GDLYKG-----PIDCLVKTVRIEGVTALYKGFAAQVFRIAPHTIMCLTFMEQTMKL 312YLR348C ----GDHQP------ALKILADAVRKEGPSFMFRGWLPSFTRLGPFTMLIFFAIEQLKKH 289YMR166C HPHVTNGRPAALSNSISLSLRTVYQSEGVLGFFSGVGPRFVWTSVQSSIMLLLYQMTLRG 360YDL198C ---FDNPESG------LRIVKNTLKNEGVTAFFKGLTPKLLTTGPKLVFSFALAQSLIPR 293YGR257C ---NSDPKGGNRSRNMFKFLETIWRTEGLAALYTGLAARVIKIRPSCAIMISSYEISKKV 359YDL119C ----SKFTNS------FNTFTSIVKNENVLKLFSGLSMRLARKAFSAGIAWGIYEELVKR 305

Page 57: Robots and Automatic Genome Annotation

AlignmentYJL133W -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 251YKR052C -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 241YIL006W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 310YBR104W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 271YGR096W ----cccccccccccccbaaaaaaaaaaaaaaacccaaaaaaaaaacccccccc------ 250YJR095W -----cccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaccc------------ 246YKL120W -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 261YLR348C -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 239YMR166C ----cccccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacccccccccccccc 300YDL198C ------cccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacc------------ 242YGR257C ----ccccccccccccaaaaaaaaaaaaaaaaacccaaaaaaaaaacccc---------- 302YDL119C ccccccccccccccaaaaaaaaaaaaaaaaaaacccaaaaaaaaaacc------------ 255

YJL133W -ccccccccccccccaaaaaaaaaaaccccaaaaccaaaaaaacaaaaaaaaaaaaaaaa 310YKR052C -ccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 300YIL006W -ccccccccc-----aaaaaaaaaaaccccaaacccaaaaaaaccaaaaaaaaaaaaaaa 364YBR104W -ccccccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 325YGR096W cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 310YJR095W ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 303YKL120W ----cccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 312YLR348C ----ccccc------aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 289YMR166C cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 360YDL198C ---cccccca------aaaaaaaaaacccaaaaacccaaaaaaaaaaaaaaaaaaaaaaa 293YGR257C ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 359YDL119C ----ccccca------aaaaaaaaaacccaaaaacccaaaaaaccaaaaaaaaaaaaaaa 305

Page 58: Robots and Automatic Genome Annotation

Types of LogicDeduction

Rule: If a cell grows, then it can synthesise tryptophan.

Fact: cell cannot synthesise tryptophan

Cell cannot grow.

Given the rule P Q, and the fact Q, infer the fact P

(modus tollens)

AbductionRule: If a cell grows, then it can synthesise tryptophan.

Fact: Cell cannot grow.

Cell cannot synthesise tryptophan.

Given the rule P Q, and the fact P, infer the fact Q