automated theory formation: first steps in bioinformatics

25
Automated Theory Formation: First Steps in Bioinformatics Simon Colton Computational Bioinformatics Laboratory

Upload: nalanie-kyle

Post on 01-Jan-2016

26 views

Category:

Documents


2 download

DESCRIPTION

Automated Theory Formation: First Steps in Bioinformatics. Simon Colton Computational Bioinformatics Laboratory. Machine Learning (ML) Questions. Given some background information Concepts, hypotheses (axioms) Given some positive examples And some negative examples Find me an explanation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automated Theory Formation: First Steps in Bioinformatics

Automated Theory Formation:First Steps in Bioinformatics

Simon ColtonComputational Bioinformatics Laboratory

Page 2: Automated Theory Formation: First Steps in Bioinformatics

Machine Learning (ML)Questions

Given some background informationConcepts, hypotheses (axioms)

Given some positive examplesAnd some negative examples

Find me an explanationWhy the positives are positive And the negatives are negative

Page 3: Automated Theory Formation: First Steps in Bioinformatics

Example: Predictive ToxicologyGiven some theory from chemistry

Structure of molecules, well known substructures

Given some examples of toxic drugsAnd some examples of non-toxic drugs

Question: Why are the toxic drugs toxic?

Page 4: Automated Theory Formation: First Steps in Bioinformatics

Automated Theory Formation (ATF) Questions

Given some background informationConcepts, hypotheses (axioms)

And some objects of interestNumbers, Molecules, etc.

Find something interestingInteresting things could be:

Concepts, examples, hypotheses, explanations

Page 5: Automated Theory Formation: First Steps in Bioinformatics

ATF OverviewScientific theories contain (at least):

Concepts: salt, acid, baseHypotheses: acid + base => salt + waterExplanations: transfer of electrons, dissolving

So, ATF should do (at least):Concept formation, Conjecture makingHypothesis proving and disproving.

Also needs to:Measure interestingness, present results, etc.

Page 6: Automated Theory Formation: First Steps in Bioinformatics

HR Theory Formation SystemDeveloped in maths

Designed to be general purpose systemConcept-based theory formation

Tries to make conceptMakes conjecture when it can’t make a conceptTries to explain conjectures

Conjecture-based theory formationFix faulty conjectures with concept formationPhD work of Alison Pease, based on Lakatos

Page 7: Automated Theory Formation: First Steps in Bioinformatics

Concept Formation in HR

10 General Production RulesTake in old concepts, produce new concepts

Split

Negate

Size

SplitCompose

[a,b] : b|a

[a,n]:n = |{b:b|a}|

[a]:2=|{b:b|a}|

[a] : 2|a

[a] : not 2|a

[a]:2=|{b:b|a}| & not 2|a (Odd Prime Numbers)

Page 8: Automated Theory Formation: First Steps in Bioinformatics

Conjecture MakingEmpirical checks are performed

After each attempt to invent a new conceptIf the concept has no examples

Makes non-existence conjectureIf concept has same examples as previous

Makes an equivalence conjectureIf another concept subsumes the concept

Makes an implication conjecture

Page 9: Automated Theory Formation: First Steps in Bioinformatics

Conjecture Extraction

Suppose HR makes equivalence conjecture:P(a) & Q(a) R(a) & S(a)

Extracts:P(a) & Q(a) => R(a), P(a) & Q(a) => S(a)R(a) & S(a) => P(a), R(a) & S(a) => Q(a)

Tries to Extract: P(a) => R(a), Q(a) => R(a), etc.Prime implicates (require proving, though)

Important: gets Horn ClausesCan be expressed in Prolog…..

Page 10: Automated Theory Formation: First Steps in Bioinformatics

Explanation GenerationIn mathematical domains

HR relies on automated theorem proversAnd Model generators

To find counterexamples

E.g., group theory: a*a=a a=id (prove easily)

In biological/chemistry domainsPossibly: visualisation tools, reaction pathways

Page 11: Automated Theory Formation: First Steps in Bioinformatics

Greatest HitsPlease ask me over coffee about:

Pre-processing constraint problemsLearning properties of quadratic residuesInventing integer sequencesPuzzle generationAdding to the TPTP librarySetting mathematical tutorial questions…

Page 12: Automated Theory Formation: First Steps in Bioinformatics

Long term aim in Bioinformatics

Develop an ATF system similar to HOMERBut working in biological domains

Biologist provides little background infoIn a format they are happy with

Program provides resultsIntelligent, interesting, not too much,And very little rubbish

Automated assistant for biology

Page 13: Automated Theory Formation: First Steps in Bioinformatics

Short term aim in BioinformaticsHR can work with biological data

Takes input similar to Muggleton’s Progol

Use HR to solve ML problemsSee how bad an idea that is

Use theory formation to improve MLIntegrate HR and Progol somehow

Page 14: Automated Theory Formation: First Steps in Bioinformatics

Naïve Approach to ML TasksGive HR the same input as Progol

Get it to form a theory

Look at the theoryExtract concepts which do well on the taski.e., they look similar to target concept

Not a goal-based approachBad idea (slow)

Page 15: Automated Theory Formation: First Steps in Bioinformatics

Less Naïve ApproachImprove search using “forward look-ahead”

ICML Paper

This has evolved to “reactive search”Uses HR’s own Java interpreterHR reacts to certain events in theory formation

Scripts supplied by the user

HR also makes “near-conjectures”Faster approach, but still fairly slow

Page 16: Automated Theory Formation: First Steps in Bioinformatics

Example – Mutagenesis42 DataMutagenesis similar to carcinogenisis42 drugs supplied with atom-bond details

Atom type, number & charge, bond type (1-8)

13 are mutagenic (active), 29 are not activeProgol learned this concept (88% accurate)

active(A) :- bond(A,B,C,2), bond(A,D,B,1),atm(A,D,c,21,E)

c,21 ? ?1 2

Page 17: Automated Theory Formation: First Steps in Bioinformatics

HR’s ResultsUsing reactive search, four PRs, 30K stepsHR learned this concept:

active(A) :- bond(A,B,C,1), atm(B,F,21), bond(A,C,D,E)Also 88% accurateBut, Progol’s answer “better”Because higher information content (fewer ?s)Biologists sometimes want more information

Is this really a simpler answer?

?,21 ? ?1 ?

Page 18: Automated Theory Formation: First Steps in Bioinformatics

But…..

HR also made these equivalence conjecturesAnd extracted them (+100 more) for us

atm(B,X,21) atm(B,c,21)atm(B,X,38) atm(B,n,38)bond(A,B,C,X1) & atm(C,X2,38) bond(A,B,C,1) & atm(C,X3,38)bond(A,X1,B,X2) & atm(B,X3,38) bond(A,B,X4,2), atm(B,X5,38)

We used these to re-write HR’s answerBy hand, but hope to automate

Page 19: Automated Theory Formation: First Steps in Bioinformatics

Giving us this answer:

Remember that Progol’s Answer was:

c,21 ? ?1 2

c,21 n,38 ?1 2

So, we filled in one of the blanks!

Page 20: Automated Theory Formation: First Steps in Bioinformatics

Are we making a meal of this?Yes, possibly for the mutagenesis data

I was worried about the difficulty of this problem

In the last week I’ve written a200-line Prolog program which runs quite fastAnd can be distributed over multiple processorsAnd can be easily understood by biologists

And gets these results….

Page 21: Automated Theory Formation: First Steps in Bioinformatics

Template search – ResultsNice result one (88% accurate, lots of info)

c,21 n,38 o,401 2

o,402

Nice result two (95% accurate)

c,21 n,38 o,401 2

c,? c,22 ?

-0.132

c,195 c,22 h,3

0.145

17 7 1

Page 22: Automated Theory Formation: First Steps in Bioinformatics

Template Search - Assumptions

Connected substructures Are interesting answersProgol’s answers are all substructures

More specific substructures are not so badBiologists may even want lots of informationDon’t forget that they want to do science

Each learned concept will be true ofAt least one active (positive) molecule

Page 23: Automated Theory Formation: First Steps in Bioinformatics

Template Search - OverviewUser chooses template for substructures

?,? ?,? ?,?

User specifies how many ?s are allowedE.g., 3 out of 8 in the above template

Algorithm starts with the first positiveExtracts all substructures in the template

Then takes the next positive, for each substructure in the set

Add the LGG so that it fits both positives Don’t go under the IC limit

? ?

Page 24: Automated Theory Formation: First Steps in Bioinformatics

Template Search – Final PartFor all the substructures

Take a disjunction Which achieves the best accuracy

Distribution of this algorithm possibleWe’re getting a big Linux farmPPP – Processor Per Positive

finds substructures true of one positive combine answers at the end

Page 25: Automated Theory Formation: First Steps in Bioinformatics

Conclusions & Future WorkAutomated Theory Formation

May be useful to bioinformaticsUse HR’s theory to improve Progol’s results

Possibly by pre-processing Progol’s input Or by post-processing the learned concept

Template search Maybe a good idea? Possibly not new….Not bad results for the Mutagenesis42 dataset