a randomized exhaustive propositionalization approach for molecule classification

32
A Randomized Exhaustive Propositionalization Approach for Molecule Classification Michele Samorani Manuel Laguna Kirk DeLisle Daniel Weaver

Upload: zeheb

Post on 16-Feb-2016

23 views

Category:

Documents


0 download

DESCRIPTION

A Randomized Exhaustive Propositionalization Approach for Molecule Classification. Michele Samorani Manuel Laguna Kirk DeLisle Daniel Weaver. Drug discovery. Drug Discovery. The process of developing new drugs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

A Randomized Exhaustive Propositionalization

Approach for Molecule ClassificationMichele Samorani

Manuel LagunaKirk DeLisleDaniel Weaver

Page 2: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

DRUG DISCOVERY

Page 3: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Drug Discovery• The process of developing new drugs• The cost of developing a drug typically varies from 500

million $ to 2 billion $• Molecule classification is used along the entire process

to discriminate between:– Active and Non Active compounds– Toxic and Non Toxic compounds

• During the development of a new drug:– Use the experiments done so far to train a classifier– Use the classifier to find the promising compounds to test next

• An ideal Classification Algorithm:– Speeds up the design of new drugs– Gives insights about chemical properties

Page 4: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Data Mining in Drug Discovery

Active (1) Non-Active (0)

Attribute representation

Classifier

The chemist designs a

compound

Non-Active!

Page 5: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Molecule classification – Binary Fingerprints

Compound Class A1 … … … … … …1 12 2… …400 1

One of the main attribute representations is the so called Binary Fingerprints:– Every attribute represents the absence/presence (0/1)

of a characteristic or a substructure– The attributes are pre-defined characteristics– The classification process does not find new knowledge,

but which attributes are most important

Page 6: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Molecule classification – Binary FingerprintsCompound Class A1 … … … … … …1 12 2… …400 1

• The focus of this work is NOT on improving the classification procedure

• It is on how to generate a good attribute representation that generates new knowledge

Page 7: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

PROPOSITIONALIZATION

Page 8: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Propositionalization• The starting point is a database• By navigating through the database, new

features are generated, which represent the result of SQL queries

• These features are added to the mining table

Page 9: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

TARGETidTarget class1 0

2 1

… …

BONDidTarget idBond type1 0 2

1 1 1

… …

ATOMBONDidTarget idAtom idBond1 0 0

1 1 0

… …

ATOMidTarget idAtom ele1 0 O

1 1 C

… …n

n

n

1

n

n

1

It contains the compounds

Page 10: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Generating a new attribute• Two steps:

1. Find a path that starts from the target table2. Roll-up one simple attribute, through

aggregations and refinements, from the last table to the target table

Page 11: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

TARGETidTarget class1 0

2 1

… …

BONDidTarget idBond type1 0 2

1 1 1

… …

ATOMBONDidTarget idAtom idBond1 0 0

1 1 0

… …

ATOMidTarget idAtom ele1 0 O

1 1 C

… …n

n

n

1

n

n

1

STEP 1:Find a path

Page 12: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

TARGETidTarget class1 0

2 1

… …

BONDidTarget idBond type1 0 2

1 1 1

… …

ATOMBONDidTarget idAtom idBond1 0 0

1 1 0

… …

ATOMidTarget idAtom ele1 0 O

1 1 C

… …n

n

n

1

n

n

1

STEP 1:Find a path

This path will find attributes of depth 2Depth = measure of how complex the attribute is

Page 13: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

TARGETidTarget class1 0

2 1

… …

BONDidTarget idBond type1 0 2

1 1 1

… …

ATOMBONDidTarget idAtom idBond1 0 0

1 1 0

… …

ATOMidTarget idAtom ele1 0 O

1 1 C

… …n

n

n

1

n

n

1

STEP 2:Roll-up

Aggregate to each Atom: count distinct bonds (CDB)

Page 14: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

TARGETidTarget class1 0

2 1

… …

BONDidTarget idBond type1 0 2

1 1 1

… …

ATOMBONDidTarget idAtom idBond1 0 0

1 1 0

… …

ATOMidTarget idAtom ele1 0 O

1 1 C

… …n

n

n

1

n

n

1

STEP 2:Roll-up

CDB12

Aggregate to each Atom: count distinct bonds (CDB)

Page 15: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

TARGETidTarget class1 0

2 1

… …

BONDidTarget idBond type1 0 2

1 1 1

… …

ATOMBONDidTarget idAtom idBond1 0 0

1 1 0

… …

ATOMidTarget idAtom ele1 0 O

1 1 C

… …n

n

n

1

n

n

1

STEP 2:Roll-up

Attach to the target table: The maximum number of bonds to which an atom of carbon participates

CDB12

X23

max(Atom.CDB)Where Atom.ele = ‘C’

Page 16: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Propositionalization – graphically

Compound Class A1 … A55 A56 … A118 … A346 …1 12 2… …400 1

Depth 1 Depth 2 Depth 3 Depth 4

Page 17: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Our contribution over traditional propositionalization

Our Randomized Exhaustive approach produces:1. More expressive attributes (Exhaustive)2. “Deeper” attributes (Randomized)

Page 18: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

More expressive attributes – Example

• Traditional propositionalization algorithms can generate the following attribute:– Count the number of double bonds to which each

atom participates– Compute the maximum

• But not the following attribute:– Count the number of double bonds to which each

atom participates– Compute the maximum among the oxygen atoms

Page 19: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Attributes Traditional vs Exhaustive

Activity Mutagenicity

Page 20: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

EXPERIMENTS

Page 21: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Design of the experiments• Given an attribute generation strategy:

– Perform a 10-fold cross validation using 10 different classifiers (from Weka): • MultilayerPerceptron, BayesNet, Bagging, J48,

ADTree, REPTree, RandomForest, PART, Nnge, Ridor

• The average accuracy across the folds and across the classifiers is the measure of the performance of the strategy used

Page 22: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Up to a predefined depthIn general, the deeper we go

the higher accuracy we

obtain

Let’s generate attributes at

depth > 4

Page 23: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Generate all attributes up to depth 4 and add 1,000 attributes randomly sampled from depth 5 to 7

Up to depth 4 + 1,000 in [5,7]

77.93% 77.72% 74.95% 76.30%

Page 24: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Summary of the results• Exhaustive is significantly better than

Traditional• Sampling deep attributes at the end of the

attribute generation procedureis significantly better thancontinuing generating non-deep attributes

• (in terms of proportion of classifiers that perform better with one strategy than with the other)

Page 25: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Comparison to fingerprints

The difference is not significant

Page 26: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Comparison to fingerprints

The difference is not significant

Years of research effort in order to identify this

attribute representation

2 hours of computing time

Page 27: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Additional Attribute Generation Strategies• Let’s not sample deep attributes randomly• Strategy 1: find the best mix of depths

(scatter search)• Strategy 2: use a Bayesian Network to

retrieve attributes with high information gain

Page 28: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

New knowledge• Although our best method does not improve

upon fingerprints, it has the potential of generating new knowledge

• The attributes used by the classifiers represent important characteristics– Number of bromine atoms– The average number of double bonds among the

atoms different from SThese attributes identify structures that have characteristics that may prevent mutagenesis

Page 29: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

New knowledge• But, sometimes, deep attributes are hard to interpret.• On Estrogen:• Label each atom A in the following way. 1) Consider the

atoms connected to it and count the bonds to which they participate (excluding the bond connecting A to each of them). 2) Compute the sum of these labels and obtain the label for A. Label the molecule with the minimum of these labels across all atoms of oxygen.

• Specifically, a high value would represent an oxygen atom that is connected to other atoms participating in a large number of additional bonds - presumably an oxygen atom that is somewhat buried and interacting with highly branched atoms.

Page 30: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Conclusions• The current attribute representations (Fingerprints)

used for molecule classification does not provide insights on the chemical properties of the compounds

• Traditional propositionalization approaches do not obtain satisfying accuracy

• Our method extends the traditional propositionalization approach and:– Obtains an accuracy comparable to Fingerprints– Has the potential of finding new knowledge

• Note that our method is applicable to any domain (marketing, medical, etc…)

Page 31: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Future Work• Accuracy improvement:

– Scan&Sample => Scan & Smartly Sample• Improve the feature representation

– Query-like is ok for computer scientists, but chemists would prefer a graphical representation

Page 32: A Randomized  Exhaustive Propositionalization  Approach for Molecule Classification

Thank you for your attention

[email protected]