a randomized exhaustive propositionalization approach for molecule classification

A Randomized Exhaustive Propositionalization

Approach for Molecule ClassificationMichele Samorani

Manuel LagunaKirk DeLisleDaniel Weaver

DRUG DISCOVERY

Drug Discovery• The process of developing new drugs• The cost of developing a drug typically varies from 500

million $ to 2 billion $• Molecule classification is used along the entire process

to discriminate between:– Active and Non Active compounds– Toxic and Non Toxic compounds

• During the development of a new drug:– Use the experiments done so far to train a classifier– Use the classifier to find the promising compounds to test next

• An ideal Classification Algorithm:– Speeds up the design of new drugs– Gives insights about chemical properties

Data Mining in Drug Discovery

Active (1) Non-Active (0)

Attribute representation

Classifier

The chemist designs a

compound

Non-Active!

Molecule classification – Binary Fingerprints

Compound Class A1 … … … … … …1 12 2… …400 1

One of the main attribute representations is the so called Binary Fingerprints:– Every attribute represents the absence/presence (0/1)

of a characteristic or a substructure– The attributes are pre-defined characteristics– The classification process does not find new knowledge,

but which attributes are most important

Molecule classification – Binary FingerprintsCompound Class A1 … … … … … …1 12 2… …400 1

• The focus of this work is NOT on improving the classification procedure

• It is on how to generate a good attribute representation that generates new knowledge

PROPOSITIONALIZATION

Propositionalization• The starting point is a database• By navigating through the database, new

features are generated, which represent the result of SQL queries

• These features are added to the mining table

TARGETidTarget class1 0

2 1

… …

BONDidTarget idBond type1 0 2

1 1 1

… …

ATOMBONDidTarget idAtom idBond1 0 0

1 1 0

… …

ATOMidTarget idAtom ele1 0 O

1 1 C

… …n

n

n

1

n

n

1

It contains the compounds

Generating a new attribute• Two steps:

1. Find a path that starts from the target table2. Roll-up one simple attribute, through

aggregations and refinements, from the last table to the target table


2 1

… …


1 1 1

… …


1 1 0

… …


1 1 C

… …n

n

n

1

n

n

1

STEP 1:Find a path


2 1

… …


1 1 1

… …


1 1 0

… …


1 1 C

… …n

n

n

1

n

n

1

STEP 1:Find a path

This path will find attributes of depth 2Depth = measure of how complex the attribute is


2 1

… …


1 1 1

… …


1 1 0

… …


1 1 C

… …n

n

n

1

n

n

1

STEP 2:Roll-up

Aggregate to each Atom: count distinct bonds (CDB)


2 1

… …


1 1 1

… …


1 1 0

… …


1 1 C

… …n

n

n

1

n

n

1

STEP 2:Roll-up

CDB12

Aggregate to each Atom: count distinct bonds (CDB)


2 1

… …


1 1 1

… …


1 1 0

… …


1 1 C

… …n

n

n

1

n

n

1

STEP 2:Roll-up

Attach to the target table: The maximum number of bonds to which an atom of carbon participates

CDB12

X23

max(Atom.CDB)Where Atom.ele = ‘C’

Propositionalization – graphically

Compound Class A1 … A55 A56 … A118 … A346 …1 12 2… …400 1

Depth 1 Depth 2 Depth 3 Depth 4

Our contribution over traditional propositionalization

Our Randomized Exhaustive approach produces:1. More expressive attributes (Exhaustive)2. “Deeper” attributes (Randomized)

More expressive attributes – Example

• Traditional propositionalization algorithms can generate the following attribute:– Count the number of double bonds to which each

atom participates– Compute the maximum

• But not the following attribute:– Count the number of double bonds to which each

atom participates– Compute the maximum among the oxygen atoms

Attributes Traditional vs Exhaustive

Activity Mutagenicity

EXPERIMENTS

Design of the experiments• Given an attribute generation strategy:

– Perform a 10-fold cross validation using 10 different classifiers (from Weka): • MultilayerPerceptron, BayesNet, Bagging, J48,

ADTree, REPTree, RandomForest, PART, Nnge, Ridor

• The average accuracy across the folds and across the classifiers is the measure of the performance of the strategy used

Up to a predefined depthIn general, the deeper we go

the higher accuracy we

obtain

Let’s generate attributes at

depth > 4

Generate all attributes up to depth 4 and add 1,000 attributes randomly sampled from depth 5 to 7

Up to depth 4 + 1,000 in [5,7]

77.93% 77.72% 74.95% 76.30%

Summary of the results• Exhaustive is significantly better than

Traditional• Sampling deep attributes at the end of the

attribute generation procedureis significantly better thancontinuing generating non-deep attributes

• (in terms of proportion of classifiers that perform better with one strategy than with the other)

Comparison to fingerprints

The difference is not significant

Comparison to fingerprints

The difference is not significant

Years of research effort in order to identify this

attribute representation

2 hours of computing time

Additional Attribute Generation Strategies• Let’s not sample deep attributes randomly• Strategy 1: find the best mix of depths

(scatter search)• Strategy 2: use a Bayesian Network to

retrieve attributes with high information gain

New knowledge• Although our best method does not improve

upon fingerprints, it has the potential of generating new knowledge

• The attributes used by the classifiers represent important characteristics– Number of bromine atoms– The average number of double bonds among the

atoms different from SThese attributes identify structures that have characteristics that may prevent mutagenesis

New knowledge• But, sometimes, deep attributes are hard to interpret.• On Estrogen:• Label each atom A in the following way. 1) Consider the

atoms connected to it and count the bonds to which they participate (excluding the bond connecting A to each of them). 2) Compute the sum of these labels and obtain the label for A. Label the molecule with the minimum of these labels across all atoms of oxygen.

• Specifically, a high value would represent an oxygen atom that is connected to other atoms participating in a large number of additional bonds - presumably an oxygen atom that is somewhat buried and interacting with highly branched atoms.

Conclusions• The current attribute representations (Fingerprints)

used for molecule classification does not provide insights on the chemical properties of the compounds

• Traditional propositionalization approaches do not obtain satisfying accuracy

• Our method extends the traditional propositionalization approach and:– Obtains an accuracy comparable to Fingerprints– Has the potential of finding new knowledge

• Note that our method is applicable to any domain (marketing, medical, etc…)

Future Work• Accuracy improvement:

– Scan&Sample => Scan & Smartly Sample• Improve the feature representation

– Query-like is ok for computer scientists, but chemists would prefer a graphical representation

Thank you for your attention

[email protected]

a randomized exhaustive propositionalization approach for molecule classification

Documents