a randomized exhaustive propositionalization approach for molecule classification
DESCRIPTION
A Randomized Exhaustive Propositionalization Approach for Molecule Classification. Michele Samorani Manuel Laguna Kirk DeLisle Daniel Weaver. Drug discovery. Drug Discovery. The process of developing new drugs - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/1.jpg)
A Randomized Exhaustive Propositionalization
Approach for Molecule ClassificationMichele Samorani
Manuel LagunaKirk DeLisleDaniel Weaver
![Page 2: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/2.jpg)
DRUG DISCOVERY
![Page 3: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/3.jpg)
Drug Discovery• The process of developing new drugs• The cost of developing a drug typically varies from 500
million $ to 2 billion $• Molecule classification is used along the entire process
to discriminate between:– Active and Non Active compounds– Toxic and Non Toxic compounds
• During the development of a new drug:– Use the experiments done so far to train a classifier– Use the classifier to find the promising compounds to test next
• An ideal Classification Algorithm:– Speeds up the design of new drugs– Gives insights about chemical properties
![Page 4: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/4.jpg)
Data Mining in Drug Discovery
Active (1) Non-Active (0)
Attribute representation
Classifier
The chemist designs a
compound
Non-Active!
![Page 5: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/5.jpg)
Molecule classification – Binary Fingerprints
Compound Class A1 … … … … … …1 12 2… …400 1
One of the main attribute representations is the so called Binary Fingerprints:– Every attribute represents the absence/presence (0/1)
of a characteristic or a substructure– The attributes are pre-defined characteristics– The classification process does not find new knowledge,
but which attributes are most important
![Page 6: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/6.jpg)
Molecule classification – Binary FingerprintsCompound Class A1 … … … … … …1 12 2… …400 1
• The focus of this work is NOT on improving the classification procedure
• It is on how to generate a good attribute representation that generates new knowledge
![Page 7: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/7.jpg)
PROPOSITIONALIZATION
![Page 8: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/8.jpg)
Propositionalization• The starting point is a database• By navigating through the database, new
features are generated, which represent the result of SQL queries
• These features are added to the mining table
![Page 9: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/9.jpg)
TARGETidTarget class1 0
2 1
… …
BONDidTarget idBond type1 0 2
1 1 1
… …
ATOMBONDidTarget idAtom idBond1 0 0
1 1 0
… …
ATOMidTarget idAtom ele1 0 O
1 1 C
… …n
n
n
1
n
n
1
It contains the compounds
![Page 10: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/10.jpg)
Generating a new attribute• Two steps:
1. Find a path that starts from the target table2. Roll-up one simple attribute, through
aggregations and refinements, from the last table to the target table
![Page 11: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/11.jpg)
TARGETidTarget class1 0
2 1
… …
BONDidTarget idBond type1 0 2
1 1 1
… …
ATOMBONDidTarget idAtom idBond1 0 0
1 1 0
… …
ATOMidTarget idAtom ele1 0 O
1 1 C
… …n
n
n
1
n
n
1
STEP 1:Find a path
![Page 12: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/12.jpg)
TARGETidTarget class1 0
2 1
… …
BONDidTarget idBond type1 0 2
1 1 1
… …
ATOMBONDidTarget idAtom idBond1 0 0
1 1 0
… …
ATOMidTarget idAtom ele1 0 O
1 1 C
… …n
n
n
1
n
n
1
STEP 1:Find a path
This path will find attributes of depth 2Depth = measure of how complex the attribute is
![Page 13: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/13.jpg)
TARGETidTarget class1 0
2 1
… …
BONDidTarget idBond type1 0 2
1 1 1
… …
ATOMBONDidTarget idAtom idBond1 0 0
1 1 0
… …
ATOMidTarget idAtom ele1 0 O
1 1 C
… …n
n
n
1
n
n
1
STEP 2:Roll-up
Aggregate to each Atom: count distinct bonds (CDB)
![Page 14: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/14.jpg)
TARGETidTarget class1 0
2 1
… …
BONDidTarget idBond type1 0 2
1 1 1
… …
ATOMBONDidTarget idAtom idBond1 0 0
1 1 0
… …
ATOMidTarget idAtom ele1 0 O
1 1 C
… …n
n
n
1
n
n
1
STEP 2:Roll-up
CDB12
Aggregate to each Atom: count distinct bonds (CDB)
![Page 15: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/15.jpg)
TARGETidTarget class1 0
2 1
… …
BONDidTarget idBond type1 0 2
1 1 1
… …
ATOMBONDidTarget idAtom idBond1 0 0
1 1 0
… …
ATOMidTarget idAtom ele1 0 O
1 1 C
… …n
n
n
1
n
n
1
STEP 2:Roll-up
Attach to the target table: The maximum number of bonds to which an atom of carbon participates
CDB12
X23
max(Atom.CDB)Where Atom.ele = ‘C’
![Page 16: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/16.jpg)
Propositionalization – graphically
Compound Class A1 … A55 A56 … A118 … A346 …1 12 2… …400 1
Depth 1 Depth 2 Depth 3 Depth 4
![Page 17: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/17.jpg)
Our contribution over traditional propositionalization
Our Randomized Exhaustive approach produces:1. More expressive attributes (Exhaustive)2. “Deeper” attributes (Randomized)
![Page 18: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/18.jpg)
More expressive attributes – Example
• Traditional propositionalization algorithms can generate the following attribute:– Count the number of double bonds to which each
atom participates– Compute the maximum
• But not the following attribute:– Count the number of double bonds to which each
atom participates– Compute the maximum among the oxygen atoms
![Page 19: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/19.jpg)
Attributes Traditional vs Exhaustive
Activity Mutagenicity
![Page 20: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/20.jpg)
EXPERIMENTS
![Page 21: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/21.jpg)
Design of the experiments• Given an attribute generation strategy:
– Perform a 10-fold cross validation using 10 different classifiers (from Weka): • MultilayerPerceptron, BayesNet, Bagging, J48,
ADTree, REPTree, RandomForest, PART, Nnge, Ridor
• The average accuracy across the folds and across the classifiers is the measure of the performance of the strategy used
![Page 22: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/22.jpg)
Up to a predefined depthIn general, the deeper we go
the higher accuracy we
obtain
Let’s generate attributes at
depth > 4
![Page 23: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/23.jpg)
Generate all attributes up to depth 4 and add 1,000 attributes randomly sampled from depth 5 to 7
Up to depth 4 + 1,000 in [5,7]
77.93% 77.72% 74.95% 76.30%
![Page 24: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/24.jpg)
Summary of the results• Exhaustive is significantly better than
Traditional• Sampling deep attributes at the end of the
attribute generation procedureis significantly better thancontinuing generating non-deep attributes
• (in terms of proportion of classifiers that perform better with one strategy than with the other)
![Page 25: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/25.jpg)
Comparison to fingerprints
The difference is not significant
![Page 26: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/26.jpg)
Comparison to fingerprints
The difference is not significant
Years of research effort in order to identify this
attribute representation
2 hours of computing time
![Page 27: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/27.jpg)
Additional Attribute Generation Strategies• Let’s not sample deep attributes randomly• Strategy 1: find the best mix of depths
(scatter search)• Strategy 2: use a Bayesian Network to
retrieve attributes with high information gain
![Page 28: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/28.jpg)
New knowledge• Although our best method does not improve
upon fingerprints, it has the potential of generating new knowledge
• The attributes used by the classifiers represent important characteristics– Number of bromine atoms– The average number of double bonds among the
atoms different from SThese attributes identify structures that have characteristics that may prevent mutagenesis
![Page 29: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/29.jpg)
New knowledge• But, sometimes, deep attributes are hard to interpret.• On Estrogen:• Label each atom A in the following way. 1) Consider the
atoms connected to it and count the bonds to which they participate (excluding the bond connecting A to each of them). 2) Compute the sum of these labels and obtain the label for A. Label the molecule with the minimum of these labels across all atoms of oxygen.
• Specifically, a high value would represent an oxygen atom that is connected to other atoms participating in a large number of additional bonds - presumably an oxygen atom that is somewhat buried and interacting with highly branched atoms.
![Page 30: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/30.jpg)
Conclusions• The current attribute representations (Fingerprints)
used for molecule classification does not provide insights on the chemical properties of the compounds
• Traditional propositionalization approaches do not obtain satisfying accuracy
• Our method extends the traditional propositionalization approach and:– Obtains an accuracy comparable to Fingerprints– Has the potential of finding new knowledge
• Note that our method is applicable to any domain (marketing, medical, etc…)
![Page 31: A Randomized Exhaustive Propositionalization Approach for Molecule Classification](https://reader036.vdocument.in/reader036/viewer/2022062410/56815f2e550346895dcdfae5/html5/thumbnails/31.jpg)
Future Work• Accuracy improvement:
– Scan&Sample => Scan & Smartly Sample• Improve the feature representation
– Query-like is ok for computer scientists, but chemists would prefer a graphical representation