biological data mining cheminformatics 1 patrick hoffman

65
06/21/22 1 Biological Data Mining Cheminformatics 1 Patrick Hoffman

Upload: aderyn

Post on 11-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Biological Data Mining Cheminformatics 1 Patrick Hoffman. Tonight's Topics. Review Lab (Excel, Weka, R-project, Clementine) SarToxDem4.zip Review Regression ? SarPredict Classify? Flattening – exploding Best SarPredict Classifier (Naïve Bayes, others in Weka) Naïve Bayes – explained - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 1

Biological Data Mining Cheminformatics 1

Patrick Hoffman

Page 2: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 2

Tonight's Topics• Review Lab (Excel, Weka, R-project, Clementine)

SarToxDem4.zip• Review

– Regression ?– SarPredict Classify?

• Flattening – exploding• Best SarPredict Classifier (Naïve Bayes, others in Weka)• Naïve Bayes – explained• R-code for probability density pnorm,dnorm• Association Rules• Predictive Tox – SarToxDem• Data – Isis Keys, Molconz descriptors• PCA, MDS, Sammon plotsOther Clusterering techniques – Comparison ??

Page 3: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 3

Lab – Understand & find best classifier • Download SarTox-Dem2.zip • Unzip (SarTox-Dem2.csv)• load into Excel (modify?)• load into Weka (visualize)• load into Clementine (output to table)• Load into R-project

– filename <- "c:/MLCourse/SarTox-Dem4.csv"– csv <- read.csv(filename)– attach(csv)

• Histograms of Act-5/BAct-5 (Excel, R, Clementine - overlays)

Page 4: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 4

Example - SAR Data (SarPredict.csv)

• Structural Activity Relationship• 960 chemicals (records)• 26 data fields (variables )

– 11 Biological Activity measures – 11 Chemical descriptors– 4 Quality Control variables

Page 5: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 5

Regression vs Classification• Regression was hard !!!• On Active vs Inactive – Two class problem• Easier problem• Problems

– R-Groups are text strings (Flatten or explode)

– Unbalanced classes• Naïve Bayes

Page 6: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 6

Naïve Bayes with 11 chemical descriptors Correctly Classified Instances 923 96.1458 %Incorrectly Classified Instances 37 3.8542 %Kappa statistic 0.0494K&B Relative Info Score -68022.5337 %K&B Information Score -166.1076 bits -0.173 bits/instanceClass complexity | order 0 230.8551 bits 0.2405 bits/instanceClass complexity | scheme 203.4613 bits 0.2119 bits/instanceComplexity improvement (Sf) 27.3938 bits 0.0285 bits/instanceMean absolute error 0.0677Root mean squared error 0.1858Relative absolute error 87.8712 %Root relative squared error 95.2927 %Total Number of Instances 960

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class 1 0.974 0.961 1 0.98 Inactive 0.026 0 1 0.026 0.051 Active

=== Confusion Matrix ===

a b <-- classified as 922 0 | a = Inactive 37 1 | b = Active

Page 7: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 7

Flattening? – Exploding?

4 Categorial columns to:

Page 8: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 8

25 Binary columns

Page 9: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 9

Naïve Bayes with 36 chemical descriptors Correctly Classified Instances 912 95 %Incorrectly Classified Instances 48 5 %Kappa statistic 0.2475K&B Relative Info Score -72819.8932 %K&B Information Score -177.8225 bits -0.1852 bits/instanceClass complexity | order 0 230.8551 bits 0.2405 bits/instanceClass complexity | scheme 249.0367 bits 0.2594 bits/instanceComplexity improvement (Sf) -18.1817 bits -0.0189 bits/instanceMean absolute error 0.067 Root mean squared error 0.1959Relative absolute error 87.051 %Root relative squared error 100.4683 %Total Number of Instances 960

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class 0.979 0.763 0.969 0.979 0.974 Inactive 0.237 0.021 0.321 0.237 0.273 Active

=== Confusion Matrix ===

a b <-- classified as 903 19 | a = Inactive 29 9 | b = Active

Over all accuracy is worse but Active class is better

Page 10: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 10

Naïve Bayes – 36 no normalization

Correctly Classified Instances 878 91.4583 %Incorrectly Classified Instances 82 8.5417 %Kappa statistic 0.3111K&B Relative Info Score -135125.7751 %K&B Information Score -329.9703 bits -0.3437 bits/instanceClass complexity | order 0 230.8551 bits 0.2405 bits/instanceClass complexity | scheme 382.5589 bits 0.3985 bits/instanceComplexity improvement (Sf) -151.7039 bits -0.158 bits/instanceMean absolute error 0.0968Root mean squared error 0.2564Relative absolute error 125.6851 %Root relative squared error 131.4682 %Total Number of Instances 960

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class 0.928 0.421 0.982 0.928 0.954 Inactive 0.579 0.072 0.25 0.579 0.349 Active

=== Confusion Matrix ===

a b <-- classified as 856 66 | a = Inactive 16 22 | b = Active

Page 11: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 11

Voting Feature IntervalsVoting feature intervals classifier

Time taken to build model: 0.05 seconds

=== Stratified cross-validation ====== Summary ===

Correctly Classified Instances 728 75.8333 %Incorrectly Classified Instances 232 24.1667 %Kappa statistic 0.1168K&B Relative Info Score -909872.6115 %K&B Information Score -2221.8629 bits -2.3144 bits/instanceClass complexity | order 0 230.8551 bits 0.2405 bits/instanceClass complexity | scheme 759.1737 bits 0.7908 bits/instanceComplexity improvement (Sf) -528.3187 bits -0.5503 bits/instanceMean absolute error 0.357 Root mean squared error 0.4215Relative absolute error 463.5091 %Root relative squared error 216.1394 %Total Number of Instances 960

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class 0.762 0.342 0.982 0.762 0.858 Inactive 0.658 0.238 0.102 0.658 0.177 Active

=== Confusion Matrix ===

a b <-- classified as 703 219 | a = Inactive 13 25 | b = Active

Page 12: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 12

Part Classifier?Time taken to build model: 1.48 seconds

=== Stratified cross-validation ====== Summary ===

Correctly Classified Instances 912 95 %Incorrectly Classified Instances 48 5 %Kappa statistic 0.2475K&B Relative Info Score -30559.9665 %K&B Information Score -74.6259 bits -0.0777 bits/instanceClass complexity | order 0 230.8551 bits 0.2405 bits/instanceClass complexity | scheme 18419.7155 bits 19.1872 bits/instanceComplexity improvement (Sf) -18188.8604 bits -18.9467 bits/instanceMean absolute error 0.0641Root mean squared error 0.214 Relative absolute error 83.283 %Root relative squared error 109.7552 %Total Number of Instances 960

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class 0.979 0.763 0.969 0.979 0.974 Inactive 0.237 0.021 0.321 0.237 0.273 Active

=== Confusion Matrix ===

a b <-- classified as 903 19 | a = Inactive 29 9 | b = Active

Page 13: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 13

Duplicate Active from 38 to 304Time taken to build model: 0.06 seconds

=== Stratified cross-validation ====== Summary ===

Correctly Classified Instances 1029 83.9315 %Incorrectly Classified Instances 197 16.0685 %Kappa statistic 0.5996K&B Relative Info Score 60187.846 %K&B Information Score 486.9928 bits 0.3972 bits/instanceClass complexity | order 0 990.6589 bits 0.808 bits/instanceClass complexity | scheme 1151.1811 bits 0.939 bits/instanceComplexity improvement (Sf) -160.5222 bits -0.1309 bits/instanceMean absolute error 0.1714Root mean squared error 0.3647Relative absolute error 45.9283 %Root relative squared error 84.4536 %Total Number of Instances 1226

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class 0.856 0.211 0.925 0.856 0.889 Inactive 0.789 0.144 0.643 0.789 0.709 Active

=== Confusion Matrix ===

a b <-- classified as 789 133 | a = Inactive 64 240 | b = Active

Page 14: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 14

Naïve Bayes ClassifierFirst, what is a Bayes Classifier ?

Bayes TheoremP(Ck|x) =p(x|Ck)P(Ck)/p(x)

Ck =Class x=Attribute vector P = posterior probability p = unconditional density

Page 15: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 15

Bayes ClassifierP(Ck|x) > P(Cj|x)

Simply chose the class having largest posterior probability given the feature

vector xsame as

p(x|Ck)P(Ck) > p(x|Cj)P(Cj)problemWhat is

p(x|Ck) and p(x|Cj) ?

Page 16: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 16

If one knew the real density function there would be no problem.

• 1. if enough data, build histograms• 2. guess the distribution (Gaussian?)

• 3. calculate mean and Std. Dev.

Normally, p(x|Ck) is a multi-variate joint probability density function

• 4. Use a Parametric method, the mean and Std. Dev. would be parameters used to calculate p(x|Ck)

Page 17: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 17

The Standard Normal or Gaussian Density function of a single variable

Page 18: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 18

Guassian Density of a multivariate distribution

Page 19: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 19

Problems

• The above gives d(d+3)/2 parameters to estimate the joint density function

• Time consuming, difficult, and might not be the correct density function

• Many of the dimensions or attributes might be independent?

Page 20: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 20

Why not build a d-dimensional histogram for each Class ? This would approximate the joint density function.

Page 21: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 21

• say 10 bins (or values) for each dimension (attribute)

• d=2, 100 bins• d=3, 1000 bins : d=4, 10000, etc..• all that multiplied by number of classes• Usually not enough data or time • Not enough data to fill the bins

Curse of dimensionality!!!

Page 22: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 22

Naïve or Simple Bayes is the answer.

• Assume all dimensions or attributes are independent !• Simple Probability Product Rule• P(X|Ck) -->P(A1|Ck) *P(A2|Ck)* ...P(Ad|Ck)• for d attributes• one can estimate P(Ai|Ck) as gaussian or• build a histogram for each attribute in a training set• 10 dimensions, 10 bins becomes 102 not 1010

Page 23: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 23

Discretization (binning) is better

• Building the Histograms is better if you have enough data

• MLC++ has both Naïve Bayes (assumes gaussian) and Discrete Naïve Bayes

• Several Binning techniques (see Kohavi)• Entropy based method very good (see

http://www.cs.uml.edu/~fjara/mineset/id3/id3_example/id3example.html

Page 24: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 24

Histogram for each class - Iris

Page 25: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 25

Simple NB in R – What class is most likely to have all dimension between 0 and 5

data(iris) #get dataattach(iris) #make names availablehi=5 #set hi limit for possible dimensionslo=0 #set lo limita1=Sepal.Length[1:50] # a vector of the data for each classa2=Sepal.Length[51:100]a3=Sepal.Length[101:150]b1=Sepal.Width[1:50]b2=Sepal.Width[51:100]b3=Sepal.Width[101:150]c1=Petal.Width[1:50]c2=Petal.Width[51:100]c3=Petal.Width[101:150]d1=Petal.Length[1:50]d2=Petal.Length[51:100]d3=Petal.Length[101:150]#gets probability of each dimension of each class being in a certain rangep1setosa = pnorm(hi,mean(a1),sd(a1)) - pnorm(lo,mean(a1),sd(a1))p1versicolor = pnorm(hi,mean(a2),sd(a2)) - pnorm(lo,mean(a2),sd(a2))p1virginica = pnorm(hi,mean(a3),sd(a3)) - pnorm(lo,mean(a3),sd(a3))p2setosa = pnorm(hi,mean(b1),sd(b1)) - pnorm(lo,mean(b1),sd(b1))p2versicolor = pnorm(hi,mean(b2),sd(b2)) - pnorm(lo,mean(b2),sd(b2))p2virginica = pnorm(hi,mean(b3),sd(b3)) - pnorm(lo,mean(b3),sd(b3))p3setosa = pnorm(hi,mean(c1),sd(c1)) - pnorm(lo,mean(c1),sd(c1))p3versicolor = pnorm(hi,mean(c2),sd(c2)) - pnorm(lo,mean(c2),sd(c2))p3virginica = pnorm(hi,mean(c3),sd(c3)) - pnorm(lo,mean(c3),sd(c3))p4setosa = pnorm(hi,mean(d1),sd(d1)) - pnorm(lo,mean(d1),sd(d1))p4versicolor = pnorm(hi,mean(d2),sd(d2)) - pnorm(lo,mean(d2),sd(d2))p4virginica = pnorm(hi,mean(d3),sd(d3)) - pnorm(lo,mean(d3),sd(d3))psetosa = p1setosa*p2setosa*p3setosa*p4setosapversicolor = p1versicolor*p2versicolor*p3versicolor*p4versicolorpvirginica = p1virginica*p2virginica*p3virginica*p4virginicapsetosapversicolorpvirginica

pnorm calculates the cumulative probability

Page 26: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 26

Simple NB in R – What class is most likely to have all dimension between 0 and 5

data(iris) #get dataattach(iris) #make names availablehi=5 #set hi limit for possible dimensionslo=0 #set lo limit # a vector of the data for each classa1=Sepal.Length[1:50] a2=Sepal.Length[51:100]a3=Sepal.Length[101:150]b1=Sepal.Width[1:50]b2=Sepal.Width[51:100]b3=Sepal.Width[101:150]c1=Petal.Width[1:50]c2=Petal.Width[51:100]c3=Petal.Width[101:150]d1=Petal.Length[1:50]d2=Petal.Length[51:100]d3=Petal.Length[101:150]

pnorm calculates the cumulative probability

Page 27: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 27

Simple NB in R – What class is most likely to have all dimension between 0 and 5

#gets probability of each dimension of each class being in a certain range

p1setosa = pnorm(hi,mean(a1),sd(a1)) - pnorm(lo,mean(a1),sd(a1))p1versicolor = pnorm(hi,mean(a2),sd(a2)) - pnorm(lo,mean(a2),sd(a2))p1virginica = pnorm(hi,mean(a3),sd(a3)) - pnorm(lo,mean(a3),sd(a3))p2setosa = pnorm(hi,mean(b1),sd(b1)) - pnorm(lo,mean(b1),sd(b1))p2versicolor = pnorm(hi,mean(b2),sd(b2)) - pnorm(lo,mean(b2),sd(b2))p2virginica = pnorm(hi,mean(b3),sd(b3)) - pnorm(lo,mean(b3),sd(b3))p3setosa = pnorm(hi,mean(c1),sd(c1)) - pnorm(lo,mean(c1),sd(c1))p3versicolor = pnorm(hi,mean(c2),sd(c2)) - pnorm(lo,mean(c2),sd(c2))p3virginica = pnorm(hi,mean(c3),sd(c3)) - pnorm(lo,mean(c3),sd(c3))p4setosa = pnorm(hi,mean(d1),sd(d1)) - pnorm(lo,mean(d1),sd(d1))p4versicolor = pnorm(hi,mean(d2),sd(d2)) - pnorm(lo,mean(d2),sd(d2))p4virginica = pnorm(hi,mean(d3),sd(d3)) - pnorm(lo,mean(d3),sd(d3))

psetosa = p1setosa*p2setosa*p3setosa*p4setosapversicolor = p1versicolor*p2versicolor*p3versicolor*p4versicolorpvirginica = p1virginica*p2virginica*p3virginica*p4virginica

psetosapversicolorpvirginica

pnorm calculates the cumulative probability

Page 28: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 28

Better NB in R – What class is most likely to have all dimension between 0 and 5

## A better way using loopscsv=iris

N = ncol(csv) # get number of columnsN = N - 1 # don't do class columnR = nrow(csv) stats = matrix(0,N,3) # store the probabilities for each class and each dimensionprobs = matrix(1,3,1) # final probabilities for each class

#loop for 3 classesfor (lp2 in 1:3) { # get mean and sd for each class and each dimension # loop for each dimension for (lp1 in 1:N){ clix1 = (lp2-1)*50 +1 clix2 = clix1+49 d1 = csv[clix1:clix2,lp1] #where each class data is m = mean(d1) s = sd(d1) stats[lp1,lp2] = pnorm(hi,m,s) - pnorm(lo,m,s) probs[lp2] = probs[lp2]*stats[lp1,lp2] }}statsprobspsetosapversicolorPvirginica

pnorm calculates the cumulative probability

Page 29: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 29

This SAR Example• Regression failed• Classification failed• Any other machine learning tricks?

Page 30: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 30

Association rules• Look for possible rules that have high

Confidence and Support• There are many, a good method will let you

specify what you are looking for• These are small little pieces of the dimensional

space• Binning or Discretization is usually necessary –

best is smart or entropy binningExample

S5 > 6.405 and 'R3 fmla' = "CN-" and 'R4 fmla' = "C4H9-"

Page 31: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 31

A. Rules – so far Clementine GRI is the best (does binning)

• Rules below are only for selection index Active• Support = percent of instances in the dataset where antecedents are true• Confidence = percentage of support instances where consequent is true

Page 32: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 32

Association Rules - Be Wary• In the dataset there are only 8 instances

where the confidence is = 100• Some of the rules can be redundant• Looking at output one might think there

are 15 instances• Generally one wants “good” rules where

both support and confidence is high.

Page 33: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 33

Example – Predictive Toxicology• Project Objective

– Understand the relationship between chemical structure and liver isozyme inhibition.

• Data Overview– 100,000 chemicals (records)– 280 data fields (variables)

• 1 biological assay• 4 liver isozyme assays• 275 chemical descriptors

– 166 Substructure Search Keys – ISIS/Host– 109 Electro-topological State Indicators – MolConnZ

Page 34: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 34

Smaller version - SarTox-Dem4.csv

• 82 keys• 76 descriptors• 1550 instances (records, rows)• 1 id column• 5 activity measurements• 5 binned activity measurements• Analyze/Predict last column Act-5, BAct-5• 1280 toxic and 269 toxic

Page 35: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 35

Analysis Stages• Metadata Overview & Data Cleansing• Isozyme activity binning• Classifying & Clustering• Association Rules• TTEST/Feature Reduction• Visualizations

Page 36: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 36

Metadata Overview & Cleansing• 10 ISIS keys and 5 MolConn Z descriptors had zero values.

– In our analyses, these fields were eliminated from the dataset, thereby reducing the number of descriptors and keys to 260.

• Many records contained missing values:– Biological Assay: ~49,000– Isozyme 1: ~50,000 Isozyme 3: ~55,000– Isozyme 2: ~50,000 Isozyme 4: ~50,000

• About 24,000 records have all values of the biological activity and four liver isozymes

Page 37: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 37

Pearson Cross Correlations260 Descriptors

260 Descriptors– Correlation

+ Correlation

No Correlation

Page 38: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 38

ISIS Keys with PatchGrid™

Shows ISIS Key Composition of each

Chemical Class

Key onKey off

150 Chemical Classes166 ISIS K

eys

Page 39: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 39

ISIS Keys with PatchGrid™

ISIS Keys Clustered to show chemical classes

with similar keys

Key onKey off

150 Chemical Classes166 ISIS K

eys

Page 40: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 40

Data Binning

Isozyme 4

Isozyme 2Isozyme 1

Isozyme 3

High InhibitionLow Inhibition

Bio Activity

High ActivityLow Activity

Page 41: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 41

Association Rules - Visually

Isozyme 1 Inhibition = high if:key1 > 0.5 & key 2 > 0.5 & Descriptor A > 3.6

Key 1Key 2

Descriptor A

Low Medium High

Page 42: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 42

Sub-Selection Overview

High Inhibitionof sub-selection

Biological Activityof sub-selection

High Inhibitionof sub-selectionin all classes

Class with high% inhibition

Page 43: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 43

Narrowing Down• Identify Important Dimensions in

sub-selection• Apply Important Dimensions from

Association Rules• Select a single chemical class with high

% inhibition• Use TTest or F-test to Reduce Keys and

descriptors

Page 44: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 44

Cost Matrix Confusion Matrix Class Accuracy Precision a b <-- classified as 96.4% Overall

0 1 19579 308 | a = nontoxic 98.5% Nontoxic .978 1 0 439 696 | b = toxic 61.3% Toxic  79.9% Class average 

Cost Matrix Confusion Matrix Class Accuracy Precision a b <-- classified as 86.5% Overall

0 1 17239 2648 | a = nontoxic 86.7% Nontoxic .989 100 0 183 952 | b = toxic 83.9% Toxic  85.3% Class average 

Cost Matrix Confusion Matrix Class Accuracy Precision a b <-- classified as 70.4% Overall

0 1 13753 6134 | a = nontoxic 69.2% Nontoxic .993 13753/(13753+97) 500 0 97 1038 | b = toxic 91.5% Toxic

  80.3% Class average

Classification Via Cost Matrix

False Positives

Precision is the percentage of chemicals classified as nontoxic that actually are nontoxic.

False Negatives

Page 45: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 45

Naïve Bayes – all 158 attrib.Classifier: NaiveBayes -x 10 -v -o -i -k -t

Correctly Classified Instances 1245 91.6789 %Incorrectly Classified Instances 113 8.3211 %Kappa statistic 0.7071K&B Relative Info Score 76254.3843 %K&B Information Score 457.5652 bits 0.3369 bits/instanceClass complexity | order 0 811.2275 bits 0.5974 bits/instanceClass complexity | scheme 3181.6444 bits 2.3429 bits/instanceComplexity improvement (Sf) -2370.4169 bits -1.7455 bits/instanceMean absolute error 0.0856Root mean squared error 0.2781Relative absolute error 34.4662 %Root relative squared error 78.9588 %Total Number of Instances 1358

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class 0.888 0.078 0.658 0.888 0.756 Toxic 0.922 0.112 0.98 0.922 0.95 Non-Toxic

=== Confusion Matrix ===

a b <-- classified as 175 22 | a = Toxic 91 1070 | b = Non-Toxic

Page 46: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 46

NB with top 20 attrib. (TTest)Classifier: NaiveBayes -x 10 -v -o -i -k -t === Stratified cross-validation ===

Correctly Classified Instances 1274 93.8144 %Incorrectly Classified Instances 84 6.1856 %Kappa statistic 0.7453K&B Relative Info Score 90035.977 %K&B Information Score 540.2618 bits 0.3978 bits/instanceClass complexity | order 0 811.2275 bits 0.5974 bits/instanceClass complexity | scheme 1578.972 bits 1.1627 bits/instanceComplexity improvement (Sf) -767.7445 bits -0.5653 bits/instanceMean absolute error 0.0648Root mean squared error 0.233 Relative absolute error 26.0612 %Root relative squared error 66.1472 %Total Number of Instances 1358

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class 0.761 0.032 0.802 0.761 0.781 Toxic 0.968 0.239 0.96 0.968 0.964 Non-Toxic

=== Confusion Matrix ===

a b <-- classified as 150 47 | a = Toxic 37 1124 | b = Non-Toxic

Page 47: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 47

Best Logistic with only 20 attributes.

Classifier: Logistic -x 10 -v -o -i -k -t === Stratified cross-validation ===

Correctly Classified Instances 1319 97.1281 %Incorrectly Classified Instances 39 2.8719 %Kappa statistic 0.8789K&B Relative Info Score 107051.552 %K&B Information Score 642.364 bits 0.473 bits/instanceClass complexity | order 0 811.2275 bits 0.5974 bits/instanceClass complexity | scheme 192.6762 bits 0.1419 bits/instanceComplexity improvement (Sf) 618.5513 bits 0.4555 bits/instanceMean absolute error 0.0465Root mean squared error 0.1547Relative absolute error 18.7318 %Root relative squared error 43.9414 %Total Number of Instances 1358

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class 0.848 0.008 0.949 0.848 0.895 Toxic 0.992 0.152 0.975 0.992 0.983 Non-Toxic

=== Confusion Matrix ===

a b <-- classified as 167 30 | a = Toxic 9 1152 | b = Non-Toxic

Page 48: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 48

Principle Component Analysis• Linear transformations • Creating new dimensions that are linear combinations of

the old attributes (dimensions)• The new dimensions are chosen to maximize the variation

of the data• PC1 and PC2 contain the most variation of the data• Typically one plots PC1 vs PC2 and shows class labels

(however they might not separate classes the best)• There will be N components, N is the max of the rows and

columns of the data.• Also called Singular Value Decomposition, essentially

finding the Eigen values of a matrix

Page 49: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 49

Principle Component Analysis• Plotting one PC vs another PC can be considered

“Clustering” using Euclidian distance measure.• One can then view the class labels to see how

good the clustering was.• Possibly making it into a “visual classifier”• It takes some work to know the “important”

attributes• It is also “Feature Reduction” since might use

only the first few PC’s for classification.• One can even “Mix” PC’s with original attributes

Page 50: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 50

sarTox-dem4.csv PCA

Using all 158 keys and descriptors does not help.

Page 51: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 51

PCA-sarTox-dem4

Using only top 20 keys and descriptors clearly helps !

Page 52: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 52

PCA-sarTox-dem4 – Selected records

Page 53: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 53

Selected Records Statistics for BAct-5

Class Selected Count Support Confidence Lift Total Count Total Density

Non-Toxic 2 0.14728% 1.36986% 0.01602 1161 85.49337%

Toxic 144 10.60383% 98.63014% 6.79897 197 14.50663%

Page 54: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 54

Mixing PC’s and original Attributes

RadViz a special “spring” paradigm clustering/classifying visualization/

Page 55: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 55

R-PCA Matrices• PC Analysis Introduction• Using R-project scripts to do standard PC

analysis gives two ouput matrices. One we will call Z (in R its called x) or Projected Matrix and one called V or Rotation Matrix. The matrix operation from the original data is:

• Z = X V where X is the original data, Z is the new Principal component projected space, and V a matrix containing the coefficients of the original records contributing to the new PC’s.

• Both the Z (projected matrix) and the V (rotation matrix) can be used in the analysis.

Page 56: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 56

PCA R-code data(iris) csv <- iris[1:4] ct <- as.data.frame(csv) m <- as.matrix(ct) # now do PCA h <- prcomp(m) dpca <- h$x #projection matrix drot <- h$rotation dm <- as.data.frame(dpca) ex <- as.data.frame( c(csv,dm ) ) # Write the new table to a file. write.table( ex, file = "out1.csv", append = FALSE, quote = FALSE, sep = ",", eol = "\n", na = "", dec = ".", row.names = FALSE, col.names = TRUE ) dm <- as.data.frame(drot) # don't merge rotation matrix with old data ex <- as.data.frame( c(dm ) ) # Write the new table to a file. write.table( ex, file = "runity2out.csv", append = FALSE, quote = FALSE, sep = ",",

eol = "\n", na = "", dec = ".", row.names = FALSE, col.names = TRUE ) #jpeg(filename="Rplot1.jpg", width=800, height=800, pointsize=12, quality=75) plot(h) # plots variation for each PC #jpeg(filename="Rplot2.jpg", width=800, height=800, pointsize=12, quality=75) px=c(rep(4,50),rep(3,50),rep(2,50)) # the class labels for shape and color plot(h$x[,1:2],xlab="first pc",ylab = "2nd pc",pch=px,col=px)

Page 57: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 57

Iris PC variation

Page 58: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 58

Iris PCA plot from R

Page 59: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 59

Multidimensional Scaling• Similar to PCA• One can choose the distance metric

between points• Sometimes used to encompass many

methods (PCA, Sammon, etc)

Page 60: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 60

Iris - MDS - Scatter Plot

Page 61: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 61

Iris – Sammon Plot

Page 62: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 62

MDS R- code csv <- read.csv(filename) N <- ncol(csv) ct <- as.data.frame(csv) m <- as.matrix(ct) if (pam1 == "pearson") { d <- pdistC( m) # use 1-pearson } else { d <- dist(m,method=pam1) } # now do MDS h <- cmdscale(d) x <- h[,1] # get x and y coordinates y <- h[,2] hdt <- as.data.frame(h) attr(hdt,"names")<- c(paste( "MDSx",pam1, sep="" ), "MDSy") extendedTable <- as.data.frame( c(csv,hdt ) ) # Write the new table to a file. write.table( extendedTable, file = "runity1out.csv", append = FALSE, quote = FALSE, sep =

",", eol = "\n", na = "", dec = ".", row.names = FALSE, col.names = TRUE ) jpeg(filename="Rplot.jpg", width=800, height=800, pointsize=12, quality=75) plot(x, y, ,xlab="", ylab="", main="Multi-dimensional scale")

Page 63: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 63

Sammon Plots (1969)• Non-linear• Projects N dimensions down to 2• Seeks to preserve the n dimensional

distance from every point to every other point ( minimize “stress”)

• Uses techniques similar to “simulated annealing” – iterative

• Uses Random jittering – one could get different pictures

Page 64: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 64

R code for Sammon require( mva ) require(MASS) csv <- read.csv(filename) N <- ncol(csv) ct <- as.data.frame(csv) m <- as.matrix(ct) if (pam1 == "pearson") { d <- pdistC( m) # use 1-pearson } else { d <- dist(m,method=pam1) maxd <- max(d) # if distance matrix has Inf we have to shift the data if (is.infinite(maxd)){ # shift data m <- ct - min(ct) + 1 d <- dist(m, method = pam1 ) } } d[d==0] <- .00001 # replace 0 distances sam <- sammon(d) #now get the points samxy <- sam$points dm <- as.data.frame(samxy) attr(dm,"names")<- c("SAMx", "SAMy") extendedTable <- as.data.frame( c( dm ) ) #row.names(dout) <- row.names(d) # out table contains columns for each clustering. # Write the new table to a file. write.table( extendedTable, file = "runity1out.csv", append = FALSE, quote = FALSE, sep = ",", eol = "\n", na = "", dec = ".",

row.names = FALSE, col.names = TRUE ) jpeg(filename="Rplot.jpg", width=800, height=800, pointsize=12, quality=75) plot(sam$points)

Page 65: Biological Data Mining  Cheminformatics 1 Patrick Hoffman

04/22/23 65

R code – Pearson Distance matrix calculation

pdistC <- function (x) { N <- nrow(x <- as.matrix(x))

dd<-dim(x) sz <- dd[1]*dd[2] xmean <- sum(abs(x))/sz var<-xmean*0.000001 cat("dd,mean,var = ",dd,xmean,var,"\n") pt<-rnorm(sz, 0, var) dim(pt)<-c(dd[1],dd[2]) xx<-x+pt a0 <- cor(t(xx),use="pairwise.complete.obs") a<- 1-a0 d <- as.vector(a[lower.tri(a)]) attr(d, "Size") <- N attr(d, "Labels") <- dimnames(x)[[1]] attr(d, "Diag") <- diag attr(d, "Upper") <- "upper" attr(d, "method") <- "pearson" attr(d, "call") <- match.call() class(d) <- "dist" return(d)}