biological data mining cheminformatics 1 patrick hoffman

04/22/23 1

Biological Data Mining Cheminformatics 1

Patrick Hoffman

04/22/23 2

Tonight's Topics• Review Lab (Excel, Weka, R-project, Clementine)

SarToxDem4.zip• Review

– Regression ?– SarPredict Classify?

• Flattening – exploding• Best SarPredict Classifier (Naïve Bayes, others in Weka)• Naïve Bayes – explained• R-code for probability density pnorm,dnorm• Association Rules• Predictive Tox – SarToxDem• Data – Isis Keys, Molconz descriptors• PCA, MDS, Sammon plotsOther Clusterering techniques – Comparison ??

04/22/23 3

Lab – Understand & find best classifier • Download SarTox-Dem2.zip • Unzip (SarTox-Dem2.csv)• load into Excel (modify?)• load into Weka (visualize)• load into Clementine (output to table)• Load into R-project

– filename <- "c:/MLCourse/SarTox-Dem4.csv"– csv <- read.csv(filename)– attach(csv)

• Histograms of Act-5/BAct-5 (Excel, R, Clementine - overlays)

04/22/23 4

Example - SAR Data (SarPredict.csv)

• Structural Activity Relationship• 960 chemicals (records)• 26 data fields (variables )

– 11 Biological Activity measures – 11 Chemical descriptors– 4 Quality Control variables

04/22/23 5

Regression vs Classification• Regression was hard !!!• On Active vs Inactive – Two class problem• Easier problem• Problems

– R-Groups are text strings (Flatten or explode)

– Unbalanced classes• Naïve Bayes

04/22/23 6

Naïve Bayes with 11 chemical descriptors Correctly Classified Instances 923 96.1458 %Incorrectly Classified Instances 37 3.8542 %Kappa statistic 0.0494K&B Relative Info Score -68022.5337 %K&B Information Score -166.1076 bits -0.173 bits/instanceClass complexity | order 0 230.8551 bits 0.2405 bits/instanceClass complexity | scheme 203.4613 bits 0.2119 bits/instanceComplexity improvement (Sf) 27.3938 bits 0.0285 bits/instanceMean absolute error 0.0677Root mean squared error 0.1858Relative absolute error 87.8712 %Root relative squared error 95.2927 %Total Number of Instances 960

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class 1 0.974 0.961 1 0.98 Inactive 0.026 0 1 0.026 0.051 Active

=== Confusion Matrix ===

a b <-- classified as 922 0 | a = Inactive 37 1 | b = Active

04/22/23 7

Flattening? – Exploding?

4 Categorial columns to:

04/22/23 8

25 Binary columns

04/22/23 9

Naïve Bayes with 36 chemical descriptors Correctly Classified Instances 912 95 %Incorrectly Classified Instances 48 5 %Kappa statistic 0.2475K&B Relative Info Score -72819.8932 %K&B Information Score -177.8225 bits -0.1852 bits/instanceClass complexity | order 0 230.8551 bits 0.2405 bits/instanceClass complexity | scheme 249.0367 bits 0.2594 bits/instanceComplexity improvement (Sf) -18.1817 bits -0.0189 bits/instanceMean absolute error 0.067 Root mean squared error 0.1959Relative absolute error 87.051 %Root relative squared error 100.4683 %Total Number of Instances 960


TP Rate FP Rate Precision Recall F-Measure Class 0.979 0.763 0.969 0.979 0.974 Inactive 0.237 0.021 0.321 0.237 0.273 Active



Over all accuracy is worse but Active class is better

04/22/23 10

Naïve Bayes – 36 no normalization

Correctly Classified Instances 878 91.4583 %Incorrectly Classified Instances 82 8.5417 %Kappa statistic 0.3111K&B Relative Info Score -135125.7751 %K&B Information Score -329.9703 bits -0.3437 bits/instanceClass complexity | order 0 230.8551 bits 0.2405 bits/instanceClass complexity | scheme 382.5589 bits 0.3985 bits/instanceComplexity improvement (Sf) -151.7039 bits -0.158 bits/instanceMean absolute error 0.0968Root mean squared error 0.2564Relative absolute error 125.6851 %Root relative squared error 131.4682 %Total Number of Instances 960





04/22/23 11

Voting Feature IntervalsVoting feature intervals classifier

Time taken to build model: 0.05 seconds

=== Stratified cross-validation ====== Summary ===

Correctly Classified Instances 728 75.8333 %Incorrectly Classified Instances 232 24.1667 %Kappa statistic 0.1168K&B Relative Info Score -909872.6115 %K&B Information Score -2221.8629 bits -2.3144 bits/instanceClass complexity | order 0 230.8551 bits 0.2405 bits/instanceClass complexity | scheme 759.1737 bits 0.7908 bits/instanceComplexity improvement (Sf) -528.3187 bits -0.5503 bits/instanceMean absolute error 0.357 Root mean squared error 0.4215Relative absolute error 463.5091 %Root relative squared error 216.1394 %Total Number of Instances 960





04/22/23 12

Part Classifier?Time taken to build model: 1.48 seconds


Correctly Classified Instances 912 95 %Incorrectly Classified Instances 48 5 %Kappa statistic 0.2475K&B Relative Info Score -30559.9665 %K&B Information Score -74.6259 bits -0.0777 bits/instanceClass complexity | order 0 230.8551 bits 0.2405 bits/instanceClass complexity | scheme 18419.7155 bits 19.1872 bits/instanceComplexity improvement (Sf) -18188.8604 bits -18.9467 bits/instanceMean absolute error 0.0641Root mean squared error 0.214 Relative absolute error 83.283 %Root relative squared error 109.7552 %Total Number of Instances 960





04/22/23 13

Duplicate Active from 38 to 304Time taken to build model: 0.06 seconds


Correctly Classified Instances 1029 83.9315 %Incorrectly Classified Instances 197 16.0685 %Kappa statistic 0.5996K&B Relative Info Score 60187.846 %K&B Information Score 486.9928 bits 0.3972 bits/instanceClass complexity | order 0 990.6589 bits 0.808 bits/instanceClass complexity | scheme 1151.1811 bits 0.939 bits/instanceComplexity improvement (Sf) -160.5222 bits -0.1309 bits/instanceMean absolute error 0.1714Root mean squared error 0.3647Relative absolute error 45.9283 %Root relative squared error 84.4536 %Total Number of Instances 1226





04/22/23 14

Naïve Bayes ClassifierFirst, what is a Bayes Classifier ?

Bayes TheoremP(Ck|x) =p(x|Ck)P(Ck)/p(x)

Ck =Class x=Attribute vector P = posterior probability p = unconditional density

04/22/23 16

If one knew the real density function there would be no problem.

• 1. if enough data, build histograms• 2. guess the distribution (Gaussian?)

• 3. calculate mean and Std. Dev.

Normally, p(x|Ck) is a multi-variate joint probability density function

• 4. Use a Parametric method, the mean and Std. Dev. would be parameters used to calculate p(x|Ck)

04/22/23 17

The Standard Normal or Gaussian Density function of a single variable

04/22/23 18

Guassian Density of a multivariate distribution

04/22/23 19

Problems

• The above gives d(d+3)/2 parameters to estimate the joint density function

• Time consuming, difficult, and might not be the correct density function

• Many of the dimensions or attributes might be independent?

04/22/23 20

Why not build a d-dimensional histogram for each Class ? This would approximate the joint density function.

04/22/23 21

• say 10 bins (or values) for each dimension (attribute)

• d=2, 100 bins• d=3, 1000 bins : d=4, 10000, etc..• all that multiplied by number of classes• Usually not enough data or time • Not enough data to fill the bins

Curse of dimensionality!!!

04/22/23 22

Naïve or Simple Bayes is the answer.

• Assume all dimensions or attributes are independent !• Simple Probability Product Rule• P(X|Ck) -->P(A1|Ck) *P(A2|Ck)* ...P(Ad|Ck)• for d attributes• one can estimate P(Ai|Ck) as gaussian or• build a histogram for each attribute in a training set• 10 dimensions, 10 bins becomes 102 not 1010

04/22/23 23

Discretization (binning) is better

• Building the Histograms is better if you have enough data

• MLC++ has both Naïve Bayes (assumes gaussian) and Discrete Naïve Bayes

• Several Binning techniques (see Kohavi)• Entropy based method very good (see

http://www.cs.uml.edu/~fjara/mineset/id3/id3_example/id3example.html

04/22/23 24

Histogram for each class - Iris

04/22/23 25

Simple NB in R – What class is most likely to have all dimension between 0 and 5

data(iris) #get dataattach(iris) #make names availablehi=5 #set hi limit for possible dimensionslo=0 #set lo limita1=Sepal.Length[1:50] # a vector of the data for each classa2=Sepal.Length[51:100]a3=Sepal.Length[101:150]b1=Sepal.Width[1:50]b2=Sepal.Width[51:100]b3=Sepal.Width[101:150]c1=Petal.Width[1:50]c2=Petal.Width[51:100]c3=Petal.Width[101:150]d1=Petal.Length[1:50]d2=Petal.Length[51:100]d3=Petal.Length[101:150]#gets probability of each dimension of each class being in a certain rangep1setosa = pnorm(hi,mean(a1),sd(a1)) - pnorm(lo,mean(a1),sd(a1))p1versicolor = pnorm(hi,mean(a2),sd(a2)) - pnorm(lo,mean(a2),sd(a2))p1virginica = pnorm(hi,mean(a3),sd(a3)) - pnorm(lo,mean(a3),sd(a3))p2setosa = pnorm(hi,mean(b1),sd(b1)) - pnorm(lo,mean(b1),sd(b1))p2versicolor = pnorm(hi,mean(b2),sd(b2)) - pnorm(lo,mean(b2),sd(b2))p2virginica = pnorm(hi,mean(b3),sd(b3)) - pnorm(lo,mean(b3),sd(b3))p3setosa = pnorm(hi,mean(c1),sd(c1)) - pnorm(lo,mean(c1),sd(c1))p3versicolor = pnorm(hi,mean(c2),sd(c2)) - pnorm(lo,mean(c2),sd(c2))p3virginica = pnorm(hi,mean(c3),sd(c3)) - pnorm(lo,mean(c3),sd(c3))p4setosa = pnorm(hi,mean(d1),sd(d1)) - pnorm(lo,mean(d1),sd(d1))p4versicolor = pnorm(hi,mean(d2),sd(d2)) - pnorm(lo,mean(d2),sd(d2))p4virginica = pnorm(hi,mean(d3),sd(d3)) - pnorm(lo,mean(d3),sd(d3))psetosa = p1setosa*p2setosa*p3setosa*p4setosapversicolor = p1versicolor*p2versicolor*p3versicolor*p4versicolorpvirginica = p1virginica*p2virginica*p3virginica*p4virginicapsetosapversicolorpvirginica

pnorm calculates the cumulative probability

04/22/23 26


data(iris) #get dataattach(iris) #make names availablehi=5 #set hi limit for possible dimensionslo=0 #set lo limit # a vector of the data for each classa1=Sepal.Length[1:50] a2=Sepal.Length[51:100]a3=Sepal.Length[101:150]b1=Sepal.Width[1:50]b2=Sepal.Width[51:100]b3=Sepal.Width[101:150]c1=Petal.Width[1:50]c2=Petal.Width[51:100]c3=Petal.Width[101:150]d1=Petal.Length[1:50]d2=Petal.Length[51:100]d3=Petal.Length[101:150]


04/22/23 27


#gets probability of each dimension of each class being in a certain range

p1setosa = pnorm(hi,mean(a1),sd(a1)) - pnorm(lo,mean(a1),sd(a1))p1versicolor = pnorm(hi,mean(a2),sd(a2)) - pnorm(lo,mean(a2),sd(a2))p1virginica = pnorm(hi,mean(a3),sd(a3)) - pnorm(lo,mean(a3),sd(a3))p2setosa = pnorm(hi,mean(b1),sd(b1)) - pnorm(lo,mean(b1),sd(b1))p2versicolor = pnorm(hi,mean(b2),sd(b2)) - pnorm(lo,mean(b2),sd(b2))p2virginica = pnorm(hi,mean(b3),sd(b3)) - pnorm(lo,mean(b3),sd(b3))p3setosa = pnorm(hi,mean(c1),sd(c1)) - pnorm(lo,mean(c1),sd(c1))p3versicolor = pnorm(hi,mean(c2),sd(c2)) - pnorm(lo,mean(c2),sd(c2))p3virginica = pnorm(hi,mean(c3),sd(c3)) - pnorm(lo,mean(c3),sd(c3))p4setosa = pnorm(hi,mean(d1),sd(d1)) - pnorm(lo,mean(d1),sd(d1))p4versicolor = pnorm(hi,mean(d2),sd(d2)) - pnorm(lo,mean(d2),sd(d2))p4virginica = pnorm(hi,mean(d3),sd(d3)) - pnorm(lo,mean(d3),sd(d3))

psetosa = p1setosa*p2setosa*p3setosa*p4setosapversicolor = p1versicolor*p2versicolor*p3versicolor*p4versicolorpvirginica = p1virginica*p2virginica*p3virginica*p4virginica

psetosapversicolorpvirginica


04/22/23 28

Better NB in R – What class is most likely to have all dimension between 0 and 5

## A better way using loopscsv=iris

N = ncol(csv) # get number of columnsN = N - 1 # don't do class columnR = nrow(csv) stats = matrix(0,N,3) # store the probabilities for each class and each dimensionprobs = matrix(1,3,1) # final probabilities for each class

#loop for 3 classesfor (lp2 in 1:3) { # get mean and sd for each class and each dimension # loop for each dimension for (lp1 in 1:N){ clix1 = (lp2-1)*50 +1 clix2 = clix1+49 d1 = csv[clix1:clix2,lp1] #where each class data is m = mean(d1) s = sd(d1) stats[lp1,lp2] = pnorm(hi,m,s) - pnorm(lo,m,s) probs[lp2] = probs[lp2]*stats[lp1,lp2] }}statsprobspsetosapversicolorPvirginica


04/22/23 29

This SAR Example• Regression failed• Classification failed• Any other machine learning tricks?

04/22/23 30

Association rules• Look for possible rules that have high

Confidence and Support• There are many, a good method will let you

specify what you are looking for• These are small little pieces of the dimensional

space• Binning or Discretization is usually necessary –

best is smart or entropy binningExample

S5 > 6.405 and 'R3 fmla' = "CN-" and 'R4 fmla' = "C4H9-"

04/22/23 31

A. Rules – so far Clementine GRI is the best (does binning)

• Rules below are only for selection index Active• Support = percent of instances in the dataset where antecedents are true• Confidence = percentage of support instances where consequent is true

04/22/23 32

Association Rules - Be Wary• In the dataset there are only 8 instances

where the confidence is = 100• Some of the rules can be redundant• Looking at output one might think there

are 15 instances• Generally one wants “good” rules where

both support and confidence is high.

04/22/23 33

Example – Predictive Toxicology• Project Objective

– Understand the relationship between chemical structure and liver isozyme inhibition.

• Data Overview– 100,000 chemicals (records)– 280 data fields (variables)

• 1 biological assay• 4 liver isozyme assays• 275 chemical descriptors

– 166 Substructure Search Keys – ISIS/Host– 109 Electro-topological State Indicators – MolConnZ

04/22/23 34

Smaller version - SarTox-Dem4.csv

• 82 keys• 76 descriptors• 1550 instances (records, rows)• 1 id column• 5 activity measurements• 5 binned activity measurements• Analyze/Predict last column Act-5, BAct-5• 1280 toxic and 269 toxic

04/22/23 35

Analysis Stages• Metadata Overview & Data Cleansing• Isozyme activity binning• Classifying & Clustering• Association Rules• TTEST/Feature Reduction• Visualizations

04/22/23 36

Metadata Overview & Cleansing• 10 ISIS keys and 5 MolConn Z descriptors had zero values.

– In our analyses, these fields were eliminated from the dataset, thereby reducing the number of descriptors and keys to 260.

• Many records contained missing values:– Biological Assay: ~49,000– Isozyme 1: ~50,000 Isozyme 3: ~55,000– Isozyme 2: ~50,000 Isozyme 4: ~50,000

• About 24,000 records have all values of the biological activity and four liver isozymes

04/22/23 37

Pearson Cross Correlations260 Descriptors

260 Descriptors– Correlation

+ Correlation

No Correlation

04/22/23 38

ISIS Keys with PatchGrid™

Shows ISIS Key Composition of each

Chemical Class

Key onKey off

150 Chemical Classes166 ISIS K

eys

04/22/23 39

ISIS Keys with PatchGrid™

ISIS Keys Clustered to show chemical classes

with similar keys

Key onKey off

150 Chemical Classes166 ISIS K

eys

04/22/23 40

Data Binning

Isozyme 4

Isozyme 2Isozyme 1

Isozyme 3

High InhibitionLow Inhibition

Bio Activity

High ActivityLow Activity

04/22/23 41

Association Rules - Visually

Isozyme 1 Inhibition = high if:key1 > 0.5 & key 2 > 0.5 & Descriptor A > 3.6

Key 1Key 2

Descriptor A

Low Medium High

04/22/23 42

Sub-Selection Overview

High Inhibitionof sub-selection

Biological Activityof sub-selection

High Inhibitionof sub-selectionin all classes

Class with high% inhibition

04/22/23 43

Narrowing Down• Identify Important Dimensions in

sub-selection• Apply Important Dimensions from

Association Rules• Select a single chemical class with high

% inhibition• Use TTest or F-test to Reduce Keys and

descriptors

04/22/23 44

Cost Matrix Confusion Matrix Class Accuracy Precision a b <-- classified as 96.4% Overall

0 1 19579 308 | a = nontoxic 98.5% Nontoxic .978 1 0 439 696 | b = toxic 61.3% Toxic 79.9% Class average


0 1 17239 2648 | a = nontoxic 86.7% Nontoxic .989 100 0 183 952 | b = toxic 83.9% Toxic 85.3% Class average


0 1 13753 6134 | a = nontoxic 69.2% Nontoxic .993 13753/(13753+97) 500 0 97 1038 | b = toxic 91.5% Toxic

80.3% Class average

Classification Via Cost Matrix

False Positives

Precision is the percentage of chemicals classified as nontoxic that actually are nontoxic.

False Negatives

04/22/23 45

Naïve Bayes – all 158 attrib.Classifier: NaiveBayes -x 10 -v -o -i -k -t

Correctly Classified Instances 1245 91.6789 %Incorrectly Classified Instances 113 8.3211 %Kappa statistic 0.7071K&B Relative Info Score 76254.3843 %K&B Information Score 457.5652 bits 0.3369 bits/instanceClass complexity | order 0 811.2275 bits 0.5974 bits/instanceClass complexity | scheme 3181.6444 bits 2.3429 bits/instanceComplexity improvement (Sf) -2370.4169 bits -1.7455 bits/instanceMean absolute error 0.0856Root mean squared error 0.2781Relative absolute error 34.4662 %Root relative squared error 78.9588 %Total Number of Instances 1358


TP Rate FP Rate Precision Recall F-Measure Class 0.888 0.078 0.658 0.888 0.756 Toxic 0.922 0.112 0.98 0.922 0.95 Non-Toxic


a b <-- classified as 175 22 | a = Toxic 91 1070 | b = Non-Toxic

04/22/23 46

NB with top 20 attrib. (TTest)Classifier: NaiveBayes -x 10 -v -o -i -k -t === Stratified cross-validation ===

Correctly Classified Instances 1274 93.8144 %Incorrectly Classified Instances 84 6.1856 %Kappa statistic 0.7453K&B Relative Info Score 90035.977 %K&B Information Score 540.2618 bits 0.3978 bits/instanceClass complexity | order 0 811.2275 bits 0.5974 bits/instanceClass complexity | scheme 1578.972 bits 1.1627 bits/instanceComplexity improvement (Sf) -767.7445 bits -0.5653 bits/instanceMean absolute error 0.0648Root mean squared error 0.233 Relative absolute error 26.0612 %Root relative squared error 66.1472 %Total Number of Instances 1358





04/22/23 47

Best Logistic with only 20 attributes.

Classifier: Logistic -x 10 -v -o -i -k -t === Stratified cross-validation ===

Correctly Classified Instances 1319 97.1281 %Incorrectly Classified Instances 39 2.8719 %Kappa statistic 0.8789K&B Relative Info Score 107051.552 %K&B Information Score 642.364 bits 0.473 bits/instanceClass complexity | order 0 811.2275 bits 0.5974 bits/instanceClass complexity | scheme 192.6762 bits 0.1419 bits/instanceComplexity improvement (Sf) 618.5513 bits 0.4555 bits/instanceMean absolute error 0.0465Root mean squared error 0.1547Relative absolute error 18.7318 %Root relative squared error 43.9414 %Total Number of Instances 1358





04/22/23 48

Principle Component Analysis• Linear transformations • Creating new dimensions that are linear combinations of

the old attributes (dimensions)• The new dimensions are chosen to maximize the variation

of the data• PC1 and PC2 contain the most variation of the data• Typically one plots PC1 vs PC2 and shows class labels

(however they might not separate classes the best)• There will be N components, N is the max of the rows and

columns of the data.• Also called Singular Value Decomposition, essentially

finding the Eigen values of a matrix

04/22/23 49

Principle Component Analysis• Plotting one PC vs another PC can be considered

“Clustering” using Euclidian distance measure.• One can then view the class labels to see how

good the clustering was.• Possibly making it into a “visual classifier”• It takes some work to know the “important”

attributes• It is also “Feature Reduction” since might use

only the first few PC’s for classification.• One can even “Mix” PC’s with original attributes

04/22/23 50

sarTox-dem4.csv PCA

Using all 158 keys and descriptors does not help.

04/22/23 51

PCA-sarTox-dem4

Using only top 20 keys and descriptors clearly helps !

04/22/23 52

PCA-sarTox-dem4 – Selected records

04/22/23 53

Selected Records Statistics for BAct-5

Class Selected Count Support Confidence Lift Total Count Total Density

Non-Toxic 2 0.14728% 1.36986% 0.01602 1161 85.49337%

Toxic 144 10.60383% 98.63014% 6.79897 197 14.50663%

04/22/23 54

Mixing PC’s and original Attributes

RadViz a special “spring” paradigm clustering/classifying visualization/

04/22/23 55

R-PCA Matrices• PC Analysis Introduction• Using R-project scripts to do standard PC

analysis gives two ouput matrices. One we will call Z (in R its called x) or Projected Matrix and one called V or Rotation Matrix. The matrix operation from the original data is:

• Z = X V where X is the original data, Z is the new Principal component projected space, and V a matrix containing the coefficients of the original records contributing to the new PC’s.

• Both the Z (projected matrix) and the V (rotation matrix) can be used in the analysis.

04/22/23 56

PCA R-code data(iris) csv <- iris[1:4] ct <- as.data.frame(csv) m <- as.matrix(ct) # now do PCA h <- prcomp(m) dpca <- h$x #projection matrix drot <- h$rotation dm <- as.data.frame(dpca) ex <- as.data.frame( c(csv,dm ) ) # Write the new table to a file. write.table( ex, file = "out1.csv", append = FALSE, quote = FALSE, sep = ",", eol = "\n", na = "", dec = ".", row.names = FALSE, col.names = TRUE ) dm <- as.data.frame(drot) # don't merge rotation matrix with old data ex <- as.data.frame( c(dm ) ) # Write the new table to a file. write.table( ex, file = "runity2out.csv", append = FALSE, quote = FALSE, sep = ",",

eol = "\n", na = "", dec = ".", row.names = FALSE, col.names = TRUE ) #jpeg(filename="Rplot1.jpg", width=800, height=800, pointsize=12, quality=75) plot(h) # plots variation for each PC #jpeg(filename="Rplot2.jpg", width=800, height=800, pointsize=12, quality=75) px=c(rep(4,50),rep(3,50),rep(2,50)) # the class labels for shape and color plot(h$x[,1:2],xlab="first pc",ylab = "2nd pc",pch=px,col=px)

04/22/23 57

Iris PC variation

04/22/23 58

Iris PCA plot from R

04/22/23 59

Multidimensional Scaling• Similar to PCA• One can choose the distance metric

between points• Sometimes used to encompass many

methods (PCA, Sammon, etc)

04/22/23 60

Iris - MDS - Scatter Plot

04/22/23 61

Iris – Sammon Plot

04/22/23 62

MDS R- code csv <- read.csv(filename) N <- ncol(csv) ct <- as.data.frame(csv) m <- as.matrix(ct) if (pam1 == "pearson") { d <- pdistC( m) # use 1-pearson } else { d <- dist(m,method=pam1) } # now do MDS h <- cmdscale(d) x <- h[,1] # get x and y coordinates y <- h[,2] hdt <- as.data.frame(h) attr(hdt,"names")<- c(paste( "MDSx",pam1, sep="" ), "MDSy") extendedTable <- as.data.frame( c(csv,hdt ) ) # Write the new table to a file. write.table( extendedTable, file = "runity1out.csv", append = FALSE, quote = FALSE, sep =

",", eol = "\n", na = "", dec = ".", row.names = FALSE, col.names = TRUE ) jpeg(filename="Rplot.jpg", width=800, height=800, pointsize=12, quality=75) plot(x, y, ,xlab="", ylab="", main="Multi-dimensional scale")

04/22/23 63

Sammon Plots (1969)• Non-linear• Projects N dimensions down to 2• Seeks to preserve the n dimensional

distance from every point to every other point ( minimize “stress”)

• Uses techniques similar to “simulated annealing” – iterative

• Uses Random jittering – one could get different pictures

04/22/23 64

R code for Sammon require( mva ) require(MASS) csv <- read.csv(filename) N <- ncol(csv) ct <- as.data.frame(csv) m <- as.matrix(ct) if (pam1 == "pearson") { d <- pdistC( m) # use 1-pearson } else { d <- dist(m,method=pam1) maxd <- max(d) # if distance matrix has Inf we have to shift the data if (is.infinite(maxd)){ # shift data m <- ct - min(ct) + 1 d <- dist(m, method = pam1 ) } } d[d==0] <- .00001 # replace 0 distances sam <- sammon(d) #now get the points samxy <- sam$points dm <- as.data.frame(samxy) attr(dm,"names")<- c("SAMx", "SAMy") extendedTable <- as.data.frame( c( dm ) ) #row.names(dout) <- row.names(d) # out table contains columns for each clustering. # Write the new table to a file. write.table( extendedTable, file = "runity1out.csv", append = FALSE, quote = FALSE, sep = ",", eol = "\n", na = "", dec = ".",

row.names = FALSE, col.names = TRUE ) jpeg(filename="Rplot.jpg", width=800, height=800, pointsize=12, quality=75) plot(sam$points)

04/22/23 65

R code – Pearson Distance matrix calculation

pdistC <- function (x) { N <- nrow(x <- as.matrix(x))

dd<-dim(x) sz <- dd[1]*dd[2] xmean <- sum(abs(x))/sz var<-xmean*0.000001 cat("dd,mean,var = ",dd,xmean,var,"\n") pt<-rnorm(sz, 0, var) dim(pt)<-c(dd[1],dd[2]) xx<-x+pt a0 <- cor(t(xx),use="pairwise.complete.obs") a<- 1-a0 d <- as.vector(a[lower.tri(a)]) attr(d, "Size") <- N attr(d, "Labels") <- dimnames(x)[[1]] attr(d, "Diag") <- diag attr(d, "Upper") <- "upper" attr(d, "method") <- "pearson" attr(d, "call") <- match.call() class(d) <- "dist" return(d)}

biological data mining cheminformatics 1 patrick hoffman

Documents