a robust missing value imputation method mifoimpute for incomplete molecular descriptor data and...

8/22/2019 A Robust Missing Value Imputation Method Mifoimpute for Incomplete Molecular Descriptor Data and Comparative

1/12

International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No4, August 2013

DOI:10.5121/ijcsa.2013.3406 63

AROBUST MISSING VALUE IMPUTATION METHOD

MIFOIMPUTE FORINCOMPLETE MOLECULAR

DESCRIPTORDATAANDCOMPARATIVE ANALYSISWITHOTHER MISSING VALUE IMPUTATION METHODS

DORESWAMY1

AND CHANABASAYYA .M.VASTRAD2

1DEPARTMENT OF COMPUTER SCIENCE MANGALORE UNIVERSITY ,MANGALAGANGOTRI-574199,KARNATAKA,INDIA

dor eswamyh@yahoo. com 2Department of Computer Science Mangalore University , Mangalagangotri-574

199, Karnataka, INDIAchannu. vast r ad@gmai l . com

ABSTRACT

Missing data imputation is an important research topic in data mining. Large-scale Molecular descriptordata may contains missing values (MVs). However, some methods for downstream analyses, including

some prediction tools, require a complete descriptor data matrix. We propose and evaluate an iterative

imputation method MiFoImpute based on a random forest. By averaging over many unpruned regression

trees, random forest intrinsically constitutes a multiple imputation scheme. Using the NRMSE and NMAE

estimates of random forest, we are able to estimate the imputation error. Evaluation is performed on two

molecular descriptor datasets generated from a diverse selection of pharmaceutical fields with artificiallyintroduced missing values ranging from 10% to 30%. The experimental result demonstrates that missing

values has a great impact on the effectiveness of imputation techniques and our method MiFoImpute is

more robust to missing value than the other ten imputation methods used as benchmark. Additionally,

MiFoImpute exhibits attractive computational efficiency and can cope with high-dimensional data.

KEYWORDS

Random Forest , normalized root mean squared error, normalized mean absolute error, missing values

1.INTRODUCTION

The nature of molecular descriptor data complicates the development of highly accuratepredictive models. Molecular descriptor data are typically inconsistently gathered. These emptyor unanswered values in data sets are named missing values (data), and are of a problem mostresearchers face. Missing data may occur from various reasons. For instance, accidentally orsome molecules descriptor generator are fail to produce descriptor data. Imputation of missingvalues (MV) is often a necessary step in data analysis. MV imputation remains a necessary keystep in data preprocessing. Since many down-stream analyses require a complete data set forimplementation, MV imputation is a common practice.


2/12


64

Many established procedures of analysis require fully observed molecular descriptor datasetswithout any missing values. However, this is infrequently the case in pharmaceutical andbiological research today. The continuous development of new and improved measurementtechniques in these fields provides data analysts with challenges prompted not only by high-

dimensional multivariate descriptor data where the number of descriptors may greatly exceed thenumber of observations where continuous descriptors are present.

Many MV imputation methods have been developed in the literature. MV imputation methodsgenerally belong to two categories. In the first category, expression information of a missingentry is borrowed from neighbouring descriptors whose closeness is determined by a distancemeasure (e.g., correlation, Euclidean distance). Those methods are k nearest neighbours [1],generalized boosted model [2], Locally Weighted Linear Imputation [3], Mean Imputation [4],SVD Imputation [5], SVT Imputation [6], Approximate SVT Imputation [7].All these methodsare based on the fact that molecular descriptor do not function individually, but are usually highlycorrelated with co-regulated descriptor. For the second category, dimension reduction techniquesare applied to decompose the data matrix and iteratively reconstruct the missing entries. Those

methods are Bayesian Principal Component Analysis(BPCA)[8], Probabilistic PCA(PPCA)[9]and LLSimpute[10],. The above imputation methods are restricted to one type of variable.Furthermore, all these methods make assumptions about the distribution of the data or subsets ofthe variables, leading to questionable situations, e.g. assuming normal distributions.

Our motivation is to introduce a method of imputation which can handle any type moleculardescriptor data and makes as few as possible assumptions about structural aspects of the data.Random forest [11] is able to deal with real valued-type data and as a non-parametric method itallows for interactive and non-linear (regression) effects. We take up the missing data problemusing an iterative imputation scheme by training an RF on observed values in a first step,followed by predicting the missing values and then go on iteratively. We choose RF because it isknown to perform very well under barren conditions like high dimensions, complex interactions

and non-linear data structures. Due to its accuracy and robustness, RF is well suited for the use inapplied research often recalling such conditions.

Here we compare our method with two categories of imputation methods are mentioned aboveparagraph. These imputation methods applied on molecular descriptor datasets. Comparisons areperformed on two molecular descriptor datasets generated from Padel-Descriptor generator [12]and using different proportions of missing values. Missing values are indicated by NAs in R [18]We show that our approach is competitive to or outperforms the compared methods on the useddatasets irrespectively of the variable type composition, the data dimensionality, the source of thedata or the amount of missing values.

In some cases, the decrease of imputation error is up to 50%. This performance is typically

reached within only a few iterations which makes our method also computationally attractive.The NRMSE and NMAE error estimates give a very good approximation of the true imputationerror having on average a proportional deviation of no more than 1015%. In addition, ourmethod needs no tuning parameter, and hence is easy to use.


3/12


65

2.MATERIALSANDMETHODS

2.1 The Data Sets

We investigate the following two molecular descriptor datasets. The first ,the moleculardescriptors of Oxazolines and Oxazoles derivatives [15-16] based H37Rv inhibitors. The datasetcovers a diverse set of molecular descriptors with a wide range of inhibitory activities againstH37Rv. This molecular Descriptor data set includes 100 observations with 254 descriptors. Thesecond ,the molecular descriptors of Thiolactomycin and Related Analogues [13] based H37Rvinhibitors. The dataset covers a diverse set of molecular descriptors with a wide range ofinhibitory activities against H37Rv. This molecular Descriptor data set includes 200observations with 255 descriptors.

2.2 Algorithmic Approach

Lets assume to be a -dimensinal descriptor dataset matrix. We

propose method using an Random Forest (RF) to impute the missing values due to its priormentioned advantages as a regression method. The RF algorithm has a predefined function tohandle missing values by weighting the number of the observed values in a variable with the RFtogetherness after being trained on the initially mean imputed descriptor dataset [14]. After all,this method requires a complete dependent variable for training the forest.

In place of, straightforwardly predict the missing values using an RF trained on the observed

parts of the descriptor dataset. For an arbitrary descriptor containing missing values at entries

we can come apart the dataset into four parts:

(a) The observed values of descriptor , indicated by ;

(b) The missing values of descriptor , indicated by ;

(c) The descriptors other than with observations indicated by ; and

(d) The descriptors other than with observations indicated by .

Indicate that is commonly not completely observed since the index corresponds to the

observed values of the descriptor . is commonly not completely missing.

To start , build an initial guess for the missing values in using mean imputation or anotherimputation method. Then, sort the descriptors according to the amount of

missing values beginning with the lowest amount. For every descriptor , the missing values are

imputed by first fitting an RF with response and predictors then, predicting the missingvalues by applying the trained RF to . The imputation method is repeated until a

termination criterion is met. Algorithmic approach of missing forest Imputation (MiFoImpute)method is given below.


4/12


66

Algorithm: Impute missing values with RF

INPUT: an descriptor dataset matrix, termination criterion

Build initial guess for missing values;

vector of sorted indices of columns in with respect to increasing amount of missingvalues;

not

store previously imputed matrix;

Fit a random forest: ;

Predict using

update imputed matrix, using predicted ;

update

the imputed matrix

The termination criterion is met as soon as the difference between the newly imputed datamatrix and the previous one increases for the first time with respect to both variable types, if

present. Here, the difference for the set of descriptor variables defined as

2.3 Performance Measure

After imputing the missing values, the performance is evaluated using the normalized root meansquared error (NRMSE) [14] for the descriptor variables which is defined by


5/12


67

Where is the complete descriptor data matrix and the imputed descriptor data matrix.use mean and var as short notation for empirical mean and variance computed over the missingvalues only. Good performance leads to a value close to 0 and bad performance to a value around

1. To evaluate the precision of imputation , the normalized mean absolute error (NMAE) [17] is

used and its value at variable is calculated as follows:

Where is the number of missing values at , denote the true value and

imputed value of the missing data respectively, and are the maximum and minimum

value at .The value of NMAE on the whole datasets takes the average over all the descriptorvariables.

3.EXPERIMENTAL RESULTS AND ANALYSIS

3.1 Generation of missings in the dataset

Given a two molecular descriptor datasets, a pattern of missing entries (NA) is producedrandomly on a matrix of the size of the data set with a pre-specified proportion of the missings.The proportion of missing entries may vary. We use the random uniform distribution forgenerating missing positions and the proportions range at 10%, 20% and 30% of the totalnumber of entries.

3.2 Evaluation of results

Since the missings are generated separately, we can evaluate the quality of imputation bycomparing the imputed values with those generated at the stage of missing entries (NA) isproduced. We use the NRMSE and NMAE, to measure the performance of an algorithm. Theimputation results for MiFoImpute on the Oxazolines and Oxazoles derivatives andThiolactomycin derivatives molecular descriptor data have to be treated. Figure 1 and Figure 2presents a comparative study between the performances (NRMSE) of various imputation methodsover two descriptor datasets. Each curve in the figure represents the results from one imputationmethod.


6/12


68

Figure. 1. Performance of the eleven imputation methods on Oxazolines and Oxazoles derivativesdescriptor data. The percentage of entries missing in the complete dataset and the NRMSE ofeach missing value estimation method are shown in the horizontal and vertical axes, respectivelyFigure. 1 shows among all other imputation methods, the MiFoImpute method gives comparableNRMSE values. From this Figure. 1, we see that when the percentage of missing values in thedata set is 20%, the MiFoImpute achieves best results. When the percentage of the missing valuesreaches 30%, the NRMSE of the MiFoImpute is little and also achieves best results than bpca,SVDImpute and ppca impute methods. This shows that MiFoImpute method is comparable withif not better than the previous methods on this data set. Performances of KNNImpute method isworst of all the eleven imputation methods.


7/12


69

Figure. 2. Performance of the eleven imputation methods on Thiolactomycin derivativesdescriptor data. The percentage of entries missing in the complete dataset and the NRMSE ofeach missing value estimation method are shown in the horizontal and vertical axes, respectively

From Figure. 2, we see that MiFoImpute method starts to outperform the other imputemethods when the missing rate is increased especially on the Thiolactomycin descriptordata set. Generally, the MiFoImpute performs stable across the missing data. Forexample, all the other methods give an estimate performance with NRMSE between2.538303 and 0.1442081 for 15% missing, whereas MiFoImpute gives 0.07245663.Consequently, MiFoImpute impute method performs robustly as the percentage of themissing values increase. MiFoImpute achieves best results than bpca and ppca.Performances of llsImpute and KNNImpute methods are worst of all the elevenimputation methods.Figure. 3 and Figure. 4 demonstrates normalized mean absoluteerror (NMAE) measure to compare various imputation methods over two descriptordatasets. Each curve in the figure represents the results from one imputation method.

Figure. 3. Performance of the eleven imputation methods on Oxazolines and Oxazoles derivativesdescriptor data. The percentage of entries missing in the complete dataset and the NMAE of eachmissing value estimation method are shown in the horizontal and vertical axes, respectivelyFirst, different noise levels have different impacts on imputation accuracy. Generally speaking,the NMAE increases with the level of missing values for all the methods. This is understandablebecause with more missing values introduced into the datasets, more negative effects will bebrought to the imputation results. Nevertheless, although the noise will deteriorate the imputationaccuracy, when comparing the results from three missing percentage. However, when level

missing percentage is relatively high, the introducing of more missing values will deteriorate theimputation results dramatically. Take Figure 3 in which the missing percentage equals to 30% asan example, when missing percentage increases from 10% to 20%,the error of MiFoImputeincreases slightly from 0.00551459 to 0.01150088. But when missing percentage reaches to 30%,the error degrades to 0.01913564. For the Oxazolines and Oxazoles derivatives descriptor data ,when comparing different methods, MiFoImpute achieve better accuracy than the other three


8/12


70

methods and their accuracy difference is indiscernible. bpca and ppca appear to be the secondbest methods. lmImpute and knnImpute methods are worst of all the eleven imputation methods.

Figure. 4. Performance of the eleven imputation methods on Thiolactomycin derivativesdescriptor data. The percentage of entries missing in the complete dataset and the NMAE of eachmissing value estimation method are shown in the horizontal and vertical axes, respectively TakeFigure 4 in which missing percentage increases from 10% to 20%, the error(NMAE) ofMiFoImpute increases slightly from 0.003833351 to 0.008735986. But when missing percentagereaches to 30%, the error degrades to 0.01336493. For the Thiolactomycin derivatives descriptordata , when comparing different methods, MiFoImpute achieve better accuracy than the otherthree methods and their accuracy difference is indiscernible. bpca and ppca appear to be thesecond best methods. lmImpute, gbmImpute and knnImpute methods are worst of all the elevenimputation methods.

3.3 Computational Efficiency

To determine the computational cost of MiFoImpute by comparing the runtimes of imputation onthe previous two descriptor datasets.

Table 1 shows the runtimes in seconds of all methods on the analysed descriptor datasets.


9/12


71

We can see SVTApproxImpute that is by far the fastest method for the first dataset.meanImpute is the fastest method for the second dataset . However, MiFoImpute runsconsiderably faster than gbmImpute and the llsImpute for both datasets. There are two possibleways to speed up computation. The first one is to reduce the number of trees grown in each forest.

In all comparative studies, the number of trees was set to 100 which offers high precision butincreased runtime.

Table 2. Average imputation error (NRMSE/NMAE in percent) and runtime (in seconds) with

different numbers of trees grown in each forest and descriptors tried at each node

of the trees. Here, we consider the Oxazolines and Oxazoles derivatives descriptor dataset withartificially introduced 10% of missing values.

In Table 2 and Table3 , we can see that changing the number of trees in the forest has a stagnatinginfluence on imputation error, but a strong influence on computation time which is approximatelylinear in the number of trees. The second one is to reduce the number of descriptors randomly

selected at each node to set up the split


10/12


72

Table 3. Average imputation error (NRMSE/NMAE in percent) and runtime (in seconds) with

different numbers of trees grown in each forest and descriptors tried at each node

of the trees. Here, we consider the Thiolactomycin derivatives descriptor dataset withartificially introduced 10% of missing values.

Table 2 and Table 3 shows that increasing has limited effect on imputation error, but

computation time is strongly increased. Note that for we no longer have a Random

Forest, since there is no more choice between descriptors to split on. This leads to a much higherimputation error, especially for the cases with low numbers of bootstrapped trees. We use for all

experiments as a default value.

4.CONCLUSIONS

We presented the MiFoImpute algorithm, as an alternate and practical approach to the dataimputation problem , it can handle multivariate data consisting of molecular descriptorsMiFoImpute has no need for tuning parameters nor does it require assumptions about

distributional aspects of the data. Different missing values imputation methods were examined.Each method was tested under two molecular descriptor data sets . By observing the behaviour ofthe different imputation methods at different missing levels, we drew the conclusion that missingvalues has great negative effects on imputation methods, especially when the missing level ishigh. NRMSE and NMAE measures were used to measure the performance of the algorithms.Comparative studies have shown that MiFoImpute performs quite well in comparison with other


11/12


73

ten popular imputation methods in the presence of missing values The goal of MiFoImputemethod is to provide an accurate way of estimating missing values in order to minimally bias theperformance of imputation methods. MiFoImpute can be applied to high-dimensional datasetswhere the number of variables may greatly exceed the number of observations to a large extentand still provides excellent imputation results.

ACKNOLDGEMENTS

We gratefully thank to the Department of Computer Science Mangalore University, MangaloreIndia for technical support of this research.

REFERENCES

[1] Troyanskaya,O. et al. (2001) Missing value estimation methods for DNA microarrays.,Bioinformatics, 17,520525.

[2] McCaffrey D, Ridgewar G, Morral A.(2004) Propensity score estimation with boostedregression for evaluating causal effects in observational studies., Psychol. Methods.;9:403425.

[3] Pier Luigi Conti , Daniela Marella , Mauro Scanu ,(2008),Evaluation of matching noise forimputation techniques based on nonparametric local linear regression estimators ,Computational Statistics and Data Analysis doi:10.1016/j.csda.2008.07.041

[4] Soren Feodor NielseNonparametric Conditional MeanImputation,http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.7549

[5] Matthew Brand (2002), Incremental Singular value decomposition of uncertain data withmissing values , A Mitsubishi Electric Research Laboratorywww.bradblock.com/Incremental_singular_value_decomposition_of_uncertain_data_with_missing_values.pdf

[6] Jian-Feng Cai,, Emmanuel J. Cand es, Zuowei Shen (2008) A Singular Value ThresholdingAlgorithm for Matrix Completion , SIAM Journal on Optimization[7] Rahul Mazumder, Trevor Hastie , Robert Tibshirani, (2010) , Spectral Regularization

Algorithms for Learning Large Incomplete Matrices, Journal of Machine LearningResearch 11 2287-2322

[8] Shigeyuki Oba, Masa-aki Sato, Ichiro Takemasa, Morito Monden, Ken-ichi Matsubara andShin Ishii.(2003) A Bayesian missing value estimation method for gene expression profiledata. Bioinformatics, 19(16):2088-2096,

[9] Alexander Ilin, Tapani Raiko (2010) Practical Approaches to Principal ComponentAnalysis in the Presence of Missing Values Journal of Machine Learning Research 111957-2000

[10] Kim H, (2005). Missing value estimation for DNA microarray gene expression data: local

least squares imputation, Bioinformatics;21:187-198.[11] Breiman,L. (2001) Random forests. ,Mach. Learn., 45, 532.[12] Padel-Descriptor http://padel.nus.edu.sg/software/padeldescriptor/[13] Laurent Kremer, James D. Douglas, Alain R. Baulard (2000), Thiolactomycin and Related

Analogues as Novel Anti-mycobacterial Agents Targeting KasA and KasB CondensingEnzymes in Mycobacterium tuberculosis, J Biol Chem. 2;275(22):16857-64.


12/12


74

[14] Oba,S. et al. (2003) A Bayesian missing value estimation method for gene expressionprofile data. Bioinformatics, 19, 20882096.

[15] Andrew J. Phillips, Yoshikazu Uto, Peter Wipf, Michael J. Reno, and David R.Williams,(2000) Synthesis of Functionalized Oxazolines and Oxazoles with DAST and

Deoxo-Fluor Organic Letters Vol 2 ,No.8 1165-1168[16] Moraski GC, Chang M, Villegas-Estrada A, Franzblau SG, Mllmann U, Miller MJ.,(2010)

Structure-activity relationship of new anti-tuberculosis agents derived from oxazoline andoxazole benzyl esters ,Eur J MedChem. 2010 May;45(5):1703-16. doi:10.1016/j.ejmech.2009.12.074. Epub.

[17] Bing Zhu ,Changzheng He , Panos Liatsis (2010),A robust missing value imputationmethod for noisy data, Appl Intell DOI 10.1007/s10489-010-0244-1

[18] The R Project for Statistical Computing, http://www.r-project.org

Authors

Doreswamy received B.Sc degree in Computer Science and M.Sc Degree in ComputerScience from University of Mysore in 1993 and 1995 respectively. Ph.D degree inComputer Science from Mangalore University in the year 2007. After completion of hisPost-Graduation Degree, he subsequently joined and served as Lecturer in ComputerScience at St. Josephs College, Bangalore from 1996-1999.Then he has elevated to the

position Reader in Computer Science at Mangalore University in year 2003. He was theChairman of the Department of Post-Graduate Studies and research in computerscience from 2003-2005 and from 2009-2008 and served at varies capacities in Mangalore University at

present he is the Chairman of Board of Studies and Associate Professor in Computer Science ofMangalore University. His areas of Research interests include Data Mining and Knowledge Discovery,Artificial Intelligence and Expert Systems, Bioinformatics ,Molecular modelling and simulationComputational Intelligence,Nanotechnology, Image Processing and Pattern recognition. He has beengranted a Major Research project entitled Scientific Knowledge Discovery Systems (SKDS) forAdvanced Engineering Materials Design Applications from the funding agency University GrantCommission, New Delhi , India. He has been published about 30 contributed peer reviewed Papers atnational/International Journal and Conferences. He received SHIKSHA RATTAN PURASKAR for hisoutstanding achievements in the year 2009 and RASTRIYA VIDYA SARASWATHI AWARD foroutstanding achievement in chosen field of activity in the year 2010.

Chanabasayya.M. Vastrad received B.E. degree and M.Tech. degree in the year 2001 and2006 respectively. Currently working towards his Ph.D Degree in Computer Science andTechnology under the guidance of Dr. Doreswamy in the Department of Post-GraduateStudies and Research in Computer Science, Mangalore University.

a robust missing value imputation method mifoimpute for incomplete molecular descriptor data and...

Documents