research article missing values and optimal selection of an ...imputation method is the most...

15
Research Article Missing Values and Optimal Selection of an Imputation Method and Classification Algorithm to Improve the Accuracy of Ubiquitous Computing Applications Jaemun Sim, 1 Jonathan Sangyun Lee, 2 and Ohbyung Kwon 2 1 SKKU Business School, Sungkyunkwan University, Seoul 110734, Republic of Korea 2 School of Management, Kyung Hee University, Seoul 130701, Republic of Korea Correspondence should be addressed to Ohbyung Kwon; [email protected] Received 18 June 2014; Revised 29 September 2014; Accepted 11 October 2014 Academic Editor: Jong-Hyuk Park Copyright © 2015 Jaemun Sim et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In a ubiquitous environment, high-accuracy data analysis is essential because it affects real-world decision-making. However, in the real world, user-related data from information systems are oſten missing due to users’ concerns about privacy or lack of obligation to provide complete data. is data incompleteness can impair the accuracy of data analysis using classification algorithms, which can degrade the value of the data. Many studies have attempted to overcome these data incompleteness issues and to improve the quality of data analysis using classification algorithms. e performance of classification algorithms may be affected by the characteristics and patterns of the missing data, such as the ratio of missing data to complete data. We perform a concrete causal analysis of differences in performance of classification algorithms based on various factors. e characteristics of missing values, datasets, and imputation methods are examined. We also propose imputation and classification algorithms appropriate to different datasets and circumstances. 1. Introduction Ubiquitous computing has been the central focus of research and development in many studies; it is considered to be the third wave in the evolution of computer technology [1]. In ubiquitous computing, data must be collected and analyzed accurately in real time. For this process to be successful, data must be well organized and uncorrupted. Data preprocessing is an essential but time- and effort-consuming step in the process of data mining. Several preprocessing methods have been developed to overcome data inconsistencies [2]. Data incompleteness due to missing values is very com- mon in datasets collected in real settings [3]; it presents a challenge in the data preprocessing phase. Data is oſten missing when user input is required. For example, in human- centric computing, systems oſten require user profile data for the purpose of personalization [4]. In the case of Twitter, text data is used for sentiment analysis in order to analyze user behaviors and attitudes [5]. As a final example, in ubiquitous commerce, customer data has been used to personalize services for users [6]. Values may be missing when users are reluctant to provide their personal data due to privacy concerns or lack of motivation. is is especially true for optional data requested by the system. Missing values can also be present in sensor data. Sensor data is usually in quantitative form. Sensors provide physical information regarding temperature, sound, or trajectory. Sensor technology has advanced over the years; it is an essential source of data for ubiquitous computing and is used for situation awareness and circumstantial decision- making. For example, human interaction sensors read and react to current situations [7]. Analysis of image files for face recognition and object detection using sensors is widely used in ubiquitous computing [8]. However, incorrect data and missing values are possible even using advanced sensor technology due to mechanical and network errors. Missing values can interfere with decision-making and personaliza- tion, which can ultimately lead to user dissatisfaction. In many cases, the impact of missing data is costly to users of data analysis methods such as classification algorithms. Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2015, Article ID 538613, 14 pages http://dx.doi.org/10.1155/2015/538613

Upload: others

Post on 25-Jul-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

Research ArticleMissing Values and Optimal Selection of an ImputationMethod and Classification Algorithm to Improve the Accuracy ofUbiquitous Computing Applications

Jaemun Sim1 Jonathan Sangyun Lee2 and Ohbyung Kwon2

1 SKKU Business School Sungkyunkwan University Seoul 110734 Republic of Korea2 School of Management Kyung Hee University Seoul 130701 Republic of Korea

Correspondence should be addressed to Ohbyung Kwon obkwonkhuackr

Received 18 June 2014 Revised 29 September 2014 Accepted 11 October 2014

Academic Editor Jong-Hyuk Park

Copyright copy 2015 Jaemun Sim et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In a ubiquitous environment high-accuracy data analysis is essential because it affects real-world decision-making However in thereal world user-related data from information systems are often missing due to usersrsquo concerns about privacy or lack of obligationto provide complete data This data incompleteness can impair the accuracy of data analysis using classification algorithms whichcan degrade the value of the data Many studies have attempted to overcome these data incompleteness issues and to improvethe quality of data analysis using classification algorithms The performance of classification algorithms may be affected by thecharacteristics and patterns of the missing data such as the ratio of missing data to complete data We perform a concrete causalanalysis of differences in performance of classification algorithms based on various factors The characteristics of missing valuesdatasets and imputation methods are examined We also propose imputation and classification algorithms appropriate to differentdatasets and circumstances

1 Introduction

Ubiquitous computing has been the central focus of researchand development in many studies it is considered to be thethird wave in the evolution of computer technology [1] Inubiquitous computing data must be collected and analyzedaccurately in real time For this process to be successful datamust be well organized and uncorrupted Data preprocessingis an essential but time- and effort-consuming step in theprocess of data mining Several preprocessing methods havebeen developed to overcome data inconsistencies [2]

Data incompleteness due to missing values is very com-mon in datasets collected in real settings [3] it presentsa challenge in the data preprocessing phase Data is oftenmissing when user input is required For example in human-centric computing systems often require user profile data forthe purpose of personalization [4] In the case of Twitter textdata is used for sentiment analysis in order to analyze userbehaviors and attitudes [5] As a final example in ubiquitouscommerce customer data has been used to personalize

services for users [6] Values may be missing when usersare reluctant to provide their personal data due to privacyconcerns or lack of motivation This is especially true foroptional data requested by the system

Missing values can also be present in sensor data Sensordata is usually in quantitative form Sensors provide physicalinformation regarding temperature sound or trajectorySensor technology has advanced over the years it is anessential source of data for ubiquitous computing and isused for situation awareness and circumstantial decision-making For example human interaction sensors read andreact to current situations [7] Analysis of image files forface recognition and object detection using sensors is widelyused in ubiquitous computing [8] However incorrect dataand missing values are possible even using advanced sensortechnology due to mechanical and network errors Missingvalues can interfere with decision-making and personaliza-tion which can ultimately lead to user dissatisfaction Inmany cases the impact of missing data is costly to users ofdata analysis methods such as classification algorithms

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 538613 14 pageshttpdxdoiorg1011552015538613

2 Mathematical Problems in Engineering

Data incompleteness may have negative effects on datapreprocessing and decision-making accuracy Extra timeand effort are required to compensate for missing dataUsing uncertain or null data results in fatal errors in theclassification algorithm and deleting all records that containmissing data (ie using the listwise deletionmethod) reducesthe sample size which might decrease statistical powerand introduce potential bias to the estimation [9] Finallyunless the researcher can be sure that the data values aremissing completely at random (MCAR) then the conclusionsresulting from a complete-case analysis are most likely to bebiased

In order to overcome issues related to data incomplete-ness many researchers have suggested methods of supple-menting or compensating for missing data The missing dataimputation method is the most frequently used statisticalmethod developed to deal with missing data problems It isdefined as ldquoa procedure that replaces the missing values in adataset by some plausible valuesrdquo [3] Missing values occurwhen no data is stored for a given variable in the currentobservation

Many studies have attempted to validate the missingdata imputation method of supplementing or compensatingfor missing data by testing it with different types of dataother studies have attempted to develop the method furtherStudies have also compared the performance of variousimputation methods based on benchmark data For exampleKang investigated the ratio of missing to complete datain various datasets and compared the average accuracy ofseveral imputation methods such as MNR 119896-NN CARTANN and LLR [10] The results demonstrated that 119896-NNperformed best on datasets with less than 10 of datamissingand LLR performed best on those withmore than 10 of datamissing

However after multiple tests using complete datasets notmuch difference in performance was observed and somedatasets were linearly inferior In Kangrsquos study [10] manydatasets with equivalent conditions yielded different resultsThus the fit between the dataset characteristics and theimputationmethodmust also be considered Previous studieshave compared imputation methods by varying the ratio ofmissing to complete data or evaluating performance differ-ences between complete and incomplete datasets Howeverthe reasons for these different results between datasets underequivalent conditions remain unexplained Various factorsmay affect the performance of classification algorithms Forexample the interrelationship or fitness between the datasetimputation method and characteristics of the missing valuesmay be important to the success or failure of the analyticalprocess

The purpose of this study is to examine the influenceof dataset characteristics and patterns of missing data onthe performance of classification algorithms using variousdatasets The moderating effects of different imputationmethods classification algorithms and data characteristicson performance are also analyzed The results are importantbecause they can suggest which imputation method or clas-sification algorithm to use depending on the data conditions

The goal is to improve the performance accuracy and timerequired for ubiquitous computing

2 Treating Datasets Containing Missing Data

Missing information is an unavoidable aspect of data analysisFor example responses may be missing to items on surveyinstruments intended to measure cognitive and affectivefactors Various imputation methods have been developedand used for treatment of datasets containing missing dataSome popular methods are listed below

(1) Listwise Deletion Listwise deletion (LD) involves theremoval of all individuals with incomplete responses for anyitems However LD reduces the effective sample size (some-times greatly resulting in large amounts of missing data)which can in turn reduce statistical power for hypothesistesting to unacceptably low levels LD assumes that the dataare MCAR (ie their omission is unrelated to all measuredvariables) When the MCAR assumption is violated as isoften the case in real research settings the resulting estimateswill be biased

(2) Zero Imputation When data are omitted as incorrect thezero imputation method is used in which missing responsesare assigned an incorrect value (or zero in the case ofdichotomously scored items)

(3) Mean Imputation In this method the mean of all valueswithin the same attribute is calculated and then imputed inthe missing data cells The method works only if the attributeexamined is not nominal

(4) Multiple Imputations Multiple imputations can incor-porate information from all variables in a dataset to deriveimputed values for those that are missing This methodhas been shown to be an effective tool in a variety ofscenarios involving missing data [11] including incompleteitem responses [12]

(5) Regression Imputation The linear regression function iscalculated from the values within the same attribute and thenused as the dependent variable The other attributes (exceptthe decision attribute) are then used as independent variablesThen the estimated dependent variable is imputed in themissing data cells This method works only if all consideredattributes are not nominal

(6) Stochastic Regression Imputation Stochastic regressionimputation involves a two-step process in which the dis-tribution of relative frequencies for each response categoryfor each member of the sample is first obtained from theobserved data

In this paper the details of the seven imputationmethodsused herein are as follows

(i) Listwise Deletion All instances are deleted that containmore than one missing cell in their attributes

Mathematical Problems in Engineering 3

(ii) Mean Imputation The missing values from each attribute(column or feature) are replaced with the mean of all knownvalues of that attribute That is let 119883119895119894 be the 119895th missingattribute of the 119894th instance which is imputed by

119883119895

119894 = sum

119896isin119868(complete)

119883119895

119896

119899|119868(complete)| (1)

where 119868(complete) is a set of indices that are not missing in119883119894 and 119899|119868(complete)| is the total number of instances where the119895th attribute is not missing

(iii) Group Mean Imputation The process for this method isthe same as that for mean imputation However the missingvalues are replaced with the group (or class) mean of allknown values of that attribute Each group represents a targetclass from among the instances (recorded) that have missingvalues Let119883119895119898119894 be the 119895thmissing attribute of the 119894th instanceof the119898th class which is imputed by

119883119895

119898119894 = sum

119896isin119868(119898th class incomplete)

119883119895

119898119896

119899|119868(119898th class incomplete)| (2)

where 119868(119898th class incomplete) is a set of indices that are notmissing in 119883119895119898119894 and 119899|119868(119898th class incomplete)| is the total numberof instances where the 119895th attribute of the 119898th class is notmissing

(iv) Predictive Mean Imputation In this method the func-tional relationship between multiple input variables andsingle or multiple target variables of the given data isrepresented in the form of a linear equationThismethod setsattributes that havemissing values as dependent variables andother attributes as independent variables in order to allowprediction of missing values by creating a regression modelusing those variables For a regression target 119910119894 the MLRequation with 119889 predictors and 119899 training instances can bewritten as

119910119894 = 1205730 + 12057311199091198941 + 12057321199091198942 + sdot sdot sdot + 120573119889119909119894119889 for 119894 = 1 119899 (3)

This can be rewritten in matrix form such that 119910 = 119883120573and the coefficient 120573 can be obtained explicitly by taking aderivative of the squared error function as follows

min119864 (120573) = 12(119910 minus 119883120573)

119879(119910 minus 119883120573)

120597119864 (120573)

120597120573= 119883119879119883120573 minus 119883

119879119910 = 0

120573 = (119883119879119883)minus1sdot 119883119879119910

(4)

(v) Hot-Deck This method is the same in principle as case-based reasoning In order for attributes that contain missingvalues to be utilized values must be found from among themost similar instances of nonmissing values and used toreplace the missing values Therefore each missing value is

replaced with the value of an attribute with the most similarinstance as follows

119883119895

119894 = 119883119895

119896 119896 = argmin

119875radic sum

119895isin119868(complete)Std119895 (119883

119895

119894 minus 119883119895

119875)2 (5)

where Std119895 is the standard deviation of the 119895th attribute whichis not missing

(vi) 119896-NN Attributes are found via a search among nonmiss-ing attributes using the 3-NN method Missing values areimputed based on the values of the attributes of the 119896 mostsimilar instances as follows

119883119895

119894 = sum

119875isin119896-NN(119883119894)119896 (119883119868(complete)119894 119883

119868(complete)119875 ) sdot 119883

119895

119875 (6)

where 119896-NN(119883119894) is the index set of the 119896th nearest neighborsof 119883119894 based on the nonmissing attributes and 119896(119883119894 119883119895) is akernel function that is proportional to the similarity betweenthe two instances119883119894 and119883119895 (119896 = 4)

(vii) 119896-Means Clustering Attributes are found through forma-tion of 119896-clusters from nonmissing data after which missingvalues are imputed The entire dataset is partitioned into 119896clusters by maximizing the homogeneity within each clusterand the heterogeneity between clusters as follows

arg min119862ℎ(complete)

119896

sum

119894=1

sum

119883119868(complete)119895

isin119862ℎ(complete)119894

10038171003817100381710038171003817119883119868(complete)119895 minus 119862

119868(complete)119894

10038171003817100381710038171003817

2

(7)

where 119862119868(complete)119894 is the centroid of 119862119868(complete)

119894 and 119862119868(complete)

is the union of all clusters (119862119868(complete)= 119862119868(complete)1 cup sdot sdot sdot cup

119862119868(complete)119896

) For a missing value 119883119895119894 the mean value of theattribute for the instances in the same cluster with119883119868(complete)

119894

is imputed thus as follows

119883119895

119894 =1

10038161003816100381610038161003816119862119868(complete)119896

10038161003816100381610038161003816

sdot sum

119883119868(complete)119875

isin119862119868(complete)119896

119883119895

119875

st 119896 = argmin119894

10038161003816100381610038161003816119883119868(complete)119895 minus 119862

119868(complete)119894

10038161003816100381610038161003816

(8)

3 Model

In this paper we hypothesize an association between the per-formance of classification algorithms and the characteristicsof missing data and datasets Moreover we assume that thechosen imputation method moderates the causality betweenthese factors Figure 1 illustrates the posited relationships

31 Missing Data Characteristics Table 1 describes the char-acteristics of missing data and how to calculate them Thepattern of missing data characteristics may be univariatemonotone or arbitrary [11] A univariate pattern of missingdata occurs when missing values are observed for a singlevariable only all other data are complete for all variables

4 Mathematical Problems in Engineering

Table 1 The characteristics of missing data

Variables Meaning Calculation

Missing data ratioThe number of missing values in theentire dataset as compared to thenumber of nonmissing values

The number of empty data cellstotal cells

Patterns of missingdata

UnivariateRatio of missing to complete values for an existing feature comparedto the values for all featuresMonotone

ArbitraryHorizontalscatteredness

Distribution of missing values withineach data record

Determine the number of missing cells in each record and calculatethe standard deviation

Verticalscatteredness

Distribution of missing values for eachattribute

Determine the number of missing cells in each feature and calculatethe standard deviation

Missing dataspread

Larger standard deviations indicatestronger effects of missing data

Determine the weighted average of the standard deviations of featureswith missing data (weight the ratio of missing to complete data foreach feature)

Missing datacharacteristics

Dataset feature

Imputation method

Classificationperformance

Figure 1 Research model

A monotone pattern occurs if variables can be arranged suchthat all 119884119895+1 119884119896 are missing for cases where 119884119895 is missingAnother characteristic missing data spread is importantbecause larger standard deviations for missing values withinan existing feature indicate that the missing data has greaterinfluence on the results of the analysis (Figure 2)

32 Dataset Features Table 2 lists the features of datasetsBased on the research of Kwon and Sim [15] in which char-acteristics of datasets that influence classification algorithmswere identified we considered the following statisticallysignificant features in this study missing values the numberof cases the number of attributes and the degree of classimbalance However the discussion of missing values isomitted here because it has already been analyzed in detailby Kwon and Sim [15]

33 Imputation Methods Table 3 lists the imputation meth-ods used in this study Since datasets with categorical decisionattributes are included imputation methods that do notaccommodate categorical attributes (eg regression imputa-tion) are excluded from this paper

Table 2 Dataset features

Variables DescriptionNumber of cases Number of records in the dataset

Number of attributes Number of features characteristicof the dataset

Degree of class imbalance Ratio

34 Classification Algorithms Many studies have comparedclassification algorithms in various areas For example thedecision tree is known as the best algorithm for arrhythmiaclassification [16] In Table 4 six types of representative clas-sification algorithms for supervised learning are describedC45 SVM (support vector machine) Bayesian networklogistic classifier 119896-nearest neighbor classifier and regres-sion

4 Method

We conducted a performance evaluation of the imputationmethods and classification algorithms described in the previ-ous section using actual datasets taken from the UCI datasetarchive To ensure the accuracy of each method in caseswith no missing values datasets with missing values werenot included Among the selected datasets six (Iris WineGlass Liver Disorder Ionosphere and Statlog Shuttle) wereincluded for comparison with the results of Kang [10] Thesedatasets are popular and frequently utilized benchmarks inthe literature whichmakes themuseful for demonstrating thesuperiority of the proposed idea

Table 5 provides the names of the datasets the numbersof cases and the descriptions of features and classes Thenumbers in parentheses in the last two columns represent thenumber of features and classes for the decision attributes Forexample in dataset Iris ldquoNumeric (4)rdquo indicates that thereare four numeric attributes and ldquoCategorical (3)rdquo means thatthere are three classes in the decision attribute

Since UCI datasets have no missing data target valuesin each dataset were randomly omitted [10] Based on

Mathematical Problems in Engineering 5

Observed valuesMissing values

Observed valuesMissing values

Observed valuesMissing values

Univariate pattern Monotone pattern Arbitrary pattern

All missing values arein the last feature

n2n

3n

All missing values arein the last feature

and last-1 and last-2

Missing values are in random feature and record

Figure 2 Missing data patterns

Table 3 Imputation methods

Imputation methods Description

Listwise deletion Perhaps the most basic traditional technique for dealing with missing data Cases with missingvalues are discarded restricting the analyses to cases for which complete data are available

Mean imputation Involves replacing missing data with the overall mean for the observed dataGroup meanimputation

A missing value is replaced by the mean of a subset of the data based on other observed variable(s)in the data

Predictive meanimputation

Also called regression imputation Predictive mean imputation involves imputing a missing valueusing an ordinary least-squares regression method to estimate missing data

Hot-deck Most similar records are imputed to missing values119896-NN The attribute value of 119896 is imputed to the most similar instance from nonmissing data119896-means clustering 119896 numbers of sets are created that are homogeneous on the inside and heterogeneous on the outside

Table 4 Classification algorithms

Algorithms Description

C45Estimates the known data using learning rules C45 gradually expands the conditions of thealgorithm splitting the upper node into subnodes using a divide-and-conquer method until it comesto the end node

SVM Classifies the unknown class by finding the optimal hyperplane with the maximum margin thatreduces the estimation error

Bayesian network A probability network with a high posterior probability given the instances Such a network canprovide insight into probabilistic dependencies among the variables in the training dataset

Logistic classifierTakes the functional form of logistic CDF (cumulative distribution function) This function relatesthe probability of some event to attribute variables through regression coefficients and alpha andbeta parameters which are estimated from training data [13]

119896-nearest neighborclassifier

Simple instance-based learner that uses the class of the nearest 119896 training instances for the class ofthe test instances

Regression The class is binarized and one regression model is built for each class value [14]

6 Mathematical Problems in Engineering

Table 5 Datasets used in the experiments

Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)

the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4

There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170

RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance

119910119901 = sum

forall119895isinM120573119901119895119909119895 + sum

forall119896isinD120594119901119896119911119896 + 120576119901 (9)

In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents

RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly

5 Results

In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance

51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method

For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results

For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best

For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results

For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best

For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest

For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst

Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method

Mathematical Problems in Engineering 7

10000000000

08000000000

06000000000

04000000000

02000000000

00000000000

001 005 010

(a) Iris

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(b) Glass Identification

001 005 010

6000000000000

4000000000000

2000000000000

00000000000

(c) Liver Disorders

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(d) Ionosphere

1250000000000000

1000000000000000

750000000000000

500000000000000

250000000000000

00000000000

k-NN

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

k-MEANS CLUSTERING

LISTWISE DELETION

imputation

(e) Wine

60000000000000

50000000000000

40000000000000

30000000000000

20000000000000

10000000000000

00000000000

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(f) Statlog Shuttle

Figure 3 Comparison of performances of imputation methods for each dataset

8 Mathematical Problems in Engineering

Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast

C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast

R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast

SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast

SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast

P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast

R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast

SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast

SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast

P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact

The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted

In Figure 4 the results suggest the following rules

(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method

(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod

(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method

(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method

(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method

53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing

Mathematical Problems in Engineering 9

05000000000

04500000000

04000000000

03500000000

03000000000

001 005 010

(a) Decision tree (J48)

001 005 010

04000000000

03500000000

03000000000

(b) BayesNet

001 005 010

04400000000

04200000000

04000000000

03800000000

(c) SMO (SVM)

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

(d) Regression

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

02800000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(e) Logistic

001 005 010

04250000000

04000000000

03750000000

03500000000

03250000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-MEANS CLUSTERINGk-NN

(f) IBk (119896-nearest neighbor classifier)

Figure 4 Comparison of classifiers in terms of classification performance

10 Mathematical Problems in Engineering

Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast

C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast

R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast

SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast

SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast

P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast

C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast

R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast

SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast

P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast

R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast

SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast

SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast

P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

2 Mathematical Problems in Engineering

Data incompleteness may have negative effects on datapreprocessing and decision-making accuracy Extra timeand effort are required to compensate for missing dataUsing uncertain or null data results in fatal errors in theclassification algorithm and deleting all records that containmissing data (ie using the listwise deletionmethod) reducesthe sample size which might decrease statistical powerand introduce potential bias to the estimation [9] Finallyunless the researcher can be sure that the data values aremissing completely at random (MCAR) then the conclusionsresulting from a complete-case analysis are most likely to bebiased

In order to overcome issues related to data incomplete-ness many researchers have suggested methods of supple-menting or compensating for missing data The missing dataimputation method is the most frequently used statisticalmethod developed to deal with missing data problems It isdefined as ldquoa procedure that replaces the missing values in adataset by some plausible valuesrdquo [3] Missing values occurwhen no data is stored for a given variable in the currentobservation

Many studies have attempted to validate the missingdata imputation method of supplementing or compensatingfor missing data by testing it with different types of dataother studies have attempted to develop the method furtherStudies have also compared the performance of variousimputation methods based on benchmark data For exampleKang investigated the ratio of missing to complete datain various datasets and compared the average accuracy ofseveral imputation methods such as MNR 119896-NN CARTANN and LLR [10] The results demonstrated that 119896-NNperformed best on datasets with less than 10 of datamissingand LLR performed best on those withmore than 10 of datamissing

However after multiple tests using complete datasets notmuch difference in performance was observed and somedatasets were linearly inferior In Kangrsquos study [10] manydatasets with equivalent conditions yielded different resultsThus the fit between the dataset characteristics and theimputationmethodmust also be considered Previous studieshave compared imputation methods by varying the ratio ofmissing to complete data or evaluating performance differ-ences between complete and incomplete datasets Howeverthe reasons for these different results between datasets underequivalent conditions remain unexplained Various factorsmay affect the performance of classification algorithms Forexample the interrelationship or fitness between the datasetimputation method and characteristics of the missing valuesmay be important to the success or failure of the analyticalprocess

The purpose of this study is to examine the influenceof dataset characteristics and patterns of missing data onthe performance of classification algorithms using variousdatasets The moderating effects of different imputationmethods classification algorithms and data characteristicson performance are also analyzed The results are importantbecause they can suggest which imputation method or clas-sification algorithm to use depending on the data conditions

The goal is to improve the performance accuracy and timerequired for ubiquitous computing

2 Treating Datasets Containing Missing Data

Missing information is an unavoidable aspect of data analysisFor example responses may be missing to items on surveyinstruments intended to measure cognitive and affectivefactors Various imputation methods have been developedand used for treatment of datasets containing missing dataSome popular methods are listed below

(1) Listwise Deletion Listwise deletion (LD) involves theremoval of all individuals with incomplete responses for anyitems However LD reduces the effective sample size (some-times greatly resulting in large amounts of missing data)which can in turn reduce statistical power for hypothesistesting to unacceptably low levels LD assumes that the dataare MCAR (ie their omission is unrelated to all measuredvariables) When the MCAR assumption is violated as isoften the case in real research settings the resulting estimateswill be biased

(2) Zero Imputation When data are omitted as incorrect thezero imputation method is used in which missing responsesare assigned an incorrect value (or zero in the case ofdichotomously scored items)

(3) Mean Imputation In this method the mean of all valueswithin the same attribute is calculated and then imputed inthe missing data cells The method works only if the attributeexamined is not nominal

(4) Multiple Imputations Multiple imputations can incor-porate information from all variables in a dataset to deriveimputed values for those that are missing This methodhas been shown to be an effective tool in a variety ofscenarios involving missing data [11] including incompleteitem responses [12]

(5) Regression Imputation The linear regression function iscalculated from the values within the same attribute and thenused as the dependent variable The other attributes (exceptthe decision attribute) are then used as independent variablesThen the estimated dependent variable is imputed in themissing data cells This method works only if all consideredattributes are not nominal

(6) Stochastic Regression Imputation Stochastic regressionimputation involves a two-step process in which the dis-tribution of relative frequencies for each response categoryfor each member of the sample is first obtained from theobserved data

In this paper the details of the seven imputationmethodsused herein are as follows

(i) Listwise Deletion All instances are deleted that containmore than one missing cell in their attributes

Mathematical Problems in Engineering 3

(ii) Mean Imputation The missing values from each attribute(column or feature) are replaced with the mean of all knownvalues of that attribute That is let 119883119895119894 be the 119895th missingattribute of the 119894th instance which is imputed by

119883119895

119894 = sum

119896isin119868(complete)

119883119895

119896

119899|119868(complete)| (1)

where 119868(complete) is a set of indices that are not missing in119883119894 and 119899|119868(complete)| is the total number of instances where the119895th attribute is not missing

(iii) Group Mean Imputation The process for this method isthe same as that for mean imputation However the missingvalues are replaced with the group (or class) mean of allknown values of that attribute Each group represents a targetclass from among the instances (recorded) that have missingvalues Let119883119895119898119894 be the 119895thmissing attribute of the 119894th instanceof the119898th class which is imputed by

119883119895

119898119894 = sum

119896isin119868(119898th class incomplete)

119883119895

119898119896

119899|119868(119898th class incomplete)| (2)

where 119868(119898th class incomplete) is a set of indices that are notmissing in 119883119895119898119894 and 119899|119868(119898th class incomplete)| is the total numberof instances where the 119895th attribute of the 119898th class is notmissing

(iv) Predictive Mean Imputation In this method the func-tional relationship between multiple input variables andsingle or multiple target variables of the given data isrepresented in the form of a linear equationThismethod setsattributes that havemissing values as dependent variables andother attributes as independent variables in order to allowprediction of missing values by creating a regression modelusing those variables For a regression target 119910119894 the MLRequation with 119889 predictors and 119899 training instances can bewritten as

119910119894 = 1205730 + 12057311199091198941 + 12057321199091198942 + sdot sdot sdot + 120573119889119909119894119889 for 119894 = 1 119899 (3)

This can be rewritten in matrix form such that 119910 = 119883120573and the coefficient 120573 can be obtained explicitly by taking aderivative of the squared error function as follows

min119864 (120573) = 12(119910 minus 119883120573)

119879(119910 minus 119883120573)

120597119864 (120573)

120597120573= 119883119879119883120573 minus 119883

119879119910 = 0

120573 = (119883119879119883)minus1sdot 119883119879119910

(4)

(v) Hot-Deck This method is the same in principle as case-based reasoning In order for attributes that contain missingvalues to be utilized values must be found from among themost similar instances of nonmissing values and used toreplace the missing values Therefore each missing value is

replaced with the value of an attribute with the most similarinstance as follows

119883119895

119894 = 119883119895

119896 119896 = argmin

119875radic sum

119895isin119868(complete)Std119895 (119883

119895

119894 minus 119883119895

119875)2 (5)

where Std119895 is the standard deviation of the 119895th attribute whichis not missing

(vi) 119896-NN Attributes are found via a search among nonmiss-ing attributes using the 3-NN method Missing values areimputed based on the values of the attributes of the 119896 mostsimilar instances as follows

119883119895

119894 = sum

119875isin119896-NN(119883119894)119896 (119883119868(complete)119894 119883

119868(complete)119875 ) sdot 119883

119895

119875 (6)

where 119896-NN(119883119894) is the index set of the 119896th nearest neighborsof 119883119894 based on the nonmissing attributes and 119896(119883119894 119883119895) is akernel function that is proportional to the similarity betweenthe two instances119883119894 and119883119895 (119896 = 4)

(vii) 119896-Means Clustering Attributes are found through forma-tion of 119896-clusters from nonmissing data after which missingvalues are imputed The entire dataset is partitioned into 119896clusters by maximizing the homogeneity within each clusterand the heterogeneity between clusters as follows

arg min119862ℎ(complete)

119896

sum

119894=1

sum

119883119868(complete)119895

isin119862ℎ(complete)119894

10038171003817100381710038171003817119883119868(complete)119895 minus 119862

119868(complete)119894

10038171003817100381710038171003817

2

(7)

where 119862119868(complete)119894 is the centroid of 119862119868(complete)

119894 and 119862119868(complete)

is the union of all clusters (119862119868(complete)= 119862119868(complete)1 cup sdot sdot sdot cup

119862119868(complete)119896

) For a missing value 119883119895119894 the mean value of theattribute for the instances in the same cluster with119883119868(complete)

119894

is imputed thus as follows

119883119895

119894 =1

10038161003816100381610038161003816119862119868(complete)119896

10038161003816100381610038161003816

sdot sum

119883119868(complete)119875

isin119862119868(complete)119896

119883119895

119875

st 119896 = argmin119894

10038161003816100381610038161003816119883119868(complete)119895 minus 119862

119868(complete)119894

10038161003816100381610038161003816

(8)

3 Model

In this paper we hypothesize an association between the per-formance of classification algorithms and the characteristicsof missing data and datasets Moreover we assume that thechosen imputation method moderates the causality betweenthese factors Figure 1 illustrates the posited relationships

31 Missing Data Characteristics Table 1 describes the char-acteristics of missing data and how to calculate them Thepattern of missing data characteristics may be univariatemonotone or arbitrary [11] A univariate pattern of missingdata occurs when missing values are observed for a singlevariable only all other data are complete for all variables

4 Mathematical Problems in Engineering

Table 1 The characteristics of missing data

Variables Meaning Calculation

Missing data ratioThe number of missing values in theentire dataset as compared to thenumber of nonmissing values

The number of empty data cellstotal cells

Patterns of missingdata

UnivariateRatio of missing to complete values for an existing feature comparedto the values for all featuresMonotone

ArbitraryHorizontalscatteredness

Distribution of missing values withineach data record

Determine the number of missing cells in each record and calculatethe standard deviation

Verticalscatteredness

Distribution of missing values for eachattribute

Determine the number of missing cells in each feature and calculatethe standard deviation

Missing dataspread

Larger standard deviations indicatestronger effects of missing data

Determine the weighted average of the standard deviations of featureswith missing data (weight the ratio of missing to complete data foreach feature)

Missing datacharacteristics

Dataset feature

Imputation method

Classificationperformance

Figure 1 Research model

A monotone pattern occurs if variables can be arranged suchthat all 119884119895+1 119884119896 are missing for cases where 119884119895 is missingAnother characteristic missing data spread is importantbecause larger standard deviations for missing values withinan existing feature indicate that the missing data has greaterinfluence on the results of the analysis (Figure 2)

32 Dataset Features Table 2 lists the features of datasetsBased on the research of Kwon and Sim [15] in which char-acteristics of datasets that influence classification algorithmswere identified we considered the following statisticallysignificant features in this study missing values the numberof cases the number of attributes and the degree of classimbalance However the discussion of missing values isomitted here because it has already been analyzed in detailby Kwon and Sim [15]

33 Imputation Methods Table 3 lists the imputation meth-ods used in this study Since datasets with categorical decisionattributes are included imputation methods that do notaccommodate categorical attributes (eg regression imputa-tion) are excluded from this paper

Table 2 Dataset features

Variables DescriptionNumber of cases Number of records in the dataset

Number of attributes Number of features characteristicof the dataset

Degree of class imbalance Ratio

34 Classification Algorithms Many studies have comparedclassification algorithms in various areas For example thedecision tree is known as the best algorithm for arrhythmiaclassification [16] In Table 4 six types of representative clas-sification algorithms for supervised learning are describedC45 SVM (support vector machine) Bayesian networklogistic classifier 119896-nearest neighbor classifier and regres-sion

4 Method

We conducted a performance evaluation of the imputationmethods and classification algorithms described in the previ-ous section using actual datasets taken from the UCI datasetarchive To ensure the accuracy of each method in caseswith no missing values datasets with missing values werenot included Among the selected datasets six (Iris WineGlass Liver Disorder Ionosphere and Statlog Shuttle) wereincluded for comparison with the results of Kang [10] Thesedatasets are popular and frequently utilized benchmarks inthe literature whichmakes themuseful for demonstrating thesuperiority of the proposed idea

Table 5 provides the names of the datasets the numbersof cases and the descriptions of features and classes Thenumbers in parentheses in the last two columns represent thenumber of features and classes for the decision attributes Forexample in dataset Iris ldquoNumeric (4)rdquo indicates that thereare four numeric attributes and ldquoCategorical (3)rdquo means thatthere are three classes in the decision attribute

Since UCI datasets have no missing data target valuesin each dataset were randomly omitted [10] Based on

Mathematical Problems in Engineering 5

Observed valuesMissing values

Observed valuesMissing values

Observed valuesMissing values

Univariate pattern Monotone pattern Arbitrary pattern

All missing values arein the last feature

n2n

3n

All missing values arein the last feature

and last-1 and last-2

Missing values are in random feature and record

Figure 2 Missing data patterns

Table 3 Imputation methods

Imputation methods Description

Listwise deletion Perhaps the most basic traditional technique for dealing with missing data Cases with missingvalues are discarded restricting the analyses to cases for which complete data are available

Mean imputation Involves replacing missing data with the overall mean for the observed dataGroup meanimputation

A missing value is replaced by the mean of a subset of the data based on other observed variable(s)in the data

Predictive meanimputation

Also called regression imputation Predictive mean imputation involves imputing a missing valueusing an ordinary least-squares regression method to estimate missing data

Hot-deck Most similar records are imputed to missing values119896-NN The attribute value of 119896 is imputed to the most similar instance from nonmissing data119896-means clustering 119896 numbers of sets are created that are homogeneous on the inside and heterogeneous on the outside

Table 4 Classification algorithms

Algorithms Description

C45Estimates the known data using learning rules C45 gradually expands the conditions of thealgorithm splitting the upper node into subnodes using a divide-and-conquer method until it comesto the end node

SVM Classifies the unknown class by finding the optimal hyperplane with the maximum margin thatreduces the estimation error

Bayesian network A probability network with a high posterior probability given the instances Such a network canprovide insight into probabilistic dependencies among the variables in the training dataset

Logistic classifierTakes the functional form of logistic CDF (cumulative distribution function) This function relatesthe probability of some event to attribute variables through regression coefficients and alpha andbeta parameters which are estimated from training data [13]

119896-nearest neighborclassifier

Simple instance-based learner that uses the class of the nearest 119896 training instances for the class ofthe test instances

Regression The class is binarized and one regression model is built for each class value [14]

6 Mathematical Problems in Engineering

Table 5 Datasets used in the experiments

Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)

the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4

There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170

RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance

119910119901 = sum

forall119895isinM120573119901119895119909119895 + sum

forall119896isinD120594119901119896119911119896 + 120576119901 (9)

In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents

RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly

5 Results

In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance

51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method

For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results

For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best

For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results

For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best

For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest

For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst

Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method

Mathematical Problems in Engineering 7

10000000000

08000000000

06000000000

04000000000

02000000000

00000000000

001 005 010

(a) Iris

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(b) Glass Identification

001 005 010

6000000000000

4000000000000

2000000000000

00000000000

(c) Liver Disorders

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(d) Ionosphere

1250000000000000

1000000000000000

750000000000000

500000000000000

250000000000000

00000000000

k-NN

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

k-MEANS CLUSTERING

LISTWISE DELETION

imputation

(e) Wine

60000000000000

50000000000000

40000000000000

30000000000000

20000000000000

10000000000000

00000000000

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(f) Statlog Shuttle

Figure 3 Comparison of performances of imputation methods for each dataset

8 Mathematical Problems in Engineering

Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast

C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast

R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast

SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast

SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast

P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast

R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast

SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast

SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast

P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact

The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted

In Figure 4 the results suggest the following rules

(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method

(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod

(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method

(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method

(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method

53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing

Mathematical Problems in Engineering 9

05000000000

04500000000

04000000000

03500000000

03000000000

001 005 010

(a) Decision tree (J48)

001 005 010

04000000000

03500000000

03000000000

(b) BayesNet

001 005 010

04400000000

04200000000

04000000000

03800000000

(c) SMO (SVM)

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

(d) Regression

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

02800000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(e) Logistic

001 005 010

04250000000

04000000000

03750000000

03500000000

03250000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-MEANS CLUSTERINGk-NN

(f) IBk (119896-nearest neighbor classifier)

Figure 4 Comparison of classifiers in terms of classification performance

10 Mathematical Problems in Engineering

Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast

C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast

R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast

SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast

SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast

P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast

C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast

R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast

SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast

P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast

R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast

SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast

SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast

P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

Mathematical Problems in Engineering 3

(ii) Mean Imputation The missing values from each attribute(column or feature) are replaced with the mean of all knownvalues of that attribute That is let 119883119895119894 be the 119895th missingattribute of the 119894th instance which is imputed by

119883119895

119894 = sum

119896isin119868(complete)

119883119895

119896

119899|119868(complete)| (1)

where 119868(complete) is a set of indices that are not missing in119883119894 and 119899|119868(complete)| is the total number of instances where the119895th attribute is not missing

(iii) Group Mean Imputation The process for this method isthe same as that for mean imputation However the missingvalues are replaced with the group (or class) mean of allknown values of that attribute Each group represents a targetclass from among the instances (recorded) that have missingvalues Let119883119895119898119894 be the 119895thmissing attribute of the 119894th instanceof the119898th class which is imputed by

119883119895

119898119894 = sum

119896isin119868(119898th class incomplete)

119883119895

119898119896

119899|119868(119898th class incomplete)| (2)

where 119868(119898th class incomplete) is a set of indices that are notmissing in 119883119895119898119894 and 119899|119868(119898th class incomplete)| is the total numberof instances where the 119895th attribute of the 119898th class is notmissing

(iv) Predictive Mean Imputation In this method the func-tional relationship between multiple input variables andsingle or multiple target variables of the given data isrepresented in the form of a linear equationThismethod setsattributes that havemissing values as dependent variables andother attributes as independent variables in order to allowprediction of missing values by creating a regression modelusing those variables For a regression target 119910119894 the MLRequation with 119889 predictors and 119899 training instances can bewritten as

119910119894 = 1205730 + 12057311199091198941 + 12057321199091198942 + sdot sdot sdot + 120573119889119909119894119889 for 119894 = 1 119899 (3)

This can be rewritten in matrix form such that 119910 = 119883120573and the coefficient 120573 can be obtained explicitly by taking aderivative of the squared error function as follows

min119864 (120573) = 12(119910 minus 119883120573)

119879(119910 minus 119883120573)

120597119864 (120573)

120597120573= 119883119879119883120573 minus 119883

119879119910 = 0

120573 = (119883119879119883)minus1sdot 119883119879119910

(4)

(v) Hot-Deck This method is the same in principle as case-based reasoning In order for attributes that contain missingvalues to be utilized values must be found from among themost similar instances of nonmissing values and used toreplace the missing values Therefore each missing value is

replaced with the value of an attribute with the most similarinstance as follows

119883119895

119894 = 119883119895

119896 119896 = argmin

119875radic sum

119895isin119868(complete)Std119895 (119883

119895

119894 minus 119883119895

119875)2 (5)

where Std119895 is the standard deviation of the 119895th attribute whichis not missing

(vi) 119896-NN Attributes are found via a search among nonmiss-ing attributes using the 3-NN method Missing values areimputed based on the values of the attributes of the 119896 mostsimilar instances as follows

119883119895

119894 = sum

119875isin119896-NN(119883119894)119896 (119883119868(complete)119894 119883

119868(complete)119875 ) sdot 119883

119895

119875 (6)

where 119896-NN(119883119894) is the index set of the 119896th nearest neighborsof 119883119894 based on the nonmissing attributes and 119896(119883119894 119883119895) is akernel function that is proportional to the similarity betweenthe two instances119883119894 and119883119895 (119896 = 4)

(vii) 119896-Means Clustering Attributes are found through forma-tion of 119896-clusters from nonmissing data after which missingvalues are imputed The entire dataset is partitioned into 119896clusters by maximizing the homogeneity within each clusterand the heterogeneity between clusters as follows

arg min119862ℎ(complete)

119896

sum

119894=1

sum

119883119868(complete)119895

isin119862ℎ(complete)119894

10038171003817100381710038171003817119883119868(complete)119895 minus 119862

119868(complete)119894

10038171003817100381710038171003817

2

(7)

where 119862119868(complete)119894 is the centroid of 119862119868(complete)

119894 and 119862119868(complete)

is the union of all clusters (119862119868(complete)= 119862119868(complete)1 cup sdot sdot sdot cup

119862119868(complete)119896

) For a missing value 119883119895119894 the mean value of theattribute for the instances in the same cluster with119883119868(complete)

119894

is imputed thus as follows

119883119895

119894 =1

10038161003816100381610038161003816119862119868(complete)119896

10038161003816100381610038161003816

sdot sum

119883119868(complete)119875

isin119862119868(complete)119896

119883119895

119875

st 119896 = argmin119894

10038161003816100381610038161003816119883119868(complete)119895 minus 119862

119868(complete)119894

10038161003816100381610038161003816

(8)

3 Model

In this paper we hypothesize an association between the per-formance of classification algorithms and the characteristicsof missing data and datasets Moreover we assume that thechosen imputation method moderates the causality betweenthese factors Figure 1 illustrates the posited relationships

31 Missing Data Characteristics Table 1 describes the char-acteristics of missing data and how to calculate them Thepattern of missing data characteristics may be univariatemonotone or arbitrary [11] A univariate pattern of missingdata occurs when missing values are observed for a singlevariable only all other data are complete for all variables

4 Mathematical Problems in Engineering

Table 1 The characteristics of missing data

Variables Meaning Calculation

Missing data ratioThe number of missing values in theentire dataset as compared to thenumber of nonmissing values

The number of empty data cellstotal cells

Patterns of missingdata

UnivariateRatio of missing to complete values for an existing feature comparedto the values for all featuresMonotone

ArbitraryHorizontalscatteredness

Distribution of missing values withineach data record

Determine the number of missing cells in each record and calculatethe standard deviation

Verticalscatteredness

Distribution of missing values for eachattribute

Determine the number of missing cells in each feature and calculatethe standard deviation

Missing dataspread

Larger standard deviations indicatestronger effects of missing data

Determine the weighted average of the standard deviations of featureswith missing data (weight the ratio of missing to complete data foreach feature)

Missing datacharacteristics

Dataset feature

Imputation method

Classificationperformance

Figure 1 Research model

A monotone pattern occurs if variables can be arranged suchthat all 119884119895+1 119884119896 are missing for cases where 119884119895 is missingAnother characteristic missing data spread is importantbecause larger standard deviations for missing values withinan existing feature indicate that the missing data has greaterinfluence on the results of the analysis (Figure 2)

32 Dataset Features Table 2 lists the features of datasetsBased on the research of Kwon and Sim [15] in which char-acteristics of datasets that influence classification algorithmswere identified we considered the following statisticallysignificant features in this study missing values the numberof cases the number of attributes and the degree of classimbalance However the discussion of missing values isomitted here because it has already been analyzed in detailby Kwon and Sim [15]

33 Imputation Methods Table 3 lists the imputation meth-ods used in this study Since datasets with categorical decisionattributes are included imputation methods that do notaccommodate categorical attributes (eg regression imputa-tion) are excluded from this paper

Table 2 Dataset features

Variables DescriptionNumber of cases Number of records in the dataset

Number of attributes Number of features characteristicof the dataset

Degree of class imbalance Ratio

34 Classification Algorithms Many studies have comparedclassification algorithms in various areas For example thedecision tree is known as the best algorithm for arrhythmiaclassification [16] In Table 4 six types of representative clas-sification algorithms for supervised learning are describedC45 SVM (support vector machine) Bayesian networklogistic classifier 119896-nearest neighbor classifier and regres-sion

4 Method

We conducted a performance evaluation of the imputationmethods and classification algorithms described in the previ-ous section using actual datasets taken from the UCI datasetarchive To ensure the accuracy of each method in caseswith no missing values datasets with missing values werenot included Among the selected datasets six (Iris WineGlass Liver Disorder Ionosphere and Statlog Shuttle) wereincluded for comparison with the results of Kang [10] Thesedatasets are popular and frequently utilized benchmarks inthe literature whichmakes themuseful for demonstrating thesuperiority of the proposed idea

Table 5 provides the names of the datasets the numbersof cases and the descriptions of features and classes Thenumbers in parentheses in the last two columns represent thenumber of features and classes for the decision attributes Forexample in dataset Iris ldquoNumeric (4)rdquo indicates that thereare four numeric attributes and ldquoCategorical (3)rdquo means thatthere are three classes in the decision attribute

Since UCI datasets have no missing data target valuesin each dataset were randomly omitted [10] Based on

Mathematical Problems in Engineering 5

Observed valuesMissing values

Observed valuesMissing values

Observed valuesMissing values

Univariate pattern Monotone pattern Arbitrary pattern

All missing values arein the last feature

n2n

3n

All missing values arein the last feature

and last-1 and last-2

Missing values are in random feature and record

Figure 2 Missing data patterns

Table 3 Imputation methods

Imputation methods Description

Listwise deletion Perhaps the most basic traditional technique for dealing with missing data Cases with missingvalues are discarded restricting the analyses to cases for which complete data are available

Mean imputation Involves replacing missing data with the overall mean for the observed dataGroup meanimputation

A missing value is replaced by the mean of a subset of the data based on other observed variable(s)in the data

Predictive meanimputation

Also called regression imputation Predictive mean imputation involves imputing a missing valueusing an ordinary least-squares regression method to estimate missing data

Hot-deck Most similar records are imputed to missing values119896-NN The attribute value of 119896 is imputed to the most similar instance from nonmissing data119896-means clustering 119896 numbers of sets are created that are homogeneous on the inside and heterogeneous on the outside

Table 4 Classification algorithms

Algorithms Description

C45Estimates the known data using learning rules C45 gradually expands the conditions of thealgorithm splitting the upper node into subnodes using a divide-and-conquer method until it comesto the end node

SVM Classifies the unknown class by finding the optimal hyperplane with the maximum margin thatreduces the estimation error

Bayesian network A probability network with a high posterior probability given the instances Such a network canprovide insight into probabilistic dependencies among the variables in the training dataset

Logistic classifierTakes the functional form of logistic CDF (cumulative distribution function) This function relatesthe probability of some event to attribute variables through regression coefficients and alpha andbeta parameters which are estimated from training data [13]

119896-nearest neighborclassifier

Simple instance-based learner that uses the class of the nearest 119896 training instances for the class ofthe test instances

Regression The class is binarized and one regression model is built for each class value [14]

6 Mathematical Problems in Engineering

Table 5 Datasets used in the experiments

Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)

the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4

There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170

RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance

119910119901 = sum

forall119895isinM120573119901119895119909119895 + sum

forall119896isinD120594119901119896119911119896 + 120576119901 (9)

In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents

RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly

5 Results

In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance

51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method

For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results

For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best

For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results

For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best

For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest

For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst

Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method

Mathematical Problems in Engineering 7

10000000000

08000000000

06000000000

04000000000

02000000000

00000000000

001 005 010

(a) Iris

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(b) Glass Identification

001 005 010

6000000000000

4000000000000

2000000000000

00000000000

(c) Liver Disorders

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(d) Ionosphere

1250000000000000

1000000000000000

750000000000000

500000000000000

250000000000000

00000000000

k-NN

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

k-MEANS CLUSTERING

LISTWISE DELETION

imputation

(e) Wine

60000000000000

50000000000000

40000000000000

30000000000000

20000000000000

10000000000000

00000000000

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(f) Statlog Shuttle

Figure 3 Comparison of performances of imputation methods for each dataset

8 Mathematical Problems in Engineering

Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast

C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast

R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast

SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast

SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast

P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast

R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast

SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast

SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast

P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact

The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted

In Figure 4 the results suggest the following rules

(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method

(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod

(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method

(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method

(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method

53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing

Mathematical Problems in Engineering 9

05000000000

04500000000

04000000000

03500000000

03000000000

001 005 010

(a) Decision tree (J48)

001 005 010

04000000000

03500000000

03000000000

(b) BayesNet

001 005 010

04400000000

04200000000

04000000000

03800000000

(c) SMO (SVM)

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

(d) Regression

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

02800000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(e) Logistic

001 005 010

04250000000

04000000000

03750000000

03500000000

03250000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-MEANS CLUSTERINGk-NN

(f) IBk (119896-nearest neighbor classifier)

Figure 4 Comparison of classifiers in terms of classification performance

10 Mathematical Problems in Engineering

Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast

C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast

R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast

SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast

SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast

P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast

C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast

R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast

SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast

P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast

R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast

SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast

SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast

P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

4 Mathematical Problems in Engineering

Table 1 The characteristics of missing data

Variables Meaning Calculation

Missing data ratioThe number of missing values in theentire dataset as compared to thenumber of nonmissing values

The number of empty data cellstotal cells

Patterns of missingdata

UnivariateRatio of missing to complete values for an existing feature comparedto the values for all featuresMonotone

ArbitraryHorizontalscatteredness

Distribution of missing values withineach data record

Determine the number of missing cells in each record and calculatethe standard deviation

Verticalscatteredness

Distribution of missing values for eachattribute

Determine the number of missing cells in each feature and calculatethe standard deviation

Missing dataspread

Larger standard deviations indicatestronger effects of missing data

Determine the weighted average of the standard deviations of featureswith missing data (weight the ratio of missing to complete data foreach feature)

Missing datacharacteristics

Dataset feature

Imputation method

Classificationperformance

Figure 1 Research model

A monotone pattern occurs if variables can be arranged suchthat all 119884119895+1 119884119896 are missing for cases where 119884119895 is missingAnother characteristic missing data spread is importantbecause larger standard deviations for missing values withinan existing feature indicate that the missing data has greaterinfluence on the results of the analysis (Figure 2)

32 Dataset Features Table 2 lists the features of datasetsBased on the research of Kwon and Sim [15] in which char-acteristics of datasets that influence classification algorithmswere identified we considered the following statisticallysignificant features in this study missing values the numberof cases the number of attributes and the degree of classimbalance However the discussion of missing values isomitted here because it has already been analyzed in detailby Kwon and Sim [15]

33 Imputation Methods Table 3 lists the imputation meth-ods used in this study Since datasets with categorical decisionattributes are included imputation methods that do notaccommodate categorical attributes (eg regression imputa-tion) are excluded from this paper

Table 2 Dataset features

Variables DescriptionNumber of cases Number of records in the dataset

Number of attributes Number of features characteristicof the dataset

Degree of class imbalance Ratio

34 Classification Algorithms Many studies have comparedclassification algorithms in various areas For example thedecision tree is known as the best algorithm for arrhythmiaclassification [16] In Table 4 six types of representative clas-sification algorithms for supervised learning are describedC45 SVM (support vector machine) Bayesian networklogistic classifier 119896-nearest neighbor classifier and regres-sion

4 Method

We conducted a performance evaluation of the imputationmethods and classification algorithms described in the previ-ous section using actual datasets taken from the UCI datasetarchive To ensure the accuracy of each method in caseswith no missing values datasets with missing values werenot included Among the selected datasets six (Iris WineGlass Liver Disorder Ionosphere and Statlog Shuttle) wereincluded for comparison with the results of Kang [10] Thesedatasets are popular and frequently utilized benchmarks inthe literature whichmakes themuseful for demonstrating thesuperiority of the proposed idea

Table 5 provides the names of the datasets the numbersof cases and the descriptions of features and classes Thenumbers in parentheses in the last two columns represent thenumber of features and classes for the decision attributes Forexample in dataset Iris ldquoNumeric (4)rdquo indicates that thereare four numeric attributes and ldquoCategorical (3)rdquo means thatthere are three classes in the decision attribute

Since UCI datasets have no missing data target valuesin each dataset were randomly omitted [10] Based on

Mathematical Problems in Engineering 5

Observed valuesMissing values

Observed valuesMissing values

Observed valuesMissing values

Univariate pattern Monotone pattern Arbitrary pattern

All missing values arein the last feature

n2n

3n

All missing values arein the last feature

and last-1 and last-2

Missing values are in random feature and record

Figure 2 Missing data patterns

Table 3 Imputation methods

Imputation methods Description

Listwise deletion Perhaps the most basic traditional technique for dealing with missing data Cases with missingvalues are discarded restricting the analyses to cases for which complete data are available

Mean imputation Involves replacing missing data with the overall mean for the observed dataGroup meanimputation

A missing value is replaced by the mean of a subset of the data based on other observed variable(s)in the data

Predictive meanimputation

Also called regression imputation Predictive mean imputation involves imputing a missing valueusing an ordinary least-squares regression method to estimate missing data

Hot-deck Most similar records are imputed to missing values119896-NN The attribute value of 119896 is imputed to the most similar instance from nonmissing data119896-means clustering 119896 numbers of sets are created that are homogeneous on the inside and heterogeneous on the outside

Table 4 Classification algorithms

Algorithms Description

C45Estimates the known data using learning rules C45 gradually expands the conditions of thealgorithm splitting the upper node into subnodes using a divide-and-conquer method until it comesto the end node

SVM Classifies the unknown class by finding the optimal hyperplane with the maximum margin thatreduces the estimation error

Bayesian network A probability network with a high posterior probability given the instances Such a network canprovide insight into probabilistic dependencies among the variables in the training dataset

Logistic classifierTakes the functional form of logistic CDF (cumulative distribution function) This function relatesthe probability of some event to attribute variables through regression coefficients and alpha andbeta parameters which are estimated from training data [13]

119896-nearest neighborclassifier

Simple instance-based learner that uses the class of the nearest 119896 training instances for the class ofthe test instances

Regression The class is binarized and one regression model is built for each class value [14]

6 Mathematical Problems in Engineering

Table 5 Datasets used in the experiments

Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)

the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4

There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170

RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance

119910119901 = sum

forall119895isinM120573119901119895119909119895 + sum

forall119896isinD120594119901119896119911119896 + 120576119901 (9)

In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents

RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly

5 Results

In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance

51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method

For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results

For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best

For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results

For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best

For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest

For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst

Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method

Mathematical Problems in Engineering 7

10000000000

08000000000

06000000000

04000000000

02000000000

00000000000

001 005 010

(a) Iris

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(b) Glass Identification

001 005 010

6000000000000

4000000000000

2000000000000

00000000000

(c) Liver Disorders

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(d) Ionosphere

1250000000000000

1000000000000000

750000000000000

500000000000000

250000000000000

00000000000

k-NN

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

k-MEANS CLUSTERING

LISTWISE DELETION

imputation

(e) Wine

60000000000000

50000000000000

40000000000000

30000000000000

20000000000000

10000000000000

00000000000

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(f) Statlog Shuttle

Figure 3 Comparison of performances of imputation methods for each dataset

8 Mathematical Problems in Engineering

Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast

C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast

R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast

SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast

SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast

P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast

R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast

SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast

SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast

P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact

The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted

In Figure 4 the results suggest the following rules

(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method

(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod

(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method

(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method

(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method

53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing

Mathematical Problems in Engineering 9

05000000000

04500000000

04000000000

03500000000

03000000000

001 005 010

(a) Decision tree (J48)

001 005 010

04000000000

03500000000

03000000000

(b) BayesNet

001 005 010

04400000000

04200000000

04000000000

03800000000

(c) SMO (SVM)

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

(d) Regression

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

02800000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(e) Logistic

001 005 010

04250000000

04000000000

03750000000

03500000000

03250000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-MEANS CLUSTERINGk-NN

(f) IBk (119896-nearest neighbor classifier)

Figure 4 Comparison of classifiers in terms of classification performance

10 Mathematical Problems in Engineering

Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast

C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast

R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast

SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast

SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast

P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast

C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast

R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast

SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast

P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast

R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast

SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast

SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast

P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

Mathematical Problems in Engineering 5

Observed valuesMissing values

Observed valuesMissing values

Observed valuesMissing values

Univariate pattern Monotone pattern Arbitrary pattern

All missing values arein the last feature

n2n

3n

All missing values arein the last feature

and last-1 and last-2

Missing values are in random feature and record

Figure 2 Missing data patterns

Table 3 Imputation methods

Imputation methods Description

Listwise deletion Perhaps the most basic traditional technique for dealing with missing data Cases with missingvalues are discarded restricting the analyses to cases for which complete data are available

Mean imputation Involves replacing missing data with the overall mean for the observed dataGroup meanimputation

A missing value is replaced by the mean of a subset of the data based on other observed variable(s)in the data

Predictive meanimputation

Also called regression imputation Predictive mean imputation involves imputing a missing valueusing an ordinary least-squares regression method to estimate missing data

Hot-deck Most similar records are imputed to missing values119896-NN The attribute value of 119896 is imputed to the most similar instance from nonmissing data119896-means clustering 119896 numbers of sets are created that are homogeneous on the inside and heterogeneous on the outside

Table 4 Classification algorithms

Algorithms Description

C45Estimates the known data using learning rules C45 gradually expands the conditions of thealgorithm splitting the upper node into subnodes using a divide-and-conquer method until it comesto the end node

SVM Classifies the unknown class by finding the optimal hyperplane with the maximum margin thatreduces the estimation error

Bayesian network A probability network with a high posterior probability given the instances Such a network canprovide insight into probabilistic dependencies among the variables in the training dataset

Logistic classifierTakes the functional form of logistic CDF (cumulative distribution function) This function relatesthe probability of some event to attribute variables through regression coefficients and alpha andbeta parameters which are estimated from training data [13]

119896-nearest neighborclassifier

Simple instance-based learner that uses the class of the nearest 119896 training instances for the class ofthe test instances

Regression The class is binarized and one regression model is built for each class value [14]

6 Mathematical Problems in Engineering

Table 5 Datasets used in the experiments

Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)

the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4

There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170

RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance

119910119901 = sum

forall119895isinM120573119901119895119909119895 + sum

forall119896isinD120594119901119896119911119896 + 120576119901 (9)

In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents

RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly

5 Results

In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance

51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method

For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results

For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best

For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results

For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best

For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest

For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst

Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method

Mathematical Problems in Engineering 7

10000000000

08000000000

06000000000

04000000000

02000000000

00000000000

001 005 010

(a) Iris

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(b) Glass Identification

001 005 010

6000000000000

4000000000000

2000000000000

00000000000

(c) Liver Disorders

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(d) Ionosphere

1250000000000000

1000000000000000

750000000000000

500000000000000

250000000000000

00000000000

k-NN

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

k-MEANS CLUSTERING

LISTWISE DELETION

imputation

(e) Wine

60000000000000

50000000000000

40000000000000

30000000000000

20000000000000

10000000000000

00000000000

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(f) Statlog Shuttle

Figure 3 Comparison of performances of imputation methods for each dataset

8 Mathematical Problems in Engineering

Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast

C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast

R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast

SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast

SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast

P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast

R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast

SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast

SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast

P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact

The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted

In Figure 4 the results suggest the following rules

(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method

(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod

(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method

(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method

(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method

53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing

Mathematical Problems in Engineering 9

05000000000

04500000000

04000000000

03500000000

03000000000

001 005 010

(a) Decision tree (J48)

001 005 010

04000000000

03500000000

03000000000

(b) BayesNet

001 005 010

04400000000

04200000000

04000000000

03800000000

(c) SMO (SVM)

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

(d) Regression

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

02800000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(e) Logistic

001 005 010

04250000000

04000000000

03750000000

03500000000

03250000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-MEANS CLUSTERINGk-NN

(f) IBk (119896-nearest neighbor classifier)

Figure 4 Comparison of classifiers in terms of classification performance

10 Mathematical Problems in Engineering

Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast

C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast

R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast

SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast

SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast

P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast

C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast

R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast

SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast

P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast

R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast

SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast

SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast

P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

6 Mathematical Problems in Engineering

Table 5 Datasets used in the experiments

Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)

the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4

There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170

RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance

119910119901 = sum

forall119895isinM120573119901119895119909119895 + sum

forall119896isinD120594119901119896119911119896 + 120576119901 (9)

In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents

RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly

5 Results

In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance

51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method

For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results

For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best

For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results

For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best

For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest

For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst

Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method

Mathematical Problems in Engineering 7

10000000000

08000000000

06000000000

04000000000

02000000000

00000000000

001 005 010

(a) Iris

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(b) Glass Identification

001 005 010

6000000000000

4000000000000

2000000000000

00000000000

(c) Liver Disorders

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(d) Ionosphere

1250000000000000

1000000000000000

750000000000000

500000000000000

250000000000000

00000000000

k-NN

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

k-MEANS CLUSTERING

LISTWISE DELETION

imputation

(e) Wine

60000000000000

50000000000000

40000000000000

30000000000000

20000000000000

10000000000000

00000000000

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(f) Statlog Shuttle

Figure 3 Comparison of performances of imputation methods for each dataset

8 Mathematical Problems in Engineering

Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast

C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast

R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast

SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast

SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast

P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast

R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast

SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast

SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast

P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact

The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted

In Figure 4 the results suggest the following rules

(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method

(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod

(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method

(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method

(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method

53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing

Mathematical Problems in Engineering 9

05000000000

04500000000

04000000000

03500000000

03000000000

001 005 010

(a) Decision tree (J48)

001 005 010

04000000000

03500000000

03000000000

(b) BayesNet

001 005 010

04400000000

04200000000

04000000000

03800000000

(c) SMO (SVM)

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

(d) Regression

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

02800000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(e) Logistic

001 005 010

04250000000

04000000000

03750000000

03500000000

03250000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-MEANS CLUSTERINGk-NN

(f) IBk (119896-nearest neighbor classifier)

Figure 4 Comparison of classifiers in terms of classification performance

10 Mathematical Problems in Engineering

Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast

C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast

R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast

SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast

SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast

P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast

C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast

R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast

SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast

P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast

R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast

SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast

SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast

P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

Mathematical Problems in Engineering 7

10000000000

08000000000

06000000000

04000000000

02000000000

00000000000

001 005 010

(a) Iris

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(b) Glass Identification

001 005 010

6000000000000

4000000000000

2000000000000

00000000000

(c) Liver Disorders

001 005 010

06000000000

05000000000

04000000000

03000000000

02000000000

01000000000

00000000000

(d) Ionosphere

1250000000000000

1000000000000000

750000000000000

500000000000000

250000000000000

00000000000

k-NN

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

k-MEANS CLUSTERING

LISTWISE DELETION

imputation

(e) Wine

60000000000000

50000000000000

40000000000000

30000000000000

20000000000000

10000000000000

00000000000

001 005 010

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(f) Statlog Shuttle

Figure 3 Comparison of performances of imputation methods for each dataset

8 Mathematical Problems in Engineering

Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast

C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast

R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast

SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast

SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast

P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast

R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast

SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast

SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast

P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact

The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted

In Figure 4 the results suggest the following rules

(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method

(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod

(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method

(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method

(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method

53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing

Mathematical Problems in Engineering 9

05000000000

04500000000

04000000000

03500000000

03000000000

001 005 010

(a) Decision tree (J48)

001 005 010

04000000000

03500000000

03000000000

(b) BayesNet

001 005 010

04400000000

04200000000

04000000000

03800000000

(c) SMO (SVM)

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

(d) Regression

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

02800000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(e) Logistic

001 005 010

04250000000

04000000000

03750000000

03500000000

03250000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-MEANS CLUSTERINGk-NN

(f) IBk (119896-nearest neighbor classifier)

Figure 4 Comparison of classifiers in terms of classification performance

10 Mathematical Problems in Engineering

Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast

C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast

R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast

SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast

SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast

P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast

C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast

R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast

SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast

P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast

R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast

SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast

SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast

P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

8 Mathematical Problems in Engineering

Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast

C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast

R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast

SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast

SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast

P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast

R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast

SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast

SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast

P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact

The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted

In Figure 4 the results suggest the following rules

(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method

(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod

(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method

(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method

(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method

53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing

Mathematical Problems in Engineering 9

05000000000

04500000000

04000000000

03500000000

03000000000

001 005 010

(a) Decision tree (J48)

001 005 010

04000000000

03500000000

03000000000

(b) BayesNet

001 005 010

04400000000

04200000000

04000000000

03800000000

(c) SMO (SVM)

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

(d) Regression

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

02800000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(e) Logistic

001 005 010

04250000000

04000000000

03750000000

03500000000

03250000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-MEANS CLUSTERINGk-NN

(f) IBk (119896-nearest neighbor classifier)

Figure 4 Comparison of classifiers in terms of classification performance

10 Mathematical Problems in Engineering

Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast

C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast

R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast

SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast

SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast

P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast

C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast

R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast

SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast

P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast

R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast

SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast

SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast

P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

Mathematical Problems in Engineering 9

05000000000

04500000000

04000000000

03500000000

03000000000

001 005 010

(a) Decision tree (J48)

001 005 010

04000000000

03500000000

03000000000

(b) BayesNet

001 005 010

04400000000

04200000000

04000000000

03800000000

(c) SMO (SVM)

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

(d) Regression

001 005 010

04200000000

04000000000

03800000000

03600000000

03400000000

03200000000

03000000000

02800000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-NNk-MEANS CLUSTERING

(e) Logistic

001 005 010

04250000000

04000000000

03750000000

03500000000

03250000000

M

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

imputation

k-MEANS CLUSTERINGk-NN

(f) IBk (119896-nearest neighbor classifier)

Figure 4 Comparison of classifiers in terms of classification performance

10 Mathematical Problems in Engineering

Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast

C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast

R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast

SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast

SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast

P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast

C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast

R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast

SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast

P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast

R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast

SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast

SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast

P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

10 Mathematical Problems in Engineering

Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast

C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast

R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast

SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast

SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast

P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast

C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast

R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast

SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast

P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast

C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast

R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast

SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast

SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast

P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

Mathematical Problems in Engineering 11

Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING

Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast

C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast

R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast

SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast

SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast

P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001

patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected

(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic

Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance

except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance

Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect

The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance

Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used

Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

12 Mathematical Problems in Engineering

0

01

02

03At

trib

ute

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

Miss

ing

p1

Miss

ing

p2

minus04

minus03

minus02

minus01

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN

Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)

0

02

04

06

08

Attr

ibut

e

Inst

ance

s

Dat

a im

bala

nce

Miss

ing

ratio

Hsc

atte

redn

ess

Vsc

atte

redn

ess

Spre

ad

minus06

minus04

minus02

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETION

GROUP MEAN IMPUTATIONHOT DECK IMPUTATION

k-NNk-MEANS CLUSTERING

Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)

missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm

Method of imputation

0490

0480

0470

0460

0450

0440

0430

0420

0410

0400

0390

0380

0370

0360

0350

0340

0330

0320

0310

0300

005 010 015 020 025 030 035 040 045 050

HOT DECKGROUP MEAN IMPUTATION

MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION

LISTWISE DELETIONk-NNk-MEANS CLUSTERING

Figure 7 RMSE by ratio of missing data

Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms

Data characteristic 119861 Data characteristic 119861

(constant) 060lowastlowast M imputation dum1 012lowastlowast

R missing 083lowastlowast M imputation dum2 minus001lowast

SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast

N attributes minus008lowastlowast M imputation dum7 minus001lowast

C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast

N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005

6 Conclusion

So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

Mathematical Problems in Engineering 13

on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based

In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not

A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply

The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications

Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values

This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance

In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)

References

[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013

[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013

[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003

[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011

[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013

[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013

[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013

[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 14: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

14 Mathematical Problems in Engineering

[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013

[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013

[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002

[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008

[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978

[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998

[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013

[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013

[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005

[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012

[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006

[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 15: Research Article Missing Values and Optimal Selection of an ...imputation method is the most frequently used statistical method developed to deal with missing data problems. It is

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of