chapter 4 data preprocessing - shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/88014/13... ·...

102

CHAPTER 4

DATA PREPROCESSING

4.1 PREAMBLE

―Information quality is not an esoteric notion;it directly affects the

effectiveness and efficiency of business processes. Information quality

also plays a major role in customer satisfaction.‖ - Larry P. English

As noted by Han and Kamber (2006) today‘s real-world databases are

highly susceptible to noise, missing, and inconsistent data because of their

typically huge size (often several gigabytes or more) and their likely origin from

multiple, heterogeneous sources. Low-quality data will lead to low-quality mining

results. Incomplete, noisy, and inconsistent data are common place properties of

large real world databases and data warehouses. Incomplete data can occur for a

number of reasons. Attributes of interest may not always be available. Other data

may not be included simply because it was not considered important at the time of

entry.

Relevant data may not be recorded due to misunderstanding, or because of

equipment malfunctions. Data that were inconsistent with other recorded data may

have been deleted. Furthermore, the recording of the history or modifications to

the data may have been overlooked, Missing data, particularly for tuples with

missing values for some attributes, may need to be inferred (Han and Kamber,

2006).

Data preprocessing is a data mining technique that involves transforming

raw data into an understandable format. Data preprocessing is a proven method of

resolving such issues.

103

4.2 PREPROCESSING

Data preprocessing prepares raw data for further processing. The traditional

data preprocessing method is reacting as it starts with data that is assumed ready

for analysis and there is no feedback and impart for the way of data collection. The

data inconsistency between data sets is the main difficulty for the data

preprocessing

Figure 4.1 Preprocessing Task

Following is the Major task of preprocessing

Data Cleaning

Data cleaning is process of fill in missing values, smoothing the noisy data,

identify or remove outliers, and resolve inconsistencies.

104

Data Integration

Integration of multiple databases, data cubes, or files.

Data Transformation

Data transformation is the task of data normalization and aggregation.

Data Reduction

Process of reduced representation in volume but produces the same or

similar analytical results

Data Discretization

Part of data reduction but with particular importance, especially for

numerical data

The proposed model and task for preprocessing is described in the following

sections.

4.3 GENERAL MODEL FOR PREPROCESSING

The proposed preprocessing task in this research work is modeled in the

Figure 4.2

Treating missing values

o Rule based outlier detection

o Imputation methods to treating missing value

o Attribute correction using data mining concepts

Data integration using Knowledge repository and Jaro Winkler

Data discretization using the Equal width methodology

Data reduction

o Dimensionality reduction

o Numerosity reduction

105

Figure 4.2 Model for Proposed Preprocessing task

106

4.4 DATA CLEANING

―Data cleaning is the number one problem in data warehousing‖—

DCI (Discovery Corps, Inc.) survey.

Data quality is an essential characteristic that determines the reliability of

data for making decisions. High-quality data is

Complete: All relevant data such as accounts, addresses and relationships

for a given customer is linked.

Accurate: Common data problems like misspellings, typos, and random

abbreviations have been cleaned up.

Available: Required data are accessible on demand; users do not need to

search manually for the information.

Timely: Up-to-date information is readily available to support decisions.

In general, data quality is defined as an aggregated value over a set of

quality criteria [Naumann.F ,2002; Heiko and Johann, 2006]. Starting with the

quality criteria defined in [Naumann.F ,2002] , the author describes the set of

criteria that are affected by comprehensive data cleansing and define how to assess

scores for each one of them for an existing data collection. To measure the quality

of a data collection, scores have to be assessed for each of the quality criteria. The

assessment of scores for quality criteria can be used to quantify the necessity of

data cleansing for a data collection as well as the success of a performed data

cleansing process of a data collection. Quality criteria can also be used within the

optimization of data cleansing by specifying priorities for each of the criteria

which in turn influences the execution of data cleansing methods affecting the

specific criteria.

Data cleaning routines work to ―clean‖ the data by filling in missing values,

smoothing noisy data, identifying or removing outliers, and resolving

inconsistencies. The actual process of data cleansing may involve

107

removing typographical errors or validating and correcting values against a known

list of entities. The validation may be strict.

Data cleansing differs from data validation in that validation almost

invariably means data is rejected from the system at entry and is performed at

entry time, rather than on batches of data.

Data cleansing may also involve activities like, harmonization of data, and

standardization of data. For example, harmonization of short codes (St, rd) to

actual words (street, road). Standardization of data is a means of changing a

reference data set to a new standard, ex, use of standard codes.

The major data cleaning tasks include

Identify outliers and smooth out noisy data

Fill in missing values

Correct inconsistent data

Resolve redundancy caused by data integration

Among these tasks missing values causes inconsistencies for data mining. To

overcome these inconsistencies, handling the missing value is a good solution.

In the medical domain, missing data might occur as the value is not relevant

to a particular case, could not be recorded when the data was collected, or is

ignored by users because of privacy concerns or it may be unfeasible for the

patient to undergo the clinical tests, equipment malfunctioning, etc. Methods for

resolving missing values are therefore needed in health care systems to enhance

the quality of diagnosis. The following sections describe about the proposed data

cleaning methods.

http://en.wikipedia.org/wiki/Typographical_error

http://en.wikipedia.org/wiki/Data_validation

108

Figure 4.3 Model For Data cleaning

4.4.1. Outlier Detection

The method incorporated for outlier detection is Rule Based Outlier Detection

Method. Outlier (or anomaly) detection is an important problem for many

domains, including fraud detection, risk analysis, network intrusion and medical

diagnosis, and the discovery of significant outliers is becoming an integral aspect

of data mining. Outlier detection is a mature field of research with its origins in

statistics.

109

Outlier detection techniques can operate in one of the following three modes:

(i) Supervised outlier detection

Techniques trained in supervised mode assume the availability

of a training data set which has labeled instances for normal as well

as outlier class. The typical approach in such cases is to build a

predictive model for normal vs. outlier classes. Any unseen data

instance is compared against the model to determine which class it

belongs to. There are two major issues that arise in supervised

outlier detection. First, the anomalous instances are few, as

compared to the normal instances in the training data. Second,

obtaining accurate and representative labels, especially for the

outlier class is usually challenging

(ii) Semi-Supervised outlier detection

Techniques that operate in a semi-supervised mode, assume that the

training data has labeled instances for only the normal class. Since

they do not require labels for the outlier class, they are more widely

applicable than supervised techniques. For example, in space craft

fault detection, an outlier scenario would signify an accident, which

is not easy to model. The typical approach used in such techniques is

to build a model for the class corresponding to normal behavior, and

use the model to identify outliers in the test data.

(iii) Unsupervised outlier detection

Techniques that operate in unsupervised mode do not require

training data, and thus are most widely applicable. The techniques in

this category make the implicit assumption that normal instances are

far more frequent than outliers in the test data. If this assumption is

not true, then such techniques suffer from high false alarm rate

110

Rule based techniques generate rules that capture the normal behavior of a

system [Skalak and Rissland 1990]. Any instance that is not covered by any such

rule is considered as an anomaly. Several rule based anomaly detection techniques

operate in a semi-supervised mode where rules are learnt for normal class(es) and

the confidence associated with the rule that ‖fires‖ for a test instance determines if

it is normal or anomaly [Fan et al. 2001; Helmer et al. 1998; Lee et al. 1997;

Salvador and Chan 2003; Teng et al. 2002].

4.4.1.1. Rule based method of outlier detection

The rule-based outlier detection is more appropriate for on-line inconsistency

testing. It works with data of a particular domain only and the consequence is its

simplicity and high execution speed. The approach is actually a set of logical tests

that must be satisfied by every patient record. If one or more of the tests is not

satisfied, the record is detected as an outlier. The logical tests are defined by the

set of rules that hold for the patient records in the domain [Gamberger et. al.,

2000].

In this concept, separate rules are constructed for the positive and negative

class cases. The confirmation rules for the positive class must be true for many

positive cases and for no negative case. If a negative case is detected true for any

confirmation rule developed for the positive class, it is a reliable sign that the case

is an outlier. In the same way, confirmation rules constructed for the negative class

can be used for outlier detection of positive patient records. Some preliminary

inductive learning results have been demonstrated [Gamberger et. al., 2000] that

explicit detection of outliers can be useful for maintaining the data quality of

medical records and that it might be a key for the improvement of medical

decisions and their reliability in the regular medical practice. With the intention of

on-line detection of possible data inconsistence, sets of confirmation rules have

been developed for the database and their test results are reported in this work. An

111

additional advantage of the approach is that the user can have the information

about the rule which caused the anxiety what can be useful in the error detection

process.

Steps Involved For Rule-Based Outlier Detection

Get the input cardiac dataset.

For each record in the table a set of logical tests (rules) is done.

Records which do not satisfy the rule are considered to be outliers.

Outliers are then removed from the table.

4.4.1.2. Procedure for outlier detection

Figure 4.4 describes the procedure for outlier detection.

Input : D /* the cardiology database */ , K /* no. of desired outliers */

Output: k identified output

/*Phase 1- initialization */

Begin

Step:1

For each record t in D do

Update hash table using t

Label t as a non-outlier with flag ―0‖

/*Phase 2- outlier identification procedure using rule based outlier detection

method*/

Counter = 0

Repeat

Counter++

Step:2

112

While not end of the database do

Read next record t which is labeled ―0‖ // non-outlier

Compute the characteristics by labeling t as outlier

If the computed character not equal to character in the rules then

Update hashing tables using t

Label t as outlier with flag ―1‖

Until (counter – k)

End

Figure 4.4 Procedure for Outlier detection

The outcome of the about discussed algorithm is dataset without outlier based on

the rule. Missing data is another important issue in preprocessing it is discussed in

the next section.

4.4.2 Handling Missing Values

The missing value treating method plays an important role in the data

preprocessing. Missing data is a common problem in statistical analysis. The

tolerance level of missing data is classified as

Missing Value (Percentage) - Significant

Upto 1% - Trivial

1-5% - Manageable

5-15% - sophisticated methods to handle

More than 15% - Severe impact of interpretation

Several methods have been proposed in the literature to treat missing data. Those

methods are divided into three categories as proposed by Dempster and et

113

al.[1977]. The different patterns of missing values are discussed in the next

section.

4.4.2.1 Pattern of missing

The Missing value in database falls into this three categories viz., Missing

Completely at Random (MCAR), Missing Random (MAR) and Non-Ignorable

(NI)

Missing Completely at Random (MCAR)

This is the highest level of randomness. It occurs when the probability of an

instance (case) having a missing value for an attribute does not depend on either

the known values or the missing values are randomly distributed across all

observations. This is not a realistic assumption for many real time data.

Missing at Random (MAR)

When missingness does not depend on the true value of the missing

variable, but it might depend on the value of other variables that are observed.

This method occurs when missing values are not randomly distributed across all

observations, rather they are randomly distributed within one or more sub samples

Non-Ignorable (NI)

NI exists when missing values are not randomly distributed across

observations. If the probability that a cell is missing depends on the unobserved

value of the missing response, then the process is non-ignorable.

In next section the theoretical framework for Handling the missing value is

discussed.

114

4.4.2.2 The theoretical framework

The classification of missing data is categorized in the following three

mechanisms:

• If the probability of an observation being missing does not depend on

observed or unobserved measurements, then the observation is MCAR. A

typical example is a patient moving to another city for non-health reasons.

Patients who drop-out of a study for this reason could be considered as a

random sample of the total study population and their characteristics are

similar.

• If the probability of an observation being missing depends only on

observed measurements, then the observation is MAR. This assumption

implies that the behavior of the post drop-out observations can be predicted

from the observed variables, and therefore that response can be estimated

without bias using exclusively the observed data. [For example, when a

patient drops out due to lack of efficacy (illness due lack of vitamin

efficiency) reflected by a series of poor efficacy outcomes that have been

observed, the appropriate value to assign to the subsequent efficacy

endpoint for this patient can be calculated using the observed data. ]

• When observations are neither MCAR nor MAR, they are classified as

Missing Not At Random (MNAR) or a non ignorable i.e. the probability of

an observation being missing depends on unobserved measurements. In

this scenario, the value of the unobserved responses depends on

information not available for the analysis (i.e. Not the values observed

previously on the analysis variable or the covariates being used), and thus,

future observations cannot be predicted without bias by the model. For

example, it may happen that after a series of visits with good outcome, a

patient drop-out due to lack of efficacy. In this situation the analysis model

based on the observed data, including relevant covariates, is likely to

continue to predict a good outcome, but it is usually unreasonable to expect

115

the patient to continue to derive benefit from treatment., it is impossible to

be certain whether there is a relationship between missing values and the

unobserved outcome variable or to judge whether that missing data can be

adequately predicted from the observed data. It is not possible to know

whether the MAR, never mind MCAR, assumptions are appropriate in any

practical situation. A proposition that no data in a confirmatory clinical trial

are MNAR seems implausible. Because it is considered that some data are

MNAR, the properties (e.g. Bias) of any methods based on MCAR or MAR

assumptions cannot be reliably determined for any given dataset.

Therefore, the method chosen should not depend primarily on the properties of the

method under the MAR or MCAR assumptions, but on whether it is considered to

provide an appropriately conservative estimate in the circumstances of the trial

under consideration. The methods for handling missing values and procedure is

described in the next section.

4.4.2.3 Methods for handling missing values

The specific methods for handling missing value are mentioned below

Method of ignoring instances with unknown feature values.

Most common feature value.

Method of treating missing feature values as special values. (Filling a

global constant like ―Cardio‖ for missing values in character data types)

a. Ignoring or Discarding Data.

In this method there are two ways to discard the data with missing values

1. The first way is complete case analysis, where the entire instance with missing

values is discarded.

116

2. The second method determines the level of missing values in each instance and

attributes. It discards the instance with high level of missing data.

b. Parameter estimation

The maximum likehood procedure is used to estimate the parameters of a

model defined for the complete data. The maximum like hood procedures that

use variants of the Expectation–Maximization algorithm can handle parameter

estimation in the presence of missing data [Mehala et. al., 2009; Dempster and

et al.,1977]

c. Imputation techniques

Imputation is the substitution of some value for a missing data point or a

missing component of a data point. Once all missing values have been imputed,

the dataset can then be analyzed using standard techniques for complete data. The

analysis should ideally take into account that there is a greater degree of

uncertainty than if the imputed values had actually been observed, however, and

this generally requires some modification of the standard complete data analysis

methods. In this research work the estimation maximization method is

implemented.

ESTIMATION MAXIMIZATION (EM) METHOD FOR MISSING

VALUES

The algorithm used for handling missing values using the most common

feature method is EM algorithm. The procedure is discussed in Figure 4.5

1. Estimates the most appropriate value to be filled in the missing field.

2. Maximizes the value of all the missing fields in the corresponding attribute.

117

Figure 4.5 Procedure for Estimation Maximization Method For Missing Values

4.4.3. Missing Value Imputation Methods

As an alternate method of the EM Model, missing data imputation is a

procedure that replaces the missing values with some possible values. Imputed

values are treated as just as reliable as the truly observed data, but they are only as

good as the assumptions used to create them.

Imputation is a method of filling in the missing values by attributing values

derived from other available data to them. Imputation is defined as ―the process of

Input : D /* the cardiology database */

Output: k identified output (with filled in values for missing value)

Begin

Step 1: For each record t in D do

Step2 : check if the field = integer then /* FILLING MISSING BY SUBSTITUTING

MEAN FOR THE INTEGER FIELD*/

compute the mean / average of the field values

Step 3:update the field with the computed value

if col name = Age

calculate average of Age

update col name with avg(age)

step 4: if the field = character then

/*FILLING GLOBAL CONSTANTS OF VALUES MISSING IN TEXT

FIELD*/

Identify the global constant used for the variable /*global constant used =

―cardio‖*/

Step 5: update the field with global constant

118

estimating missing data of an observation, based on valid values of other

variables‖ (Hair et al. 1998). Imputation minimizes bias in the mining process, and

preserves ―expensive to collect‟ data, that would otherwise be discarded (Marvin

et al. 2003). It is important that the estimates for the missing values are accurate,

as even a small number of biased estimates may lead to inaccurate and misleading

results in the mining process.

The imputation consists of many types viz., Single Imputation, Partial

imputation, Multiple Imputation and Iterative Imputation. Zhang.S.C [2010] has

handled the missing values in heterogeneous data sets using semi parametric way

of iterative imputation method [Zhang.S.C, 2010].

Multiple imputation(MI) has several desirable features:

Introducing appropriate random error in the imputation process makes it

possible to get approximately unbiased estimates of all parameters. No

deterministic imputation method can do this in general settings.

Repeated imputation allows one to get good estimates of the standard

errors. Single imputation methods don‘t allow for the additional error

introduced by imputation (without specialized software of very limited

generality).

MI can be used with any kind of data and any kind of analysis without

specialized software

4.4.3.1 Imputation in K-Nearest Neighbors (K-NN)

In this method, the missing values of an instance are imputed considering a

given number of instances that are most similar to the instance of interest. The

distance is calculated using distance function.

119

The advantage of this method is

Prediction of quantitative and qualitative attributes

Handling multiple missing value in the records.

The disadvantage of this method is

(i) Searches through out all the dataset looking for the most similar

instances which is time consumable.

(ii) Choice of distance function to calculate the distance.

4.4.3.2 Mean based imputation (single imputation)

In the mean imputation, the mean of the values of an attribute that contains

missing data is used to fill in the missing values. In the case of a categorical

attribute, the mode, which is the most frequent value, is used instead of the mean

[Liu et.al, 2004]. The algorithm imputes missing values for each attribute

separately. Mean imputation can be conditional or unconditional, i.e., not

conditioned on the values of other variables in the record. The conditional mean

method imputes a mean value, that depends on the values of the complete

attributes for the incomplete record.

4.4.3.3 Norm that implements missing value estimation

On the expectation maximization algorithm [Schafer J.L, 1999] multiple

imputation inference involves three distinct phases:

• The missing data are filled in m times to generate m complete data sets

• The m complete data sets are analyzed by using standard procedures

• The results from the m complete data sets are combined for the inference

120

4.4.3.4 LSImpute_Rows

LSImpute_Rows method estimates missing values based on the least square

error principle and correlation between cases (rows in the input matrix) [Liu et.al.

2004, Jos´ e et.al, 2006].

4.4.3.5 EMImpute_Columns

The EMImpute_Columns estimates missing values using the same

imputation model, but based on the correlation between the features [Marisol et.al,

2005] (columns in the input matrix). LSImpute_Rows and EMImpute_Columns

involve multiple regressions to make their predictions

4.4.3.6 Other imputation methods

Hot deck imputation

In these method the missing value is filled with a value from an estimated

distribution of the missing value in the data set. In Random Hot deck, a missing

value of an attribute is replaced by an observed value of the attribute chosen

randomly.

Cold deck imputation

It is same as hot deck imputation, but the difference is the source for

imputated value obtained from different source.

Imputation using decision tree

All the decision tree classifier handles missing values by using built in

approaches.

GCFIT_MISS_IMPUTE which is proposed by Ilango et. al., [2009] is to impute

the missing values in the Type II diabetes, databases and to evaluate its

121

performance by estimating average imputation error. The average imputation error

is the measure which represents the degree of inconsistency between the observed

and imputed values. The approach is experimented on PIMA Indian Type II

Diabetes Data set, which originally do not have any missing data. All the 8

attributes are considered for the experiments as the decision attribute is derived

using these 8 attributes. Datasets with different percentage of missing data (from

5% to 85%) were generated using the random labeling feature. For each

percentage of missing data, 20 random simulations are to be conducted.

In each dataset, missing values were simulated by randomly labeling

feature values as missing values. The datasets with different amounts of missing

values (from 5% to 35% of the total available data) were generated. For each

percentage of missing data, 20 random simulations were conducted. The data were

standardised using the maximum difference normalisation procedure which

mapped the data into the interval [0..1]. The estimated values were compared to

those in the original data set. The average estimation error E was calculated as

follows:

E= ( Oij- Iij)/(maxj-minj))

n

i=1

/n /m

m

k=1

(4.1)

where ‗n‘ is the number of imputed values, ‗m‘ is the number of random

simulations for each missing value, Oij is the original value to be imputed, Iij is

the imputed value, j is the corresponding feature to which Oi and Ii belong. The

result analysis of all these methods is discussed in the next section.

122

4.4.3.7 Result analysis

The estimated error results obtained from different methods for the

databases is tabulated in Figure 4.4. The different k-NN estimators were

implemented, but only the most accurate model is shown. The 10-NN models

produced an average estimation error that is consistently more accurate than those

obtained using the Mean imputation, NORM and LSImpute_Rows methods.

Tables 4.1 and Figure 4.6 shows the average estimated errors and corresponding

standard deviation. The predictive performance of these methods depends on the

amount of missing values and complete cases containing the dataset.

Table 4.1 Average estimated error ± standard deviation

Methods Percentage of Missing Data

5 10 15 20 25 30 35

10-NN 10.5±9.4 11.1±9.7 11.7±10.2 12.6±10.6 13.7±11.6 14.7±12.2 15.5±12.7

Mean based

Imputation 13.6±11.3 14.0±11.5 13.5±11.1 13.7±11.4 13.4±11.3 13.7±11.4 13.8±11.5

NORM 12.4±13.5 13.3±14.8 12.7±13.9 14.0±14.4 14.6±15.3 14.7±15.3 15.3±15.2

EMImpute_Columns 8.5±22.7 9.2±22.5 9.1±22.4 9.3±22.3 9.2±22.2 7.8±23.2 7.7±23.1

LSImpute_Rows 12.3±22.7 13.6±22.7 14.4±22.6 14.3±22.6 14.6±22.7 13.1±23.7 12.9±23.6

123

Figure 4.6 Comparison of different methods using different percentages of missing values

From the analysis , it is clearly understood that 10-NN method produced

the least variability in results. However, when more than 30% of the data were

missing the performance of k-NN started to significantly deteriorate. This

deterioration occurs if the number of complete cases (nearest neighbors) used to

impute a missing value is actually smaller than k. This is one of the limitations of

this study because the k-NN models only considered complete cases (nearest

neighbors) for making estimations.

The k-NN was able to generate relatively accurate and less variable results

for different amounts of missing data, which were assessed using 20 missing value

random simulations. However, it is important to remark that, while on the one

hand, this study allowed us to assess the potential of different missing data

estimation methods, on the other hand it did not offer significant evidence to

describe a relationship between the amount of missing data and the accuracy of the

0

2

4

6

8

10

12

14

16

18

5% 10% 15% 20% 25% 30% 35%

Avg

. Est

imat

ed

Err

or

Missing Values

10-NN

Mean based Imputation

NORM

EMImpute_Columns

LSImpute_Rows

124

predictions. Attribute correction using data mining concept is discussed in the

following section.

4.4.4 Attribute Correction Using Association Rule And Clustering

Techniques

In this section the proposed two algorithms Context Dependent Attribute

Correction using Association Rule (CDACAR) and Context Independent Attribute

correction implemented using Clustering Technique (CIACCT) for attribute

correction using data mining techniques for external reference are discussed.The

algorithm described in this section is used to examine if the data set is source of

reference data that could be used to identify the incorrect entries and enable to

correct the entries.

4.4.4.1 Framework

The Framework for Attribute correction is shown in Figure 4.7.

Figure 4.7 Framework for Attribute correction

Imputed Attribute

Association Rule Clustering

Corrected Attribute

Context Dependent Context Independent

125

4.4.4.2 Context Dependent Attribute Correction using Association Rule

(CDACAR)

The context dependent attributes refer to attribute values which consider

the reference data values and the other attribute values of the record.

In this algorithm the association rules methodology is used to discover

validation rules for data sets.The frequent item sets are generated by using Apriori

[Webb.J, 2003] algorithm is utilized.

The following two parameters are used in CDACAR

Minsup is defined analogically for the parameter of the same name for the Apriori

algorithm

Distthresh is the minimum distance between the value of the ―suspicious‖

attribute and the proposed value. Being a successor rule, it violates in order to

make corrections.

Levenshtein distance (LD) is a measure of the similarity between two strings,

which we will refer to as the source string (s) and the target string (t). The distance

is the number of deletions, insertions, or substitutions required to transform ‗s‘

into ‗t‘. For example,

• If ‗s‘ is "test" and ‗t‘ is "test", then LD(s,t) = 0, because no

transformations are needed. The strings are already identical.

• If ‗s‘ is "test" and ‗t‘ is "tent", then LD(s,t) = 1 , because one substitution

(change s" to "n") is sufficient to transform ‗s‘ into ‗t‘.

The following is the modified Levenshtein distance

|)|),(||),(.(21),( 22112121 sssLevsssLevssLev

(4.2)

126

where Lev(s1,s2) denotes Levenshtein distance between strings s1 1n s2.

The modified.

Distance for strings may be interpreted as an average fraction of one string

that has to be modified to be transformed into the other. For instance, The LD

between ―Articulation‖ and ―Articaulation‖ is 2. The modified Levenshtein

distance for above said string is 0.25. The modification was introduced to be

independent of the string length during the comparison.

The algorithm is outlined below

Step 1: Generate all the frequent sets.

Step 2: Generate all the association rules from the already generated

sets..The rules generated may have 1, 2 or 3 predecessors and only one

successor. The association rules are generated from the set of validation

rules.

Step 3: The algorithm discovers records whose attribute values are the

predecessors of the rules generated with an attribute whose value is

different from

the successor of a given rule.

Step 4: The value of the attribute which is suspicious in a row is compared

with all the successors.

Step 4: If the relative Levenshtein distance is lower than the threshold

distance the value may be corrected. If there are more values within the

accepted range of the parameter, a value most similar to the value of the

record is chosen.

The result is analyzed in the section 4.4.4.4.

127

4.4.4.3 Context Independent Attribute Correction using Clustering

Technique (CIACCT)

Context-independent attribute correction implies that all the record

attributes are examined and cleaned in isolation, without regard to values of other

attributes of a given record. The main idea behind this algorithm is based on an

observation that in most data sets there is a certain number of values having large

number of occurrences within the data sets and a very large number of attributes

with a very low number of occurrences. Therefore, the most representative value

may be the source of reference data. The values with a low number of occurrences

are noise or misspelled instance of the reference data.

The same Levenshtein distance is used in these methods which were

discussed in the previous algorithm.

In this methods the following two parameters are considered

i. Distthresh being the minimum distance between two values, allowing them to

be marked as similar and related

ii. Occrel is used to determine whether both compared values belong to the

reference data set.

The CICACCT algorithm is described below

Step 1: First cleaning process, for that all attributes convert from lower

case to upper case, all the non-alpha numeric values are removed and then

the number of occurrences of all the values in the cleaned data set is

calculated

128

Step 2: Each element is assigned to separate cluster. The cluster element

with the highest number of occurrences is treated as cluster representative.

Step 3: Cluster list is sorted in descending order according to the number of

occurrences of each cluster representative.

Step 4:Starting from the first cluster, compare all the cluster and also

calculate the distance between the cluster using the modified Levenshtein

distance.

Step 5: If the distance is lower than the distthresh parameter and the ratio of

occurrences of cluster representative is greater or equal the Occrel

parameter, the clusters are merged

Step 6: After all the clusters are compared, the clusters are examined

whether they contain values having distance between them and the cluster

representative above the threshold value.if so, they are removed from the

cluster and added to the cluster list as separate clusters.

Step 7: Repeat the same step until there are no changes in the cluster list

i.e.no clusters are merged and no cluster are created. The cluster

representative is from the reference data set and the cluster define

transformation rules for a given cluster values should be replaced with the

value of the cluster representative.

As far as the reference dictionary is concerned, it may happen that it will

contain values where the number of occurrences are very small. These values may

be marked as noise and trimmed in order to preserve the compactness of the

dictionary.

129

4.4.4.4 Results analysis of attribute correction

Context Dependent Attribute Correction using Association Rule (CDACAR)

The Algorithm was tested using the sample Cardiology dataset drawn from

Hungarian data.The rule-generation part of the algorithm is performed on the

whole data set. The Attribute correction part was performed on a random sample.

The Following measures are used for checking the correctness of the

algorithm. Let

Pc – Percentage of correctly altered values

Pi – Percentage of incorrectly altered values

P0- Percentage of values marked during the review as incorrect, but not

Altered during cleaning

The measure is defined as

Pc = nc / na * 100 (4.3)

Pi = ni / na * 100 (4.4)

P0 = n00 / n0 * 100 (4.5)

nc -correctly altered values

ni -number of incorrectly altered values

na -total number of altered values

n0 -number of values identified as incorrect

n00 -the number of elements initially marked as incorrect that were

not altered during the cleaning process.

From Table – 4.2 it can be observed that the relationship between the

measures and the distthresh parameter. Figure 4.8 shows the result that the number

of values marked as incorrect and altered is growing with the increase of the

distthresh parameter. This also proves that the context-dependent algorithm

130

perform better for identifying incorrect entries. The number of incorrectly altered

values is growing with the increase of the parameter. However, a value of the

distthresh parameter can be identified that gives optimal results. i.e. the number of

correctly altered values is high and the number of incorrectly altered values is low.

Table –4.2 Dependency between the measures and the parameter for Context-dependent algorithm

Distthresh Pc Pi P0

0 0.0 0.0 100.0

0.1 90 10 73.68

0.2 68.24 31.76 46.62

0.3 31.7 68.3 36.09

0.4 17.26 82.74 33.83

0.5 11.84 88.16 31.33

0.6 10.2 89.8 31.08

0.7 9.38 90.62 30.33

0.8 8.6 91.4 28.82

0.9 8.18 91.82 27.32

1.0 7.77 92.23 17.79

131

Figure 4.8 Dependency between the measures and the parameter for Context-dependent

algorithm

The result shows that the number of values marked as incorrect(Pi) and altered

is growing with the increase of the DistThresh parameter. Some attribute that may

at first glance seem to be incorrect, are correct in the context of other attribute

within the same record. Percentage of correctly marked entries reaches the peak

for the DistThresh parameter equal to 0.05.

Context Independent Attribute Correction using Clustering Techniques

(CIACCT)

The Algorithm was tested using the sample Cardiology dataset which is drawn

from Hungarian data There are about 44000 records divided into 11 batches of 4

thousand records. The attribute CP (Chest pain type) in that Angial is one of the

types which occurs when an area of the heart muscle does not get enough oxygen

rich blood. By using CIACCT 4.22% i.e. 1856 element of whole data set were

identified as incorrect and hence subject to alteration. Table 4.3 contains the

example transformation rules discovered during the execution.

0

10

20

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pe

rce

nta

ge

Distthresh

Pc

Pi

Po

132

Table 4.3 Transformation Rules

Original value Correct value

Angail Angial

Anchail Angial

Angal Angial

Ancail Angial

The measure is used

Pc – Percentage of correctly altered values

Pi – Percentage of incorrectly altered values

P0- Percentage of values marked during the review as incorrect, but not

Altered during cleaning

From Table 4.4 and Figure 4.9 it can be observed that the relationship

between the measures and the distthresh parameter. The results show that the

number of values marked as incorrect and altered is growing with the increase of

the distthresh parameter. This also proves that the context-independent algorithm

perform better to identify incorrect entries. However, a value of the distthresh

parameter can be identified that gives good results. i.e. the number of correctly

altered values(Pc) is high and the number of incorrectly altered values(Pi) is low.

133

Table 4.4 – Dependency between the measures and the parameter for Context -Independent

algorithm

Distthresh Pc Pi P0

0 0.0 0.0 100.0

0.1 92.63 7.37 92.45

0.2 79.52 20.48 36.96

0.3 67.56 32.44 29.25

0.4 47.23 52.77 26.93

0.5 29.34 70.66 23.41

0.6 17.36 82.64 19.04

0.7 7.96 92.04 8.92

0.8 4.17 95.83 1.11

0.9 1.17 98.83 0.94

1.0 0.78 99.22 0

134

Figure 4.9 – Dependency between the measures and the parameter for Context –Independent

algorithm

The algorithm perform better for longer strings as short string would

require higher value of the parameter to discover a correct reference value. High

values of the distthresh parameter results in larger number of incorrectly altered

elements. This algorithm results in an efficiency of 92% of correctly altered

elements which is an acceptable value. The range of the application of this method

is limited to elements that can be standardized for which reference data is

available. Conversely, using this method for cleaning last names could not yield

good results.

4.5 DATA INTEGRATION

Data integration is the process of combining data residing at different

sources and providing the user with a unified view of these data. This process

emerges in a variety of situations , both commercial (when two similar companies

need to merge their databases) and scientific (combining research results from

0

10

20

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pe

rce

nta

ge

Distthresh

Pc

Pi

Po

135

different bioinformatics repositories). In this work combining the two

cardiovascular databases from different Hospital is taken into consideration.

The data consumed and/or produced by one component is the same as the data

produced and/or consumed by all the other components. This description of

integration highlights the three primary types of system integration, specifically:

presentation, control and data integration.

4.5.1. Need for Data Integration

Data integration appears with increasing frequency as the volume and the need

to share existing data explodes. As information systems grow in complexity and

volume, the need for scalability and versatility of data integration increases. In

management practice, data integration is frequently called Enterprise Information

Integration.

The rapid growth of distributed data has fueled significant interest in building

data integration systems. However, developing these systems today still requires

an enormous amount of labor from system builders. Several nontrivial tasks must

be performed, such as wrapper construction and mapping between schemas. Then,

in dynamic environments such as the Web, sources often undergo changes that

break the system, requiring the builder to continually invest maintenance effort.

This has resulted in very high cost of ownership for integration systems, and

severely limited their deployment in practice.

Health care providers collect and maintain large quantities of data. The major

issue in these data representatives is the dissimilarity in the structure. Very rarely

the structure of the database remains the same. Yet data communication and

data sharing is becoming more important as organizations see the advantages

of integrating their activities and the cost benefits that accrue when data can

be reused rather than recreated from scratch.

136

The integration of heterogeneous data sources has a long research history

following the different evolutions of information systems. Integrating various data

sources is a major problem in knowledge management. It deals with integrating

heterogeneous data sources and it is a complex activity that involves reconciliation

at various levels - data models, data schema and data instances. Thus there arises a

strong need for a viable automation tool that organize data into a common syntax.

Some of the current work in data integration research concerns the Semantic

Integration problem. This problem is not about how to structure the architecture of

the integration, but how to resolve semantic conflicts between heterogeneous data

sources. For example if two companies merge their databases, certain concepts

and definitions in their respective schemas like "earnings" inevitably have

different meanings. In one database it may mean profits in dollars (a floating point

number), while in the other it might be the number of sales (an integer). A

common strategy for the resolution of such problems is the use of ontology which

explicitly defines schema terms and thus help to resolve semantic conflicts.

4.5.2. Implementation of Data Integration

Data integration is done using JARO ALGORITHM

The Jaro-Winkler distance is a measure of similarity between two strings. It

is a variant of the Jaro distance metric and mainly used in the area of record

linkage. The higher the Jaro-Winkler distance of two strings is, the more similar

the strings are. The Jaro distance metric states that given two strings s1 and s2, are

similar only if all the characters of s1 matches with that of s2.This number of

matching characters is denoted as ‗m‘.

Two characters from s1 and s2 respectively, are considered matching only if

they are similar.

While comparing the columns of one dataset with another for similarity, there

exist two kinds of similarities namely Exact matching and Statistical matching.

http://en.wikipedia.org/wiki/String_(computer_science)

137

Exact matching involves the exact matching of strings to the columns and

Statistical matching involves the partial matching of the strings present in the

column name. For eg. Pname in one database column matching with Pname in

another database column is exact matching. ―Pname‖ in one database column

matching with ‗Patient name‘ in another database column is called as statistical

matching.

This method has a limitation that if two strings represent the same thing, but

different in words says for eg. Cost and price are two different words representing

the same meaning. It is not possible to match these two words using Jaro-Winkler

method. To avoid this in this research work knowledge repository is used to

check all possible words which cannot be matched by Jaro method. First compare

these words with the given strings. If it does not match compare it using the Jaro –

Winkler method.

After Data integration is performed, the database may contain incomplete

data set, hence it is efficient to perform data cleaning after every data integration is

performed. This increases data reliability.

A sample of the knowledge repository that maintain in this work is such as

given below:

String Similar name

Patient identification number p_id, pat_id, id, patient, pat_no,

p_no,file_no, f_no.

Address Address, street, area

Blood pressure BP, Pressure, stress

Medicine Medicine, drug, medication

138

First copy all the columns of one database into the new database, then compare

each column of the other database to be integrated with the knowledge repository

that maintain in the research work. If the word matches with these words in the

knowledge base, we integrate it to the corresponding column in the new database,

else compare it using the Jaro – Winkler measure. Even then, if the columns don‘t

match ,create a new column in the integrated database. Figure 4.10 describes the

Procedure for Jaro- Winkler algorithm.

Algorithm for Data Integration

Step 1: Get the two databases which is needed to be integrated as input.

Step 2: Check for the attribute name in both the table and calculate the Jaro

distance Metric

Step 3: Higher the Jaro distance metric is, the higher the similarities between the

two attributes and the two attributes are considered as similar and their

values are merged.

Step 4: If two attributes are dissimilar check for their name in the knowledge

repository.

Step 5: If found, then the two attribute‘s values are merged.

else it is considered as new attributes and one added to the database.

139

Input : database1, database2.

Output: database 3 ( integrated database)

Method:

Step1: copy all the attributes and values of database 1 into database3.

Step2: For every attribute in database2,

set flag=0;

do { select each attribute from database 3

{ match it using knowledge repository

String[] st3={"p_id,pat_id,id,patient", "address,street,area",

"amount,amt,cost", "phone no, mobile no, contact no"}

IF it matches {

set flag = 1

copy all the values of that particular column into the

corresponding matching column of database3 }

ELSE { check for similarity between the two attributes from

database2 and database3 }

using JARO method ( string comparison)

IF it matches { et flag = 1

copy all the values of that particular column into the

corresponding matching column of database3

} } }

IF (flag = 0) {

create a new column in database 3 with the same column name as in

database2

copy all the values of that column into the corresponding column in

database 3; }

Step3: end

Figure 4.10 Procedure for Data Integration

140

4.6 DATA DISCRETIZATION

Discretization is a process that transforms quantitative data into qualitative

data. Quantitative data are commonly involved in data mining applications. It

significantly improve the quality of discovering knowledge and also reduces the

running time of various data mining tasks such as association rule discovery,

classification, clustering and prediction.

Discretization is a process that transforms data containing a quantitative

attribute so that the attribute in question is replaced by a qualitative

attribute. A many to one mapping function is created so that each value of

the original quantitative attributes is mapped onto a value of the new

qualitative attribute. First, discretization divides the value range of the

quantitative attribute into a finite number of intervals. The mapping function

associates all of the quantitative values in a single interval to a single qualitative

value.

Discrete data is information that can be categorized into a classification.

Discrete data are based on counts. Only a finite number of values are possible, and

the values cannot be subdivided meaningfully. Attribute data (Discrete data) is

data that cannot be broken down into smaller unit and add additional meaning. It is

typically things counted in whole numbers.

4.6.1. Need for Discretization

Reducing the number of values for an attribute is especially beneficial if

decision-tree-based methods of classification are to be applied to the pre-processed

data. The reason is that these methods are typically recursive, and a large amount

of time is spent on sorting the data at each step.

Before applying learning algorithms to data sets, practitioners often globally

discretize any numeric attributes. If the algorithm cannot handle numeric attributes

directly, prior discretization is essential. Even if it can, prior discretization often

141

accelerates induction, and may produce simpler and more accurate classification.

As it is generally done, global discretization denies the learning algorithm of

taking any change advantage of the ordering information implicit in numeric

attributes.

However, a simple transformation of discretized data preserves this

information in a form that learners can use. This work show that, compared to

using the discretized data directly, this transformation significantly increases the

accuracy of decision trees built by C4.5, decision lists built by PART, and decision

tables built using the wrapper method, on several benchmark datasets. Moreover,

it can significantly reduce the size of the resulting classifiers.

This simple technique makes global discretization an even more useful for

data preprocessing.

Many algorithms developed in the machine learning community focus on

learning in nominal feature spaces. However, many real-world databases often

involve continuous features. Those features have to be discretized before using

such algorithms. Discretization methods can transform continuous features into a

finite number of intervals, where each interval is associated with a numerical

discrete value. Discretized intervals, then can be treated as ordinal values during

induction and deduction.

4.6.2. Methods in Discretization

The discretization methods can be classified according to three axes:

supervised versus unsupervised, global versus local, and static versus dynamic. A

supervised method would use the classification information during the

discretization process, while the unsupervised method would not depend on class

information. The popular supervised discretization algorithms contain many

142

categories, such as entropy based algorithms, including Ent-MDLP, Mantaras

distance, dependence based algorithms, including ChiMerge, Chi2, and binning

based algorithms including 1R, Marginal Ent. The unsupervised algorithms

contain equal width, equal frequency and some other recently proposed algorithms

and an algorithm using tree-based density estimation.

Local methods produce partitions that are applied to localized regions of the

instance space. Global methods, such as binning, produce a mesh over the entire

continuous instances, space, where each feature is partitioned into regions

independent of the other attributes.

Many discretization methods require a parameter, n, indicating the maximum

number of partition intervals in discretizing a feature. Static methods, such as Ent-

MDLP, perform the discretization on each feature and determine the value of n for

each feature independent of the other features. However, the dynamic methods

search through the space of possible n values for all features simultaneously,

thereby capturing interdependencies in feature discretization. There are a wide

variety of discretization methods starting with the naive methods such as equal-

width and equal-frequency.

The simplest and efficient discretization method is an unsupervised direct

method named equal width discretization which is a binning methodology. It

calculates the maximum and the minimum for the feature that is being discretized

and partitions the range observed into k approximately equal sized intervals.

4.6.3. Equal width Discretization Methodology

The equal width discretization methodology is described below

1. Get the input dataset which has to be discretized.

2. For each attribute calculate its minimum possible value and maximum

possible value.

143

3. Then divide the attribute value into k intervals approximately of equal size.

4. For each interval sets replace the values with a class name.

Algorithm for Width Discretization Methodology described in Figure 4.11.

Rules for Discretization

In this work, the following rule are used to transform the data in the database.

Systole

90-130 Normal

below 90 Hypotension

above 130 Hypertension

Diastole

60-80 Normal

below 60 Hypotension

above 80 Hypertension

Heart beat

72 - Normal for adult

140-150 Normal for Child

BMI- Body mass Index

Below 18.5 - Underweight

18.5-25 - Normal range

25-30 - Overweight

Above 30 – Obesity

Dose

100-300 low

144

300-500 medium

Below 500 heavy dose

Anesthesia

1-3 Normal

4-8 Serious

Input: database to be discretized

output: database ( discretized database)

Begin

step1: Get each column from the input database

step2: Check the column name with the column name present in the rules for

discretization

set flag = 0;

IF it matches { set flag = 1;

do{ Check for the condition in the rules and transform the

numerical attribute in the column to its corresponding categorical attribute.

} until all the values in the column are discretized

}

IF (flag = 0)

{ Leave that column and go on to the next column (Start from step 1)

}

End

Figure 4.11 Procedure for Data Discretization

4.7. DATA REDUCTION

Data warehouses store vast amounts of data. Mining takes a long time to run

this complete and complex data set. Data reduction reduces the data set and

145

provides a smaller volume data set, which yields similar results as the complete

data sets.

Working with data collected through a team effort or at multiple sites can be

both challenging and rewarding. The sheer size and complexity of the dataset

sometimes makes the analysis daunting, but a large data set may also yield richer

and more useful information. The benefits of the data reduction techniques

increase as the data sets grow in size and complexity

4.7.1. Methods for Data Reduction

Reduction can be handled by two methods they are discussed as follows.

1. Dimensionality Reduction

2. Numerosity Reduction

Dimensionality Reduction

Dimensionality Reduction is defined as removal of unimportant attributes.

The method used for handling dimensionality reduction is feature selection. A

process that chooses an optimal subset of features according to a objective

function This selects the minimum set of attributes, features that is sufficient for

the data mining task. Algorithm for Dimensionality reduction is described in

Figure 4.12.

Numerosity Reduction

Numerosity Reduction is fitting the data into model. This method can be

handled by Parametric Methods. The parameter on which the numerosity

reduction has to take place is got from the user. According to the parameter its

corresponding values are stored and the remaining data are discarded. Algorithm

for Numerosity reduction is described in Figure 4.13.

146

4.7.2. Implementation of Data Reduction


1. Get the input dataset which has to be reduced.

2. According to the need of data mining algorithms, get the attribute names

that are necessary for the domain.

3. Remove the other attributes from the dataset which are not needed.


1. Get the input dataset for which numerosity reduction has to be done.

2. Get the attribute names and the parametric value according to which

numerosity reduction has to be done.

3. The dataset that satisfy the parameter value are stored and the remaining

data are discarded


Input: D /* the cardiology database */

K /* no. of attributes need to reduced*/

Output: The cardiology database with reduced dimensionality

Begin

Step 1: For each attribute in D

Step 2: Get the number of attributes and attribute name which has to be reduced

from database

Step 3: Delete the attribute from the database

Step 4: Repeat until all the attribute which need to be reduced are deleted

End Figure 4.12 Procedure for Data Reduction

147


Input: D /* the cardiology database with discretized attributes*/,

K /* parameter according to which numerosity reduction has to

performed*/

Output: The cardiology database with reduced numerosity

Begin

Step 1: For each attribute in D

Step 2: Get the input parameter according to which reduction has to be performed

Step 3: For each record in the database, remove the records which does not satisfy

the parameter

End

Figure 4.13 Procedure for Numerosity Reduction

4.8 SUMMARY

In this part of research work, a new preprocessing technique is

implemented. The need for the proposed model is discussed in detail. Randomly

simulated missing values were estimated by five data imputation methods out of

these, K-NN produce the promising result, Attribute Correction algorithm for

Context Dependent and Context Independent is proposed and implemented, also

implemented knowledge repository along with Jaro Winkler for data integration,

equal width discretization methodology is used for data discretization and

Dimensionality reduction and numerosity reduction is used to reduced the data for

better knowledge discovery.

chapter 4 data preprocessing - shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/88014/13... ·...

Documents