chapter 4 data preprocessing - shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/88014/13... ·...
TRANSCRIPT
102
CHAPTER 4
DATA PREPROCESSING
4.1 PREAMBLE
―Information quality is not an esoteric notion;it directly affects the
effectiveness and efficiency of business processes. Information quality
also plays a major role in customer satisfaction.‖ - Larry P. English
As noted by Han and Kamber (2006) today‘s real-world databases are
highly susceptible to noise, missing, and inconsistent data because of their
typically huge size (often several gigabytes or more) and their likely origin from
multiple, heterogeneous sources. Low-quality data will lead to low-quality mining
results. Incomplete, noisy, and inconsistent data are common place properties of
large real world databases and data warehouses. Incomplete data can occur for a
number of reasons. Attributes of interest may not always be available. Other data
may not be included simply because it was not considered important at the time of
entry.
Relevant data may not be recorded due to misunderstanding, or because of
equipment malfunctions. Data that were inconsistent with other recorded data may
have been deleted. Furthermore, the recording of the history or modifications to
the data may have been overlooked, Missing data, particularly for tuples with
missing values for some attributes, may need to be inferred (Han and Kamber,
2006).
Data preprocessing is a data mining technique that involves transforming
raw data into an understandable format. Data preprocessing is a proven method of
resolving such issues.
103
4.2 PREPROCESSING
Data preprocessing prepares raw data for further processing. The traditional
data preprocessing method is reacting as it starts with data that is assumed ready
for analysis and there is no feedback and impart for the way of data collection. The
data inconsistency between data sets is the main difficulty for the data
preprocessing
Figure 4.1 Preprocessing Task
Following is the Major task of preprocessing
Data Cleaning
Data cleaning is process of fill in missing values, smoothing the noisy data,
identify or remove outliers, and resolve inconsistencies.
104
Data Integration
Integration of multiple databases, data cubes, or files.
Data Transformation
Data transformation is the task of data normalization and aggregation.
Data Reduction
Process of reduced representation in volume but produces the same or
similar analytical results
Data Discretization
Part of data reduction but with particular importance, especially for
numerical data
The proposed model and task for preprocessing is described in the following
sections.
4.3 GENERAL MODEL FOR PREPROCESSING
The proposed preprocessing task in this research work is modeled in the
Figure 4.2
Treating missing values
o Rule based outlier detection
o Imputation methods to treating missing value
o Attribute correction using data mining concepts
Data integration using Knowledge repository and Jaro Winkler
Data discretization using the Equal width methodology
Data reduction
o Dimensionality reduction
o Numerosity reduction
105
Figure 4.2 Model for Proposed Preprocessing task
106
4.4 DATA CLEANING
―Data cleaning is the number one problem in data warehousing‖—
DCI (Discovery Corps, Inc.) survey.
Data quality is an essential characteristic that determines the reliability of
data for making decisions. High-quality data is
Complete: All relevant data such as accounts, addresses and relationships
for a given customer is linked.
Accurate: Common data problems like misspellings, typos, and random
abbreviations have been cleaned up.
Available: Required data are accessible on demand; users do not need to
search manually for the information.
Timely: Up-to-date information is readily available to support decisions.
In general, data quality is defined as an aggregated value over a set of
quality criteria [Naumann.F ,2002; Heiko and Johann, 2006]. Starting with the
quality criteria defined in [Naumann.F ,2002] , the author describes the set of
criteria that are affected by comprehensive data cleansing and define how to assess
scores for each one of them for an existing data collection. To measure the quality
of a data collection, scores have to be assessed for each of the quality criteria. The
assessment of scores for quality criteria can be used to quantify the necessity of
data cleansing for a data collection as well as the success of a performed data
cleansing process of a data collection. Quality criteria can also be used within the
optimization of data cleansing by specifying priorities for each of the criteria
which in turn influences the execution of data cleansing methods affecting the
specific criteria.
Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies. The actual process of data cleansing may involve
107
removing typographical errors or validating and correcting values against a known
list of entities. The validation may be strict.
Data cleansing differs from data validation in that validation almost
invariably means data is rejected from the system at entry and is performed at
entry time, rather than on batches of data.
Data cleansing may also involve activities like, harmonization of data, and
standardization of data. For example, harmonization of short codes (St, rd) to
actual words (street, road). Standardization of data is a means of changing a
reference data set to a new standard, ex, use of standard codes.
The major data cleaning tasks include
Identify outliers and smooth out noisy data
Fill in missing values
Correct inconsistent data
Resolve redundancy caused by data integration
Among these tasks missing values causes inconsistencies for data mining. To
overcome these inconsistencies, handling the missing value is a good solution.
In the medical domain, missing data might occur as the value is not relevant
to a particular case, could not be recorded when the data was collected, or is
ignored by users because of privacy concerns or it may be unfeasible for the
patient to undergo the clinical tests, equipment malfunctioning, etc. Methods for
resolving missing values are therefore needed in health care systems to enhance
the quality of diagnosis. The following sections describe about the proposed data
cleaning methods.
108
Figure 4.3 Model For Data cleaning
4.4.1. Outlier Detection
The method incorporated for outlier detection is Rule Based Outlier Detection
Method. Outlier (or anomaly) detection is an important problem for many
domains, including fraud detection, risk analysis, network intrusion and medical
diagnosis, and the discovery of significant outliers is becoming an integral aspect
of data mining. Outlier detection is a mature field of research with its origins in
statistics.
109
Outlier detection techniques can operate in one of the following three modes:
(i) Supervised outlier detection
Techniques trained in supervised mode assume the availability
of a training data set which has labeled instances for normal as well
as outlier class. The typical approach in such cases is to build a
predictive model for normal vs. outlier classes. Any unseen data
instance is compared against the model to determine which class it
belongs to. There are two major issues that arise in supervised
outlier detection. First, the anomalous instances are few, as
compared to the normal instances in the training data. Second,
obtaining accurate and representative labels, especially for the
outlier class is usually challenging
(ii) Semi-Supervised outlier detection
Techniques that operate in a semi-supervised mode, assume that the
training data has labeled instances for only the normal class. Since
they do not require labels for the outlier class, they are more widely
applicable than supervised techniques. For example, in space craft
fault detection, an outlier scenario would signify an accident, which
is not easy to model. The typical approach used in such techniques is
to build a model for the class corresponding to normal behavior, and
use the model to identify outliers in the test data.
(iii) Unsupervised outlier detection
Techniques that operate in unsupervised mode do not require
training data, and thus are most widely applicable. The techniques in
this category make the implicit assumption that normal instances are
far more frequent than outliers in the test data. If this assumption is
not true, then such techniques suffer from high false alarm rate
110
Rule based techniques generate rules that capture the normal behavior of a
system [Skalak and Rissland 1990]. Any instance that is not covered by any such
rule is considered as an anomaly. Several rule based anomaly detection techniques
operate in a semi-supervised mode where rules are learnt for normal class(es) and
the confidence associated with the rule that ‖fires‖ for a test instance determines if
it is normal or anomaly [Fan et al. 2001; Helmer et al. 1998; Lee et al. 1997;
Salvador and Chan 2003; Teng et al. 2002].
4.4.1.1. Rule based method of outlier detection
The rule-based outlier detection is more appropriate for on-line inconsistency
testing. It works with data of a particular domain only and the consequence is its
simplicity and high execution speed. The approach is actually a set of logical tests
that must be satisfied by every patient record. If one or more of the tests is not
satisfied, the record is detected as an outlier. The logical tests are defined by the
set of rules that hold for the patient records in the domain [Gamberger et. al.,
2000].
In this concept, separate rules are constructed for the positive and negative
class cases. The confirmation rules for the positive class must be true for many
positive cases and for no negative case. If a negative case is detected true for any
confirmation rule developed for the positive class, it is a reliable sign that the case
is an outlier. In the same way, confirmation rules constructed for the negative class
can be used for outlier detection of positive patient records. Some preliminary
inductive learning results have been demonstrated [Gamberger et. al., 2000] that
explicit detection of outliers can be useful for maintaining the data quality of
medical records and that it might be a key for the improvement of medical
decisions and their reliability in the regular medical practice. With the intention of
on-line detection of possible data inconsistence, sets of confirmation rules have
been developed for the database and their test results are reported in this work. An
111
additional advantage of the approach is that the user can have the information
about the rule which caused the anxiety what can be useful in the error detection
process.
Steps Involved For Rule-Based Outlier Detection
Get the input cardiac dataset.
For each record in the table a set of logical tests (rules) is done.
Records which do not satisfy the rule are considered to be outliers.
Outliers are then removed from the table.
4.4.1.2. Procedure for outlier detection
Figure 4.4 describes the procedure for outlier detection.
Input : D /* the cardiology database */ , K /* no. of desired outliers */
Output: k identified output
/*Phase 1- initialization */
Begin
Step:1
For each record t in D do
Update hash table using t
Label t as a non-outlier with flag ―0‖
/*Phase 2- outlier identification procedure using rule based outlier detection
method*/
Counter = 0
Repeat
Counter++
Step:2
112
While not end of the database do
Read next record t which is labeled ―0‖ // non-outlier
Compute the characteristics by labeling t as outlier
If the computed character not equal to character in the rules then
Update hashing tables using t
Label t as outlier with flag ―1‖
Until (counter – k)
End
Figure 4.4 Procedure for Outlier detection
The outcome of the about discussed algorithm is dataset without outlier based on
the rule. Missing data is another important issue in preprocessing it is discussed in
the next section.
4.4.2 Handling Missing Values
The missing value treating method plays an important role in the data
preprocessing. Missing data is a common problem in statistical analysis. The
tolerance level of missing data is classified as
Missing Value (Percentage) - Significant
Upto 1% - Trivial
1-5% - Manageable
5-15% - sophisticated methods to handle
More than 15% - Severe impact of interpretation
Several methods have been proposed in the literature to treat missing data. Those
methods are divided into three categories as proposed by Dempster and et
113
al.[1977]. The different patterns of missing values are discussed in the next
section.
4.4.2.1 Pattern of missing
The Missing value in database falls into this three categories viz., Missing
Completely at Random (MCAR), Missing Random (MAR) and Non-Ignorable
(NI)
Missing Completely at Random (MCAR)
This is the highest level of randomness. It occurs when the probability of an
instance (case) having a missing value for an attribute does not depend on either
the known values or the missing values are randomly distributed across all
observations. This is not a realistic assumption for many real time data.
Missing at Random (MAR)
When missingness does not depend on the true value of the missing
variable, but it might depend on the value of other variables that are observed.
This method occurs when missing values are not randomly distributed across all
observations, rather they are randomly distributed within one or more sub samples
Non-Ignorable (NI)
NI exists when missing values are not randomly distributed across
observations. If the probability that a cell is missing depends on the unobserved
value of the missing response, then the process is non-ignorable.
In next section the theoretical framework for Handling the missing value is
discussed.
114
4.4.2.2 The theoretical framework
The classification of missing data is categorized in the following three
mechanisms:
• If the probability of an observation being missing does not depend on
observed or unobserved measurements, then the observation is MCAR. A
typical example is a patient moving to another city for non-health reasons.
Patients who drop-out of a study for this reason could be considered as a
random sample of the total study population and their characteristics are
similar.
• If the probability of an observation being missing depends only on
observed measurements, then the observation is MAR. This assumption
implies that the behavior of the post drop-out observations can be predicted
from the observed variables, and therefore that response can be estimated
without bias using exclusively the observed data. [For example, when a
patient drops out due to lack of efficacy (illness due lack of vitamin
efficiency) reflected by a series of poor efficacy outcomes that have been
observed, the appropriate value to assign to the subsequent efficacy
endpoint for this patient can be calculated using the observed data. ]
• When observations are neither MCAR nor MAR, they are classified as
Missing Not At Random (MNAR) or a non ignorable i.e. the probability of
an observation being missing depends on unobserved measurements. In
this scenario, the value of the unobserved responses depends on
information not available for the analysis (i.e. Not the values observed
previously on the analysis variable or the covariates being used), and thus,
future observations cannot be predicted without bias by the model. For
example, it may happen that after a series of visits with good outcome, a
patient drop-out due to lack of efficacy. In this situation the analysis model
based on the observed data, including relevant covariates, is likely to
continue to predict a good outcome, but it is usually unreasonable to expect
115
the patient to continue to derive benefit from treatment., it is impossible to
be certain whether there is a relationship between missing values and the
unobserved outcome variable or to judge whether that missing data can be
adequately predicted from the observed data. It is not possible to know
whether the MAR, never mind MCAR, assumptions are appropriate in any
practical situation. A proposition that no data in a confirmatory clinical trial
are MNAR seems implausible. Because it is considered that some data are
MNAR, the properties (e.g. Bias) of any methods based on MCAR or MAR
assumptions cannot be reliably determined for any given dataset.
Therefore, the method chosen should not depend primarily on the properties of the
method under the MAR or MCAR assumptions, but on whether it is considered to
provide an appropriately conservative estimate in the circumstances of the trial
under consideration. The methods for handling missing values and procedure is
described in the next section.
4.4.2.3 Methods for handling missing values
The specific methods for handling missing value are mentioned below
Method of ignoring instances with unknown feature values.
Most common feature value.
Method of treating missing feature values as special values. (Filling a
global constant like ―Cardio‖ for missing values in character data types)
a. Ignoring or Discarding Data.
In this method there are two ways to discard the data with missing values
1. The first way is complete case analysis, where the entire instance with missing
values is discarded.
116
2. The second method determines the level of missing values in each instance and
attributes. It discards the instance with high level of missing data.
b. Parameter estimation
The maximum likehood procedure is used to estimate the parameters of a
model defined for the complete data. The maximum like hood procedures that
use variants of the Expectation–Maximization algorithm can handle parameter
estimation in the presence of missing data [Mehala et. al., 2009; Dempster and
et al.,1977]
c. Imputation techniques
Imputation is the substitution of some value for a missing data point or a
missing component of a data point. Once all missing values have been imputed,
the dataset can then be analyzed using standard techniques for complete data. The
analysis should ideally take into account that there is a greater degree of
uncertainty than if the imputed values had actually been observed, however, and
this generally requires some modification of the standard complete data analysis
methods. In this research work the estimation maximization method is
implemented.
ESTIMATION MAXIMIZATION (EM) METHOD FOR MISSING
VALUES
The algorithm used for handling missing values using the most common
feature method is EM algorithm. The procedure is discussed in Figure 4.5
1. Estimates the most appropriate value to be filled in the missing field.
2. Maximizes the value of all the missing fields in the corresponding attribute.
117
Figure 4.5 Procedure for Estimation Maximization Method For Missing Values
4.4.3. Missing Value Imputation Methods
As an alternate method of the EM Model, missing data imputation is a
procedure that replaces the missing values with some possible values. Imputed
values are treated as just as reliable as the truly observed data, but they are only as
good as the assumptions used to create them.
Imputation is a method of filling in the missing values by attributing values
derived from other available data to them. Imputation is defined as ―the process of
Input : D /* the cardiology database */
Output: k identified output (with filled in values for missing value)
Begin
Step 1: For each record t in D do
Step2 : check if the field = integer then /* FILLING MISSING BY SUBSTITUTING
MEAN FOR THE INTEGER FIELD*/
compute the mean / average of the field values
Step 3:update the field with the computed value
if col name = Age
calculate average of Age
update col name with avg(age)
step 4: if the field = character then
/*FILLING GLOBAL CONSTANTS OF VALUES MISSING IN TEXT
FIELD*/
Identify the global constant used for the variable /*global constant used =
―cardio‖*/
Step 5: update the field with global constant
118
estimating missing data of an observation, based on valid values of other
variables‖ (Hair et al. 1998). Imputation minimizes bias in the mining process, and
preserves ―expensive to collect‟ data, that would otherwise be discarded (Marvin
et al. 2003). It is important that the estimates for the missing values are accurate,
as even a small number of biased estimates may lead to inaccurate and misleading
results in the mining process.
The imputation consists of many types viz., Single Imputation, Partial
imputation, Multiple Imputation and Iterative Imputation. Zhang.S.C [2010] has
handled the missing values in heterogeneous data sets using semi parametric way
of iterative imputation method [Zhang.S.C, 2010].
Multiple imputation(MI) has several desirable features:
Introducing appropriate random error in the imputation process makes it
possible to get approximately unbiased estimates of all parameters. No
deterministic imputation method can do this in general settings.
Repeated imputation allows one to get good estimates of the standard
errors. Single imputation methods don‘t allow for the additional error
introduced by imputation (without specialized software of very limited
generality).
MI can be used with any kind of data and any kind of analysis without
specialized software
4.4.3.1 Imputation in K-Nearest Neighbors (K-NN)
In this method, the missing values of an instance are imputed considering a
given number of instances that are most similar to the instance of interest. The
distance is calculated using distance function.
119
The advantage of this method is
Prediction of quantitative and qualitative attributes
Handling multiple missing value in the records.
The disadvantage of this method is
(i) Searches through out all the dataset looking for the most similar
instances which is time consumable.
(ii) Choice of distance function to calculate the distance.
4.4.3.2 Mean based imputation (single imputation)
In the mean imputation, the mean of the values of an attribute that contains
missing data is used to fill in the missing values. In the case of a categorical
attribute, the mode, which is the most frequent value, is used instead of the mean
[Liu et.al, 2004]. The algorithm imputes missing values for each attribute
separately. Mean imputation can be conditional or unconditional, i.e., not
conditioned on the values of other variables in the record. The conditional mean
method imputes a mean value, that depends on the values of the complete
attributes for the incomplete record.
4.4.3.3 Norm that implements missing value estimation
On the expectation maximization algorithm [Schafer J.L, 1999] multiple
imputation inference involves three distinct phases:
• The missing data are filled in m times to generate m complete data sets
• The m complete data sets are analyzed by using standard procedures
• The results from the m complete data sets are combined for the inference
120
4.4.3.4 LSImpute_Rows
LSImpute_Rows method estimates missing values based on the least square
error principle and correlation between cases (rows in the input matrix) [Liu et.al.
2004, Jos´ e et.al, 2006].
4.4.3.5 EMImpute_Columns
The EMImpute_Columns estimates missing values using the same
imputation model, but based on the correlation between the features [Marisol et.al,
2005] (columns in the input matrix). LSImpute_Rows and EMImpute_Columns
involve multiple regressions to make their predictions
4.4.3.6 Other imputation methods
Hot deck imputation
In these method the missing value is filled with a value from an estimated
distribution of the missing value in the data set. In Random Hot deck, a missing
value of an attribute is replaced by an observed value of the attribute chosen
randomly.
Cold deck imputation
It is same as hot deck imputation, but the difference is the source for
imputated value obtained from different source.
Imputation using decision tree
All the decision tree classifier handles missing values by using built in
approaches.
GCFIT_MISS_IMPUTE which is proposed by Ilango et. al., [2009] is to impute
the missing values in the Type II diabetes, databases and to evaluate its
121
performance by estimating average imputation error. The average imputation error
is the measure which represents the degree of inconsistency between the observed
and imputed values. The approach is experimented on PIMA Indian Type II
Diabetes Data set, which originally do not have any missing data. All the 8
attributes are considered for the experiments as the decision attribute is derived
using these 8 attributes. Datasets with different percentage of missing data (from
5% to 85%) were generated using the random labeling feature. For each
percentage of missing data, 20 random simulations are to be conducted.
In each dataset, missing values were simulated by randomly labeling
feature values as missing values. The datasets with different amounts of missing
values (from 5% to 35% of the total available data) were generated. For each
percentage of missing data, 20 random simulations were conducted. The data were
standardised using the maximum difference normalisation procedure which
mapped the data into the interval [0..1]. The estimated values were compared to
those in the original data set. The average estimation error E was calculated as
follows:
E= ( Oij- Iij)/(maxj-minj))
n
i=1
/n /m
m
k=1
(4.1)
where ‗n‘ is the number of imputed values, ‗m‘ is the number of random
simulations for each missing value, Oij is the original value to be imputed, Iij is
the imputed value, j is the corresponding feature to which Oi and Ii belong. The
result analysis of all these methods is discussed in the next section.
122
4.4.3.7 Result analysis
The estimated error results obtained from different methods for the
databases is tabulated in Figure 4.4. The different k-NN estimators were
implemented, but only the most accurate model is shown. The 10-NN models
produced an average estimation error that is consistently more accurate than those
obtained using the Mean imputation, NORM and LSImpute_Rows methods.
Tables 4.1 and Figure 4.6 shows the average estimated errors and corresponding
standard deviation. The predictive performance of these methods depends on the
amount of missing values and complete cases containing the dataset.
Table 4.1 Average estimated error ± standard deviation
Methods Percentage of Missing Data
5 10 15 20 25 30 35
10-NN 10.5±9.4 11.1±9.7 11.7±10.2 12.6±10.6 13.7±11.6 14.7±12.2 15.5±12.7
Mean based
Imputation 13.6±11.3 14.0±11.5 13.5±11.1 13.7±11.4 13.4±11.3 13.7±11.4 13.8±11.5
NORM 12.4±13.5 13.3±14.8 12.7±13.9 14.0±14.4 14.6±15.3 14.7±15.3 15.3±15.2
EMImpute_Columns 8.5±22.7 9.2±22.5 9.1±22.4 9.3±22.3 9.2±22.2 7.8±23.2 7.7±23.1
LSImpute_Rows 12.3±22.7 13.6±22.7 14.4±22.6 14.3±22.6 14.6±22.7 13.1±23.7 12.9±23.6
123
Figure 4.6 Comparison of different methods using different percentages of missing values
From the analysis , it is clearly understood that 10-NN method produced
the least variability in results. However, when more than 30% of the data were
missing the performance of k-NN started to significantly deteriorate. This
deterioration occurs if the number of complete cases (nearest neighbors) used to
impute a missing value is actually smaller than k. This is one of the limitations of
this study because the k-NN models only considered complete cases (nearest
neighbors) for making estimations.
The k-NN was able to generate relatively accurate and less variable results
for different amounts of missing data, which were assessed using 20 missing value
random simulations. However, it is important to remark that, while on the one
hand, this study allowed us to assess the potential of different missing data
estimation methods, on the other hand it did not offer significant evidence to
describe a relationship between the amount of missing data and the accuracy of the
0
2
4
6
8
10
12
14
16
18
5% 10% 15% 20% 25% 30% 35%
Avg
. Est
imat
ed
Err
or
Missing Values
10-NN
Mean based Imputation
NORM
EMImpute_Columns
LSImpute_Rows
124
predictions. Attribute correction using data mining concept is discussed in the
following section.
4.4.4 Attribute Correction Using Association Rule And Clustering
Techniques
In this section the proposed two algorithms Context Dependent Attribute
Correction using Association Rule (CDACAR) and Context Independent Attribute
correction implemented using Clustering Technique (CIACCT) for attribute
correction using data mining techniques for external reference are discussed.The
algorithm described in this section is used to examine if the data set is source of
reference data that could be used to identify the incorrect entries and enable to
correct the entries.
4.4.4.1 Framework
The Framework for Attribute correction is shown in Figure 4.7.
Figure 4.7 Framework for Attribute correction
Imputed Attribute
Association Rule Clustering
Corrected Attribute
Context Dependent Context Independent
125
4.4.4.2 Context Dependent Attribute Correction using Association Rule
(CDACAR)
The context dependent attributes refer to attribute values which consider
the reference data values and the other attribute values of the record.
In this algorithm the association rules methodology is used to discover
validation rules for data sets.The frequent item sets are generated by using Apriori
[Webb.J, 2003] algorithm is utilized.
The following two parameters are used in CDACAR
Minsup is defined analogically for the parameter of the same name for the Apriori
algorithm
Distthresh is the minimum distance between the value of the ―suspicious‖
attribute and the proposed value. Being a successor rule, it violates in order to
make corrections.
Levenshtein distance (LD) is a measure of the similarity between two strings,
which we will refer to as the source string (s) and the target string (t). The distance
is the number of deletions, insertions, or substitutions required to transform ‗s‘
into ‗t‘. For example,
• If ‗s‘ is "test" and ‗t‘ is "test", then LD(s,t) = 0, because no
transformations are needed. The strings are already identical.
• If ‗s‘ is "test" and ‗t‘ is "tent", then LD(s,t) = 1 , because one substitution
(change s" to "n") is sufficient to transform ‗s‘ into ‗t‘.
The following is the modified Levenshtein distance
|)|),(||),(.(21),( 22112121 sssLevsssLevssLev
(4.2)
126
where Lev(s1,s2) denotes Levenshtein distance between strings s1 1n s2.
The modified.
Distance for strings may be interpreted as an average fraction of one string
that has to be modified to be transformed into the other. For instance, The LD
between ―Articulation‖ and ―Articaulation‖ is 2. The modified Levenshtein
distance for above said string is 0.25. The modification was introduced to be
independent of the string length during the comparison.
The algorithm is outlined below
Step 1: Generate all the frequent sets.
Step 2: Generate all the association rules from the already generated
sets..The rules generated may have 1, 2 or 3 predecessors and only one
successor. The association rules are generated from the set of validation
rules.
Step 3: The algorithm discovers records whose attribute values are the
predecessors of the rules generated with an attribute whose value is
different from
the successor of a given rule.
Step 4: The value of the attribute which is suspicious in a row is compared
with all the successors.
Step 4: If the relative Levenshtein distance is lower than the threshold
distance the value may be corrected. If there are more values within the
accepted range of the parameter, a value most similar to the value of the
record is chosen.
The result is analyzed in the section 4.4.4.4.
127
4.4.4.3 Context Independent Attribute Correction using Clustering
Technique (CIACCT)
Context-independent attribute correction implies that all the record
attributes are examined and cleaned in isolation, without regard to values of other
attributes of a given record. The main idea behind this algorithm is based on an
observation that in most data sets there is a certain number of values having large
number of occurrences within the data sets and a very large number of attributes
with a very low number of occurrences. Therefore, the most representative value
may be the source of reference data. The values with a low number of occurrences
are noise or misspelled instance of the reference data.
The same Levenshtein distance is used in these methods which were
discussed in the previous algorithm.
In this methods the following two parameters are considered
i. Distthresh being the minimum distance between two values, allowing them to
be marked as similar and related
ii. Occrel is used to determine whether both compared values belong to the
reference data set.
The CICACCT algorithm is described below
Step 1: First cleaning process, for that all attributes convert from lower
case to upper case, all the non-alpha numeric values are removed and then
the number of occurrences of all the values in the cleaned data set is
calculated
128
Step 2: Each element is assigned to separate cluster. The cluster element
with the highest number of occurrences is treated as cluster representative.
Step 3: Cluster list is sorted in descending order according to the number of
occurrences of each cluster representative.
Step 4:Starting from the first cluster, compare all the cluster and also
calculate the distance between the cluster using the modified Levenshtein
distance.
Step 5: If the distance is lower than the distthresh parameter and the ratio of
occurrences of cluster representative is greater or equal the Occrel
parameter, the clusters are merged
Step 6: After all the clusters are compared, the clusters are examined
whether they contain values having distance between them and the cluster
representative above the threshold value.if so, they are removed from the
cluster and added to the cluster list as separate clusters.
Step 7: Repeat the same step until there are no changes in the cluster list
i.e.no clusters are merged and no cluster are created. The cluster
representative is from the reference data set and the cluster define
transformation rules for a given cluster values should be replaced with the
value of the cluster representative.
As far as the reference dictionary is concerned, it may happen that it will
contain values where the number of occurrences are very small. These values may
be marked as noise and trimmed in order to preserve the compactness of the
dictionary.
129
4.4.4.4 Results analysis of attribute correction
Context Dependent Attribute Correction using Association Rule (CDACAR)
The Algorithm was tested using the sample Cardiology dataset drawn from
Hungarian data.The rule-generation part of the algorithm is performed on the
whole data set. The Attribute correction part was performed on a random sample.
The Following measures are used for checking the correctness of the
algorithm. Let
Pc – Percentage of correctly altered values
Pi – Percentage of incorrectly altered values
P0- Percentage of values marked during the review as incorrect, but not
Altered during cleaning
The measure is defined as
Pc = nc / na * 100 (4.3)
Pi = ni / na * 100 (4.4)
P0 = n00 / n0 * 100 (4.5)
nc -correctly altered values
ni -number of incorrectly altered values
na -total number of altered values
n0 -number of values identified as incorrect
n00 -the number of elements initially marked as incorrect that were
not altered during the cleaning process.
From Table – 4.2 it can be observed that the relationship between the
measures and the distthresh parameter. Figure 4.8 shows the result that the number
of values marked as incorrect and altered is growing with the increase of the
distthresh parameter. This also proves that the context-dependent algorithm
130
perform better for identifying incorrect entries. The number of incorrectly altered
values is growing with the increase of the parameter. However, a value of the
distthresh parameter can be identified that gives optimal results. i.e. the number of
correctly altered values is high and the number of incorrectly altered values is low.
Table –4.2 Dependency between the measures and the parameter for Context-dependent algorithm
Distthresh Pc Pi P0
0 0.0 0.0 100.0
0.1 90 10 73.68
0.2 68.24 31.76 46.62
0.3 31.7 68.3 36.09
0.4 17.26 82.74 33.83
0.5 11.84 88.16 31.33
0.6 10.2 89.8 31.08
0.7 9.38 90.62 30.33
0.8 8.6 91.4 28.82
0.9 8.18 91.82 27.32
1.0 7.77 92.23 17.79
131
Figure 4.8 Dependency between the measures and the parameter for Context-dependent
algorithm
The result shows that the number of values marked as incorrect(Pi) and altered
is growing with the increase of the DistThresh parameter. Some attribute that may
at first glance seem to be incorrect, are correct in the context of other attribute
within the same record. Percentage of correctly marked entries reaches the peak
for the DistThresh parameter equal to 0.05.
Context Independent Attribute Correction using Clustering Techniques
(CIACCT)
The Algorithm was tested using the sample Cardiology dataset which is drawn
from Hungarian data There are about 44000 records divided into 11 batches of 4
thousand records. The attribute CP (Chest pain type) in that Angial is one of the
types which occurs when an area of the heart muscle does not get enough oxygen
rich blood. By using CIACCT 4.22% i.e. 1856 element of whole data set were
identified as incorrect and hence subject to alteration. Table 4.3 contains the
example transformation rules discovered during the execution.
0
10
20
30
40
50
60
70
80
90
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pe
rce
nta
ge
Distthresh
Pc
Pi
Po
132
Table 4.3 Transformation Rules
Original value Correct value
Angail Angial
Anchail Angial
Angal Angial
Ancail Angial
The measure is used
Pc – Percentage of correctly altered values
Pi – Percentage of incorrectly altered values
P0- Percentage of values marked during the review as incorrect, but not
Altered during cleaning
From Table 4.4 and Figure 4.9 it can be observed that the relationship
between the measures and the distthresh parameter. The results show that the
number of values marked as incorrect and altered is growing with the increase of
the distthresh parameter. This also proves that the context-independent algorithm
perform better to identify incorrect entries. However, a value of the distthresh
parameter can be identified that gives good results. i.e. the number of correctly
altered values(Pc) is high and the number of incorrectly altered values(Pi) is low.
133
Table 4.4 – Dependency between the measures and the parameter for Context -Independent
algorithm
Distthresh Pc Pi P0
0 0.0 0.0 100.0
0.1 92.63 7.37 92.45
0.2 79.52 20.48 36.96
0.3 67.56 32.44 29.25
0.4 47.23 52.77 26.93
0.5 29.34 70.66 23.41
0.6 17.36 82.64 19.04
0.7 7.96 92.04 8.92
0.8 4.17 95.83 1.11
0.9 1.17 98.83 0.94
1.0 0.78 99.22 0
134
Figure 4.9 – Dependency between the measures and the parameter for Context –Independent
algorithm
The algorithm perform better for longer strings as short string would
require higher value of the parameter to discover a correct reference value. High
values of the distthresh parameter results in larger number of incorrectly altered
elements. This algorithm results in an efficiency of 92% of correctly altered
elements which is an acceptable value. The range of the application of this method
is limited to elements that can be standardized for which reference data is
available. Conversely, using this method for cleaning last names could not yield
good results.
4.5 DATA INTEGRATION
Data integration is the process of combining data residing at different
sources and providing the user with a unified view of these data. This process
emerges in a variety of situations , both commercial (when two similar companies
need to merge their databases) and scientific (combining research results from
0
10
20
30
40
50
60
70
80
90
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pe
rce
nta
ge
Distthresh
Pc
Pi
Po
135
different bioinformatics repositories). In this work combining the two
cardiovascular databases from different Hospital is taken into consideration.
The data consumed and/or produced by one component is the same as the data
produced and/or consumed by all the other components. This description of
integration highlights the three primary types of system integration, specifically:
presentation, control and data integration.
4.5.1. Need for Data Integration
Data integration appears with increasing frequency as the volume and the need
to share existing data explodes. As information systems grow in complexity and
volume, the need for scalability and versatility of data integration increases. In
management practice, data integration is frequently called Enterprise Information
Integration.
The rapid growth of distributed data has fueled significant interest in building
data integration systems. However, developing these systems today still requires
an enormous amount of labor from system builders. Several nontrivial tasks must
be performed, such as wrapper construction and mapping between schemas. Then,
in dynamic environments such as the Web, sources often undergo changes that
break the system, requiring the builder to continually invest maintenance effort.
This has resulted in very high cost of ownership for integration systems, and
severely limited their deployment in practice.
Health care providers collect and maintain large quantities of data. The major
issue in these data representatives is the dissimilarity in the structure. Very rarely
the structure of the database remains the same. Yet data communication and
data sharing is becoming more important as organizations see the advantages
of integrating their activities and the cost benefits that accrue when data can
be reused rather than recreated from scratch.
136
The integration of heterogeneous data sources has a long research history
following the different evolutions of information systems. Integrating various data
sources is a major problem in knowledge management. It deals with integrating
heterogeneous data sources and it is a complex activity that involves reconciliation
at various levels - data models, data schema and data instances. Thus there arises a
strong need for a viable automation tool that organize data into a common syntax.
Some of the current work in data integration research concerns the Semantic
Integration problem. This problem is not about how to structure the architecture of
the integration, but how to resolve semantic conflicts between heterogeneous data
sources. For example if two companies merge their databases, certain concepts
and definitions in their respective schemas like "earnings" inevitably have
different meanings. In one database it may mean profits in dollars (a floating point
number), while in the other it might be the number of sales (an integer). A
common strategy for the resolution of such problems is the use of ontology which
explicitly defines schema terms and thus help to resolve semantic conflicts.
4.5.2. Implementation of Data Integration
Data integration is done using JARO ALGORITHM
The Jaro-Winkler distance is a measure of similarity between two strings. It
is a variant of the Jaro distance metric and mainly used in the area of record
linkage. The higher the Jaro-Winkler distance of two strings is, the more similar
the strings are. The Jaro distance metric states that given two strings s1 and s2, are
similar only if all the characters of s1 matches with that of s2.This number of
matching characters is denoted as ‗m‘.
Two characters from s1 and s2 respectively, are considered matching only if
they are similar.
While comparing the columns of one dataset with another for similarity, there
exist two kinds of similarities namely Exact matching and Statistical matching.
137
Exact matching involves the exact matching of strings to the columns and
Statistical matching involves the partial matching of the strings present in the
column name. For eg. Pname in one database column matching with Pname in
another database column is exact matching. ―Pname‖ in one database column
matching with ‗Patient name‘ in another database column is called as statistical
matching.
This method has a limitation that if two strings represent the same thing, but
different in words says for eg. Cost and price are two different words representing
the same meaning. It is not possible to match these two words using Jaro-Winkler
method. To avoid this in this research work knowledge repository is used to
check all possible words which cannot be matched by Jaro method. First compare
these words with the given strings. If it does not match compare it using the Jaro –
Winkler method.
After Data integration is performed, the database may contain incomplete
data set, hence it is efficient to perform data cleaning after every data integration is
performed. This increases data reliability.
A sample of the knowledge repository that maintain in this work is such as
given below:
String Similar name
Patient identification number p_id, pat_id, id, patient, pat_no,
p_no,file_no, f_no.
Address Address, street, area
Blood pressure BP, Pressure, stress
Medicine Medicine, drug, medication
138
First copy all the columns of one database into the new database, then compare
each column of the other database to be integrated with the knowledge repository
that maintain in the research work. If the word matches with these words in the
knowledge base, we integrate it to the corresponding column in the new database,
else compare it using the Jaro – Winkler measure. Even then, if the columns don‘t
match ,create a new column in the integrated database. Figure 4.10 describes the
Procedure for Jaro- Winkler algorithm.
Algorithm for Data Integration
Step 1: Get the two databases which is needed to be integrated as input.
Step 2: Check for the attribute name in both the table and calculate the Jaro
distance Metric
Step 3: Higher the Jaro distance metric is, the higher the similarities between the
two attributes and the two attributes are considered as similar and their
values are merged.
Step 4: If two attributes are dissimilar check for their name in the knowledge
repository.
Step 5: If found, then the two attribute‘s values are merged.
else it is considered as new attributes and one added to the database.
139
Input : database1, database2.
Output: database 3 ( integrated database)
Method:
Step1: copy all the attributes and values of database 1 into database3.
Step2: For every attribute in database2,
set flag=0;
do { select each attribute from database 3
{ match it using knowledge repository
String[] st3={"p_id,pat_id,id,patient", "address,street,area",
"amount,amt,cost", "phone no, mobile no, contact no"}
IF it matches {
set flag = 1
copy all the values of that particular column into the
corresponding matching column of database3 }
ELSE { check for similarity between the two attributes from
database2 and database3 }
using JARO method ( string comparison)
IF it matches { et flag = 1
copy all the values of that particular column into the
corresponding matching column of database3
} } }
IF (flag = 0) {
create a new column in database 3 with the same column name as in
database2
copy all the values of that column into the corresponding column in
database 3; }
Step3: end
Figure 4.10 Procedure for Data Integration
140
4.6 DATA DISCRETIZATION
Discretization is a process that transforms quantitative data into qualitative
data. Quantitative data are commonly involved in data mining applications. It
significantly improve the quality of discovering knowledge and also reduces the
running time of various data mining tasks such as association rule discovery,
classification, clustering and prediction.
Discretization is a process that transforms data containing a quantitative
attribute so that the attribute in question is replaced by a qualitative
attribute. A many to one mapping function is created so that each value of
the original quantitative attributes is mapped onto a value of the new
qualitative attribute. First, discretization divides the value range of the
quantitative attribute into a finite number of intervals. The mapping function
associates all of the quantitative values in a single interval to a single qualitative
value.
Discrete data is information that can be categorized into a classification.
Discrete data are based on counts. Only a finite number of values are possible, and
the values cannot be subdivided meaningfully. Attribute data (Discrete data) is
data that cannot be broken down into smaller unit and add additional meaning. It is
typically things counted in whole numbers.
4.6.1. Need for Discretization
Reducing the number of values for an attribute is especially beneficial if
decision-tree-based methods of classification are to be applied to the pre-processed
data. The reason is that these methods are typically recursive, and a large amount
of time is spent on sorting the data at each step.
Before applying learning algorithms to data sets, practitioners often globally
discretize any numeric attributes. If the algorithm cannot handle numeric attributes
directly, prior discretization is essential. Even if it can, prior discretization often
141
accelerates induction, and may produce simpler and more accurate classification.
As it is generally done, global discretization denies the learning algorithm of
taking any change advantage of the ordering information implicit in numeric
attributes.
However, a simple transformation of discretized data preserves this
information in a form that learners can use. This work show that, compared to
using the discretized data directly, this transformation significantly increases the
accuracy of decision trees built by C4.5, decision lists built by PART, and decision
tables built using the wrapper method, on several benchmark datasets. Moreover,
it can significantly reduce the size of the resulting classifiers.
This simple technique makes global discretization an even more useful for
data preprocessing.
Many algorithms developed in the machine learning community focus on
learning in nominal feature spaces. However, many real-world databases often
involve continuous features. Those features have to be discretized before using
such algorithms. Discretization methods can transform continuous features into a
finite number of intervals, where each interval is associated with a numerical
discrete value. Discretized intervals, then can be treated as ordinal values during
induction and deduction.
4.6.2. Methods in Discretization
The discretization methods can be classified according to three axes:
supervised versus unsupervised, global versus local, and static versus dynamic. A
supervised method would use the classification information during the
discretization process, while the unsupervised method would not depend on class
information. The popular supervised discretization algorithms contain many
142
categories, such as entropy based algorithms, including Ent-MDLP, Mantaras
distance, dependence based algorithms, including ChiMerge, Chi2, and binning
based algorithms including 1R, Marginal Ent. The unsupervised algorithms
contain equal width, equal frequency and some other recently proposed algorithms
and an algorithm using tree-based density estimation.
Local methods produce partitions that are applied to localized regions of the
instance space. Global methods, such as binning, produce a mesh over the entire
continuous instances, space, where each feature is partitioned into regions
independent of the other attributes.
Many discretization methods require a parameter, n, indicating the maximum
number of partition intervals in discretizing a feature. Static methods, such as Ent-
MDLP, perform the discretization on each feature and determine the value of n for
each feature independent of the other features. However, the dynamic methods
search through the space of possible n values for all features simultaneously,
thereby capturing interdependencies in feature discretization. There are a wide
variety of discretization methods starting with the naive methods such as equal-
width and equal-frequency.
The simplest and efficient discretization method is an unsupervised direct
method named equal width discretization which is a binning methodology. It
calculates the maximum and the minimum for the feature that is being discretized
and partitions the range observed into k approximately equal sized intervals.
4.6.3. Equal width Discretization Methodology
The equal width discretization methodology is described below
1. Get the input dataset which has to be discretized.
2. For each attribute calculate its minimum possible value and maximum
possible value.
143
3. Then divide the attribute value into k intervals approximately of equal size.
4. For each interval sets replace the values with a class name.
Algorithm for Width Discretization Methodology described in Figure 4.11.
Rules for Discretization
In this work, the following rule are used to transform the data in the database.
Systole
90-130 Normal
below 90 Hypotension
above 130 Hypertension
Diastole
60-80 Normal
below 60 Hypotension
above 80 Hypertension
Heart beat
72 - Normal for adult
140-150 Normal for Child
BMI- Body mass Index
Below 18.5 - Underweight
18.5-25 - Normal range
25-30 - Overweight
Above 30 – Obesity
Dose
100-300 low
144
300-500 medium
Below 500 heavy dose
Anesthesia
1-3 Normal
4-8 Serious
Input: database to be discretized
output: database ( discretized database)
Begin
step1: Get each column from the input database
step2: Check the column name with the column name present in the rules for
discretization
set flag = 0;
IF it matches { set flag = 1;
do{ Check for the condition in the rules and transform the
numerical attribute in the column to its corresponding categorical attribute.
} until all the values in the column are discretized
}
IF (flag = 0)
{ Leave that column and go on to the next column (Start from step 1)
}
End
Figure 4.11 Procedure for Data Discretization
4.7. DATA REDUCTION
Data warehouses store vast amounts of data. Mining takes a long time to run
this complete and complex data set. Data reduction reduces the data set and
145
provides a smaller volume data set, which yields similar results as the complete
data sets.
Working with data collected through a team effort or at multiple sites can be
both challenging and rewarding. The sheer size and complexity of the dataset
sometimes makes the analysis daunting, but a large data set may also yield richer
and more useful information. The benefits of the data reduction techniques
increase as the data sets grow in size and complexity
4.7.1. Methods for Data Reduction
Reduction can be handled by two methods they are discussed as follows.
1. Dimensionality Reduction
2. Numerosity Reduction
Dimensionality Reduction
Dimensionality Reduction is defined as removal of unimportant attributes.
The method used for handling dimensionality reduction is feature selection. A
process that chooses an optimal subset of features according to a objective
function This selects the minimum set of attributes, features that is sufficient for
the data mining task. Algorithm for Dimensionality reduction is described in
Figure 4.12.
Numerosity Reduction
Numerosity Reduction is fitting the data into model. This method can be
handled by Parametric Methods. The parameter on which the numerosity
reduction has to take place is got from the user. According to the parameter its
corresponding values are stored and the remaining data are discarded. Algorithm
for Numerosity reduction is described in Figure 4.13.
146
4.7.2. Implementation of Data Reduction
Dimensionality Reduction
1. Get the input dataset which has to be reduced.
2. According to the need of data mining algorithms, get the attribute names
that are necessary for the domain.
3. Remove the other attributes from the dataset which are not needed.
Numerosity Reduction
1. Get the input dataset for which numerosity reduction has to be done.
2. Get the attribute names and the parametric value according to which
numerosity reduction has to be done.
3. The dataset that satisfy the parameter value are stored and the remaining
data are discarded
Dimensionality Reduction
Input: D /* the cardiology database */
K /* no. of attributes need to reduced*/
Output: The cardiology database with reduced dimensionality
Begin
Step 1: For each attribute in D
Step 2: Get the number of attributes and attribute name which has to be reduced
from database
Step 3: Delete the attribute from the database
Step 4: Repeat until all the attribute which need to be reduced are deleted
End Figure 4.12 Procedure for Data Reduction
147
Numerosity Reduction
Input: D /* the cardiology database with discretized attributes*/,
K /* parameter according to which numerosity reduction has to
performed*/
Output: The cardiology database with reduced numerosity
Begin
Step 1: For each attribute in D
Step 2: Get the input parameter according to which reduction has to be performed
Step 3: For each record in the database, remove the records which does not satisfy
the parameter
End
Figure 4.13 Procedure for Numerosity Reduction
4.8 SUMMARY
In this part of research work, a new preprocessing technique is
implemented. The need for the proposed model is discussed in detail. Randomly
simulated missing values were estimated by five data imputation methods out of
these, K-NN produce the promising result, Attribute Correction algorithm for
Context Dependent and Context Independent is proposed and implemented, also
implemented knowledge repository along with Jaro Winkler for data integration,
equal width discretization methodology is used for data discretization and
Dimensionality reduction and numerosity reduction is used to reduced the data for
better knowledge discovery.