data m lqlqj¶v r ole in m ining m edical d atasets for d ... · data m lqlqj¶v r ole in m ining m...
TRANSCRIPT
Data Mining’s Role in Mining Medical Datasets for Disease
Assessments – a Case Study
Ms. A. Malarvizhi
Research Scholar and Assistant Professor, PG and Research Department of Computer Science, H.H. The Rajah’s College (A), Pudukkottai, Tamilnadu- 622001
Email:[email protected]
Dr. S. Ravichandran Assistant Professor and Head, PG and Research Department of Computer Science,
H.H. The Rajah’s College (A), Pudukkottai, Tamilnadu- 622001
Email:[email protected]
Abstract—Healthcare organizations conserve their
patient’s data digitally for medical evaluations,
references and treatments. These data sets are complex, voluminous and heterogeneous, due to
varying value types, fields of evaluations and medical
disparities in humans. Clinical decision support
systems (CDSS) can help in deep analysis of these
large data values and offer opportunities to improve diagnosis by extracting knowledge from them, where
data mining techniques have been very helpful in
discovering relationships and patters in large data
sets. Also, Data mining techniques have been proved
to be useful in extracting inferential knowledge from complex biomedical datasets for deep insights into
healthcare helping clinicians. Further, complexities
involved in investigating healthcare data for intrinsic
details have been minimized by constructing models
from data mining methods and techniques. Thus, this paper aims to describe clustering techniques for
analyzing patient data sets for disease predictions.
Clustering techniques FCM,K-means++ and K-
Means are detailed in this paper. Further, this paper
also proposes a different set of parameters for use in disease assessment of patients from medical data sets.
Keywords—Data Mining, Health Care, K-Means, K-
Means++
I. INTRODUCTION
Technology advances has made it possible to
store large amounts of medical data. Medical and
healthcare data contain valuable information for
diagnosing diseases. Data mining techniques can
extract intelligible knowledge from medical data
for diagnosis. Healthcare system need automated
tools for identifying and disseminating required
healthcare information. Interest in expert medical
capable of taking decisions independently is
growing in studies of external symptoms and lab
tests, mainly due to the ease in the availability of
medical data of patients. Non-enveloping internal
examinations have enhanced observations. Medical
diagnosis is identifying or establishing supportive
evidences from a patient's symptoms [1]. The
success of data Mining Techniques in many
application domains like Marketing, e-business and
Retail management has led its extension in the field
of health care.
A. Data mining in the Medical Domain
Medical data mining has been explored and
utilized in clinical evaluations and diagnosis.
Available raw medical data is distributed and has to
be compiled into for accuracy and efficiency in
evaluations proving that Healthcare environment is
International Journal of Pure and Applied MathematicsVolume 119 No. 12 2018, 16255-16260ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu
16255
information rich, but poor in knowledge [2].
Classifications, grouping, neural networks,
association rule and decision trees have been used
in different healthcare applications [3]. Data
mining techniques in healthcare analysis work with
the primary objective of predicting diseases from
stored medical information. A wide range of
algorithms have predicted diseases including Heart
Diseases, Diabetics, Liver diseases and cancer.
Neural network technique was used in the analysis
of heart diseases [4]. The study in [5] used
association rules-based sequence numbers for
converting the Cleveland heart data set into a
binary dataset. Heart diseases were predicted using
an enhanced k-means clustering algorithm [6].
Limited attributes were also used for assessing
heart diseases with fuzzy techniques [7]. The useful
aspects of big data techniques in healthcare were
explained in [8]. Rules were discovered from
medical transcripts including association of
disease, medications, symptoms and prominent age
groups for diseases[9].The availability of enormous
amount of data in healthcare enables creation of
real data sets. Analyzing these data sets poses
challenges like missing values, high dimensional
values or noise, which makes it inefficient for
classifications. Clustering techniques are a solution
for data analytics[10]. Naïve Bayes, C4.5, back
propagation and decision trees were used to predict
survivability in breast cancer patients [11]. This
paper provides details on three clustering
techniques of data mining in the aspect of
healthcare, in addition to proposing parameters that
can be used by data mining techniques in analysis
of medical data. Figure 1 depicts data mining’s role
in Disease Predictions.
Fig. 1 – Data Mining Technique and Disease Prediction
Accuracy
B. Clustering
Clustering is grouping similar data into groups
called clusters, where objects are similar within and
dissimilar outside groups. Large amount of data
like medical data is summarized into smaller
number of groups or classes for facilitating analysis
or evaluations. Clustering is a machine learning
technique and unsupervised classification, which
can be implemented using partitionsor Hierarchical
grouping or based on data densities. Clustering can
also be implemented based on constraints or
modeled. In partitioning, clustering data objects are
divided into several subsets. In hierarchical
clustering, a connected hierarchy of datasets is
generated. Hierarchical clustering is a frequent
phenomenon in detecting clustering structures [12].
Density based clustering forms clusters based on
the density of data points in region.
The k-means algorithm is a commonly used
clustering algorithm in machine learning for
clustering and quantization, mainly due to its
simplicity. k-means++ initialization is an
improvement of k-means. Fuzzy partitioning is
employed by Fuzzy C Means algorithm, where data
points can belong to many groups with 0 and 1 as
membership grades. In clustering the Euclidean
distance, data vector p and centroid q is computed
using equation (1)
……..(1)
Figure 2 depicts Clustering Types
Fig. 2 - Clustering Types
The k-means algorithm, partitions n objects into
k clusters (The input Value). Once clustered, intra
cluster similarity is high and is measured in
comparison to the mean value of cluster objects
(Cluster centroid or center of gravity). Figure 3
depicts the k-means algorithm.
Fig. 3 – K-Means Algorithm and its Output
Figure 4 illustrates K-means run one step at a
time, where data elementsare depicted by colors
International Journal of Pure and Applied Mathematics Special Issue
16256
(a) (b) (c)
(a) Initial centroid (b) centroids Repositioning (c) Convergence.
Fig. 4 - K-Means Clustering
C. K-Means++ Algorithm
The k-means++ is an enhanced K-means.
Sometimes because of poor initialization on
cetroids, k-means may give poor results. K-
means++ overcomes this problem by proposing a
balanced initial value and is said to successfully
overcome problems associated with initial defining
of cluster-centers for k-means [13]. K_means++ is
better than other proposed methods [14] for
initializing value in k-means and defined as a best
algorithm. The prime objective of K-means++ is to
defeat the difficulty in k-means clustering
algorithm that is sensitive to initial condition as
different initial values may result in inefficient
results as the optimum value may not be precise.
For the same medical dataset, different cluster may
lead to different sets of rules for building a
classifier model. The accuracy of the system again
depends on the optimal value. To improve the
accuracy, efficiency and consistency in the cluster
members, K-means++ is used. The first step in K-
means++ algorithm is calculating the sum of the
distance from origin to each attribute of a data
point in a data set calculated using equation (2).
Figure 5 depicts the K-means++ Algorithm
U =
…… (2)
Fig. 5 – K-Means++ Algorithm
D. Fuzzy C Means Algorithm
In FCM Data points are considered based on
membership levels and can be a part of more than
one group. The membership level defines the
association strength within the group. Thus, data
points are assigned to groups based on membership
levels. Figure 6 depicts the FCM Algorithm
Fig. 6 – FCM Clustering Algorithm
II. PROPOSED METHODOLOGY
There has been an increase in the number of
diabetics (World Health Organization and India is
number one in diabetes. Diabetes need proper care
and management and lack of proper care leads to
other medical complications in humans . Every 8th
woman in US is likely to develop breast cancer,
according to studies. Early diagnosis of breast
cancer is crucial and the diagnosis and staging for
prognosis is based on clinical examination. This
study considers the diabetic and breast cancer
patients details for evaluating clustering algorithms
that can be used in early diagnosis. The Pima
Indian diabetes data set is taken from the UCI
machine learning repository. The proposed
technique was implemented in MATLAB and is
depicted in Figure 7..
Fig. 7 – Flow Chart of the Proposed Methodology
A. Major Parameters used in the study
The input data set has 768 instances with 9
attributes as listed in Table 1 and two class
problems for testing diabetic incidences. Table 2
International Journal of Pure and Applied Mathematics Special Issue
16257
details on 9 attributes of the breast cancer dataset
taken for the study with 106 instances.
Table I
Attributes of Pima Indian Dataset
No. Attribute Description Missing Values
1 pregnancy Pregnancy Count 109
2 glucose Level
Plasma glucose levels 6
3 Blood Pressure
Diastolic blood pressure (mm Hg)
34
4 triceps Triceps skin fold thickness (mm)
224
5 insulin serum insulin (mu U/ml) 373
6 mass Body mass index (weight in kg/(height in m)̂ 2)
12
7 pedigree Diabetes pedigree function 0 8 age Age (years) 0
9 diabetes Class variable (test for diabetes)
0
Table II
Breast T issue Dataset
Attribute
Description
I0
Ohmat 0 frequency
PA500 phase angle (500 KHz)
AREA spectrum area
A/DA DA normalized area
HFS
high-frequency slope PA
DA
spectral impedance distance
MAX IP Max. spectrum
P spectral curve length
DR distance between I0 and real part of the maximum
frequency point
B. Validity Measures used in the Study
Clustering algorithms try to find the best fit for a
fixed number of clusters and parameterized cluster
shapes. Sometimes the number of clusters might be
wrong or the cluster shapes might not correspond
to the groups in the data, if the data can be grouped
in a meaningful way at all. It can be overcome with
compatible cluster merging, where more number of
clusters are used first and successively reduced by
merging clusters that are similar based on a
predefined criterion. It can also be overcome by
using a validity function with upper and lower
bounds or using a validity measure for grouping
into clusters. Though many scalar validity
measures have been proposed in literature, this
paper uses a set of new validity measures in
clustering to assess the comparative performances
of clustering techniques taken for the study. The
measures are detailed below
1. Partition Coefficient (PC): measures
overlapping in clusters.
Where µijis the membership of data point j in
cluster I and te optimal number of cluster is at the
maximum value.
2. Classification Entropy (CE): measures the
fuzzyness of a cluster.
3. Partition Index (SC): is the ratio between sum
of compactness and separation in clusters. It is used
for comparing partitions with same number of
clusters.
4. Separation Index (S):It is the minimumdistance
between clusters in partitionvalidity.
5. Xie and Beni's Index (XB): ratio between
variation and separation of clusters.
6. Dunn's Index (DI):is the identification of
compactness and separation between clusters.
7. Alternative Dunn Index (ADI):Original Dunn's
index is modified to rate dissimilarity between two
clusters (minxϵ Ci,yϵ Cj d(x, y))
where vj is the cluster center of the j-th cluster.
III. RESULTS AND DISCUSSIONS
All the three algorithms were run on Matlab for
the above discussed parameters. Table 3 lists the
parameters output of the cluster parameters for all
the three algorithms taken for the study.
International Journal of Pure and Applied Mathematics Special Issue
16258
TABLE III
CLUSTERING OUTPUT
Dataset
s
Algorithms
PC CE SC S XB DI ADI
Dia
betes
K-Mean
s
1 NaN
0.726
5
0.001
4
3.178
4
0.6481
0.563
6
K-Means++
1 NaN
0.7237
0.0014
Inf 0.6481
0.5501
FCM 0.8
088
0.3
311
0.9
548
0.0
018
2.6
115
0.6
4808
0.5
471
Breast
Tissue
K-Mean
s
1 NaN
0.804
7
0.007
6
4.099
6
0.1448
0.034
1
K-Means++
1 NaN
0.8317
0.0078
Inf 0.1448
0.0338
FCM 0.8151
0.3126
1.0106
0.0095
3.6623
0.1448
0.0317
A.Time Factor
This part describes the amount of time needed
for predicting diseases and the comparative time
taken for processing is listed in Table 4.
TABLE IV
TIME FACTOR
DISEASES K MEANS (in msec)
FUZZY C MEANS (in msec)
K-Means ++ (in
msec) Diabe tes 2160 3080 617
Brea st Tissue 2060 2200 618
Table 4 shows that the time taken by each
algorithm for predicting the diseases and it can also
be inferred that K means++ clustering algorithm
takes lesser time when compared to other
algorithms. The comparative accuracy of the
clustering algorithm taken for study is listed in
Table 5. K-means++ achieves better accuracy when
compared to the other two algorithms, proving it is
a better algorithm for clustering and prediction of
diseases.
TABLE V ACCURACY
DISEASES K MEANS
(in msec)
FUZZY C
MEANS (in msec)
K-Means
++ (in
msec) Diabe tes 85 80 90
Brea st Tissue 88 88 93
IV. CONCLUSION
Data mining techniques in medical domains
sometimes fall short of effectiveness or in
discovering hidden relationships and trends. Thus,
there is a need to create effective and efficient
analysis tools using data mining techniques in
healthcare domains. Clustering is technique by
which large datasets are dividing in to small data
collections that are turned into information. This
paper has detailed on clustering technique, while
proposing a different set of parameters for
improving accuracy in disease predictions. This
paper has also analyzedpredictions in healthcare
domain using clustering to discover new range of
information retrieval, specifically proving the
usefulness of k-means++ clustering..
REFERENCES
[1]E. Barati, M. Saraee, A. Mohammadi, N. Adibi and M. R.
Ahamadzadeh “ A Survey On Predictive Data mining Approaches for Medical Informatics” (JSHI): March Edition, 2011. [2]R., Zhang, Y., Katta, "Medical Data Mining, Data Mining
and Knowledge Discovery”, pp. 305-308, 2002 [3]Jiawei Han and Micheline Kamber,” Data Mining Concepts and Techniques”. San Francisco, CA: Elsevier Inc, 2006 [4]Rani U. Analysis of heart diseases dataset using neural net -
work approach. IJDKP. 2011 Sep; 1(5): 1–8. [5]Jabbar MA, Chandra P, Deekshatult BL. Cluster based asso-ciation rule mining for heart attack prediction. JTAIT. 2011 Oct; 32(2):196–201
[6]Sumathi, Kirubakaran. Enhanced Weighted K-Means Clustering Based Risk Level Prediction for Coronary Heart Disease. European Journal of Scientific Research. 2012;
71(4):490–500. [7]Ephzibah EP. A hybrid genetic-fuzzy expert system for effective heart disease diagnosis. ACITY CCIS. 2011; 198:115–21.
[8]S. Vijayarani and S. Sudha, “An Efficient Clustering Algorithm for Predicting Diseases from Hemogram Blood Test Samples”, Indian Journal of Science and Technology, vol.8, pp. 1–8, Aug. 2015.
[9]Jyoti Soni, “Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction”, International Journal of Computer Applications, vol.17, pp. 43–48, Mar. 2011.
[10]Akhilesh Kumar Yadav, Divya Tomar, Sonali Agarwal, “Clustering of Lung Cancer Data Using Foggy K-Means”, International Conference on Recent Trends in Information
Technology (ICRTIT) 2013 [11]Abdelghani Bellaachia, Erhan Guven, “Predicting Breast Cancer Survivability Using Data Mining Techniques”, Washington DC 20052, 2010
[12] R. Karpagam, 2Dr. S. Suganya, “APPLICATIONS OF DATA MINING AND ALGORITHMS IN EDUCATION – A SURVEY”, International Journal of Innovations in Scientific and Engineering Research (IJISER), Vol.3, No.4, pp.38-
46,2016. [13]R.Nithya1, P.Manikandan2, Dr.D.Ramyachitra, Analysis of clustering technique for the diabetes dataset using the training set parameter, International Journal of Advanced Research in
Computer and Communication Engineering, Vol. 4, Issue 9, September 2015. [14]D. Arthur, “Analysing and improving local search: k-means
and ICP,” Stanford University, 2009. [15]Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra,A systematic evaluation of different methods for initializing the K-means clustering algorithm 2010.
International Journal of Pure and Applied Mathematics Special Issue
16259
16260