data m lqlqj¶v r ole in m ining m edical d atasets for d ... · data m lqlqj¶v r ole in m ining m...

Data Mining’s Role in Mining Medical Datasets for Disease

Assessments – a Case Study

Ms. A. Malarvizhi

Research Scholar and Assistant Professor, PG and Research Department of Computer Science, H.H. The Rajah’s College (A), Pudukkottai, Tamilnadu- 622001

Email:[email protected]

Dr. S. Ravichandran Assistant Professor and Head, PG and Research Department of Computer Science,

H.H. The Rajah’s College (A), Pudukkottai, Tamilnadu- 622001

Email:[email protected]

Abstract—Healthcare organizations conserve their

patient’s data digitally for medical evaluations,

references and treatments. These data sets are complex, voluminous and heterogeneous, due to

varying value types, fields of evaluations and medical

disparities in humans. Clinical decision support

systems (CDSS) can help in deep analysis of these

large data values and offer opportunities to improve diagnosis by extracting knowledge from them, where

data mining techniques have been very helpful in

discovering relationships and patters in large data

sets. Also, Data mining techniques have been proved

to be useful in extracting inferential knowledge from complex biomedical datasets for deep insights into

healthcare helping clinicians. Further, complexities

involved in investigating healthcare data for intrinsic

details have been minimized by constructing models

from data mining methods and techniques. Thus, this paper aims to describe clustering techniques for

analyzing patient data sets for disease predictions.

Clustering techniques FCM,K-means++ and K-

Means are detailed in this paper. Further, this paper

also proposes a different set of parameters for use in disease assessment of patients from medical data sets.

Keywords—Data Mining, Health Care, K-Means, K-

Means++

I. INTRODUCTION

Technology advances has made it possible to

store large amounts of medical data. Medical and

healthcare data contain valuable information for

diagnosing diseases. Data mining techniques can

extract intelligible knowledge from medical data

for diagnosis. Healthcare system need automated

tools for identifying and disseminating required

healthcare information. Interest in expert medical

capable of taking decisions independently is

growing in studies of external symptoms and lab

tests, mainly due to the ease in the availability of

medical data of patients. Non-enveloping internal

examinations have enhanced observations. Medical

diagnosis is identifying or establishing supportive

evidences from a patient's symptoms [1]. The

success of data Mining Techniques in many

application domains like Marketing, e-business and

Retail management has led its extension in the field

of health care.

A. Data mining in the Medical Domain

Medical data mining has been explored and

utilized in clinical evaluations and diagnosis.

Available raw medical data is distributed and has to

be compiled into for accuracy and efficiency in

evaluations proving that Healthcare environment is

International Journal of Pure and Applied MathematicsVolume 119 No. 12 2018, 16255-16260ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

16255

information rich, but poor in knowledge [2].

Classifications, grouping, neural networks,

association rule and decision trees have been used

in different healthcare applications [3]. Data

mining techniques in healthcare analysis work with

the primary objective of predicting diseases from

stored medical information. A wide range of

algorithms have predicted diseases including Heart

Diseases, Diabetics, Liver diseases and cancer.

Neural network technique was used in the analysis

of heart diseases [4]. The study in [5] used

association rules-based sequence numbers for

converting the Cleveland heart data set into a

binary dataset. Heart diseases were predicted using

an enhanced k-means clustering algorithm [6].

Limited attributes were also used for assessing

heart diseases with fuzzy techniques [7]. The useful

aspects of big data techniques in healthcare were

explained in [8]. Rules were discovered from

medical transcripts including association of

disease, medications, symptoms and prominent age

groups for diseases[9].The availability of enormous

amount of data in healthcare enables creation of

real data sets. Analyzing these data sets poses

challenges like missing values, high dimensional

values or noise, which makes it inefficient for

classifications. Clustering techniques are a solution

for data analytics[10]. Naïve Bayes, C4.5, back

propagation and decision trees were used to predict

survivability in breast cancer patients [11]. This

paper provides details on three clustering

techniques of data mining in the aspect of

healthcare, in addition to proposing parameters that

can be used by data mining techniques in analysis

of medical data. Figure 1 depicts data mining’s role

in Disease Predictions.

Fig. 1 – Data Mining Technique and Disease Prediction

Accuracy

B. Clustering

Clustering is grouping similar data into groups

called clusters, where objects are similar within and

dissimilar outside groups. Large amount of data

like medical data is summarized into smaller

number of groups or classes for facilitating analysis

or evaluations. Clustering is a machine learning

technique and unsupervised classification, which

can be implemented using partitionsor Hierarchical

grouping or based on data densities. Clustering can

also be implemented based on constraints or

modeled. In partitioning, clustering data objects are

divided into several subsets. In hierarchical

clustering, a connected hierarchy of datasets is

generated. Hierarchical clustering is a frequent

phenomenon in detecting clustering structures [12].

Density based clustering forms clusters based on

the density of data points in region.

The k-means algorithm is a commonly used

clustering algorithm in machine learning for

clustering and quantization, mainly due to its

simplicity. k-means++ initialization is an

improvement of k-means. Fuzzy partitioning is

employed by Fuzzy C Means algorithm, where data

points can belong to many groups with 0 and 1 as

membership grades. In clustering the Euclidean

distance, data vector p and centroid q is computed

using equation (1)

……..(1)

Figure 2 depicts Clustering Types

Fig. 2 - Clustering Types

The k-means algorithm, partitions n objects into

k clusters (The input Value). Once clustered, intra

cluster similarity is high and is measured in

comparison to the mean value of cluster objects

(Cluster centroid or center of gravity). Figure 3

depicts the k-means algorithm.

Fig. 3 – K-Means Algorithm and its Output

Figure 4 illustrates K-means run one step at a

time, where data elementsare depicted by colors

International Journal of Pure and Applied Mathematics Special Issue

16256

(a) (b) (c)

(a) Initial centroid (b) centroids Repositioning (c) Convergence.

Fig. 4 - K-Means Clustering

C. K-Means++ Algorithm

The k-means++ is an enhanced K-means.

Sometimes because of poor initialization on

cetroids, k-means may give poor results. K-

means++ overcomes this problem by proposing a

balanced initial value and is said to successfully

overcome problems associated with initial defining

of cluster-centers for k-means [13]. K_means++ is

better than other proposed methods [14] for

initializing value in k-means and defined as a best

algorithm. The prime objective of K-means++ is to

defeat the difficulty in k-means clustering

algorithm that is sensitive to initial condition as

different initial values may result in inefficient

results as the optimum value may not be precise.

For the same medical dataset, different cluster may

lead to different sets of rules for building a

classifier model. The accuracy of the system again

depends on the optimal value. To improve the

accuracy, efficiency and consistency in the cluster

members, K-means++ is used. The first step in K-

means++ algorithm is calculating the sum of the

distance from origin to each attribute of a data

point in a data set calculated using equation (2).

Figure 5 depicts the K-means++ Algorithm

U =

…… (2)

Fig. 5 – K-Means++ Algorithm

D. Fuzzy C Means Algorithm

In FCM Data points are considered based on

membership levels and can be a part of more than

one group. The membership level defines the

association strength within the group. Thus, data

points are assigned to groups based on membership

levels. Figure 6 depicts the FCM Algorithm

Fig. 6 – FCM Clustering Algorithm

II. PROPOSED METHODOLOGY

There has been an increase in the number of

diabetics (World Health Organization and India is

number one in diabetes. Diabetes need proper care

and management and lack of proper care leads to

other medical complications in humans . Every 8th

woman in US is likely to develop breast cancer,

according to studies. Early diagnosis of breast

cancer is crucial and the diagnosis and staging for

prognosis is based on clinical examination. This

study considers the diabetic and breast cancer

patients details for evaluating clustering algorithms

that can be used in early diagnosis. The Pima

Indian diabetes data set is taken from the UCI

machine learning repository. The proposed

technique was implemented in MATLAB and is

depicted in Figure 7..

Fig. 7 – Flow Chart of the Proposed Methodology

A. Major Parameters used in the study

The input data set has 768 instances with 9

attributes as listed in Table 1 and two class

problems for testing diabetic incidences. Table 2


16257

details on 9 attributes of the breast cancer dataset

taken for the study with 106 instances.

Table I

Attributes of Pima Indian Dataset

No. Attribute Description Missing Values

1 pregnancy Pregnancy Count 109

2 glucose Level

Plasma glucose levels 6

3 Blood Pressure

Diastolic blood pressure (mm Hg)

34

4 triceps Triceps skin fold thickness (mm)

224

5 insulin serum insulin (mu U/ml) 373

6 mass Body mass index (weight in kg/(height in m)̂ 2)

12

7 pedigree Diabetes pedigree function 0 8 age Age (years) 0

9 diabetes Class variable (test for diabetes)

0

Table II

Breast T issue Dataset

Attribute

Description

I0

Ohmat 0 frequency

PA500 phase angle (500 KHz)

AREA spectrum area

A/DA DA normalized area

HFS

high-frequency slope PA

DA

spectral impedance distance

MAX IP Max. spectrum

P spectral curve length

DR distance between I0 and real part of the maximum

frequency point

B. Validity Measures used in the Study

Clustering algorithms try to find the best fit for a

fixed number of clusters and parameterized cluster

shapes. Sometimes the number of clusters might be

wrong or the cluster shapes might not correspond

to the groups in the data, if the data can be grouped

in a meaningful way at all. It can be overcome with

compatible cluster merging, where more number of

clusters are used first and successively reduced by

merging clusters that are similar based on a

predefined criterion. It can also be overcome by

using a validity function with upper and lower

bounds or using a validity measure for grouping

into clusters. Though many scalar validity

measures have been proposed in literature, this

paper uses a set of new validity measures in

clustering to assess the comparative performances

of clustering techniques taken for the study. The

measures are detailed below

1. Partition Coefficient (PC): measures

overlapping in clusters.

Where µijis the membership of data point j in

cluster I and te optimal number of cluster is at the

maximum value.

2. Classification Entropy (CE): measures the

fuzzyness of a cluster.

3. Partition Index (SC): is the ratio between sum

of compactness and separation in clusters. It is used

for comparing partitions with same number of

clusters.

4. Separation Index (S):It is the minimumdistance

between clusters in partitionvalidity.

5. Xie and Beni's Index (XB): ratio between

variation and separation of clusters.

6. Dunn's Index (DI):is the identification of

compactness and separation between clusters.

7. Alternative Dunn Index (ADI):Original Dunn's

index is modified to rate dissimilarity between two

clusters (minxϵ Ci,yϵ Cj d(x, y))

where vj is the cluster center of the j-th cluster.

III. RESULTS AND DISCUSSIONS

All the three algorithms were run on Matlab for

the above discussed parameters. Table 3 lists the

parameters output of the cluster parameters for all

the three algorithms taken for the study.


16258

TABLE III

CLUSTERING OUTPUT

Dataset

s

Algorithms

PC CE SC S XB DI ADI

Dia

betes

K-Mean

s

1 NaN

0.726

5

0.001

4

3.178

4

0.6481

0.563

6

K-Means++

1 NaN

0.7237

0.0014

Inf 0.6481

0.5501

FCM 0.8

088

0.3

311

0.9

548

0.0

018

2.6

115

0.6

4808

0.5

471

Breast

Tissue

K-Mean

s

1 NaN

0.804

7

0.007

6

4.099

6

0.1448

0.034

1

K-Means++

1 NaN

0.8317

0.0078

Inf 0.1448

0.0338

FCM 0.8151

0.3126

1.0106

0.0095

3.6623

0.1448

0.0317

A.Time Factor

This part describes the amount of time needed

for predicting diseases and the comparative time

taken for processing is listed in Table 4.

TABLE IV

TIME FACTOR

DISEASES K MEANS (in msec)

FUZZY C MEANS (in msec)

K-Means ++ (in

msec) Diabe tes 2160 3080 617

Brea st Tissue 2060 2200 618

Table 4 shows that the time taken by each

algorithm for predicting the diseases and it can also

be inferred that K means++ clustering algorithm

takes lesser time when compared to other

algorithms. The comparative accuracy of the

clustering algorithm taken for study is listed in

Table 5. K-means++ achieves better accuracy when

compared to the other two algorithms, proving it is

a better algorithm for clustering and prediction of

diseases.

TABLE V ACCURACY

DISEASES K MEANS

(in msec)

FUZZY C

MEANS (in msec)

K-Means

++ (in

msec) Diabe tes 85 80 90

Brea st Tissue 88 88 93

IV. CONCLUSION

Data mining techniques in medical domains

sometimes fall short of effectiveness or in

discovering hidden relationships and trends. Thus,

there is a need to create effective and efficient

analysis tools using data mining techniques in

healthcare domains. Clustering is technique by

which large datasets are dividing in to small data

collections that are turned into information. This

paper has detailed on clustering technique, while

proposing a different set of parameters for

improving accuracy in disease predictions. This

paper has also analyzedpredictions in healthcare

domain using clustering to discover new range of

information retrieval, specifically proving the

usefulness of k-means++ clustering..

REFERENCES

[1]E. Barati, M. Saraee, A. Mohammadi, N. Adibi and M. R.

Ahamadzadeh “ A Survey On Predictive Data mining Approaches for Medical Informatics” (JSHI): March Edition, 2011. [2]R., Zhang, Y., Katta, "Medical Data Mining, Data Mining

and Knowledge Discovery”, pp. 305-308, 2002 [3]Jiawei Han and Micheline Kamber,” Data Mining Concepts and Techniques”. San Francisco, CA: Elsevier Inc, 2006 [4]Rani U. Analysis of heart diseases dataset using neural net -

work approach. IJDKP. 2011 Sep; 1(5): 1–8. [5]Jabbar MA, Chandra P, Deekshatult BL. Cluster based asso-ciation rule mining for heart attack prediction. JTAIT. 2011 Oct; 32(2):196–201

[6]Sumathi, Kirubakaran. Enhanced Weighted K-Means Clustering Based Risk Level Prediction for Coronary Heart Disease. European Journal of Scientific Research. 2012;

71(4):490–500. [7]Ephzibah EP. A hybrid genetic-fuzzy expert system for effective heart disease diagnosis. ACITY CCIS. 2011; 198:115–21.

[8]S. Vijayarani and S. Sudha, “An Efficient Clustering Algorithm for Predicting Diseases from Hemogram Blood Test Samples”, Indian Journal of Science and Technology, vol.8, pp. 1–8, Aug. 2015.

[9]Jyoti Soni, “Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction”, International Journal of Computer Applications, vol.17, pp. 43–48, Mar. 2011.

[10]Akhilesh Kumar Yadav, Divya Tomar, Sonali Agarwal, “Clustering of Lung Cancer Data Using Foggy K-Means”, International Conference on Recent Trends in Information

Technology (ICRTIT) 2013 [11]Abdelghani Bellaachia, Erhan Guven, “Predicting Breast Cancer Survivability Using Data Mining Techniques”, Washington DC 20052, 2010

[12] R. Karpagam, 2Dr. S. Suganya, “APPLICATIONS OF DATA MINING AND ALGORITHMS IN EDUCATION – A SURVEY”, International Journal of Innovations in Scientific and Engineering Research (IJISER), Vol.3, No.4, pp.38-

46,2016. [13]R.Nithya1, P.Manikandan2, Dr.D.Ramyachitra, Analysis of clustering technique for the diabetes dataset using the training set parameter, International Journal of Advanced Research in

Computer and Communication Engineering, Vol. 4, Issue 9, September 2015. [14]D. Arthur, “Analysing and improving local search: k-means

and ICP,” Stanford University, 2009. [15]Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra,A systematic evaluation of different methods for initializing the K-means clustering algorithm 2010.


16259

data m lqlqj¶v r ole in m ining m edical d atasets for d ... · data m lqlqj¶v r ole in m ining m...

Documents