an efficient algorithm for multimodal epidemic liability prediction over … · 2018. 9. 1. · an...

An Efficient Algorithm for Multimodal Epidemic

Liability Prediction over Big Data 1R. Shiva Shankar,

2*M. Mounica Devi,

3J. Rajanikanth and

4G. Mahesh

1Department of CSE,

SRKR Engineering College,

Bhimavaram, AP, India. 2*

Department of CST,


Bhimavaram, AP, India.

[email protected] 3Department of CSE,


Bhimavaram, AP, India. 4Department of CSE,


Bhimavaram, AP, India.

Abstract The Healthcare industry contains big and complex data that may be

required in order to discover fascinating pattern of diseases & makes

effective decisions with the help of different machine learning techniques.

Advanced data mining techniques are used to discover knowledge in

database and for medical research. Big data analytics provided tools for

gathering, managing, analyzing and assimilating large, structured and

unstructured volumes of data produced by current healthcare systems. In

this paper, we discuss some of these major challenges with a focus on three

upcoming and promising areas of medical research: image, signal, and

genomics based analytics. Recent research which targets utilization of large

volumes of medical data while combining multimodal data from disparate

sources is discussed. We experiment on a regional chronic disease of

cerebral infarction. We propose a new convolution neural network (CNN)-

based multimodal disease risk prediction algorithm using structured and

unstructured data from hospital. To the best of our knowledge, none of the

International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 207-223ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

207

existing work focused on both data types in the area of medical big data

analytics.

Key Words:Image, signal, genomics based analytics, convolution neural

network, multimodal data.

International Journal of Pure and Applied Mathematics Special Issue

208

1. Introduction

From the recognition of big data study ,more commitment has been paid to

disease expectation based on the help of big data analytics equipment .To

improve the truth of risk classification rather than the previously selected

physiognomies different explores have been conducted by choosing the features

mechanically from vast number of data. However, those prevailing work mostly

measured structured data .The following tasks remain based on big data analysis

in risk organization .How should be lectured the mislaid data ? How should be

gritty that the main chronic diseases in a positive county and the main faces of

the disease in the region ?,How can big data analysis be used to estimate the

disease and generate a better method.

To estimate the risk of disease, it searches the structured and unstructured data

in healthcare field to solve these problems. To generate the pattern and causes

of disease firstly, the system use Decision tree map algorithm. It plainly shows

the diseases and sub diseases. Second, by using Map Reduce algorithm for

partitioning the data such that a query will be analyzed only in a specific

partition, which will increase the operational efficiency but reduce query

retrieval time. For partitioning the medical data based on the output of Decision

Tree map algorithm, Map reducing algorithm is used. When compared to

several typical prediction algorithms the prediction accuracy of our proposed

algorithm is more.

The concept of big data is not new; however, the way it is defined is

continuously changing. For archiving, analyzing and displaying data

successfully big data essentially characterize it as a collection of data elements

whose size, speed, type, and/or complexity require one to pursue, adopt, and

invent new hardware and software mechanisms [9-10]. This data is spread

among multiple healthcare systems, researchers, government entities, health

insurers, and so forth. Each of these data repositories and inherently incapable

of providing a platform for global data transparency. The accuracy of health

care data is also important for developing translational research.

For the growth of Big data technology, more observation has been paid to

disease prediction from the perspective of big data analysis. To improve the

accuracy of risk classification many researches have been conducted by

selecting the characteristics automatically. However, those existing work mostly

considered structured data [11]. By using convolution neural network (CNN)

obtaining text characteristics automatically achieved very good results for

unstructured data. However, none of previous work handles medical text data

by CNN. On the other side, there is a large difference between diseases in

different regions, because of the climate and living habits. How should be

determined that the main chronic diseases in a certain region and the main

characteristics of the disease in the region? For analyzing the disease how can

big data analysis technology be used to create a better model? To overcome


209

these problems in healthcare field, we combine the structured and unstructured

data to access the risk of disease.

2. Literature Survey

“Mining electronic health records towards better research applications and

clinical care, “P. B. Jensen, L. J. Jensen, and S. Brunak[1],Clinical data that

describes the treatment of patients which shows an under used data source that

has much greater research potential. Unknown disease correlations are disclosed

by mining of electronic health records (EHRs). Unification of EHR data with

genetic data will also give a better understanding of genotype-pheno type

relationships. However, a broad range of ethical, legal and technical reasons

currently hinder the systematic deposition of these data in EHRs and their

mining. Here, we consider the potential for furthering medical research and

clinical care using EHR data and the challenges that must be overcome before

this is a reality.

“ Enable Human-Cloud Integration in Next Generation Healthcare System,” M.

Chen,Y. Zhang, Y. Ma, , C. Youn,Y. Li, D. Wu[2],Many comprehensive

applications become available with the fast growth in Internet of Things, big

data and ,cloud computing. Simultaneously, people pay more interest on higher

QoE and QoS in a źterminal-cloud integrated system. Especially, both advanced

cloud technologies and advanced terminal technologies (e.g. bigdata analytics

and cognitive computing in clouds) are expected to provide people with more

reliable, authentic and intelligent services. Therefore, in this article to improve

QoE and QoS of the next generation healthcare system they proposed a

Wearable 2.0 healthcare system.

”Smart Clothing: Connecting Human with Big Data and clouds for Sustainable

Health Monitoring” M. Chen, C. Lai, B. Hu, Y. Ma, J. Song [3] ,This paper

introduces key technologies ,design details and practical implementation

methods of smart clothing system. Existing wearable devices have various

drawbacks, such as insufficient accuracy and comfortableness for long-term

wearing, etc. Hence, health monitoring through traditional wearable devices is

hard to be sustainable. Here they designed “Smart Clothing” to obtain

healthcare big data by sustainable health monitoring, facilitating unobtrusive

collection of various physiological indicators of human body. To provide

extensive intelligence for smart clothing system, mobile healthcare cloud

platform is constructed by the use of mobile internet, cloud computing and big

data analytics.

“Enabling real-time information service on telehealth system over cloud-based

big data platform,” M. Qiu, J. Here they proposed a flow estimating algorithm

for the telehealth cloud system. They also designed a data coherence protocol

for the PHR-based distributed system. The telehealth system that covers both

clinical and nonclinical uses .It also store-and-forward data services to be

offline studied by relevant specialists .In this paper, they proposes a probability-


210

based bandwidth model. It helps cloud broker to provide a high performance

allocation of computing nodes and links. This brokering process considers the

location protocol of Personal Health Record (PHR) in cloud .

“Big data in healthcare: using analyticsto identify and manage high-risk and

high-cost patients,” S. Saria, D. W. Bates, A. Shah, L. Ohno-Machado, and G.

EscobarHealth Affairs[5], US health care system rapidly adopts electronics

health records. At the same time, For analyzing large quantities of data rapid

progress has been made in clinical analytics techniques and gleaning new

insights from that analysis which is part of what is known as big data. They

presented six use cases. key examples where some of the clearest opportunities

exist to reduce costs through the use of big data, readmissions high-cost

patients, adverse events ,triage when a patient's condition worsens, treatment

optimization for diseases affecting multiple organ systems. In this paper they

also discussed the types of data needed to obtain such insights, the types of

insights that are likely to emerge from clinical analytics.

“Healthcps: Healthcare cyber-physical system assisted by cloud and big

data”Y. Zhang, M. M. Hassan M. Qiu, C.-W. Tsai, and A. Alamri [6], for

patient-centric healthcare applications and services this paper proposes a cyber-

physical system .It is named as Health-CPS. It was built on cloud and big data

analytics technologies. This system maintains three layers. A data collection

layer is a unified standard, and a data management layer for distributed storage

and parallel computing, and a data-oriented service layer. This study shows that

the technologies of cloud and big data are used to improve the performance of

the healthcare system .Hence humans can then enjoy several smart healthcare

applications and services.

“Localization based on social big data analysis in the vehicular networks, ”M.

S. Hossain, K. Lin, J. Luo, A. Ghoneim,L. Huand[7], Location-based services,

especially for vehicular localization, are an indispensable component of most

technologies and applications related to the vehicular networks. In this paper, an

overlapping and hierarchical social clustering model (OHSC) is designed to

classify the vehicles into different social clusters by exploring the social

relationship between them. By using OHSC model results, they proposed a

social-based localization algorithm (SBL) that use location prediction to assist

in global localization in the vehicular networks. The experiment result shows

the accuracy and performance of the OHSC model.

“Risk factors and risk assessment tools for falls in hospital in-patients: a

systematic review,” F. Daly, D. Oliver, M. E. McMurdo, F. C. Martin[8], A

small number of significant falls risk factors emerged consistently, despite the

heterogeneity of settings namely gait instability, agitated confusion, urinary

incontinence/frequency, falls history and prescription of 'culprit' drugs

(especially sedative/hypnotics). Simple risk assessment tools constructed of

similar variables have been shown to predict falls with sensitivity and

specificity in excess of 70%, although validation in a variety of settings and in


211

routine clinical use is lacking. Effective falls interventions in this population

may require the use of better-validated risk assessment tools, or alternatively,

attention to common reversible fall risk factors in all patients.

3. Implementation Procedure

Disease Risk Prediction

The aim of this study is to predict whether a patient is amongst the cerebral

infarction high-risk population according to their medical history. More

formally, we regard the risk prediction model for cerebral infarction as the

supervised learning methods of machine learning, i.e., the input value is the

attribute value of the patient, X D (x1, x2,….., xn) which includes the patient's

personal information such as age, gender, the prevalence of symptoms, and

living habits (smoking or not) and other structured data and unstructured data.

Structured data (S-data): To predict whether the patient is at high-risk of

cerebral infarction or not it uses the patient's structured data.

Text data (T-data): To predict whether the patient is at high-risk of cerebral

infarction or not it uses the patient's unstructured text data.

Structured and text data (S&T-data): It uses the S-data and T-data and fuse

the structured data and unstructured text data to predict whether the patient is at

high-risk of cerebral infarction or not.

System Architecture

Fig 1: System Architecture


212

4. Hospital Data

The hospital dataset used in this study contains real-life hospital data, and are

stored in the data center. To secure the patient's privacy and security, we created

a security access mechanism. The data provided by the hospital include EHR,

medical image data and gene data. We use a three year dataset from 2013 to

2015. Our data focus on inpatient department data which included 31919

hospitalized patients with 20320848 records in total. The inpatient department

data is mainly consists of structured and unstructured text data. The structured

data encompass laboratory data and the patient's basic information such as the

patient's age, gender and life habits, etc and unstructured text data comprise of

patient's report of his/her illness, the doctor's diagnosis and inquisition records,

etc which is shown in Table 1.

Table 1: Summary of Common EHR Data

Summary of Common EHR data

ICD CPT LAB MEDICATION CLINICAL

NOTES

Availability High High High Medium Medium

Recall Medium Poor Medium

Inpatient: High

Outpatient:

Variable

Medium

Precision Medium High High

Inpatient: High

Outpatient:

Variable

Medium high

Format Structured Structured Mostly

Structured

Structured and

Unstructured Unstructured

Pros

Easy to work

with a good

approximation

of disease

status

Easy to

work

with, high

precision

High data

validity

High data

validity

More Details

about doctors

thoughts

Cons

Disease code

often used for

Screening,

therefore

disease might

not be there

Missing

data

Data

normalization

and ranges

Prescribed not

necessary taken

Difficult to

Process

5. Proposed Methodology

Disease diagnosis by considering the features that have the most impact on

recognitions is one of the interesting and important subjects among researchers

in the field of medical and computer science. The subject discusses a new

concept which is called Medical Data Mining (MDM). Indeed, data mining

methods use different ways such as classification and clustering to categorize

diseases and their symptoms which are helpful for diagnosing which is shown

in Fig 2.


213

A disease diagnosis system is designed in order to envision different diseases

such as diabetes, as well as kidney & liver diseases, etc. System’s workflow is

scrutinized below:

Step 1: Practicing proposed application user (doctor, patient, physician etc.) can

input the attribute values of disease and fire them to the decision support system

for analysis.

Step 2: At decision support system, dataset of different diseases are loaded and

apply data mining algorithms to train dataset. User input requests are collected

and processed on server to estimate the diagnosis result.

Step 3: Healthcare data is scanned by utilizing steps of data mining approaches

like preprocess data, replace missing values, feature selection, machine learning

and decision making on trained dataset. On the decision support system an end

different classification algorithm are executed on train dataset and classifies the

test dataset.

Step 4: Proposed algorithms like Support Vector Machine and Random Forest

are used to give cluster hierarchy for different subspaces. The voting model will

ensemble all these results and output the final classification result. Eventually,

the predicted results are collated with true labels of the testing phase.

Fig 2: Test Model Learn Model

Naive Bayes: It is a classification technique based on Bayes Theorem. A Navie

Bayes classifier assumes that the presence of a particular in a class is unrelated

to the presence of any other feature. For example, a fruit may be considered to

be an apple if it is red, round, and about 3 inches in diameter. Even if these

features depend on each other or upon the existence of the other features, all

these properties independently contribute to the probability that this fruit is an

apple.

K-Nearest Neighbor (KNN): compare each value with the neighbor values or

Nearest values. It is a non parametric method used for classification. Here we

apply classification techniques on the test data set and categorize the data into

different departments.


214

Decision Tree: It is tree structure that includes a root node, branches, and leaf

nodes. Each internal node denotes a test on an attribute, each branch denotes the

outcome of a test, and each leaf node holds a class label.

Test Dataset and Training Dataset

Separating data into test datasets and training datasets is an important part of

evaluating data mining models. By this separation of total data set into two data

sets we can minimize the effects of data inconsistency and better understand the

characteristics of the model. The test data set contains all the required data for

data prediction and training data set contains all irrelevant data. We select 706

patients in total as the experiment data and the data is randomly divided into

training data and test data. The ratio between training set and the test set is 6:1

i.e., 606 patients as training data set and 100 patients as the test data set. Here

CCC language is used to realize the machine learning and deep learning

algorithms and run it in a parallel by the use of data center. In this paper, for S-

data, we extract the patient's demo- graphics characteristics, cerebral infarction

characteristics and living habits (such as smoking)are analyzed according to the

discussion with doctors and Pearson's correlation. Then, we obtain a total of

patient's 79 features. For T-data, to learn Word Embedding we extract 815073.

Then we use the independent feature extraction by CNN.

Algorithm to Calculate Accuracy, Precision, Recall, Measure Step 1: Intilize the attribute

String attribute[]={"","Clump Thickness","Uniformity of Cell

Size","Uniformity of Cell Shape", "Marginal Adhesion", "Single Epithelial

Cell Size", "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli",

"Mitoses" };

Step 2: Read the data from File

BufferedReader br=new BufferedReader(new

FileReader("/home/user/Desktop/breast-cancer.csv"));

Step 3: Repeat the loop until the data is NULL

while((data=bufferedReader.readLine())!=null)

Begin while loop

String split[]=data.toString().split(",");

Double total=1.0;

for(int i=1;i<split.length;i++)

Begin for loop

BufferedReader bufferedReader2=new BufferedReader(new FileReader

("/home/user/Desktop/ output13/ part-r-00000"));

String data1;

if(attribute[i].toLowerCase().equals(splite[0].split(" ")[0].replaceAll("_", "

").toLowerCase()))

Begin if

if(splite[0].split(" ")[2].equals(split[i]+""))

Begin if

total*=Double.parseDouble(splite[1]);

End if End if


215

End for

End While

Step 4: Write the data into the File

fileWriter.write(data+","+"2\n");

Step 5: Calculate the Summary

Accuracy=((tp+tn)/(tp+tn+fp+fn));

precision=((tp)/(tp+fp));

Recall=((tp)/(tp+fn));

f1_measure=(2*precision*Recall)/(precision+Recall);

FileWriter fileWriter2=new FileWriter("/var/www/html/graph/data.txt");

fileWriter2.write("Accuracy\t"+Accuracy+"\n");

fileWriter2.write("precision\t"+precision+"\n");

fileWriter2.write("Recall\t"+Recall+"\n");

fileWriter2.write("f1_measure\t"+f1_measure);

Methods Used

In this paper, we introduce the data imputation, and CNN-based multi model

disease risk prediction (CNN-MDRP) algorithm.

Data imputation

We need to fill the structured data because due to human error there is a large

number of missing data. Thus Before data imputation, we first identify

incomplete medical data and then update or delete them to improve the quality

of data. Next we use data integration .It is for data pre-processing. We can

integrate the medical datav for data atomicity: i.e. height and weight integrated

to obtain body mass index (BMI). We use the Latent factor model for data

imputation .It explains the observable variables in terms of the latent variables.

CNN-Based Multimodal Disease Risk Prediction (CNN-MRDP) Algorithm

To predict whether the patient is at high risk of cerebral infarction CNN-UDRP

only uses the text data. As for structured and unstructured text data, we design a

CNN-MDRP algorithm based on CNN-UDR. The processing of text data is

similar with CNN-UDRP. It extracts 100 features about text data set. We extract

79 features for structure data. Then, we perform the feature level fusion by

using100 features in T-data and 79 features in the S-data.

6. Experimental Results

We perform the execution in virtual environment.

Firstly, we use the command to start all scripts related to our command line

environment.

$ start-all.sh

In-order, to create a directory we use the command

$ hadoop fs –mkdir /user/info


216

Here info is the directory name we are likely to create.

To watch the created directory

$ hadoop fs –ls /user/

To upload input file to HDFS (Hadoop distributed file system )

The input file or datasets used here are

breast-cancer.csv

breast-cancer1.csv

$ hadoop jar dise.jar disease.driver/user/breast-cancer.csv/user/xyz

To change into desktop environment

$ cd Desktop

To show list of items on desktop ,so that we can select a jar file to exhibit the

main command (through jar functionality .java functionality can be exhibited in

hadoop).

$ ls

Main command

$ hadoop jar abc.jar comm.mapred /user/info /user/x

Here “abc.jar” is jar directory and “comm.mapred” is particular function to be

called and “x” is the output file

To get the output on desktop

$ hadoop fs –get /user/x

To get the output in user node

$ hadoop fs –cat /user/x/part-r-00000

Fig 3: Screen shot shows command to start all scripts related to our

command line environment.


217

Fig 4. Shows list of items on desktop, so that we can select a jar file to

exhibit the main command.

Fig 5. Screenshot here shows “abc.jar” which is jar directory and

“comm.mapred” is particular function to be called and “x” is the output file.

Fig 6: Screenshot shows no. of splits and tokens used to accomplish the job.


218

Fig 7: screenshot shows data of job counters and Map-Reduce Framework.

Fig 8: Screen shot shows final result in user-node environment.

Fig 9. Screenshot shows output file on desktop environment.


219

Fig 10: screenshot shows the final output on desktop environment.

From Fig 3 to Fig 10, we observed that the trained dataset is uploaded to HDFS

environment. A job /process/thread was created and splitted in order to perform

Map Reduce.

The CNN MDRP algorithm based KNN classification obtains a training record

and the next k instance in the training record is found. It is necessary to

determine the distance measurement and the selection of the k-value for this

algorithm .First the data is normalized in this experiment. And then to measure

the distance we use the simple distance. With the selection of the parameters k,

we find that the model is best when k = 10. The results are compared using a

different type of decision tree algorithm for the base classifier. The process

algorithmic approach gives, the computational efficiency, the data

representation, and the quality of the resulting program.

Figure 11: the Overall comparison of Existing and Proposed Method

The existing method of CNN-MDRP is not exactly given by the user. But our

proposed machine learning algorithm by KNN gives the best result. Hence, the

proposed method is the easy way to analyze and obtain the data which is shown

in Fig 11.


220

7. Conclusion

In this paper several machine learning methods advantages and disadvantages in

biomedical literature are discussed.KNN will perform best for two class

classification tasks and for multiple classification problem. To improve the

accuracy and efficiency of the system the process proposes a multiple classifier

system. Various member classifiers used in the multi-classifier system should

have independent errors and better performance than a minimum level. That is,

each element classifier should maintain a minimum degree of disagreement for

the method to be successful. By taking average results of a large number of such

classifiers, the decision boundary can be approximated with some accuracy.

Hence the uncorrelated errors of individual classifiers can be eliminated by

averaging.

References

[1] P. Groves, B. Kayyali, D. Knott, and S. V. Kuiken, “The bigdata revolution in healthcare: Accelerating value and innovation big-data revolution”.

[2] P. B. Jensen, L. J. Jensen, and S. Brunak, “Mining electronic health records: towards better research applications and clinical care”.

[3] M. Chen, Y. Ma, Y. Li, D. Wu, Y. Zhang, C. Youn, “Wearable 2.0: Enable Human-Cloud Integration in Next Generation Healthcare System,” IEEE Communications, Vol. 55, No. 1, pp. 54–61, Jan. 2017

[4] M. Chen, Y. Ma, J. Song, C. Lai, B. Hu, ”Smart Clothing: Connecting Human with Clouds and Big Data for Sustainable Health Monitoring,” ACM/Springer Mobile Networks and Applications’ Vol. 21, No. 5, pp.825C845, 2016

[5] J. Wang, M. Qiu, and B. Guo, “Enabling real-time information service on telehealth system over cloud-based big data platform,” Journal of Systems Architecture, vol. 72, pp. 69–79, 2017

[6] D. W. Bates, S. Saria, L. Ohno-Machado, A. Shah, and G. Escobar, “Big data in health care: using analytics to identify and manage high-risk and high-cost patients,” Health Affairs, vol. 33, no. 7, pp. 1123–1131, 2014

[7] Y. Zhang, M. Qiu, C.-W. Tsai, M. M. Hassan, and A. Alamri, “Healthcps: Healthcare cyber-physical system assisted by cloud and big data

[8] K. Lin, J. Luo, L. Hu, M. S. Hossain, and A. Ghoneim, “Localization based on social big data analysis in the vehicular networks,” IEEE Transactions on Industrial Informatics, 2016


221

[9] D. Oliver, F. Daly, F. C. Martin, and M. E. McMurdo, “Risk factors and risk assessment tools for falls in hospital in-patients: a systematic review,” Age and ageing, vol. 33, no. 2, pp. 122–130, 2004.

[10] B. Qian, X. Wang, N. Cao, H. Li, and Y.-G. Jiang, “A relative similarity based method for interactive patient risk prediction,” Data Mining and Knowledge Discovery, vol. 29, no. 4, pp. 1070–1093, 2015.

[11] S. Zhai, K.-h. Chang, R. Zhang, and Z. M. Zhang, “Deep intent: Learning attentions for online advertising with recurrent neural networks.


222

an efficient algorithm for multimodal epidemic liability prediction over … · 2018. 9. 1. · an...

Documents