an efficient algorithm for multimodal epidemic liability prediction over … · 2018. 9. 1. · an...
TRANSCRIPT
An Efficient Algorithm for Multimodal Epidemic
Liability Prediction over Big Data 1R. Shiva Shankar,
2*M. Mounica Devi,
3J. Rajanikanth and
4G. Mahesh
1Department of CSE,
SRKR Engineering College,
Bhimavaram, AP, India. 2*
Department of CST,
SRKR Engineering College,
Bhimavaram, AP, India.
[email protected] 3Department of CSE,
SRKR Engineering College,
Bhimavaram, AP, India. 4Department of CSE,
SRKR Engineering College,
Bhimavaram, AP, India.
Abstract The Healthcare industry contains big and complex data that may be
required in order to discover fascinating pattern of diseases & makes
effective decisions with the help of different machine learning techniques.
Advanced data mining techniques are used to discover knowledge in
database and for medical research. Big data analytics provided tools for
gathering, managing, analyzing and assimilating large, structured and
unstructured volumes of data produced by current healthcare systems. In
this paper, we discuss some of these major challenges with a focus on three
upcoming and promising areas of medical research: image, signal, and
genomics based analytics. Recent research which targets utilization of large
volumes of medical data while combining multimodal data from disparate
sources is discussed. We experiment on a regional chronic disease of
cerebral infarction. We propose a new convolution neural network (CNN)-
based multimodal disease risk prediction algorithm using structured and
unstructured data from hospital. To the best of our knowledge, none of the
International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 207-223ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/
207
existing work focused on both data types in the area of medical big data
analytics.
Key Words:Image, signal, genomics based analytics, convolution neural
network, multimodal data.
International Journal of Pure and Applied Mathematics Special Issue
208
1. Introduction
From the recognition of big data study ,more commitment has been paid to
disease expectation based on the help of big data analytics equipment .To
improve the truth of risk classification rather than the previously selected
physiognomies different explores have been conducted by choosing the features
mechanically from vast number of data. However, those prevailing work mostly
measured structured data .The following tasks remain based on big data analysis
in risk organization .How should be lectured the mislaid data ? How should be
gritty that the main chronic diseases in a positive county and the main faces of
the disease in the region ?,How can big data analysis be used to estimate the
disease and generate a better method.
To estimate the risk of disease, it searches the structured and unstructured data
in healthcare field to solve these problems. To generate the pattern and causes
of disease firstly, the system use Decision tree map algorithm. It plainly shows
the diseases and sub diseases. Second, by using Map Reduce algorithm for
partitioning the data such that a query will be analyzed only in a specific
partition, which will increase the operational efficiency but reduce query
retrieval time. For partitioning the medical data based on the output of Decision
Tree map algorithm, Map reducing algorithm is used. When compared to
several typical prediction algorithms the prediction accuracy of our proposed
algorithm is more.
The concept of big data is not new; however, the way it is defined is
continuously changing. For archiving, analyzing and displaying data
successfully big data essentially characterize it as a collection of data elements
whose size, speed, type, and/or complexity require one to pursue, adopt, and
invent new hardware and software mechanisms [9-10]. This data is spread
among multiple healthcare systems, researchers, government entities, health
insurers, and so forth. Each of these data repositories and inherently incapable
of providing a platform for global data transparency. The accuracy of health
care data is also important for developing translational research.
For the growth of Big data technology, more observation has been paid to
disease prediction from the perspective of big data analysis. To improve the
accuracy of risk classification many researches have been conducted by
selecting the characteristics automatically. However, those existing work mostly
considered structured data [11]. By using convolution neural network (CNN)
obtaining text characteristics automatically achieved very good results for
unstructured data. However, none of previous work handles medical text data
by CNN. On the other side, there is a large difference between diseases in
different regions, because of the climate and living habits. How should be
determined that the main chronic diseases in a certain region and the main
characteristics of the disease in the region? For analyzing the disease how can
big data analysis technology be used to create a better model? To overcome
International Journal of Pure and Applied Mathematics Special Issue
209
these problems in healthcare field, we combine the structured and unstructured
data to access the risk of disease.
2. Literature Survey
“Mining electronic health records towards better research applications and
clinical care, “P. B. Jensen, L. J. Jensen, and S. Brunak[1],Clinical data that
describes the treatment of patients which shows an under used data source that
has much greater research potential. Unknown disease correlations are disclosed
by mining of electronic health records (EHRs). Unification of EHR data with
genetic data will also give a better understanding of genotype-pheno type
relationships. However, a broad range of ethical, legal and technical reasons
currently hinder the systematic deposition of these data in EHRs and their
mining. Here, we consider the potential for furthering medical research and
clinical care using EHR data and the challenges that must be overcome before
this is a reality.
“ Enable Human-Cloud Integration in Next Generation Healthcare System,” M.
Chen,Y. Zhang, Y. Ma, , C. Youn,Y. Li, D. Wu[2],Many comprehensive
applications become available with the fast growth in Internet of Things, big
data and ,cloud computing. Simultaneously, people pay more interest on higher
QoE and QoS in a źterminal-cloud integrated system. Especially, both advanced
cloud technologies and advanced terminal technologies (e.g. bigdata analytics
and cognitive computing in clouds) are expected to provide people with more
reliable, authentic and intelligent services. Therefore, in this article to improve
QoE and QoS of the next generation healthcare system they proposed a
Wearable 2.0 healthcare system.
”Smart Clothing: Connecting Human with Big Data and clouds for Sustainable
Health Monitoring” M. Chen, C. Lai, B. Hu, Y. Ma, J. Song [3] ,This paper
introduces key technologies ,design details and practical implementation
methods of smart clothing system. Existing wearable devices have various
drawbacks, such as insufficient accuracy and comfortableness for long-term
wearing, etc. Hence, health monitoring through traditional wearable devices is
hard to be sustainable. Here they designed “Smart Clothing” to obtain
healthcare big data by sustainable health monitoring, facilitating unobtrusive
collection of various physiological indicators of human body. To provide
extensive intelligence for smart clothing system, mobile healthcare cloud
platform is constructed by the use of mobile internet, cloud computing and big
data analytics.
“Enabling real-time information service on telehealth system over cloud-based
big data platform,” M. Qiu, J. Here they proposed a flow estimating algorithm
for the telehealth cloud system. They also designed a data coherence protocol
for the PHR-based distributed system. The telehealth system that covers both
clinical and nonclinical uses .It also store-and-forward data services to be
offline studied by relevant specialists .In this paper, they proposes a probability-
International Journal of Pure and Applied Mathematics Special Issue
210
based bandwidth model. It helps cloud broker to provide a high performance
allocation of computing nodes and links. This brokering process considers the
location protocol of Personal Health Record (PHR) in cloud .
“Big data in healthcare: using analyticsto identify and manage high-risk and
high-cost patients,” S. Saria, D. W. Bates, A. Shah, L. Ohno-Machado, and G.
EscobarHealth Affairs[5], US health care system rapidly adopts electronics
health records. At the same time, For analyzing large quantities of data rapid
progress has been made in clinical analytics techniques and gleaning new
insights from that analysis which is part of what is known as big data. They
presented six use cases. key examples where some of the clearest opportunities
exist to reduce costs through the use of big data, readmissions high-cost
patients, adverse events ,triage when a patient's condition worsens, treatment
optimization for diseases affecting multiple organ systems. In this paper they
also discussed the types of data needed to obtain such insights, the types of
insights that are likely to emerge from clinical analytics.
“Healthcps: Healthcare cyber-physical system assisted by cloud and big
data”Y. Zhang, M. M. Hassan M. Qiu, C.-W. Tsai, and A. Alamri [6], for
patient-centric healthcare applications and services this paper proposes a cyber-
physical system .It is named as Health-CPS. It was built on cloud and big data
analytics technologies. This system maintains three layers. A data collection
layer is a unified standard, and a data management layer for distributed storage
and parallel computing, and a data-oriented service layer. This study shows that
the technologies of cloud and big data are used to improve the performance of
the healthcare system .Hence humans can then enjoy several smart healthcare
applications and services.
“Localization based on social big data analysis in the vehicular networks, ”M.
S. Hossain, K. Lin, J. Luo, A. Ghoneim,L. Huand[7], Location-based services,
especially for vehicular localization, are an indispensable component of most
technologies and applications related to the vehicular networks. In this paper, an
overlapping and hierarchical social clustering model (OHSC) is designed to
classify the vehicles into different social clusters by exploring the social
relationship between them. By using OHSC model results, they proposed a
social-based localization algorithm (SBL) that use location prediction to assist
in global localization in the vehicular networks. The experiment result shows
the accuracy and performance of the OHSC model.
“Risk factors and risk assessment tools for falls in hospital in-patients: a
systematic review,” F. Daly, D. Oliver, M. E. McMurdo, F. C. Martin[8], A
small number of significant falls risk factors emerged consistently, despite the
heterogeneity of settings namely gait instability, agitated confusion, urinary
incontinence/frequency, falls history and prescription of 'culprit' drugs
(especially sedative/hypnotics). Simple risk assessment tools constructed of
similar variables have been shown to predict falls with sensitivity and
specificity in excess of 70%, although validation in a variety of settings and in
International Journal of Pure and Applied Mathematics Special Issue
211
routine clinical use is lacking. Effective falls interventions in this population
may require the use of better-validated risk assessment tools, or alternatively,
attention to common reversible fall risk factors in all patients.
3. Implementation Procedure
Disease Risk Prediction
The aim of this study is to predict whether a patient is amongst the cerebral
infarction high-risk population according to their medical history. More
formally, we regard the risk prediction model for cerebral infarction as the
supervised learning methods of machine learning, i.e., the input value is the
attribute value of the patient, X D (x1, x2,….., xn) which includes the patient's
personal information such as age, gender, the prevalence of symptoms, and
living habits (smoking or not) and other structured data and unstructured data.
Structured data (S-data): To predict whether the patient is at high-risk of
cerebral infarction or not it uses the patient's structured data.
Text data (T-data): To predict whether the patient is at high-risk of cerebral
infarction or not it uses the patient's unstructured text data.
Structured and text data (S&T-data): It uses the S-data and T-data and fuse
the structured data and unstructured text data to predict whether the patient is at
high-risk of cerebral infarction or not.
System Architecture
Fig 1: System Architecture
International Journal of Pure and Applied Mathematics Special Issue
212
4. Hospital Data
The hospital dataset used in this study contains real-life hospital data, and are
stored in the data center. To secure the patient's privacy and security, we created
a security access mechanism. The data provided by the hospital include EHR,
medical image data and gene data. We use a three year dataset from 2013 to
2015. Our data focus on inpatient department data which included 31919
hospitalized patients with 20320848 records in total. The inpatient department
data is mainly consists of structured and unstructured text data. The structured
data encompass laboratory data and the patient's basic information such as the
patient's age, gender and life habits, etc and unstructured text data comprise of
patient's report of his/her illness, the doctor's diagnosis and inquisition records,
etc which is shown in Table 1.
Table 1: Summary of Common EHR Data
Summary of Common EHR data
ICD CPT LAB MEDICATION CLINICAL
NOTES
Availability High High High Medium Medium
Recall Medium Poor Medium
Inpatient: High
Outpatient:
Variable
Medium
Precision Medium High High
Inpatient: High
Outpatient:
Variable
Medium high
Format Structured Structured Mostly
Structured
Structured and
Unstructured Unstructured
Pros
Easy to work
with a good
approximation
of disease
status
Easy to
work
with, high
precision
High data
validity
High data
validity
More Details
about doctors
thoughts
Cons
Disease code
often used for
Screening,
therefore
disease might
not be there
Missing
data
Data
normalization
and ranges
Prescribed not
necessary taken
Difficult to
Process
5. Proposed Methodology
Disease diagnosis by considering the features that have the most impact on
recognitions is one of the interesting and important subjects among researchers
in the field of medical and computer science. The subject discusses a new
concept which is called Medical Data Mining (MDM). Indeed, data mining
methods use different ways such as classification and clustering to categorize
diseases and their symptoms which are helpful for diagnosing which is shown
in Fig 2.
International Journal of Pure and Applied Mathematics Special Issue
213
A disease diagnosis system is designed in order to envision different diseases
such as diabetes, as well as kidney & liver diseases, etc. System’s workflow is
scrutinized below:
Step 1: Practicing proposed application user (doctor, patient, physician etc.) can
input the attribute values of disease and fire them to the decision support system
for analysis.
Step 2: At decision support system, dataset of different diseases are loaded and
apply data mining algorithms to train dataset. User input requests are collected
and processed on server to estimate the diagnosis result.
Step 3: Healthcare data is scanned by utilizing steps of data mining approaches
like preprocess data, replace missing values, feature selection, machine learning
and decision making on trained dataset. On the decision support system an end
different classification algorithm are executed on train dataset and classifies the
test dataset.
Step 4: Proposed algorithms like Support Vector Machine and Random Forest
are used to give cluster hierarchy for different subspaces. The voting model will
ensemble all these results and output the final classification result. Eventually,
the predicted results are collated with true labels of the testing phase.
Fig 2: Test Model Learn Model
Naive Bayes: It is a classification technique based on Bayes Theorem. A Navie
Bayes classifier assumes that the presence of a particular in a class is unrelated
to the presence of any other feature. For example, a fruit may be considered to
be an apple if it is red, round, and about 3 inches in diameter. Even if these
features depend on each other or upon the existence of the other features, all
these properties independently contribute to the probability that this fruit is an
apple.
K-Nearest Neighbor (KNN): compare each value with the neighbor values or
Nearest values. It is a non parametric method used for classification. Here we
apply classification techniques on the test data set and categorize the data into
different departments.
International Journal of Pure and Applied Mathematics Special Issue
214
Decision Tree: It is tree structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label.
Test Dataset and Training Dataset
Separating data into test datasets and training datasets is an important part of
evaluating data mining models. By this separation of total data set into two data
sets we can minimize the effects of data inconsistency and better understand the
characteristics of the model. The test data set contains all the required data for
data prediction and training data set contains all irrelevant data. We select 706
patients in total as the experiment data and the data is randomly divided into
training data and test data. The ratio between training set and the test set is 6:1
i.e., 606 patients as training data set and 100 patients as the test data set. Here
CCC language is used to realize the machine learning and deep learning
algorithms and run it in a parallel by the use of data center. In this paper, for S-
data, we extract the patient's demo- graphics characteristics, cerebral infarction
characteristics and living habits (such as smoking)are analyzed according to the
discussion with doctors and Pearson's correlation. Then, we obtain a total of
patient's 79 features. For T-data, to learn Word Embedding we extract 815073.
Then we use the independent feature extraction by CNN.
Algorithm to Calculate Accuracy, Precision, Recall, Measure Step 1: Intilize the attribute
String attribute[]={"","Clump Thickness","Uniformity of Cell
Size","Uniformity of Cell Shape", "Marginal Adhesion", "Single Epithelial
Cell Size", "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli",
"Mitoses" };
Step 2: Read the data from File
BufferedReader br=new BufferedReader(new
FileReader("/home/user/Desktop/breast-cancer.csv"));
Step 3: Repeat the loop until the data is NULL
while((data=bufferedReader.readLine())!=null)
Begin while loop
String split[]=data.toString().split(",");
Double total=1.0;
for(int i=1;i<split.length;i++)
Begin for loop
BufferedReader bufferedReader2=new BufferedReader(new FileReader
("/home/user/Desktop/ output13/ part-r-00000"));
String data1;
if(attribute[i].toLowerCase().equals(splite[0].split(" ")[0].replaceAll("_", "
").toLowerCase()))
Begin if
if(splite[0].split(" ")[2].equals(split[i]+""))
Begin if
total*=Double.parseDouble(splite[1]);
End if End if
International Journal of Pure and Applied Mathematics Special Issue
215
End for
End While
Step 4: Write the data into the File
fileWriter.write(data+","+"2\n");
Step 5: Calculate the Summary
Accuracy=((tp+tn)/(tp+tn+fp+fn));
precision=((tp)/(tp+fp));
Recall=((tp)/(tp+fn));
f1_measure=(2*precision*Recall)/(precision+Recall);
FileWriter fileWriter2=new FileWriter("/var/www/html/graph/data.txt");
fileWriter2.write("Accuracy\t"+Accuracy+"\n");
fileWriter2.write("precision\t"+precision+"\n");
fileWriter2.write("Recall\t"+Recall+"\n");
fileWriter2.write("f1_measure\t"+f1_measure);
Methods Used
In this paper, we introduce the data imputation, and CNN-based multi model
disease risk prediction (CNN-MDRP) algorithm.
Data imputation
We need to fill the structured data because due to human error there is a large
number of missing data. Thus Before data imputation, we first identify
incomplete medical data and then update or delete them to improve the quality
of data. Next we use data integration .It is for data pre-processing. We can
integrate the medical datav for data atomicity: i.e. height and weight integrated
to obtain body mass index (BMI). We use the Latent factor model for data
imputation .It explains the observable variables in terms of the latent variables.
CNN-Based Multimodal Disease Risk Prediction (CNN-MRDP) Algorithm
To predict whether the patient is at high risk of cerebral infarction CNN-UDRP
only uses the text data. As for structured and unstructured text data, we design a
CNN-MDRP algorithm based on CNN-UDR. The processing of text data is
similar with CNN-UDRP. It extracts 100 features about text data set. We extract
79 features for structure data. Then, we perform the feature level fusion by
using100 features in T-data and 79 features in the S-data.
6. Experimental Results
We perform the execution in virtual environment.
Firstly, we use the command to start all scripts related to our command line
environment.
$ start-all.sh
In-order, to create a directory we use the command
$ hadoop fs –mkdir /user/info
International Journal of Pure and Applied Mathematics Special Issue
216
Here info is the directory name we are likely to create.
To watch the created directory
$ hadoop fs –ls /user/
To upload input file to HDFS (Hadoop distributed file system )
The input file or datasets used here are
breast-cancer.csv
breast-cancer1.csv
$ hadoop jar dise.jar disease.driver/user/breast-cancer.csv/user/xyz
To change into desktop environment
$ cd Desktop
To show list of items on desktop ,so that we can select a jar file to exhibit the
main command (through jar functionality .java functionality can be exhibited in
hadoop).
$ ls
Main command
$ hadoop jar abc.jar comm.mapred /user/info /user/x
Here “abc.jar” is jar directory and “comm.mapred” is particular function to be
called and “x” is the output file
To get the output on desktop
$ hadoop fs –get /user/x
To get the output in user node
$ hadoop fs –cat /user/x/part-r-00000
Fig 3: Screen shot shows command to start all scripts related to our
command line environment.
International Journal of Pure and Applied Mathematics Special Issue
217
Fig 4. Shows list of items on desktop, so that we can select a jar file to
exhibit the main command.
Fig 5. Screenshot here shows “abc.jar” which is jar directory and
“comm.mapred” is particular function to be called and “x” is the output file.
Fig 6: Screenshot shows no. of splits and tokens used to accomplish the job.
International Journal of Pure and Applied Mathematics Special Issue
218
Fig 7: screenshot shows data of job counters and Map-Reduce Framework.
Fig 8: Screen shot shows final result in user-node environment.
Fig 9. Screenshot shows output file on desktop environment.
International Journal of Pure and Applied Mathematics Special Issue
219
Fig 10: screenshot shows the final output on desktop environment.
From Fig 3 to Fig 10, we observed that the trained dataset is uploaded to HDFS
environment. A job /process/thread was created and splitted in order to perform
Map Reduce.
The CNN MDRP algorithm based KNN classification obtains a training record
and the next k instance in the training record is found. It is necessary to
determine the distance measurement and the selection of the k-value for this
algorithm .First the data is normalized in this experiment. And then to measure
the distance we use the simple distance. With the selection of the parameters k,
we find that the model is best when k = 10. The results are compared using a
different type of decision tree algorithm for the base classifier. The process
algorithmic approach gives, the computational efficiency, the data
representation, and the quality of the resulting program.
Figure 11: the Overall comparison of Existing and Proposed Method
The existing method of CNN-MDRP is not exactly given by the user. But our
proposed machine learning algorithm by KNN gives the best result. Hence, the
proposed method is the easy way to analyze and obtain the data which is shown
in Fig 11.
International Journal of Pure and Applied Mathematics Special Issue
220
7. Conclusion
In this paper several machine learning methods advantages and disadvantages in
biomedical literature are discussed.KNN will perform best for two class
classification tasks and for multiple classification problem. To improve the
accuracy and efficiency of the system the process proposes a multiple classifier
system. Various member classifiers used in the multi-classifier system should
have independent errors and better performance than a minimum level. That is,
each element classifier should maintain a minimum degree of disagreement for
the method to be successful. By taking average results of a large number of such
classifiers, the decision boundary can be approximated with some accuracy.
Hence the uncorrelated errors of individual classifiers can be eliminated by
averaging.
References
[1] P. Groves, B. Kayyali, D. Knott, and S. V. Kuiken, “The bigdata revolution in healthcare: Accelerating value and innovation big-data revolution”.
[2] P. B. Jensen, L. J. Jensen, and S. Brunak, “Mining electronic health records: towards better research applications and clinical care”.
[3] M. Chen, Y. Ma, Y. Li, D. Wu, Y. Zhang, C. Youn, “Wearable 2.0: Enable Human-Cloud Integration in Next Generation Healthcare System,” IEEE Communications, Vol. 55, No. 1, pp. 54–61, Jan. 2017
[4] M. Chen, Y. Ma, J. Song, C. Lai, B. Hu, ”Smart Clothing: Connecting Human with Clouds and Big Data for Sustainable Health Monitoring,” ACM/Springer Mobile Networks and Applications’ Vol. 21, No. 5, pp.825C845, 2016
[5] J. Wang, M. Qiu, and B. Guo, “Enabling real-time information service on telehealth system over cloud-based big data platform,” Journal of Systems Architecture, vol. 72, pp. 69–79, 2017
[6] D. W. Bates, S. Saria, L. Ohno-Machado, A. Shah, and G. Escobar, “Big data in health care: using analytics to identify and manage high-risk and high-cost patients,” Health Affairs, vol. 33, no. 7, pp. 1123–1131, 2014
[7] Y. Zhang, M. Qiu, C.-W. Tsai, M. M. Hassan, and A. Alamri, “Healthcps: Healthcare cyber-physical system assisted by cloud and big data
[8] K. Lin, J. Luo, L. Hu, M. S. Hossain, and A. Ghoneim, “Localization based on social big data analysis in the vehicular networks,” IEEE Transactions on Industrial Informatics, 2016
International Journal of Pure and Applied Mathematics Special Issue
221
[9] D. Oliver, F. Daly, F. C. Martin, and M. E. McMurdo, “Risk factors and risk assessment tools for falls in hospital in-patients: a systematic review,” Age and ageing, vol. 33, no. 2, pp. 122–130, 2004.
[10] B. Qian, X. Wang, N. Cao, H. Li, and Y.-G. Jiang, “A relative similarity based method for interactive patient risk prediction,” Data Mining and Knowledge Discovery, vol. 29, no. 4, pp. 1070–1093, 2015.
[11] S. Zhai, K.-h. Chang, R. Zhang, and Z. M. Zhang, “Deep intent: Learning attentions for online advertising with recurrent neural networks.
International Journal of Pure and Applied Mathematics Special Issue
222
223
224