2013 ieee international conference on …...data, we propose to use restricted boltzmann machine...

A Generative Framework for Prediction andInformative Risk Factor Selection of Bone Diseases

Hui Li∗, Xiaoyi Li∗, Yuan Zhang†, Murali Ramanathan‡, Aidong Zhang∗∗ Department of Computer Science and Engineering, State University of New York at Buffalo, USA

{hli24,xiaoyili,azhang}@buffalo.edu†College of Electronic Information and Control Engineering, Beijing University of Technology, Beijing, China

{zhangyuan}@emails.bjut.edu.cn‡Department of Pharmaceutical Sciences, State University of New York at Buffalo, USA

{murali}@buffalo.edu

Abstract—With the rapid development of healthcare industry,the overwhelming amounts of electronic health records (EHRs)have been well documented and shared by healthcare institutionsand practitioners. It is important to take advantage of EHR datato develop an effective disease risk management model that notonly predicts the progression of the disease, but also providesa candidate list of informative risk factors (RFs) in order toprevent the disease. Although EHRs are valuable sources due tothe comprehensive patient information, it is difficult to pinpointthe underlying causes of the disease in order to assess the risk ofa patient in developing a target disease. Because of the entangledEHR data, it is also challenging to discriminate between patientssuffering from the disease and without the disease for the purposeof selecting RFs that cause the disease. To tackle these challenges,we propose a disease memory (DM) framework which can extractthe integrated features by modeling the relationships among RFsand more importantly between RFs and the target disease byestablishing a deep graphical model with two types of labels. Thevariants of DM can model characteristics for patients with diseaseand without disease respectively via training deep networks withdifferent samples. Experiments on a real bone disease data setshow that the proposed framework can successfully predict thebone disease and select the informative RFs that are beneficialand useful to aid clinical decision support. Most of the selectedRFs are validated by medical literature and some new RFs willattract interests in medical research. The stable and promisingperformance on evaluation metrics confirms the effectiveness ofour model.

I. INTRODUCTION

The Electronic Health Record (EHR) is a longitudinal elec-tronic record of patient health information including diverseinformation like demographics, medications, past medical his-tory, laboratory data, and lifestyles. EHRs are valuable sourcesfor exploratory analysis and statistics to assist clinical decision-making and further medical research. Researchers have beenconverting EHR data into risk factors (RFs) for the diseaserisk analysis which includes two crucial tasks: disease riskprediction and informative risk factor (RF) selection. Withthe success of both tasks, patient can avoid unnecessary tests,reduce the cost of public health care, and change their modifi-able RFs for disease control or prevention. Usually, numerouspotential RFs need to be considered simultaneously sinceobserved and hidden reasons behind all RFs are worth learningfor the exploration of the disease progression. However, it isan extremely challenging task to capture the disease character-istics and clinical nuances for predicting disease progression

and detecting the informative risk factors (RFs) due to thecomplexity and diversity of the EHR data. The difficultiesare showing in many aspects. First, it is hard to find a goodRF representation so that the salient integrated features canbe disentangled from heterogeneous information. Second, itis difficult to discriminate the different roles of independentfeatures for both healthy and diseased patients.

Osteoporosis and bone fractures are common bone diseasesassociated with aging and may be clinically silent but cancause significant mortality and morbidity after onset. Overthe past few decades, osteoporosis has been recognized asa common bone disease that affects more than 75 millionpeople in the United States, Europe and Japan, and it causesmore than 8.9 million fractures annually worldwide [1]. It’sreported that 20-25% of people with a hip fracture are unableto return to independent living and 12-20% die within one year.Although the diagnosis of osteoporosis is usually based on theassessment of bone mineral density (BMD) using dual energyX-ray absorptiometry (DXA), the World Health Organization(WHO) embarked on a project to integrate information onRFs to better predict the risk of bone disease in men andwomen worldwide [2]. In this paper, we propose a novelapproach for the study of bone diseases in two aspects: bonedisease prediction and disease RF selection according to thesignificance.

Existing models usually fall into two categories: the expertknowledge based model or the handcrafted feature set basedmodel. The first mentioned model mainly relies on a smallnumber of well-known RFs which have been validated by anexpert in this field like [3]. However, the information basedon the expert knowledge is limited so that some importantfeatures might be discarded and thus affect the predictiveperformance. The second mentioned model tries to find theinformative RFs by calculating their statistical significance andthen measure the predictive power. The assessment method ofthe relationship between a disease and a handcrafted RF isbased on the regression model [4], Artificial Neural Network(ANN) [5], association rules and decision tree [6]. Althoughthese models are theoretically acceptable for analyzing therisk dependence of several variables, it pays little attentionto the relationships among RFs and between RFs and thetarget disease. Furthermore, they usually select statisticallysignificant features from an expert support candidate list,

2013 IEEE International Conference on Bioinformatics and Biomedicine

978-1-4799-1310-7/13/$31.00 ©2013 IEEE

11 RFs�

CDM�

Original Dataset�

Integrated Risk

Features�

Phase1� Phase2�

��

672 RFs�

��

BDM�

NDM�Disease Samples�

Non-Disease Samples�

Training Process�Tr

Candidate Informative

RFs�

Medical Knowledge�

Validate�

Fig. 1: Overview of our framework for bone health.

which means there still exists the loss of useful information ifthe list is not comprehensive. Recently, mining the causalityrelationship between RFs and a specific disease has attractedconsiderable research attentions. In [7], limited RFs are usedto construct a Bayesian network and the RFs are assumedconditionally independent of one another. However, learningthe Bayesian networks becomes tough and even impossible asthe number of RFs increases.

II. PROBLEM DEFINITION

In this section, we define our problem by showing apipeline for the whole framework. Generally speaking, ourproposed system contains a two-task framework, as shown inFig. 1. The upper component of Fig. 1 shows the roadmap forthe first task: the bone disease prediction based on integratedRFs. The bottom component of Fig. 1 shows the roadmapfor the second task: informative RF selection. Given patients’information, our system can not only predict the risk ofosteoporosis and bone fractures, but also rank the informativeRFs and explain the semantics of each RF. The description ofeach component is given as follows.

Task 1 – The Bone Disease Prediction Component. Inthis component, we feed the original data set to the compre-hensive disease memory (CDM). The training procedure ofCDM includes two procedures: pre-training and fine-tuning.In the pre-training step, we train CDM in an unsupervisedfashion. This pre-training procedure aims at capturing thecharacteristics among all RFs with the ultimate goal of guidingthe learning towards basins of optima that supports a bettergeneralization. In the fine-tuning step, we take advantage oftwo types of labeled information (osteoporosis and bone lossrate) for the purpose of focusing on these two prediction tasks.We use a greedy layer-wised learning algorithm to train a two-layer Deep Belief Network (DBN) which is the structure ofCDM. Besides, all RFs in the original data are projected ontoa new space with the lower dimensionality by restricting thenumber of units in the output layer of DBN. Therefore, theintegrated risk features are extracted by CDM module fromthe original date set. These lower-dimensional integrated riskfeatures are new representation of original higher-dimensionalRFs which will be examined by a two-phase prediction mod-ule. Two classifiers, Logistic Regression and Support VectorMachine (SVM), are composed of the prediction module. InPhase 1, we predict the risk of osteoporosis for all test samples.We regard the osteoporotic bone as the positive output and thenormal bone as the negative output. Because the osteoporoticpatients tend to have more severe bone fractures. In Phase 2,we further predict the risk of bone loss rate for all positive

samples from Phase 1. The high bone loss rate, as the positiveoutput, shows higher possibility to have bone fractures.

Task 2 – The Informative RF selection Component.Since we are not able to explain the semantics of the integratedRFs extracted by the first component, we are required to selectthe meaningful and significant RFs from all candidates in thesecond component. Instead of using all samples into trainingprocedure, we first split the original data set into two parts:diseased samples and non-diseased samples. In the procedureof training, we separately train bone disease memory (BDM)using diseased samples and non-disease memory (NDM) usingnon-diseased sample, shown a dashed arrows at the bottomcomponent of Fig. 1. Once the training session is complete,both memories are used to reconstruct data respectively basedon the contrast group of samples. A two-layer DBN, as thestructure of NDM and BDM, has properties to reconstructsamples. But it yields large reconstruction errors if we useBDM to reconstruct non-diseased samples because of themismatch between the input data and the memory module.The contrasts are valuable information to explain why a non-diseased person will get disease. Similarly, the differences areobvious when reconstructing diseased samples using NDM. AllRFs cumulatively lead to the reconstruction errors. Our ulti-mate goal is to find the top-N individual RFs which contributegreatly to the reconstruction errors. The top-N selected RFsform a candidate informative RF list that will be validatedusing the medical knowledge such as medical reports fromWHO and National Osteoporosis Foundation (NOF), as wellas biomedical literatures from PubMed.

III. METHODOLOGY

In this section, we first introduce both single-layer andmulti-layer learning approaches which are preliminaries to ourproposed method. Then we propose our model focusing on theprediction and informative RF selection for bone diseases.

A. Single-Layer Learning for the Latent Reasons

To have a good RF representation of latent reasons for thedata, we propose to use Restricted Boltzmann Machine (RBM)[8]. A RBM is a generative stochastic graphical model that canlearn a probability distribution over its set of inputs, with therestriction that their visible units and hidden units must forma fully connected bipartite graph. Specifically, it has a singlelayer of hidden units that are not connected to each other andhave undirected, symmetrical connections to a layer of visibleunits. We show a shallow RBM in Fig. 2(a). The model definesthe following energy function: E : {0, 1}D+F → R :

E(v, h; θ) = −D∑

i=1

F∑

j=1

viWijhj −D∑

i=1

bivi −F∑

j=1

ajhj , (1)

where θ = {a, b,W} are the model parameters. D and F are thenumber of visible units and hidden units. The joint distribution overthe visible and hidden units is defined by:

P (v, h; θ) =1

Z(θ)exp(−E(v, h; θ)), (2)

where Z(θ) is the partition function that plays the role of anormalizing constant for the energy function.

Exact maximum likelihood learning is intractable in RBM.In practice, efficient learning is performed using ContrastiveDivergence (CD) [9]. In particular, each hidden unit activation

is penalized in the form:∑Fj=1KL(ρ|hj), where F is the total

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Fig. 2: (a) Shallow Restricted Boltzmann Machine, which containsa layer of visible units v that represent the data and a layer ofhidden units h that learn to represent features that capture higher-order correlations in the data. The two layers are connected by amatrix of symmetrically weighted connections, W , and there are noconnections within a layer. (b) A 2-Layer DBN in which the toptwo layers form a RBM and the bottom layer forms a multi-layerperceptron. It contains a layer of visible units v and two layers ofhidden units h1 and h2.

number of hidden units, hj is the activation of unit j and ρ isa predefined sparsity parameter, typically a small value closeto zero (we use 0.05 in our model). So the overall cost of asparse RBM used in our model is:

E(v, h; θ) = −∑Di=1∑Fj=1 viWijhj −

∑Di=1 bivi−∑F

j=1 ajhj + β∑Fj=1KL(ρ|hj) + λ ‖W‖ ,

(3)

where ‖W‖ is the regularizer, β and λ are hyper-parameters1The advantage of RBM is that it investigates an expressive

representation of the input risk factors. Each hidden unit inRBM is able to encode at least one high-order interactionamong the input variables. Given a specific number of la-tent reasons in the input, RBM requires less hidden unitsto represent the problem complexity. Under this scenario,RFs can be analyzed by a RBM model with an efficientCD learning algorithm. In this paper, we use RBM for anunsupervised greedy layer-wise pre-training. Specifically, eachsample describes a state of visible units in the model. The goalof learning is to minimize the overall energy so that the datadistribution can be better captured by the single-layer model.

B. Multi-Layer Learning for Mining Abstractive Reasons

The new representations learned by a shallow RBM (onelayer RBM) can model some directed hidden causalities behindthe RFs. But there are more abstractive reasons behind them(i.e. the reasons of the reasons). To sufficiently model reasonsin different abstractive levels, we can stack more layers intothe shallow RBM to form a deep graphical model, namely,a DBN [10]. DBN is a probabilistic generative model thatis composed of multiple layers of stochastic, latent variables.The latent variables typically have binary values and are oftencalled hidden units. The top two layers form a RBM which canbe viewed as an associative memory. The lower layer forms amulti-layer perceptron (MLP) [11] which receives top-down,directed connections from the layers above. The states of theunits in the lowest layer represent data vector.

We show a two-layer DBN in Fig. 2(b), in which thepre-training follows a greedy layer-wise training procedure.Specifically, one layer is added on top of the network at eachstep, and only that top layer is trained as an RBM using CDstrategy [9]. After each RBM has been trained, the weights areclamped and a new layer is added and then repeats the above

1We tried different settings for both β and λ and found our model is notvery sensitive to the input parameters. We fixed β to 0.1 and λ to 0.0001.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Whole Data Set

Classifiers��

Osteoporosis Prediction�

Bone Loss Rate Prediction�

��

��

��

��

��

��

��

��

��

CDM�

Fig. 3: Bone disease prediction using a two-layer DBN model.

procedure. After pre-training, the values of the latent variablesin every layer can be inferred by a single, bottom-up pass thatstarts with an observed data vector in the bottom layer anduses the generative weights in the reverse direction. The toplayer of DBN forms a compressed manifold of input data, inwhich each unit in this layer has distinct weighted non-linearrelationship with all of the input factors.

C. Bone Disease Prediction Using CDM

Our goal now is to disentangle the salient integratedfeatures from the complex EHR data for the bone diseaseprediction. We define an integrated RF learning model basedon the given data set for two types of bone disease pre-diction, osteoporosis and bone loss rate and DBN that isintroduced in the last section. Our general idea is shownin Fig. 3, where a good RF representation for predictingosteoporosis and bone loss rate is achieved by learning aset of intermediate representation using a DBN structure atbottom appending a regression layer (classifiers) on it. Thismulti-learning model can capture the characteristics from bothobserved input (bottom-up learning) and labeled information(top-down learning). The internal model, which memorizes thetrained parameters using the whole training data and preservesthe information for both normal and abnormal patients, isterm as the comprehensive disease memory (CDM). Thatis, the learned representation model CDM discovers goodintermediate representations that can be shared across twoprediction tasks with the combination of knowledge from bothinput layer with the original training data and output layer withtwo types of class labels. The training procedure for CDMwill focus on two specific prediction tasks (osteoporosis andbone loss rate) with all risk factors as the input and modelparameters as the output. It includes a pre-training stage anda fine-tuning stage. In the first stage, the unsupervised pre-training stage, we apply the layer-wised CD learning procedurefor putting the parameter values in the appropriate range forfurther supervised training. It guides the learning towardsbasins of attraction of minima that support better risk factorgeneralization from the training data set. So the result of thepre-training procedure establishes an initialization point of thefine-tuning procedure inside a region of parameter space inwhich the parameters are henceforth restricted. In the secondstage, the fine-tuning (FT) stage, we take advantage of the twolabeled information to train our model in a supervised fashion.In this way, the prediction errors for both prediction tasks willbe minimized. Specifically, we use parameters from the pre-training stage to calculate the prediction results for each sampleand then back propagate the errors between the predicted result

Normal data set

��

��

��

��

��

��

��

��

��

��

��

��

��

Data reconstruction

Bone disease memory (BDM)

��

(a)

��Abnormal data set

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Data reconstruction

Non-disease memory (NDM)

��

(b) Fig. 4: Informative RF selection based on (a) BDM and (b) NDM.

and the ground truth about osteoporosis from top to bottomto update model parameters to a better state. Since we haveanother type of labeled information, we then repeat the fine-tuning stage by calculating errors between the predicted resultand another ground truth about bone loss rate. After the two-stage training procedure, our CDM is well trained and can beused to predict bone diseases.

D. Informative Risk Factor Selection Using BDM and NDM

In the previous section, we use CDM to model bothdiseased patients and healthy patients together and establisha comprehensive disease memory which captures the salienceof all RFs by a limited number of integrated RFs for predictingosteoporosis and bone loss rate. In this section, we modelthe diseased patients and healthy patients separately based ontheir unique characteristics and identify the RFs that causethe disease (or osteoporosis). We first define a pair of diseasememory models with a contrast pattern (diseased patients vs.non-diseased patients). We term the bone disease memory(BDM) model as a type of DM model which is trained bythe diseased samples so it only memorizes the characteristicsof those patients who suffer from the osteoporosis disease orhaving high bone loss rate. BDM is different from CDM inthat it is a disease-targeted model that implies possible latentreasons to those abnormal patients. Given an abnormal sample,our goal is to represent the latent reasons leading to his/herdisease. The top block of Fig. 4(a) shows a hierarchical latentstructure underlying the observed RFs, which is well trainedusing the abnormal samples. To find informative RFs, we willapply this model with the normal samples as the input data andits reconstruction as output, as illustrated in Fig. 4(a). Note thatthere are obvious contrasts between the input and output sincedata reconstructed by BDM reflects abnormal cases whichis contrary to the input. Under this scenario, the differencesbetween both sides help us in finding the informative RFs.Similarly, we term the non-disease memory (NDM) model asa model which is trained by the non-diseased samples whohave normal bone and low bone loss rate and memorizestheir attributes. The structure of NDM is similar with BDMas shown in Fig. 4(b), but NDM is a non-disease targetedmodel that keeps information about normal patients. Contraryto BDM, the top block of NDM memorizes the characteristicsof normal patients since it is totally trained by the normalsamples. It has the same function with BDM to find theinformative RFs. Also, it can be severed as a cross-validationfor analyzing the informative RFs provided by BDM.

Distance Metrics. For the convenience of finding theinformative RFs which cause a normal case becomes abnormal,we look inside to track the distance for each column pair(each column is a risk factor) between the original data andthe reconstructed data. Note that if information provided isnot reliable it also yields a large distance. To remove theunreliable information and purify the informative RF list, wefirst examine the validity of BDM. We will calculate the

distance d(k)dB between the original disease samples and the

data generated by BDM. We denote the distance for the kthRF between the original non-disease data and data generated

by BDM as d(k)nB . The cumulative distance for the kth RF is:

d(k)cB = |d(k)nB - d(k)dB |. We use Root Mean Square Error (RMSE)to calculate the distance for both d

(k)nB and d

(k)dB and absolute

distance for d(k)cB . The sum of distances for all RFs is large

since BDM and the input data follow diverse distributions.

NDM has the similar function as BDM. The only differenceis that NDM is used to generate samples with the diseasesamples as input and the distances between the reconstructed

data and the original data is d(k)dN . And the validation for NDM

is the distances between the original non-disease samples and

data generated by NDM d(k)nN . The cumulative distance can be

calculated using d(k)cN = |d(k)dN - d(k)nN |. We then rank the distances

for d(k)cB and d

(k)cN with a descending order respectively and then

find the top-N informative RF. Ideally, the candidate informa-tive RFs produced by either BDM or NDM are consistent andclose from one another because only the informative RFs causea larger distance if we successfully remove the unreliable data.

IV. EXPERIMENTSA. Data Set

The Study of Osteoporotic Fractures (SOF) is the largestand most comprehensive study of risk factors (RFs) for bonediseases which includes 9704 Caucasian women aged 65 yearsand older. It contains 20 years of prospective data aboutosteoporosis, bone fractures, breast cancer, and so on. Potentialrisk factors (RFs) and confounders were classified into 20categories such as demographics, family history, lifestyle, andmedical history [12]. A number of potential RFs are groupedand organized at the first and second visits which include 672variables scattered into 20 categories as the input of our model.The rest of the visits contain time-series dual-energy x-rayabsorptiometry (DXA) scan results on bone mineral density(BMD) measure, which will be extracted and processed as thelabel for our data set. Based on the WHO standard, T-scoreof less than -12 indicates the osteopenia condition that is theprecursor to osteoporosis, which is used as the first type oflabel. The second type of label is the annual rate of BMDvariation. We use at least two BMD values in the data set tocalculate the bone loss rate and define the high bone loss ratewith greater than 0.84% bone loss in each year [13].

B. Evaluation Metric

The error rate on a test data set is commonly used as theevaluation method of the classification performance. Neverthe-less, for most skewed medical data sets, the error rate couldbe still low when misclassifying entire minority samples to theclass of majority. Thus, two alternative measurements are usedin this paper. First, Receiver Operator Characteristic (ROC)curves are plotted to generally capture how the number ofcorrectly classified abnormal cases varies with the number ofincorrectly classifying normal cases as abnormal cases. Sincein most medical problems, researchers usually attach greatimportance to the fraction of examples classified as abnormalcases that are truly abnormal, the measurements, Precision-Recall (PR) curves, are also plotted to show this property.2T-score of -1 corresponds to BMD of 0.82, if the reference BMD is 0.942

and the reference standard deviation is 0.122.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

True

Pos

itive

Rat

e

LR (Expert): 0.729SVM (Expert): 0.601LR (RBM): 0.638SVM (RBM): 0.591LR (RBM with FT): 0.795SVM (RBM with FT): 0.785

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

LR (Expert): 0.458SVM (Expert): 0.343LR (RBM): 0.379SVM (RBM): 0.358LR (RBM with FT): 0.594SVM (RBM with FT): 0.581

(a) ROC and PR curve for single-layer learning.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

True

Pos

itive

Rat

e

LR (Expert): 0.729SVM (Expert): 0.601LR (DBN): 0.662SVM (DBN): 0.631LR (DBN with FT): 0.878SVM (DBN with FT): 0.879

0 0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

LR (Expert): 0.458SVM (Expert): 0.343LR (DBN): 0.393SVM (DBN): 0.386LR (DBN with FT): 0.718SVM (DBN with FT): 0.72

(b) ROC and PR curve for multi-layer learning.

Fig. 5: Performance Comparison

C. Experiments and Results for Task 1

RFs are extracted based on the expert opinion [3], [14]and summarized using the following variables: age, weight,height, BMI, parent fall, smoke, excess alcohol, rheumatoidarthritis, physical activity. We apply two basic classifiers LRand SVM and choose the parameters by cross-validation forfairness. Note that this is a supervised learning process sinceall samples for this expert knowledge based model are labeled.For fair comparison with the classification results using theexpert knowledge, we fix the number of the output dimensionsto be equal to the expert selected RFs. Specifically, we fixthe number of units in the output layer to be 11, where eachunit in this layer represents a new integrated feature describingcomplex relationships among all 672 input factors, rather thana set of typical RFs selected by experts.

Since the sample size is large and highly imbalanced inPhase1, we evaluate the performance using area under curve(AUC) of ROC and PR curves. AUC indicates the performanceof a classifier: the larger the better (an AUC of 1.0 indicatesa perfect performance). The number of samples in Phase2 issmall and balanced, thus we only evaluate the performanceusing the classification error rate. We will present and discussthe experiment results for both phases.

1) Phase 1– Osteoporosis Prediction: From Figure 5(a),we observe that a shallow RBM without FT “LR (RBM)” and“SVM (RBM)” get a sense of how data is distributed whichrepresents the basic characteristics of the data itself. Althoughthe performances are not always higher than the expert model“LR (Expert)” and “SVM (Expert)”, this is a completelyunsupervised process without borrowing knowledge from anytypes of labeled information. Achieving such a comparableperformance is not easy since the expert model is trained ina supervised way. But we find that the model is lack of focusto a specific task and thus leads to poor performances. Furtherimprovements may be possible by more thorough experimentswith a two-stage fine-tuning. So we take advantage of thelabeled information and transform from an unsupervised taskto a semi-supervised task because of the partial label data.Figure 5(a) shows the classification results after using the two-stage fine-tuning to boost the performance of all classifiers “LR(RBM with FT)” and “SVM (RBM with FT)”. Especially, theAUC of PR of our model outperforms the expert system.

Since the capacity for the RBM model with one hiddenlayer is usually small, it indicates a need for a more expres-sive model over the complex data. To satisfy this need, weadd a new layer of non-linear perceptron at the bottom ofRBM, which forms a DBN as shown in Fig. 2(b). This newadded layer greatly enlarges the overall model expressiveness.

More importantly, the deeper structure is able to extract moreabstractive reasons. As we expected, using a deeper structurewithout labeled information, both LR (DBN) and SVM (DBN)yield better performance than the shallow RBM model, asillustrated in Fig. 5(b). And the model “LR (DBN with FT)”and “SVM (DBN with FT)” further improve its behaviorbecause of the two-stage fine-tuning. The performance usingDBN model improves at 32% average rate using ROC measureand 80% average rate using PR measure.

2) Phase 2 – Bone Loss Rate Prediction: In this section,we show the bone loss rate prediction using the abnormal casesafter Phase1. High bone loss rate is an important predictor ofhigher fracture risk. Our integrated risk features are good atdetecting this property since they integrate the characteristicsof data itself and nicely tuned under the help of two kindsof labels. We compare the results between expert knowledgebased model and our DBN with fine-tuning model that yieldsthe best performance for Phase1. Since our result is also fine-tuned by the bone loss rate, we can directly feed the 11 newintegrated features into Phase2. Table I shows that our modelachieves high predictive power when predicting bone loss rate.In this case, the expert model fails because the limited featuresare not sufficient to describe the bone loss rate which mayinteract with other different RFs. This highlights the need fora more complex model to extract the precise attributes from anamount of potential RFs. Moreover, our CDM module takesinto account the whole data set, not only keeping the 672 riskfactor dimensions but also utilizing two types of labeled data.

TABLE I: Classification error rate comparison

LR-Error SVM-Error

Expert 0.383 0.326

DBN with FT 0.107 0.094

D. Experiments and Results for Task 2

In this section, we will show experiments and resultson informative RF selection. Based on the proposed methodshown in Figure 1, we show a case study which lists the top20 informative RFs selected using BDM and NDM in Table II.Variable descriptions are shown from the data provider [12].

In this study osteoporosis appears to be associated withseveral known risk factors that are well described in theliterature. Based on the universal rule used by FRAX [3]that is a popular fracture risk assessment tool developed byWHO, some of the selected RFs have already been usedto evaluate fracture risk of patients such as age, fracturehistory, family history, BMD and extra alcohol intake. Someresearchers find that not only well-known RFs are associated

TABLE II: Informative risk factors generated by BDM and NDM

Category Variable Description

Demographics AGE The patient’s age at this visit

Fracture history

IFX14 Vertebral fracturesINTX Intertrochanteric fracturesFACEF Face fractureANYF Follow-up time to 1st any fracture since current visit

History MHIP80 Mom hip fracture after age 80

ExamDSTBMC Distal radius bone mass content(gm/cm)PRXBMD Proximal radius bone mass density(gm/cm2)

Physicalperformance

TURNUM Number of steps in turnSTEADY Steadiness of turnSTEPUP Ability to step up one stepSTDARM Does participant use arms to stand up?GAID Aid used for pace tests(i.e.crutch,cane,walker)

Exercise 50TMWT Total number of times of activity/year at age 50

Life style DR30 How often did you have 5 or more drinks one day

Breast cancer BRSTCA Breast cancer status

Blood pressureLISYS Systolic blood pressure lying down (mmHg)DIZZY Dizziness upon standing up

Vision CSHAVG Average contrast sensitivity

0 10 20 30 40 500.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Number of informative RFs

AU

C o

f RO

C

Informative RFIntegrated RFExpert RF

(a) AUC of ROC curve

0 10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of informative RFs

AU

C o

f PR

Informative RFIntegrated RFExpert RF

(b) AUC of Precision-Recall curve

Fig. 6: Osteoporosis prediction based on informative RFs. AUC ofROC curve(on left) and AUC of Precision-Recall curve(on right)

with osteoporosis and more falls, but more lifestyle-relatedbehavioral and environmental risk factors are also importantcauses of falls in older women. In Table II, some selectedRFs have been well studied like DIZZY, GAID, STDARMand 50TMWT [15], [16]. The rest of the RFs would attractmedical researchers’ interests and call researchers’ attentionon monitoring bone disease progression.

In general, it is probably not practical to acquire manyfeatures from all participants. So what are the most importantquestions the physician need to know? How many featuresthey need to achieve a good predictive performance. Usingthe proposed approach we selected top 50 informative RFs,instead of using all of them, and fed them directly to theLogistic Regression classifier for the osteoporosis prediction.Fig. 6 shows that we only need top 20 informative RFs soas to improve both ROC and PR curves. The area underthe ROC curve and the precision-recall curve (AUC) forour selected RFs (denoted as Informative RF) is even betterthan the RFs selected using expert knowledge (denoted asExpert RF) when the number of selected RFs is fixed to 20.The proposed informative RF selection method exhibits greatpower of predicting osteoporosis in that the selected RFs arerather significant than the rest RFs. But the performance ofthe prediction result of top 50 RFs selected by BDM andNDM is always inferior to that of integrated RFs extracted byCDM (denoted as Integrated RF) in that some information arediscarded and those information might still make contributionto enhancing the predictive behaviors.

V. CONCLUSIONS

We proposed to tackle the problem of bone disease pre-diction and informative risk factor (RF) selection by modeling

the observed and latent reasons behind risk factors (RFs) usinga deep graphical model pre-trained by CD algorithm. Wefound an effective way of modeling the comprehensivenessand uniqueness from different samples. First, we combined twotypes of bone disease labeled information to train our modelfor the prediction task. Second, we formulated a reconstructionpattern comparison framework to select the informative RFsfor bone diseases. Besides, a group of “disease memories”(DMs) including comprehensive disease memory (CDM), bonedisease memory (BDM) and non-disease memory (NDM) werewell defined and applied to our experiments. Our extensiveexperiment results showed that the proposed method improvesthe prediction performance and has great potential to select theinformative RFs for bone diseases.

VI. ACKNOWLEDGMENTS

The materials published in this paper are partially sup-ported by the National Science Foundation under Grants No.1218393, No. 1016929, and No. 0101244.

REFERENCES

[1] W. H. Organization et al., “Who scientific group on the assessment ofosteoporosis at primary health care level,” 2004.

[2] W. H. O. S. group on the prevention and management of osteoporosis.Report, Prevention and management of osteoporosis: report of a WHOscientific group. WHO, 2003.

[3] Http://www.shef.ac.uk/FRAX/.

[4] R. Bender, “Introduction to the use of regression models in epidemiol-ogy,” Methods Mol Biol, vol. 471, pp. 179–195, 2009.

[5] G. Lemineur, R. Harba, N. Kilic, O. Ucan, O. Osman, and L. Benhamou,“Efficient estimation of osteoporosis using artificial neural networks,”in Industrial Electronics Society. IEEE, 2007, pp. 3039–3044.

[6] C. Ordonez and K. Zhao, “Evaluating association rules and decisiontrees to predict multiple target attributes,” Intelligent Data Analysis,vol. 15, no. 2, pp. 173–192, 2011.

[7] H. Li, C. Buyea, X. Li, M. Ramanathan, L. Bone, and A. Zhang,“3d bone microarchitecture modeling and fracture risk prediction,” inProceedings of the ACM Conference on Bioinformatics, ComputationalBiology and Biomedicine. ACM, 2012, pp. 361–368.

[8] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionalityof data with neural networks,” science, 2006.

[9] M. A. Carreira-Perpinan and G. E. Hinton, “On contrastive divergencelearning,” 2005.

[10] G. E. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p.5947, 2009.

[11] F. Rosenblatt, Principles of neurodynamics; perceptrons and the theoryof brain mechanisms. Washington: Spartan Books, 1962.

[12] Http://www.sof.ucsf.edu/interface/.

[13] J. Sirola, A.-K. Koistinen, K. Salovaara, T. Rikkonen, M. Tuppurainen,J. S. Jurvelin, R. Honkanen, E. Alhava, and H. Kroger, “Bone loss ratemay interact with other risk factors for fractures among elderly women:A 15-year population-based study,” Journal of osteoporosis, vol. 2010,2010.

[14] Cummings, S.R., Nevitt, M.C., Browner, W.S., Stone, K., Fox, K.M.,Ensrud, K.E., Cauley, J., Black, D., and Vogt, T.M., “Risk factors forhip fracture in white women.” Study of Osteoporotic fractures researchgroup, vol. 332, pp. 767–773, 1995.

[15] K. A. Faulkner, J. A. Cauley, S. A. Studenski, D. Landsittel, S. Cum-mings, K. E. Ensrud, M. Donaldson, and M. Nevitt, “Lifestyle predictsfalls independent of physical risk factors,” Osteoporosis international,vol. 20, no. 12, pp. 2025–2034, 2009.

[16] R. Bensen, J. D. Adachi, A. Papaioannou, G. Ioannidis, W. P. Olszynski,R. J. Sebaldt, T. M. Murray, R. G. Josse, J. P. Brown, D. A. Hanleyet al., “Evaluation of easily measured risk factors in the prediction ofosteoporotic fractures,” BMC musculoskeletal disorders, vol. 6, no. 1,p. 47, 2005.

2013 ieee international conference on …...data, we propose to use restricted boltzmann machine...

Documents