dynamic bayesian networks: class discriminative structure ...treport/tr/05-06/burge_proposal.pdf ·...

Dynamic Bayesian Networks:

Class Discriminative Structure Search

with an Application to Functional Magnetic Resonance Imaging Data

John Burge

PhD Thesis Proposal

Table of Contents

1 Introduction................................................................................................................. 3 2 Background................................................................................................................. 4

2.1 Bayesian Networks............................................................................................. 4 2.2 Notation............................................................................................................... 5 2.3 Bayesian Multinets.............................................................................................. 5 2.4 Functional Magnetic Resonance Imaging........................................................... 6 2.5 Talairach Database.............................................................................................. 7 2.6 Related Neuroimaging Analysis Work ............................................................... 7

3 Current Work – Analyzing Neuroanatomical Relationship Dynamics....................... 8 3.1 Experimental Design for fMRI Analysis............................................................ 8 3.2 DBN Analysis................................................................................................... 10 3.3 Results............................................................................................................... 10 3.4 Confidence Measure......................................................................................... 12 3.5 Experimental Conclusions................................................................................ 14

4 Bayesian Networks – Research Questions................................................................ 14 4.1 Feature Extraction............................................................................................. 15 4.2 Structure Search................................................................................................ 15 4.3 Parameter Identification.................................................................................... 16 4.4 Structure Scoring .............................................................................................. 17 4.5 Analyzing Network Structures.......................................................................... 17 4.6 Phase Transitions.............................................................................................. 17

5 Bayesian Networks – Proposed Issues to be Addressed........................................... 17 5.1 Comparative Scoring Methods and ACL.......................................................... 18 5.2 Maximum ACL parameters.............................................................................. 20 5.3 Hierarchical Information................................................................................... 22 5.4 Shrinkage.......................................................................................................... 24

6 Research Contributions............................................................................................. 27 7 Acknowledgements................................................................................................... 28 8 References................................................................................................................. 28 9 Appendices................................................................................................................ 31

9.1 Bayesian networks............................................................................................ 31 9.2 Dynamic Bayesian Networks............................................................................ 34 9.3 Gaussian Naïve Bayesian Networks................................................................. 37 9.4 Lagrange Multipliers for Maximizing the Binomial Ratio Test ....................... 38

1 Introduction

The neuroanatomical behavior and underlying root causes of many mental illnesses are only moderately understood. What changes in the neural activity of patients with schizophrenia? How do the relationships among neuroanatomical regions change due to the onset of senile dementia? Questions such as these are frequently addressed in the work of many neuroscientists. Answers to these questions could result in better diagnoses and treatment options for patients, ultimately leading to an improved livelihood for many people.

But to answer these questions, a fundamental issue must be addressed: how can the relationships among neuroanatomical regions be modeled? Neuroscientists have several standardized analytical methods which have yielded much success in furthering the current understanding of mental illness (Section 2.6). However, many of their modeling techniques make overly restrictive assumptions, such as the linearity and Gaussian distribution assumptions. These assumptions could result in the inability of their models to represent many important relationships. The machine learning community, and the AI community in general, have developed many modeling techniques that can be applied to the neuroanatomical domain that do not require these assumptions.

I propose a method based on Bayesian networks (BNs) (Jensen 2001, Pearl 1986) capable of modeling the temporal relationships among neuroanatomical regions. Specifically, I use a subset of BNs referred to as dynamic Bayesian networks (DBNs) (Murphy 2002) that are capable of modeling temporal processes. BNs are graphical models that explicitly represent statistical relationships among random variables (RVs). They are composed of two primary components: a topological structure indicating which RVs are statistically dependent on each other and a set of parameters detailing the particular behavior among interdependent RVs. In many systems, it is not known a priori which RVs are dependent on each other. For these systems, a BN’s topological structure will be unknown and must be searched for—an NP-hard task. In conjunction with finding a structure, a method for determining the BN’s parameters must also be devised. My research focuses on novel techniques capable of addressing these issues.

First, I observe that under certain circumstances, learning multiple Bayesian networks, one for each possible class of data, is a better approach than the frequently employed method of learning a single BN with a class node. Doing this falls under the general multinet framework (Heckerman 1991; Geiger and Heckerman 1996) (Section 2.3). Second, I propose a method for determining which structures are representative of class-discriminating relationships. Third, I propose a structure search heuristic that takes advantage of the relationships among hierarchically related RVs. As an extension to the third proposal, I propose to incorporate shrinkage, a statistical technique previously used in machine learning to improve classification efficacy (Section 5.4).

As structure search is an unsupervised learning problem, validation of experimental results is an issue. For validation, I compare the classification efficacy of my methods with other commonly used machine learning classification techniques. I also employ a bootstrapping method for quantitatively measuring the confidence in the results which

indicates how likely the results were due to chance (Section 3.4). Further, I show that my methods are capable of extracting known relationships in simulated data. Finally, I analyze functional magnetic resonance imaging data collected in an experiment designed to identify neuroanatomical differences between healthy and demented elderly patients. A collaborating neuroscientist, Vincent P. Clark, compares my structural results with putative neuroanatomical behavior.

2 Background

2.1 Bayesian Networks

For an introductory overview of BNs, I refer the reader to Charniak’s aptly titled “ Bayesian Networks without Tears” (1991) and for a detailed analysis, to (Heckerman, Geiger & Chickering 1995; Jensen 2001). I will briefly introduce BNs and the notation I use to describe them. See Appendix 9.1 for a significantly more detailed overview.

Given a system composed of a set of RVs, it is desirable to assign a probability to every possible event, where an event is a unique setting of the RVs. Modeling a system with a joint probability distribution (JPD) does this. However, as the number of RVs in a system increases, the number of JPD parameters increases exponentially. As such, using a JPD becomes intractable for all but simplistic models. However, if some of the RVs are statistically independent, many of the JPD’s parameters will be redundant. BNs take advantage of this redundancy and provide a method to succinctly represent the non-redundant parameters within the JPD.

A BN can be represented as a directed acyclic graph (DAG). For each RV in the system, the DAG contains a single node. If two nodes are statistically dependent, the DAG contains a directed link connecting the nodes. The link’s origin is the parent node and the link’s destination is the child. Since the link is directed, a sense of causality is often semantically associated with the relation such that the parent causally influences the child. Associated with each child node is a conditional probability distribution (CPD) describing the statistical relationship between the node and its parents.

A subclass of BNs, dynamic Bayesian networks (DBNs), are used to model temporal processes. Ideally, the DBN’s topological structure would contain a single column of nodes for each timeframe in the system. Each column would include one node for every RV. Unfortunately, for all but the simplest domains, such models quickly become intractably large. To deal with this intractability, two assumptions are frequently made: the Markov assumption and the stationary assumption (Appendix 9.2). Under these assumptions, the system can be modeled with a DBN containing two columns of RVs (see Figure 13 in Appendix 9.2 for an example of such a DBN).

Unlike other BNs with fixed topologies, such as the Gaussian naïve Bayesian network (Appendix 9.3) and the hidden Markov model, a DBN’s topology is frequently not known. Finding the correct DBN topology, a process referred to as structure search, is a well-understood problem (Heckerman 1995) that is NP-Hard (Chickering, Geiger & Heckerman 1994). Much of my research focuses on the details for finding the structures that best describe a set of data.

2.2 Notation

A Bayesian network is described by the pair (BS, BO) where BS is the graphical structure of the network and BO specifies the conditional probability parameters of the BN. BS contains a set of nodes for every RV in the system, X = { X1, X2 , …,Xn} with arities { r1, r2, …, rn} and a link for every causal dependency between RVs.

Y is a RV used to represent the class associated with a set of observed RVs. To simplify the math, I will assume binary classification such that the domain of Y = { 1, 2} . A data point consists of observable RVs and a class RV: d = { X, Y} . Xd and Yd refer to the observable RVs and the class of data point d. A dataset, D, is a collection of m data points, { d1, …, dm} . Dα denotes a dataset containing all of (and only) the data points for a specific class, i.e., Dα = { d : Yd = α} .

B denotes a Bayesian network containing nodes for RVs { X, Y} . The parent set for a RV Xi in B is denoted PaB(Xi) or just Pa(Xi) if the BN can be inferred. qi is the number of possible configurations for the RVs in Pa(Xi). The substructure in a BN containing Xi and Pa(Xi) is referred to as a family. For convenience, the family that contains Xi as a child is referred to as family Xi.

Unless specified otherwise, all BNs are assumed to be discrete and their conditional probability distributions are represented by conditional probability tables (CPTs). The joint probability distribution represented by B is given in Equation (1),

ΘB is the set of CPT parameters for BN B. BiΘ is the CPT for node Xi in BN B, ,

Bi jΘ is the

multinomial PB(Xi | Pa(Xi)=j) and , ,Bi j kΘ is the single CPT parameter PB(Xi = k | Pa(Xi) = j).

Notation for DBNs is slightly modified. Xit (also denoted Xi

t+0) and Xit+1 represent the i th

RV in columns t and t+1, respectively. The set of DBN RVs is denoted X0:1={ Xit, Xi

t+1 :

1≤i≤n} . The parameters for PB(Xit+e = k | Pa(Xi

t+e) = j) are denoted , , ,Be i j kΘ , e∈{ 0,1} . Y

represents the class associated with the entire time series.

2.3 Bayesian Multinets

Frequently, when learning a BN for the purposes of class discrimination, a single BN with a class node is trained. However, if the conditional independence assumptions among RVs change based on the class, the resulting BN will have an overly complex structure with many redundant parameters. Instead of learning a single BN, a BN for each value of the class node should be learnt. This reduces the maximum number of simultaneous parents needed for the families whose parent sets change across classes and results in a better use of sparse data. A multinet (Heckerman 1991; Geiger and Heckerman 1996) results as the use of multiple BNs to represent a single more complex BN. Figure 1 illustrates a simple example.

Using multiple BNs also increases the comprehensibility of learned models. If a single BN were used to simultaneously model both classes, the relationships applicable for each individual class would have to be teased out from the parameters—a task significantly

( ) ( )( ) ( )( )1

, | | .n

B B B i ii

P Y P Y Pa Y P X Pa X=

= ∏X (1)

more difficult than analyzing the structure. For example, observing the topology for the BN in Figure 1, it is not apparent that X1 and X3 are only correlated when Y = 1.

The BN used to model class α is referred to as Bα. Dα is Bα’s intra-class data and Dβ, β ≠α, as Bα’s extra-class data. An event , ,i j k

βξ is defined to be the occurrence of a data point d∈Dβ where Xi = k and PaB(Xi) is in its j th configuration. The value , ,i j k

βη is the number of times , ,i j k

βξ occurs.

Multinets have been previously employed for DBNs for the purposes of speech recognition by Bilmes (2000). Bilmes employed multinets as alternatives for hidden Markov models, achieving higher classification accuracies from models with fewer parameters.

2.4 Functional Magnetic Resonance Imaging

Recently, magnetic resonance imaging (MRI) techniques have become widely used in the study and diagnosis of metal illness. They are non-invasive and allow properties of small cubic regions of brain tissue (voxels) to be measured. Depending on the type of MRI used, different properties of tissue can be measured. Of interest to my work is functional MRI (fMRI). fMRI measures the amount of oxygenized hemoglobin present in the blood surrounding neural tissue. This measurement is referred to as the blood oxygen level dependant (BOLD) response. The level of the measured BOLD response is believed to be positively correlated with the amount of neural activation within the voxel.

In order to test a hypothesis on the behavior of a mental illness, psychologists design experiments for patients to undergo while receiving an fMRI scan. These experiments are believed to elicit different responses between healthy controls and afflicted patients. Typically, these experiments consist of a simple recognition task in which the patient is exposed to an audio or visual stimulus to which they respond with a signal indicating that they have observed the stimulus.

A typical fMRI session generates a tremendous amount of information which frequently renders direct analysis intractable. Instead, the analysis must be performed on features

Figure 1. A system where X3 is correlated with X1 when Y = 1 and with X2 when Y = 2. (left) The BN required to model the system. The CPT contains Θ(r4) parameters where r is the arity of the nodes. (right) The multinet required to represent the system. The CPTs contain only Θ(r3) parameters.

Y = 1 Y = 2

Bayes Network Bayes Multinet

X1

X3

X2 X1 Y

X3

X2 X1

X3

X2

extracted from the data. For example, a standard fMRI scanning session can result in a single 3D image every few seconds and will last on the order of an hour, yielding hundreds of images. Each image will typically contain on the order of tens of thousands of voxels. In total, an fMRI session can result in millions or even tens of millions of data values per patient.

2.5 Talairach Database

To reduce the enormous quantity of data produced by an fMRI scan, I abstract from the voxel level to the neuroanatomical region level. Several databases describe such regions of interest (ROIs), with the most widely used being the Talairach database (Lancaster et al. 2000).

The Talairach database separates voxels into 150 different regions layered into four hierarchical levels. Each hierarchical level corresponds to a different level of granularity of the brain. The topmost level contains gross regions such as the left and right cerebrum, cerebellum and brain stems while the bottom most level contains smaller more specialized regions such as the Brodmann areas and individual nuclei. The Talairach database maps every voxel into one region per hierarchical level. I use the mapping to compute the aggregate mean voxel activation (AMVA) for each ROI. Analysis can then be performed on the resulting AMVA time series.

2.6 Related Neuroimaging Analysis Work

Much of the previous work in the analysis of fMRI images uses statistical regression and hypothesis testing, frequently via statistical software packages such as SPM (Friston 2002) or AFNI (Cox 1996). The general linear model (GLM) is frequently used to determine if a stimulus is correlated with voxel activity (Friston et al 1995). Derived from the GLM, the t-test is also used (Worsley & Friston 1995).

Discriminant analysis techniques (e.g., multivariate regression) are commonly used to analyze neuroimaging data and they share many modeling similarities with BNs. The main difference lies in the methods used to measure correlation between RVs. Discriminant analysis techniques generally measure correlation with an analysis of variance (ANOVA), whereas, the generality of the BN framework does not constrain the means of measuring correlation to any single method. Instead, the BN framework allows RVs to be modeled in an arbitrary fashion, with many possible methods for measuring correlation possible. Of course, for any particular method chosen, there will be assumptions made that limit the representative power of the BN.

The most commonly employed method in neuroscience for modeling connectivity between ROIs is structural equation modeling (SEM), also known as path analysis. The goal in SEM is to use a graphical structure to model the covariance between groups of RVs. Statistical inference is used to measure the goodness of fit of proposed covariance structures to data and, by selecting the model that maximizes the goodness of fit, a set of causal relationships can be elicited from the data. As an example of using SEM to analyze fMRI data, Büchel and Friston (1997) were able to show significant changes in the relationships between regions in the dorsal visual system caused by shifts in attention. The most significant difference between SEM and BNs is that SEM assumes the random

variables can be accurately modeled with first and second order moments (i.e., that they have values distributed with a Gaussian) and frequently makes linearity assumptions. Again, the generality of the BN framework does not require these assumptions (though they may be present in some BN applications) and can model relationships SEMs typically cannot.

Mine is not the first BN technique applied within the neuroanatomical domain. H∅jen-S∅rensen, Hansen and Rasmussen (2000) use Hidden Markov Models (HMMs), a special case of DBNs, to learn a model of activity within the visual cortex from visual stimuli. They were capable of reconstructing the patterns of visual stimulus presented to patients based solely on fMRI data.

However, the work most similar to mine is that of Mitchell et al. (2003). They use a BN approach to analyze fMRI data with the goal of classifying instantaneous cognitive states such as ‘ reading a book’ or ‘ looking at a picture’ . They use three separate algorithms for analyzing fMRI data: a Gaussian naïve Bayesian network (GNBN) classifier, a support vector machine (SVM) and a K-Nearest Neighbor (KNN) classifier. Their work significantly differs from mine as they solely attempt to classify fMRI data, a supervised learning task, while I attempt to both classify and extract neuroanatomical relationships, a more difficult unsupervised learning task.

Friston, Harrison and Penny (2003) employed Bayesian network modeling techniques to attempt reconstruction of activation networks. Like me, they employ dynamic Bayesian networks, which they refer to as dynamic causal models (DCMs). I refer the reader to Penny et al. (2004) for a review and comparison of SEMs and DCMs. They restrict the correlational model among RVs to bilinear relationships. Given the simplicity of their correlational model, they do not need to perform a structure search. Further, using their extensive knowledge of neurophysiology, they incorporate a detailed hemodynamic response into their model. I instead assume a multinomial correlational model and do not explicitly model the hemodynamic response. Though my correlational model is more general and can capture many nonlinear relationships theirs cannot, my approach leads to a vastly more complex model identification problem and requires a number of assumptions and approximations not required by Friston, Harrison and Penny.

3 Current Work – Analyzing Neuroanatomical Relationship Dynamics

I have experimented with the ability of DBNs to model changing relationships among neuroanatomical regions caused by the onset of dementia. In this section, I will describe the DBN analysis method, the experimental results, the validation methods and the conclusions I reached during the experiment.

3.1 Experimental Design for fMRI Analysis

Figure 2 gives a flowchart for the steps performed in the analysis of the fMRI data. First, raw fMRI data is obtained for each of the patients. Then, for each individual fMRI scan, voxels are grouped into the ROIs they compose via the Talairach database. For each structure, the aggregate mean voxel activation (AMVA) is computed at each time step, resulting in a time series for each ROI. The collection of AMVA time series for every ROI is collated into a single data point. The collection of all the data points is contained

in dataset D = { d1, d2, … , d40} where di contains the data of the i th patient. Each di can be thought of as a table, with the columns representing ROIs and the rows representing time frames. Each cell in the table is the AVMA of a single ROI at a single point in time.

D is divided into four datasets, Dh, Dd, De and Dy, containing data for the healthy elderly, the demented elderly, all the elderly and the young subjects, respectively. De and Dy form a partition of D. Dh and Dd form a partition of De.

At this point, an SVM and a GNBN can be applied to the data. However, the DBN I employ is discrete and requires the continuous data to be quantized. Each data point is separated into sequential non-overlapping temporal windows of eight time points. A trend is then defined per window as the mean of the AMVA values within the window. The difference between the AMVA value and trend value can then be used to assign a region into a high state or a low state.

Three quantized methods are employed1: Highlow, Highlow2 and Highlow4. The first method results in two quantized states; one state above the trend, and one state below the trend. The second method resulted in four equally sized states, two above and two below the trend. The third method resulted in eight equally sized states, four above and four below the trend. Figure 2e. illustrates the Highlow quantization for the AMVA time series in Figure 2c. After the data is quantized, a DBN is learnt for each of the datasets.

1 I examined other methods of quantization such as using medians and quartiles as opposed to means and extremum, and creating a separate region surrounding the trend constituting a ‘noisy’ state, but found no significant impact on the results.

Time

Act

ivat

ion

Time

Act

ivat

ion

��

��

��

��

X��

X2��

��

X4��

t t+1

SVM Model

��

��

Figure 2. Experimental design flowchart. (a) Raw data is first acquired. (b) The data is then grouped into ROIs with the Talairach database. (c) The aggregate mean voxel activation (AMVA) is computed for each ROI over time as well as the sliding window trend. (d, f) The GNBN and SVM models are learnt from the AMVA time series. (e) The AMVA time series is quantized (HighLow quantization shown). (g) The discrete dynamic Bayesian network model is learnt from the quantized data.

a)

b)

c)

d)

e)

f) g)

3.2 DBN Analysis

The DBNs make both the Markov order 1 and stationary assumptions. To perform the structure search, I use the BDE score (1995) and employ the following greedy algorithm. For each node in column t+1, the algorithm finds the parent that increases the BN’s score the most. It then finds the next parent that increases the score the most in conjunction with the previously added parent. This process is repeated until a predetermined number of parents has been added.

The search algorithm is parameterized by two variables, numParents and numBestToKeep. numParents indicates the maximum number of parents each ROI should have. This is required since adding a parent will result in a higher BDE score but will eventually result in a loss of model generality and lower classification accuracy due to over fitting. I set numParents empirically, as the highest value possible before LLOCV accuracies begin to drop. numBestToKeep indicates how many nodes should retain their parents during classification. Generally, classification accuracy improves when only a certain subset of nodes are allowed to retain their parents. This is because many nodes contribute only noise or over fit the data, and reduce model generalization.

The BDE score is a Bayesian approach that allows prior information to guide the results of the structure search. I employ an uninformed structural prior and an uninformed conditional probability prior. This makes the assertion that, prior to seeing any data on the patients, there is no bias towards believing any two ROIs are more or less likely to appear causally related. Nor is there a bias towards how any discovered relationships should behave.

Using this algorithm, four DBNs were learnt: DBNh, DBNd, DBNe and DBNy, obtained by training on the Dh, Dd, De and Dy datasets, respectively. To classify a new subject’s data, Dnew, (which has not be used to train the DBNs) the posterior likelihood of each of the models given the subject’s data, ( | ), { , , , } ,newDBN D h d e yψ ψ ∈� was calculated. The model with the highest posterior likelihood was used to determine the class of the new subject. Datasets Dh and Dd were used to differentiate between healthy and demented patients while De and Dy were used to differentiate between elderly and young patients.

3.3 Results

Analyzing the structural differences between DBNs can illuminate how the relationships between neuroanatomical ROIs change due to dementia. One way to analyze the structural differences is to examine how the parent set for a child changes between the healthy and demented DBNs. Another telling feature is the list of the highest scoring families. Figure 3 gives a sample the top families for both the healthy and demented DBNs. By looking at which families have the best scores, one can determine which ROIs are most strongly correlated with other regions in the brain.

However, through interaction with my collaborating neuroscientist, I have found the most informative structural features tend to be information on which ROIs are commonly parents in one DBN, but not the other. Finding such regions can indicate significant changes in the way patients use their brain. Figure 3b gives the structural results for seven of the top ten demented families. It’s striking that a region known as the amygdala

appears as a parent in all of them. In fact, of all the demented families, the amygdala is a parent of nearly fifty percent of them. This is profound given that in the healthy DBN, the amygdala is a parent of only a single family. These results strongly suggest that the behavior of the amygdala has significantly changed in demented patients. Further analysis of the structural results is outside the scope of this proposal. For a more detailed discussion, see (Burge et al. 2004).

To validate the structure search method, I have compared the classification performance of the DBNs against two other powerful machine learning classifiers, a support vector machine (SVM) and a Gaussian Naïve Bayesian network (GNBN). For the classification between healthy elderly and demented elderly patients, both the DBN and the GNBN achieved a 73% classification accuracy. Of the healthy subjects, the DBN had a 93% accuracy, while the GNBN had a 84% accuracy. Thus, while the overall classification accuracy of the DBN and GNBN were the same, they show different biases towards classifying healthy patients correctly vs. classifying demented patients correctly. Both the DBN and GNBN achieved higher accuracies than the SVM which at its best only achieved a classification accuracy of 65%. These results verify that the DBN is capable of learning discriminative qualities between healthy and demented patients with at least equal strength compared to two other powerful and commonly used machine learning techniques.

Cuneus

Cingulate Gyrus

BA 13

BA 12

Gray Matter

Parietal Lobe

Amygdala

Uvula of Vermis

Lat. Dorsal Nucleus

BA 40

Inf. Parietal Lobule

Uvula of Vermis

Amygdala

Inf. Occipital

Gyrus


Inf. Parietal

Declive

Amygdala

Inf. Semi Lunar Lobule

BA 7

BA 31

Insula

Amygdala


Right Cerebrum

Post-central Gyrus

Amygdala

Lat. Dorsal Nucleus

Uvula of Vermis

Post-Central Gyrus

a)

b)

Figure 3. DBN structural results. a) Highest scoring family in the healthy DBN. b) Seven of the highest scoring families in the demented DBN with the amygdala as a parent (highlighted to emphasize its prevalence in the structural results). Of all the demented families, the amygdala is a parent in 49% of them. Of all the healthy families, the amygdala is only a parent in one of them.

Parietal Lobe

Right Cerebellum

Amygdala

Lat. Dorsal Nucleus

Parietal Lobe


Caudate

Tail

Amygdala


BA 3

I have also attempted to compare the classification efficacy of the DBN with that of typical neuroscience approaches. Unfortunately, as classification is not the primary goal in many neuroscience applications, Buckner et al. did not attempt classification on the demented data. However, they found few functional differences between groups which suggests that classification would be difficult. There have been other studies that have attempted to classify dementia based on neuroanatomical data. Using multivariate linear regression on positron emission tomography (PET) data, Azari et al. (1993) examined the neurophysiological effects of dementia in elderly patients. They were able to attain a significantly higher classification accuracy than I have on my initial DBNs, correctly classifying dementia with 87% accuracy. However, while the subject population between Buckner et al.’s study and Azari et al.’s study are similar, Buckner et al.’s demented patients suffered only from very mild or mild dementia as opposed to moderate dementia in Azari et al.’s study. The comparison between my results and theirs are also complicated by the different imaging techniques used (fMRI vs. PET) and different experimental paradigms. I am not aware of other more-compatible experiments to compare my classification results with.

To show that the results of the DBN structure search would not be found with linear regression models (such as the one employed by Azari et al.), I compared the strength of the linear fit in the relationships of the top five demented and healthy families with the linear fit of 200,000 families whose structures were randomly chosen. For each family, a linear regressor minimizing the sum of squared residuals was fit with the dependent RV set as the family’s child RV and the independent RVs set as the family’s parent RVs. The r values2 for the top five demented families ranged from 0.19 to 0.59 and for the top five healthy families, from 0.26 to 0.65. However, many of the randomly generated families had significantly higher r values, with the maximum r value being 0.74. Further, within the top 5 families, the best r values did not correspond to the best BDE scores. For instance, of the top five healthy families, the family with the lowest BDE score had the highest r value and thus, the strongest linear fit. These results indicate both that the DBN structure search does not simply maximize linear fit and that linear regression based models would not have found the relationships within the DBN (which could explain why Bucker et al. found little differences between healthy and demented subjects).

3.4 Confidence Measure

I have applied a technique to measure the confidence in the individual DBN structural results. It is based on the bootstrap method used by Friedman et al. (2000) in their analysis of gene expression data. The bootstrap method measures the likelihood that the score of a DBN family was due to random chance.

Recall that the data for each patient is stored in a table where the columns represent the activation values of neuroanatomical regions and the rows represent points in time. Normally, the rows are ordered chronologically and temporal correlations between times frames are intact. To destroy the temporal correlations, the rows can be randomly

2 r is the coefficient of determination, a measure of the goodness of fit of a linear regressor.

r = �(Ý-<Y>)2 / �(Y-<Y>)2, where Y is the dependant variable and Ý is the regressor’s estimation of Y .

permuted. For any given row in the table, the relationships among regions are not affected. DBNs are then learnt for the randomly permuted datasets to calibrate BDE scores for networks based on data with no temporal relationships. I will refer to these DBNs as random DBNs and the DBNs learnt on the actual data as true DBNs.

This process is repeated 20 times and a distribution of BDE scores for each family in the random DBNs is created. I then assign a measure of confidence to the families in the true DBNs based on the difference between the true DBN’s BDE score and the random DBNs’ BDE scores. Equation (2) gives the confidence measure,

( ) ( ), ,

,

,i true i rand

ii rand

BDE XConf X

µσ

−= (2)

where BDE(Xi,true) is the BDE score for family Xi in the true DBN, �rand is the mean BDE

score for the Xi family in the random DBNs and � rand is the standard deviation of the BDE scores for the Xi family in the random DBNs. Figure 4 gives a visualization of the confidence measure.

For all of the highest BDE scoring families, there was high confidence that the relationships were not due to random chance. For instance, the top demented family had a confidence of 7.64. This indicates that family’s BDE score in the true DBN was 7.64 standard deviations higher than the average BDE score in the random DBNs. The top five families (and many others), all had confidences of at least 4.9. Thus, it is extremely unlikely the relationships in the DBN families were due to chance alone.

Figure 5 illustrates a “difference histogram” which shows that the BDE score distribution in the actual data was significantly higher than the BDE score distribution in the randomized data. This indicates that overall, the true DBN relationships are significantly stronger than random DBN relationships—as was expected.

increasing score

�

� Randomized model score Low confidence score Medium confidence score High confidence score

Figure 4. Visualization of confidence score. The confidence of each of the families in a DBN trained on a true dataset vary in their reliability. Computing the BDE score for families trained on randomly permuted datasets yields a distribution, assumed to be Gaussian with mean � and variance �, of BDE scores occurring due to random chance. The larger the distance of a true family’s BDE score from �, the more likely the true family did not occur by chance.

3.5 Experimental Conclusions

By comparing the structural differences between DBNs learnt on healthy and demented subjects, I found many changes in neuroanatomical relationships congruent with known neuroanatomical behavior (as confirmed by my collaborating neuroscientist). I also uncovered a significant increase in the correlation of the amygdala to many other neuroanatomical regions, especially within the parietal lobe. This is the first report that my collaborating neuroscientist was aware of showing an increased correlation of the amygdala in an emotionally neutral task due to the onset of dementia. Further, it agrees with a previous study by Grady et al. (Grady et al. 2001) which found an increased role of the amygdala in demented patients on a emotionally charged task (facial recognition).

It is my hope that the results of this experiment will foster interest into further research towards determining the extent of the amygdalar involvement in dementia in order to lead to a method for better diagnosis and treatment. I plan to continue my analysis of the demented dataset with the DBN methods I am proposing in Section 5 that build on the techniques and lessons learned in this experiment.

4 Bayesian Networks – Research Questions

I have uncovered too many BN research problems in my fMRI experiments to cover in a single thesis. In this section I will briefly describe the open areas of research and specify which of them I will pursue in my doctoral research. I will not be directly addressing all of these issues in my thesis.

Much of the BN research can be put into two main categories. First, given a BN and its parameters, how can the values of hidden variables be inferred from a set of observed variables? This process is referred to as BN inference. While inference on typically

-10

-5

0

5

10

15

Healthy Family's BDE Score (right is higher)

Figure 5. A “Difference Histogram” showing the difference between the number of times the true BDE scores were in a bin minus the number of times the randomized BDE scores were in a bin. Positive bins indicate ranges of BDE scores seen more in the true DBNs. Negative bins indicate ranges seen more in random DBNs. The positive bins are located on the right side of the graph, corresponding to higher scores and indicating the scores were on average higher for the true DBNs.

employed BNs (e.g., those with common probability distribution functions such as multinomials, Gaussians, etc.) is a well understood problem (Jensen 2001), many modifications to the BN framework require modifications to inference techniques.

Second, given a collection of data, what is the most appropriate Bayesian network model that describes it? My research focuses on this second category, which I will refer to as model identification. BN model identification can be separated into two tasks: structure search and parameter identification.

Model identification is also a well-understood problem (Heckerman 1995). Typically, it involves searching many hypothesis structures, calculating the parameters for each structure (frequently with a maximum likelihood estimate) and then scoring the model with a function that measures the goodness of fit of the proposed BN to the data. While this is a fairly straight forward procedure, many details need to be specified and many issues arise.

4.1 Feature Extraction

Within many domains, the amount of data available is so large that learning directly on the data is intractable. This is the case with fMRI data. In order to deal with this intractability, learning is applied to features extracted from the data. The method for extracting features is generally specific to the domain at hand. I use the Talairach database (Lancaster et al. 2000) to convert a series of fMRI images into a time series detailing the activation of neuroanatomical regions, but other methods for extracting features could prove promising.

Of particular interest is the notion of a probabilistic 4D atlas proposed by Mazziotta et al. (2001) in which individual voxels are not assigned absolute labels, as they are in the Talairach database. Instead, each voxel is assigned a probability of being located within a set of ROIs. This system could adequately deal with the differing anatomical structures among patients due to age, disease, normal spatial differences, etc. Bayesian methods such as the DBNs I employ offer a natural method for working with such a probabilistic database.

4.2 Structure Search

Given a method for extracting features from a dataset, the next issue is structure search. Ideally, every possible BN structure would be searched. However, the size of the structure space is prohibitively large for most real-world domains. For example, the BNs that I use in the fMRI domain have approximately 1047 possible structures—clearly too many to search exhaustively. Search heuristics must be employed.

The most common heuristics applied are iterative greedy algorithms that repetitively make small changes to a hypothesis structure, accepting the change only if it results in an increase in the structure’s score. Other more complex heuristics have also been applied to BN structure search such as simulated annealing (Heckerman, Geiger and Chickering 1995) and genetic algorithms (Larrañaga et al. 1996) as well as techniques that attempt to increase implementation efficiency such as Moore and Wong’s (2003) optimal

reinsertion. Techniques for searching structures with a very large sets of RVs have also been developed (Goldberg and Moore 2004).

Further, in many real-world domains, RVs do not interact with each other haphazardly or chaotically, but instead participate in meaningful semantic relationships, such as the is-a and has-a relationships from the object oriented programming paradigm. Within the neuroanatomical domain, ROI are related to each other via the contains and contained by relationships. Each ROI contains smaller regions or is contained by larger regions (or both). A hierarchy can be built as the conglomeration of these semantic relationships. Structure search can benefit from a hierarchical relationship between RVs. This is one of my main focuses for research and I will defer further discussion to Section 5.3.

An issue important in modeling temporal processes is time-delay. In the fMRI DBNs, I make the first order Markov assumption. Doing so causes the DBN to ignore all correlations among RVs that take more than a single time frame to develop. An obvious approach to taking larger temporal relations into account is to simply have more columns of RVs in the BN—however, this dramatically increases the search space. For example, simply adding a t+2 column of RVs in the fMRI DBNs increases the number of possible structures from 1047 to 1092. Further exacerbating the problem, many of these structures have larger maximum family sizes, which exponentially increases the number of parameters. One possible avenue for research would be to determine how structure search in such large state spaces could progress.

Structure search has been shown to be an NP-hard problem (Chickering 1994). If it can also be shown to also be NP-easy, then it could be reduced to other NP-complete problems such as integer linear programming (ILP). The tremendous amount of research developed for solving ILP problems could possibly be leveraged for solving structure search problems. All that would be needed is a method for converting a BN structure search into an ILP.

Another significant issue is the incorporation of hidden variables. I have constructed BNs that are built solely upon observed data points. Others have built models that assume a ROI’s state cannot be directly measured and must be inferred from observed data (Friston, Harrison and Penny 2003). Unfortunately, performing a structure search on BNs with hidden variables is not a well understood problem. The relatively obvious approach of estimating the distribution of values for hidden variables (perhaps via EM) for each proposed structure is not tractable. However, Friedman (1998) has made significant advancement in this area with his Structural EM algorithm, which calculates expected scores for hypothesis BNs as opposed to actual scores.

4.3 Parameter Identification

For each hypothesized BN structure, a set of parameters must be determined. The method for doing so is generally tied to the method for scoring a BN. In discrete BNs, due to Heckerman, Geiger and Chickering (1995), the parameters are typically calculated by simply counting the number of times RVs take on certain values in the data. As I propose a new BN scoring method, I must also propose an appropriate parameter identification method, which I discuss in Section 5.1. As was the case with structure

search, I conjecture that a hierarchical relationship between RVs can be used to more effectively determine the BN parameters. I will discuss this in Section 5.4.

4.4 Structure Scoring

When dealing with Bayesian scores such as Heckerman’s BDE score, prior information needs to be specified. I use uninformative priors. This means that all structures and parameters are considered, a priori, as equally likely as possible. However, a wealth of information exists in neuroscience that could be used to help guide BN structure search and parameter learning. This information could be incorporated as prior information for Bayesian methods.

For BN models that attempt to discriminate among separate classes of data, using likelihood-based scores such as the BDE score is not ideal. Instead, the model should favor structures that discriminate among the classes. This constitutes a significant portion of my research and I will defer further discuss to Section 5.1.

4.5 Analyzing Network Structures

After identifying separate BNs for differing classes of data, analysis of the structural differences in the BNs can indicate fundamental differences among the classes. I have found in the analysis of DBN structures learnt from fMRI data that the most informative structural information is the number of times ROIs appear as parents in each DBN. For instance, if a RV X is a parent 50% of the time in class C1, but only a parent 5% of the time in class C2, then it is likely X plays a more significant role in C2 than in C1.

In many domains, including the neuroanatomical domain, it would also be useful to identify feedback loops—networks of causal influences that are cyclic in nature. Unfortunately, a method for inferring feedback loops from DBNs is not immediately obvious. If a feedback loop is simply defined as a cyclic set of temporal correlations between variables, there will be many possible feedback loops, perhaps too many to even enumerate. A method for determining which feedback loops are important is necessary and constitutes an interesting avenue for research.

4.6 Phase Transitions

A common problem in machine learning is determining under what conditions individual algorithms do well. One possible approach for addressing this problem is phase transition analysis (PTA). PTA is not frequently applied in machine learning but Baskiotis and Sebag (2004) have introduced a PTA-based method for analyzing the behavior of the C4.5 decision tree algorithm (Hastie, Tibshirani and Friedman 2001). Using a similar technique, PTA could help determine which categories of problems DBNs should be applied to.

5 Bayesian Networks – Proposed Issues to be Addressed

In this section, I describe in depth the issues identified in Section 4 that constitute the bulk of my doctoral research.

5.1 Comparative Scoring Methods and ACL

During a structure search, the quality of a proposed structure is measured via a scoring functions. Likelihood-derived scores, such as Heckerman’s BDE score detailed in Appendix 9.1, result in parameters being trained based only on data for a single class. In the multinet framework where one BN is learnt for each class of data, this corresponds to each BN being trained solely on its intra-class data. Such BNs include the highly likely features of a single class of data.

If the goal is class discrimination, this is not ideal. The score should not only increase as the likelihood of a BN given its intra-class data increases, but it should also decrease as the likelihood with respect to its extra-class data increases. Such a score will produce networks that discriminate between classes since high-scoring structures will be indicative of relationships that are strong in one class and weak in another. I will refer to these scores as comparative scores. Figure 6 provides motivation.

Using a comparative score is not a new idea in training BNs. Bilmes et al. (2001) use comparative scores with the multinet framework in their analysis of speech recognition. Grossman and Domingos (2004) suggest maximizing class-conditional likelihood (CCL). In fact, if classification accuracy is the goal, the optimal discriminative classifier is the one that maximizes CCL (Duda and Hart 1973).

There are, however, limitations to BNs trained with comparative scores. Most prominently, the resulting models will not be generative. Distributions of samples drawn from the BNs will not match the distributions in the underlying system. Thus, if data generation is the goal, comparative scores should not be employed.

There are also significant issues when using the CCL score in particular. The CCL score cannot be maximized by independently maximizing a score from each of the DBN’s families, as can be done with most likelihood-derived scores (see Equations (21) and (22) in Appendix 9.1). This has two prominent effects. 1) The iterative greedy structure search heuristic cannot cache family scores between iterations. This increases the

X1 X3

X4

C1 C2

X2 X1 X3

X4

X2

Figure 6. A graphical structure giving the conditional dependencies between RVs X1, X2, X3, X4, for two classes of data, C1 and C2. Darker lines indicate a stronger correlation between RVs. In C1, the correlation between X1 and X4 is stronger than the correlation between X2 and X4. Further, assume the dark-lined correlation does not change between C1 and C2. Likelihood-based scores will favor the dark-lined relationship and fail to differentiate between the classes. Comparative scores will favor the weaker but changing relationships and would succeed in class discrimination.

search’s polynomial time complexity, yielding it intractable for larger domains. 2) Individual BN parameters cannot be isolated in the score, preventing the parameters from being easily maximized. In fact, at this time there is no known closed form solution for the maximal CCL parameters.

Grossman and Domingos (2004) address the second issue by using the maximum likelihood estimate for the parameters while scoring each hypothesized network with the CCL score. For classification, they find that this technique significantly outperformed standard scoring techniques such as the BDE score. However, by using the MLE parameters, discriminative information is being discarded that could potentially be beneficial for discriminative tasks.

I propose a scoring method that favors discriminative structures but also decomposes into individual family scores and has a closed-form solution for maximizing parameters. Using the notation given in Section 2.2, the CCL score is derived in Equation (3),

( ) ( ) ( )( )

,| | B d d

Bd D B d

P YCCL B D P Y

P∈

= = ∏ XX

X( )

( ) ( ),

., 1 , 2

B d d

d D B d B d

P Y

P Y P Y∈

== + =∏ X

X X (3)

It will be useful to separate the data into two datasets, D1 and D2, each containing only elements of class 1 or class 2, respectively. This results in Equation (4),

( ) ( ) ( )

( ) ( )( ) ( )

( ) ( )( ) ( )

1

2

1 2

1

2

| | | ,

, 1| ,

, 1 , 2

, 2| .

, 1 , 2

B d d

d D B d d B d d

B d d

d D B d d B d d

CCL B D CCL B D CCL B D

P YCCL B D

P Y P Y

P YCCL B D

P Y P Y

∈

∈

= ⋅

==

= + =

==

= + =

∏

∏

X

X X

X

X X

(4)

Replacing BN B with the two multinets B1 and B2, results in Equation (5),

( )( )

( ) ( )

( )

( ) ( )1 2

1 2

1 2 1 2

1 2

| .B d B d

d D d D

B d B d B d B dd D d D

P P

CCL B DP P P P

∈ ∈

∈ ∈

= ⋅� � � �+ +� � � �

∏ ∏∏ ∏

X X

X X X X (5)

It will also be useful to swap the denominators, such that the CCL score is the product of two ratios, each between differing datasets, as seen in Equation (6),

( )( )

( ) ( )

( )

( ) ( )

( ) ( )

1 2

1 2

1 2 1 2

2 1

1 2

| ,

| | .

B d B dd D d D

B d B d B d B dd D d D

P P

CCL B DP P P P

PCCL B D PCCL B D

∈ ∈

∈ ∈

= ⋅� � � �+ +� � � �

= ⋅

∏ ∏∏ ∏

X X

X X X X (6)

where PCCL(B1|D) and PCCL(B2|D) are loosely said to represent the partial contributions of B1 and B2 to the CCL score for BN B. Since there are parameters for both Bayesian networks in each partial CCL term, the structure for each of the two Bayesian networks cannot be determined independently. Further, the summations in the

PCCL’s denominators prevent the score from decomposing into the aggregate score of individual DBN families.

As an alternative to using CCL, I propose the approximate conditional likelihood (ACL) score given in Equation (7),

( )( )

( )

( )

( ) ( ) ( )1 2

1 2

1 2

2 1

1 2| | | .B d B d

d D d D

B d B dd D d D

P P

ACL B D PACL B D PACL B DP P

∈ ∈

∈ ∈

= ⋅ = ⋅∏ ∏∏ ∏

X X

X X (7)

Each PACL term is a likelihood ratio test of a BN on both the classes of data: its intra-class data in the numerator and its extra-class data in the denominator. In order to maximize each PACL term, the numerator should be as large as possible while the denominator should be as small as possible. Thus, high ACL scores occur for BNs whose intra-class data likelihood is high but whose extra-class data likelihood is low. The only difference between the PACL terms in Equation (7) and the PCCL terms in Equation (6) is that the summations in the PCCL denominators have been replaced by single probabilities in the PACL denominators. This similarity is why I refer to ACL as the approximate conditional likelihood. However, if ACL is considered as an approximation to CCL, it is an unbounded one.

There are consequences for not including both probability terms in the denominators of the ACL score that are present in the CCL score. If the likelihood of B1 given D1 is high, the

1( )B dP X term in the PCCL’s denominator (Equation (6)) would lower the PCCL(B2 |

D) score. As this term does not exist in the ACL score, the PACL(B2 | D) score is not lowered and may score the proposed structure too highly.

Even so, ACL has significant computational advantages over CCL. It is decomposable while still favoring discriminating structures over high-likelihood structures—a feature important to class-discrimination. In addition, closed form solutions exist for parameters that maximize ACL.

5.2 Maximum ACL parameters

To calculate the parameters that maximize the ACL score, notice that the BN parameters compose sets of multinomials and that the binomial is a special case of the multinomial. I conjecture that calculating the maximizing parameters is closely related to calculating the maximizing parameters for what I will refer to as a binomial ratio test (BRT).

In a standard binomial, there are two parameters3, �a and �b=1-�a. Each gives the probability a RV takes on one of two possible values, a or b. The likelihood of the binomial model, M, given a set of data, D, is ( | ) a b

a bM D η η= Θ ⋅Θ� where � a is the number of times a was seen in D and � b is the number of times b was seen in D. The MLE parameters are given by ( ) /( )a a a bMLE η η ηΘ = + and � �

�� /( )b a bη η η+ .

3 Since a binomial has a constraint that the sum of its two parameters = 1, it can be represented with a single parameter. Since I will be generalizing to a multinomial, it’s helpful to represent it with two.

In the BRT, the data is divided into two classes of data, D={D1, D2}. The likelihood of the BRT is given by Equation (8),

( ) ( )( ) ( )

1 1

2 2( | ) ,a b

a b

a b

a b

M Dη η

η η

Θ ⋅ Θ=

Θ ⋅ Θ� (8)

where yαη is the number of times α∈{a,b} was seen in class y. � a and � b must both be

greater than ε, for some small value ε, and � a + � b must equal 1. The BRT is a ratio of two binomial likelihoods. The parameters that maximize the BRT can be determined using Lagrange multipliers (see Appendix 9.4 for the derivation). This results in three possible settings for the parameters. In the first case, � a=ε and � b=1-ε; in the second case � a=1-ε and � b=ε; in the third case, 2 1( ) /a a aη ηΘ = − 2 1 2 1( )a a b bη η η η− + − and

2 1( ) /b b bη ηΘ = − 2 1 2 1( )a a b bη η η η− + − . The case that yields the highest BRT likelihood corresponds to the maximizing parameters.

A similar ratio test with more than two parameters can also be devised. I will refer to this as the multinomial ratio test (MRT). The likelihood function for the MRT is given by Equation (9). I conjecture that the maximizing parameters for the MRT are given by Equation (10). I will prove (or disprove) this in my thesis.

( ) ( )( )

1

2| ,M Dα

α

ηα

ηα α∈Ψ

Θ=

Θ∏� (9) ( ) ( )

( )2 1

2 1,MLE

α αα

ϖ ϖϖ

η ηη η

∈Ψ

−Θ =

−� (10)

where � is the domain of the multinomial’s RV. Equation (10) is actually only one case of the MLE. The other cases remove an element � from � and sets �� to 0 if 2 1

δ δη η> . As a BN’s CPT is merely a collection of multinomials, Equation (10) can be generalized for CPT parameters, resulting in Equation (11),

( ), , , , , , , ,, ,

1( )

,i j k i j k i j k i j ki j k

if

otherwise

α β α βα η η ε η η

ωε

� − + >�Θ = ��

(11)

where ω is a normalizing constant and ε is a Laplacian smoothing constant. I will refer to setting the parameters via Equation (11) as the ACL-ML parameters.

I conjecture that maximizing the parameters can be thought of as increasing a margin in likelihood-space between the two classes. See Figure 7 for an example. Each point on the graph indicates a single data point in likelihood space. Boxes represent the data points d∈D1 and ×’s represent data points d∈D2. The x-axis is B1’s likelihood given d and the y-axis is B2’s likelihood given d. If the BN with the highest score is used to classify a data point, the classification boundary is the diagonal dashed line. Anything below the line would be classified as a box and above, classified as an ×. The classification boundary in the example would misclassify two points: the box above the boundary and the × below the boundary.

Assume that an event occurs more frequently in a BN’s intra-class data than in its extra-class data, i.e., 1

, ,i j kη > 2, ,i j kη where D1 is the intra-class data for B1. I conjecture that

increasing the probability of the CPT element 1, ,i j kΘ will increase the BN’s likelihood

given its intra-class data more than the likelihood given its extra-class data. Intuitively, it makes sense to increase these elements since the intra-class events occur more frequently than the extra-class events. Using the ACL-ML parameters does exactly this. I also conjecture that using the ACL-ML parameters significantly moves data points in likelihood space away from the diagonal classification boundary and increases the chance a data point will be correctly classified. I plan on proving (or disproving) this conjecture in my thesis.

As motivation, Figure 8 gives an example for the CPT in BN B1 for node b (which is assumed to be the 1st node). Take the CPT element (b2, a4). This element corresponds to two events in the intra-class data, but not to any events in the extra-class data. As such, it should have high probability mass. On the other hand, the element (b3, a3) corresponds to two intra-class events but also to two extra-class events and the probability mass assigned to it should be low. ACL-ML does this.

5.3 Hierarchical Information

Within the neuro-anatomical domain, regions of interest (ROIs) are hierarchically related via the contains and contained by relationships. In a strict hierarchy, an element may only be contained by a single larger element. In a semi-hierarchy (Figure 9), an element may be contained by more than one larger regions. Elements that are contained by other

Figure 7. Class likelihood space. Boxes indicate the data points d∈D1 and ex’s indicate the data points d∈D2. If the BN with the maximum likelihood given a data point is used to classify the data point, the dashed diagonal line is the classifier. Points crossing the line would be misclassified. Two data points, one in D1 and one in D2, would be misclassified in this example.

( )1 |B d�

( )2 |B d�

B

A

b1

b2

b3

b4

a1 a2 a3 a4

e1 e1

e1 e2

e2 e1

e2 e2

e2 e2

e2 e2

e1 e1

e2 e2

e1 e1

e1 e1

e1 e2

e1

e2 e2

e2

e2 e1

e1

e1 e1

e1

e2 e2

e1 e1

e2 e1

e1 e1

e2

e1

e1

Figure 8. Example CPT for the BN B1, A� B. Node b is assumed to the node with index 1. An e1 in the CPT element (aj, bk) indicates the occurrence of the event 1

1, ,j kξ in B1’s intra-class data. An e2 in indicates the occurrence of the event 2

1, ,j kξ in B1’s extra-class data. When training B1’s parameters, the CPT elements with many e1’s but few e2’s should be given more probability mass than those with many e2’ s and few e1’s. This allows the parameters to incorporate class-discriminative information.

P(B=k | A=j)

elements are h-children and elements that contain other elements are h-parents4. The set of RVs at the same hierarchical level is referred to as an h-level.

This hierarchy allows a system to be decomposed into multiple layers of granularity. The system described in Figure 9 can be represented at a coarse level by RV A since all other RVs are contained by A. At finer levels, the system can be represented as either {B C} or {D E F}.

I propose that a hierarchical structure can be used to improve BN structure search since it allows a system to be represented with varying numbers of RVs. Instead of learning one BN for all the RVs, a single BN could be learned for the RVs in a single h-level.

The top h-level’s BN would be trained first in an unconstrained manner. The candidate parents for each node would consist of all other nodes in the top h-level. BN structures for subsequent levels would be constrained by structures at the previous level. The only links allowed would be between RVs whose h-parents were also linked. Using this strategy makes the assumption that if two RVs are independent, their h-children will also be independent. While this assumption is certainly not true in the general case, it suggests a reasonable method to constrain structure searches at h-levels that cannot be performed exhaustively.

One issue that arises is the order in which structure search is performed at each h-level. While I will not thoroughly investigate this issue in my proposed research, it is worth mentioning. Should the structure search for a BN at a high h-levels be fully completed before searches at subsequent h-levels begin? Or, should the constraints implied by the

4 This “h-” notation is used because child and parent are terms used in Bayesian networks and a single ROI may be part of both a hierarchy and a BN.

Figure 9. An example of a semi-hierarchy. This is not a strict hierarchy since E has two h-parents. The dashed lines separate the h-levels in the hierarchy. Since the hierarchical relationships are contains and contained by, the sets of nodes {A}, {B C} and {D E F} can each be thought of as a decomposition of the system at differing levels of granularity.

A

B C

D E F a) b)

Figure 10. Two methods for searching through the structure state-space in a hierarchical search. Each line represents a change in a structure—long lines represent significant changes which correspond to structure searches at high h-levels. a) For each change made at the top h-level: propagate the constraints to lower h-level and perform structure searches at the lower levels. b) Completely finish the structure search at each level of the hierarchy before proceeding to the next.

changes made to the structures in higher h-levels be propagated to the lower level searches before fully completing the higher level searches? The two proposed methods admit a significantly different strategy for searching through structure space and is visualized by Figure 10. Future work may find that each search pattern is better suited to different domains.

A second advantage of using a hierarchy is the ability to set the nodes’ CPT parameters at subsequent h-levels based on CPT parameters at previous h-levels. For instance, in the example in Figure 9, assume there is a wealth of information available on the RV B but not much for RV D. Since B is D’s h-parent, B’s well specified behavior can be used to infer the relatively unspecified behavior of D. One previously used technique to achieve this goal is shrinkage (Section 5.4).

There also seem to be parallels between what I suggest in hierarchical learning, and relational learning, e.g. (Friedman, Getoor, Koller and Pfeffer 1999; Pfeffer 2001; Jaeger 1997; Poole 1993; Ng and Subrahmanian 1992; Kerstring and Raedt 2000; Muggleton 2000; McGovern and Jensen 2003). I plan to further explore the similarities between my hierarchical proposals and these methods in my thesis.

5.4 Shrinkage

To infer information about one RV based on another, I propose using shrinkage (Stein 1955). Shrinkage is a statistical technique that allows for the combination of low-variance, high-bias data with high-variance, low-bias data. Take an example in the neuroanatomical domain: the amygdala. It is a small region embedded within the uncus, which is in turn embedded within the limbic lobe. As the amygdala is a small structure, measurements of it are based on few voxels resulting in relatively high variance measurements.

Conversely, both the uncus and limbic lobe are composed of a far greater number of voxels, resulting in a low-variance measurements. However, the activation of these structures only roughly corresponds to the activation of the amygdala. I.e., they are a biased source of information for the amygdala. Since the limbic lobe is both larger and further abstracted from the amygdala than the uncus, it has an even smaller variance and higher bias. Intuitively, it seems reasonable to complement high-variance data with copious amounts of closely-related low-variance data. Shrinkage gives a mechanism for doing this.

I have found three significant papers describing methods using shrinkage in the machine learning literature. Anderson, Domingos and Weld (2002) employed shrinkage with Relational Markov Models (an extension of standard Markov Models) in order to classify a user’s activity on internet web pages. McCallum et al. (1998) applied shrinkage to multinomial models in order to improve text classification. Freitag and McCallum (1999) applied shrinkage to hidden Markov Models (HMMs) in order to extract information out of text documents.

Mathematically, shrinkage models are similar to mixture models. Both techniques combine simpler models together to form a single more complex model. But they do so in a significantly different fashion. An example differentiating the two techniques based on a simple exponential model follows.

A single exponential model, parameterized by λ, is given by xe λλ − . For a mixture of m exponential models, the collection of m parameters is Θ={λ1, …, λm}. Equation (12) shows the probability of a data point given a single model in the mixture. Equation (13) shows the probability of all n data points given a single model. Equation (14) shows the probability of all the data points given the full mixture of exponentials.

( )| .j ix

i j jP x e λλ λ −= (12)

( ) 1| .j i

n x

j jiP X e λλ λ −

== ∏ (13)

( )1 11

| , 1.j in m mx

j j jj jiP X e λα λ α−

= ==Θ = =� �∏ (14)

i is the number of data points and j is the number of mixture components. The λ terms in a single exponential model would be trained via MLE or maximum a-posterior (MAP). Generally, the λ and α terms in the mixture model would be learnt via expectation maximization (EM).

In a shrinkage model, the equations for the probability of a single data point given a single model and the probability of all the data given a single model remain the same as in Equations (12) and (13). However, the probability of all the data given all of the models changes, and is seen in Equation (15).

( )1

| in x

iP X e λλ −

=Θ = ∏ �

� (15)

Equation (15) is identical to Equation (13) except the λj’s have been replaced with λ�’s.

Within a shrinkage model, the probability of the data given all of the models is computed as the probability of all the data given a single model. Equation (16) details how to compute the λ

� values.

( ): k j

jml jml k kk HAncλ λ

λ α λ α λ∈

= + �� (16)

jmlλ is the MLE (or MAP estimate if working in the Bayesian domain) for a single exponential model based solely on the data and ( )jHAnc λ is the set of all h-ancestors of the j th exponential model in the hierarchy of exponential models. The � values are learnt via EM or some other parameter estimating procedure.

One key difference between mixture and shrinkage models is that shrinkage models are of the same form as the underlying model whereas mixture models aren’t. For example, with a mixture of exponentials, the resulting model is not an exponential model but instead a linear combination of exponential models. The resulting shrinkage model remains an exponential.

If a strict hierarchy relates a BN’s RVs, the calculation of the CPT parameters is given by Equation (17),

( )( ) ( ) ( )( )( )1

1

| | ,h

i raw i i i i i ii

P X Pa X P HAnc X HAnc Pa Xλ λ−

=

Θ = ⋅ +� (17)

where the λ ’s are shrinkage coefficients, h equals the h-level Xi is in and HAnci(X) is the i th h-ancestor of X. This form of shrinkage makes the assumption that RV Xi’s relationship with its parents will be similar to the relationship between its h-ancestors and its parents’ h-ancestors. The extent to which a RV’s parent relationship is similar to each of its h-ancestors’ relationships is quantified by the

�i shrinkage coefficients.

A significant issue arises when dealing with semi-hierarchies. In a semi-hierarchy, the h-parent of a RV may actually be a set of other RVs (Figure 11). Equation (18) gives the calculation of shrinkage CPTs for a semi-hierarchy.

( )( ) ( )

( ) ( )( ){ }

1

,1

| |i

h

X raw i i S Ri S R H

i i i i i

P X Pa X P S R

H HAnc X HAnc Pa X

λ λ−

= × ∈

Θ = ⋅ +

= ×

� ∏ (18)

There is a single term in (18) for every possible combination of X’s h-ancestors and Pa(X)’s h-ancestors. In this case, the shrinkage model assumes a RV’s behavior is similar to the multiple behaviors of all possible combinations of its h-parents and its

At

Bt

Ct

Dt

Et

At+1

Bt+1

Ct+1

Dt+1

Et+1

… …

A B C

D E

A B C D E

a)

b)

c)

Figure 11. a) An example DBN. The dashed line separates the DBN RVs between the first and second h-levels of a hierarchy. b) An example hierarchy for the RVs in the DBN. The boxes are square to indicate they are a hierarchical structure as opposed to a BN. c) A Venn diagram depicting the relationship between the RVs in the hierarchy. Assume that shrinkage is being applied to calculate the CPT for node Dt+1. The possible corresponding CPTs that would be included as terms in Equation (18) would contain the cross product between any two parent elements from { At, Bt, Ct} and any one child element from { At+1, Bt+1} resulting in a total of six terms. If the terms were constrained by the results of a structure search on the first level, then there is only a single valid term: the CPT corresponding to the family with At and Bt as parents and At+1 as a child. All other possible combinations of parents and children are not represented in the structure at the first h-level.

parents’ h-parents. Depending on the number of h-parents for the nodes in the semi-hierarchy, the number of Hi terms could increase quickly.

There are several ways to deal with this issue. First, a method (yet to be determined) to reduce the number of terms in Hi could be devised by selecting which h-ancestors were most important. Second, if the structure search for one level is constrained by the structure search from a previous level (as described in Section 5.3), the Hi terms in (18) could be limited to terms containing RVs that were linked in previous h-levels..

6 Research Contributions

In this section, I succinctly iterate the contributions to neuroscience and machine learning I will make in my thesis. My main contribution to neuroscience will be to devise a method capable of indicating how the temporal relationships among neuroanatomical regions change due to the onset of mental illness. The method will be applied to fMRI data, but will be robust enough to be applied to other datasets relatively easily.

To accomplish this goal, I will model the fMRI data with dynamic Bayesian networks (DBNs). I will augment DBN structure search with the three main methods proposed in Section 5.

1) The approximate conditional likelihood (ACL) structure score and methods for training parameters to maximize this score.

2) A hierarchical modification to standard BN structure search.

3) The incorporation of shrinkage in CPT parameter estimation.

I will validate the results of my techniques with the following four methods.

1) Domain experts will verify that results in the neuroscience domain are meaningful and will hopefully find previously unknown and interesting relationships among RVs.

2) A bootstrap confidence measure will be applied to show that the results are not due to chance.

3) Simulated data with controllable RV relationships will be created and the ability for my techniques to find the known relationships will be gauged.

4) My algorithms will be tested on well known and commonly accepted examples from data repositories. This validation method is complicated by the apparent absence of commonly used DBN benchmark data, but I will attempt to locate appropriate data bases.

The hierarchical search method I propose makes a specific assumption: the absence of a correlation among RVs at a high h-level implies the absence of a correlation at lower h-levels. I will test whether or not this assumption holds in the fMRI data.

7 Acknowledgements

I would like to thank my advisor, Terran Lane, for the large amount of guidance he provided in my research; Vincent Clark, the collaborating neuroscientist I have been working with, who has spent significant amounts of time and effort with me on my research; and Shibin Qiu and Hamilton Link for developing the SVM and GNBN classifiers (respectively) that I used to compare the DBN classification accuracy to in my experimental work described in Section 3.3.

8 References

Akike, H. (1973). Information Theory and an Extension of the Maximum Likelihood Principle. In Second International Symposium on Information Theory. B.N. Petrov and F. Csaki (eds.), Academiai Kiado, Budapest, pp 267-281.

Azari, N. P., Pettigrew, K. D., Schapiro, M. B., Haxby J. V., Grady C. L., Pietrini, P., Salerno, J. A., Heston, L. L., Rapoport S. I. And Horwitz (1993), B. Early Detection of Alzheimer’s Disease: A Statistical Approach Using Positron Emission Tomographic Data. Journal of Cerebral Blood Flow and Metabolism. v. 13, pg. 438-447.

Anderson, C., Domingos, P. and Weld, D. (2002) Relational Markov models and their application to adaptive web navigation. International Conference on Knowledge Discovery and Data Mining (KDD). 143-152.

Baskiotis, N., Sebag, M. (2004) C4.5 Competence Map: a Phase Transition-inspired Approach. International Conference on Machine Learning, 21, 73-80.

Bilmes, J.A. Dynamic Bayesian Multinets. (2000). Proceedings of the 16th conference on Uncertainty in Artificial Intelligence. pg. 38-45.

Bilmes, J., Zweig, G., Richardson, T., Filali, K., Livescu, K., Xu, P., Jackson, K., Brandman, Y., Sandness, E., Holtz, E., Torres, J., & Byrne, B. (2001). Discriminatively structured graphical models for speech recognition (Tech. Report). Center for Language and Speech Processing, Johns Hopkins Univ., Baltimore, MD.

Buchel, C., Friston, K.J. (1997) Modulation of connectivity in visual pathways by attention: cortical interactions evaluated with structural equation modelling and fMRI. (sic) Cerebral Cortex, Vol 7, 768-778.

Buntine, W. (1991). Theory Refinement on Bayesian Networks. Proceedings of the Seventh Conference on UAI. 52-60.

Buckner, R. L., Snyder, A., Sanders, A., Marcus, R., Morris, J. (2000). Functional Brain Imaging of Young, Nondemented, and Demented Older Adults. Journal of Cognitive Neuroscience, 12, 2. 24-34.

Burge, J., Clark, V.P., Lane, T., Link, H., Qiu, S. (2004). Bayesian Classification of FMRI Data: Evidence for Altered Neural Networks in Dementia. Submitted to Human Brain Mapping. Temporary alternate: Tech Report TR-CS-2004-28, University of New Mexico.

Cooper, G., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347.

Cox, R.W. (1996) Software for Analysis and Visualization of Functional Magnetic Resonance Imaging Neuroimages. Computers and Biomedical Research. v. 29, pg. 162-173.

Charniak, E. (1991). Bayesian Networks Without Tears. AI Magazine, 12 (4). Chickering, D., Geiger, D., Heckerman, D. (1994). Learning Bayesian Networks is NP-

Hard (Technical Report MSR-TR-94-17). Microsoft. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York,

NY: Wiley. Friston, K.J., Harrison, L. and Penny, W. (2003). Dynamic causal modeling.

NeuroImage 19, 1273–1302. Friston, J. et al. (1995). Statistical Parametric Maps in Functional Imaging: A General

Linear Approach. Human Brain Mapping 2, 189-210. Friston, J. (2002). Online document. Statistical Parametric Mapping. Available at

http://www.fil.ion.ucl.ac.uk/spm/papers/SPM-Chapter.pdf. Freitag, D. McCallum, A. (1999) Information Extraction with HMMs and Shrinkage.

Proceedings of the Sixteenth National Conference on Artificial Intelligence: Workshop on Machine Learning for Information Extraction. Orlando, FL, pp. 31-36

Friedman, N. (1998). The Bayesian Structural EM Algorithm. Uncertainty in Artificial Intelligence.

Friedman, N., Getoor, L., Koller, D., Pfeffer, A. (1999) Learning Probabilistic Relational Models. International Joint Conferences on Artificial Intelligence. 1300-1309

Friedman, N., Linial, M., Nachman, I. and Pe'er, D. (2000) Using Bayesian networks to analyze expression data. Research in Computational Molecular Biology (RECOMB). p. 127-135.

Geiger, D., Heckerman, D. (1996). Knowledge representation and inference in similarity networks and Bayesian Multinets. Artificial Intelligence. v. 82, pg. 45-74.

Grady, C. L., Furey, M. L., Pietrini, P., Horwitz, B., & Rapoport, S. L. (2001). Altered brain functional connectivity and impaired short-term memory in Alzheimer’s disease. Brain, 124, 739-756.

Goldenberg, A., Moore, A. (2004) Tractable Learning of Large Bayes Net Structures from Sparse Data. International Conference on Machine Learning, 21, 345-352

Grossman, D., Domingos, P. (2004) Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood. International Conference on Machine Learning, 21. 361-368

Hastie, T., Ribshirani, R., Friedman, J. (2001). The Elements of Statistical Learning; Data Mining, Inference, and Prediction. Springer Verlag. Q325.75.F75 2001.

Heckerman, D. (1991). Probability Similarity Networks. MIT Press, 1991. Heckerman, D., Geiger, D., Chickering, D.M. (1995). Learning Bayesian Networks: the

Combination of Knowledge and Statistical Data. Machine Learning, 20, 197-243. H∅jen-S∅rensen, P., Hansen, L., Rasmussen, C. (2000). Bayesian Modelling of fMRI

Time Series. In S.A. Solla, T.K. Leen and K. R. Muller (eds.), Advances in Neural Information Processing Systems. MIT Press.

Jaeger, M. (1997) Relational Bayesian Networks. Uncertainty in Artificial Intelligence. pg. 266-273.

Jensen, F. V., (2001). Bayesian Networks and Decision Graphs. Springer-Verlag, New York.

Kersting, K., Raedt, L.D. (2000) Bayesian Logic Programs. Proceedings of the Work-in-Progress Track at the 10th International Conference on Inductive Logic Programming. Pg. 138-155.

Larrañaga, P., Poza, M., Yurramendi, Y., Murga, R.H., Kuijpers, C.M.H. (1996). Structure Learning of Bayesian Networks by Genetic Algorithms: A Performance Analysis of Control Parameters. IEEE Journal on Pattern Analysis and Machine Intelligence. v 18, s. 9, 912-926.

Lancaster JL, Woldorff MG, Parsons LM, Liotti M, Freitas CS, Rainey L, Kochunov PV, Nickerson D, Mikiten SA, Fox PT. (2000). Automated Talairach Atlas labels for functional brain mapping. Human Brain Mapping 10,120-131.

Mazziotta, J., Toga, A., Evans, A., Fox, P., Lancaster, J., Zilles, K., Woods, R., Paus, T., Simpson, G., Pike, B., Holmes, C., Collins, L., Thompson, P., MacDonald, D., Iacoboni, M., Schormann, T., Amunts, K., Palomero-Gallagher, N., Geyer, S., Parsons, L., Narr, K., Kabani, N., Goualher, G., Feidler, J., Smith, K., Boomsma, D., Pol, H. H., Cannon, T., Kawashima, R., Mazoyer, B. (2001) A Four-Dimensional Probabilistic Atlas of the Human Brain. Journal of the American Medical Informatics Association, v. 8, n. 5, p. 401-430.

McCallum, A., Rosenfeld, R., Mitchell, T. and Ng, A. Y. (1998) Improving Text Classification by Shrinkage in a Hierarchy of Classes. International Conference on Machine Learning.

McGovern, A., Jensen, D. (2003) Identifying Predictive Structures in Relational Data Using Multiple Instance Learning. International Conference on Machine Learning.

Mitchell, T., Hutchinson, R., Just, M., Niculescu, R.S., Pereira, F., Wang, X. (2003). Classifying Instantaneous Cognitive States from fMRI Data. American Medical Informatics Association Annual Symposium).

Moore, A., Wong, W. (2003) Optimal Reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. Proceedings of the 20th International Conference on Machine Learning, 2003

Muggleton, S. (2000) Learning Stochastic Logic Programs. Proceedings of the AAAI 2000 Workshop on Learning Statistical Models from Relational Data.

Murphy, K. (2002). Dynamic Bayesian Networks: Representation, Inference and Learning. PhD dissertation, Berkeley, University of California, Computer Science Division.

Ng, R., Subrahmanian, V.S. (1992). Probabilistic Logic Programming. Information and Computation, v. 101, n. 2, pg 150-201.

Pearl, J. (1986) Fusion, Propagation and Structuring in Belief Networks. Artificial Intelligence, v. 29, n. 3, 241-288

Penny, W.D., Stephan K.E., Mechelli, A., Friston, K.J. (2004) Modelling Functional Integration: A Comparison of Structural Equation and Dynamic Causal Models. (sic) Neuroimage v 23, supplement 1, pg. S264-S274.

Pfeffer, A. (2001) IBAL: An Integrated Bayesian Agent Language. International Joint Conference on Artificial Intelligence.

Poole, D. (1993) Probabilistic Horn Abduction and Bayesian Networks. Artificial Intelligence, v. 61, n. 1, pg. 81-129.

Rabiner, L.R. (1989) A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, v. 77, n. 2, pg. 257-286.

Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461-464.

Stein, C. (1955) Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. v. 1, pg. 197-206. University of California Press.

Worsley, K. and Friston, K. (1995). Analysis of fMRI times-series revisited – again. Neuroimage, 2, 173-181.

9 Appendices

9.1 Bayesian networks

A Bayesian network is a compact representation of a joint probability distribution (JPD) over a set of n random variables (RVs), X1, X2,…, Xn. The JPD is the function 1 2( , , , )nP X X X� giving the probability that the RV takes on each value in its domain. When the variables have a finite discrete domain, as is the case with the Bayesian networks I employ, the JPD can be represented by an n-dimensional table.

Unfortunately, as the number of RVs in a system increases, the size of the JPD grows exponentially, quickly becoming too large to represent explicitly. Using the chain rule of probability, the JPD can be rewritten as a multiplicative series of conditional probabilities, as seen in Equation (19).

( ) ( ) ( )1

1 2 1 21, , , | , , ,

n

n n i i i niP X X X P X P X X X X

−+ +=

= ∏� � (19)

Recall that if two variables, Xa and Xb, are statistically independent, then ( | )a bP X X = ( )aP X . Using the chain rule and assumptions of independence, the number of terms

required to represent the JPD can be dramatically reduced. Take, for instance, a system with 3 variables: A, B and C. Using the chain rule of probability,

( , , )P A B C = ( | , ) ( | ) ( )P A B C P B C P C . If A and B were assumed to be independent, then ( | , ) ( | )P A B C P A C= and the JPD could be represented by the equation ( | ) ( | ) ( )P A C P B C P C , which has fewer terms than the full JPD. In this example, the

reduction in terms is minor but in a system with many variables and (more importantly) many conditional independence assertions, the reduction can be dramatic.

A Bayesian network is a graphical system that encodes conditional dependencies between variables which can be used as a compact alternate representation of a JPD. Specifically, it is a directed graph with nodes and arcs connecting the nodes. The nodes represent variables in the system being modeled and the arcs represent causal relationship between them. Figure 12 gives a hypothetical example of a Bayesian network that models the price of tea in China. The arcs in the graph indicate that the price is directly influenced (stochastically) by both the supply and the demand for tea in China. The absence of an

arc from the Current Lunar Cycle to any of the other nodes in the graph indicates that the current lunar cycle does not statistically influence the price of tea in China. The Price of Tea in China node is considered a child node with a parent set that includes the nodes Demand for tea in China and Supply of tea in China.

Each node represents a random variable with a specific domain. In the example, these domains could be { high or low} for the demand, supply and price of tea in China, { less than 50” , more than 50” } for the annual rain fall and { new, waxing, full, waning} for the current lunar cycle. Each of these domains is finite and discrete, though a Bayesian network can model continuous or infinitely-sized discrete RVs as well.

Associated with each node is a set of conditional probabilities. In the discrete case, these probabilities are typically represented by a conditional probability table (CPT) and in the continuous case, are typically represented by conditional probability distribution functions, such as the conditional Gaussian distribution. The CPT for the Price of Tea in China node is shown in Figure 12b. CPTs can also be thought of as collections of multinomial distributions, with a single multinomial describing a node’s behavior given a unique setting of the node’s parents.

The structure for a Bayesian network can also be thought of as a collection of families, with one family for every node. Included in a node’s family are all of the parents of that node (each family contains a single child node and zero or more parent nodes). For instance, the family for the Price of Tea in China node in the BN in Figure 12 contains the child node Price of Tea in China as well as the parent nodes { Demand for tea in China, Supply of tea in China} .

Formally, a Bayesian network is described by the pair (BS, BO) where BS is the graphical structure of the network and BO specifies the conditional probability information of the BN. BS contains a set of nodes for every RV in the system, X = { X1, X2 , …,Xn} , and an arc for every causal dependency between RVs. In the example given in Figure 12, BS =

Price of tea in China

Annual rain fall in China

Demand for tea in

China

Supply of tea in China

Current Lunar Cycle

0.50 0.25

0.50 0.75

Demand

Price High Low

High

Low

Supply = High

0.95 0.35

0.05 0.65

Demand

Price High Low

High

Low

Supply = Low

Figure 12. a) Example of a hypothetical Bayesian network modeling the price of tea in China. The directed arcs indicate that the price of tea is directly influenced by both the demand and the supply of tea, and indirectly influenced by the annual rain fall in China. Conversely, the absence of an arc indicates that the current lunar cycle has no statistical effect on the price of tea in China. b) The conditional probability table for the Price of tea in China node. The table indicates, among other things, that if the supply is high and the demand is low then 25% of the time the price will be high and the 75% of the time the price will be low. Note that the Price node has two parents resulting in a 3 dimensional conditional probability table. There are also CPT for all other nodes in the graph (not shown).

a) b)

{ Annual rain fall in China, Demand for tea in China, Supply of tea in China, Price of tea in China, Current Lunar Cycle} . It would also contain the three arcs connecting these nodes. The parents of node Xi are denoted Pa(Xi), which can assume q possible configurations where q is given by Equation (20),

1,

k

jjq r

== ∏ (20)

where k is the number of parents and r j is the arity of the jth parent. For the node Price of tea in China, Pa(Xj) = { Demand for tea in China, Supply of tea in China} .

BO is the set of CPTs for all of the nodes in BS and encodes the degree and direction in which the nodes in Pa(Xi) influence Xi. If a node has k parents, i.e. | ( ) |iPa X k= , then the CPT for that node will have k+1 dimensions specifying ( | ( ))i iPa X Pa X =

1 2( | ( ), ( ), , ( ))i i i k iP X Pa X Pa X Pa X� , where Pa� (Xi) is the �th parent of Xi. A significant advantage to using CPTs is the ability to model complex non-linear relationships. However, as the number of parents in Pa(Xi) grows, the number of parameters in the CPT grows quadratically, thus fundamentally limiting the number of parents each random variable can reasonable have.

The formulation of the DBN search, and BN search in general, is a well-understood problem (Heckerman 1995). Ideally, every possible structure in the DBN would be individually constructed and the probability that structure represented the data would be determined. Two significant problems arise. First, a network score must be devised capable of indicating how well a given structure represents the relationships embedded among variables in the data. Second, as structure search is NP-Complete, it is not feasible to construct every possible structure.

Several scoring methods have been proposed including AIC (Akike 1973), BIC (Schwarz 1978), BDEU (Buntine 1991), BD (Cooper and Herskovits 1992) and BDE (Heckerman 1995). Likelihood itself can also be used as a score. One of the most commonly used scores is Heckerman’s BDE score, which has several significant advantages, including its ability to easily incorporate prior knowledge and that it assigns two structures encoding the same set of conditional dependencies the same BDE score (commonly referred to as likelihood equivalence). Like most scoring methods, the BDE score is composed of the aggregation of the scores of each of the families in the network. Setting up a score this way has the significant advantage of allowing the BN structure search to be broken up into smaller independent structure searches for each of the families.

Equation (21) gives the equation for calculating the scoring of the entire network, and (22) gives the score for a single family in the network,

( ) ( )( ) ( )( )1, | , , , ,

n

S i i i iiP D B X Pa X score X Pa X Dξ ρ

=∝ ⋅∏ (21)

( ) ( )( )

( )( )

, , , , ,

1 1, , , ,

( , , ) ,iq ri j i j k i j k

i i j ki j i j i j k

N N Nscore X Pa X D

N N N= =

� �′ ′Γ Γ += �

′ ′Γ + Γ �� ∏ ∏

(22)

where ρ(Xi,Pa(Xi)) is a structural prior allowing certain structures to be considered more likely than other structures. Γ(⋅) is the gamma function. Ni,j is a count for the number of times the parent set of Xi is in configuration j. Ni,j,k is the count of the number of times

the parent set of Xi is in configuration j while Xi = k. ξ refers to the set of prior information,{ , , ( , ( ))} , ,ij i jk t tN N X Pa X i j kρ′ ′ ∀ .

Prior information influencing conditional probabilities is allowed in the form of the N’ virtual counts, which are analogous to N except they are not counted from dataset D but instead represent a hypothetical prior dataset that describes how variables in the system are thought to behave in relation to each other. For instance, if we wanted to indicate that the Price of tea in China should be low when the Demand for tea in China is high and the Supply of tea in China is high, we could set the N’ ijk count to a positive number where Xi = Price of tea in China, j = { high, high} and k = low. The magnitude of the number would indicate the degree to which we believed this would be the case. Networks connecting the Price of tea in China to either the Supply of tea in China or the Demand for tea in China would then receive a higher score even without sufficient supporting data. Specifying an uninformative prior is achieved by setting ρ(Xi, Pa(Xi))=1, N’ ijk to 1 and

1, , ,

r

ij i jkkN N i j k

=′ ′= ∀� , where r is the number of states node k can be in.

While Heckerman’s derivation of the BDE is somewhat complex, in a sense, the BDE score measures the amount of uniformity (or entropy) in the CPTs’ multinomial distributions. The less uniform the distribution, the more informative the parents are and the higher the BDE score will be.

Given a fully parameterized Bayesian network, it is possible to compute the posterior likelihood that the Bayesian network describes a set of data. The posterior likelihood is given in Equation (23),

( ) ( ) ( )( )1| | ,

i

m n j ji i i Pa Xj i i

B P X Pa X= =

= = =∏ ∏D D D� (23)

where D is a set of m data points in which each data point contains a value for each node in the DBN, j

iD is the value of the i th node in the jth data point and ( )i

jPa XD is the set of

values of the parent variables of the i th node in the j th data point. Given two Bayesian networks and a set of data, the Bayesian network that more accurately describes that data is defined to be the one with the higher posterior likelihood.

9.2 Dynamic Bayesian Networks

The class of BNs that explicitly model temporal processes are referred to as dynamic Bayesian networks (DBNs). The topological structure for a DBN is divided into columns of RVs, where each column represents a time frame in the process being modeled and each system variable is represented by a single node in each of the columns. Causal relations are allowed to connect nodes between columns, provided the direction of the causality goes from a past time step to a future time step. Figure 13 (left) gives an example of the structure of a DBN. The example lists a single possible structure, but there are a super-exponential5 number of other structural possibilities.

5 In the number of nodes in the BN.

Ideally, a DBN would contain one column of variables for each time step in the system being modeled. However, this is computationally intractable for all but the simplest models. To reduce the complexity of the DBN model, several assumptions are typically made. The first assumption is the Markov assumption which requires that the values a variable takes on at time t can be accurately predicted (in a stochastic sense) from only knowledge of its parent variables’ values from time t-m to time t, for some bounded history length m6. The Markov assumption is expressed in Equation (24).

The second assumption is stationary assumption, which requires the dynamic behavior of a variable from time t to time t+1 be independent of t, given in Equation (25).

where �� is a set of all variables in the column representing time step t. Adhering to

both the stationary and first order Markov assumptions results in a DBN composed of two columns of n variables where each of the columns no longer models a specific time

6 Frequently, m is set to 1. When it is, the assumption is referred to as a first order Markov assumption.

( ) ( )1 2 1 1 1| , ,..., | , ,..., ,t t t t m t m tP P− − − + −=X X X X X X X X (24)

( ) ( )1 1| | , , ,t t t tP P t t′ ′− − ′= ∀X X X X (25)

X1,t

X2,t

X3,t

Xn,t

X1,t+1

X2,t+1

X3,t+1

Xn,t+1

… …

…

…

…

… …

…

…

…

Class

X1 X2 Xn … X2

Figure 13. (left) A possible DBN topology for two columns of RVs. The left column represents the values of RVs at time t and the right column represents the values of neuroanatomical regions at time t+1. The dashed lines indicate that DBN topology is generally not known, and must be determined. The horizontal ellipses indicate this is just a portion of a bigger DBN. (right) The static structure for the GNBN. There is a single discrete class variable whose CPT is represented with a multinomial, and n observation variables whose CPTs are represented as a set of Gaussian distributions (one Gaussian for each value the Class node can take on). The GNBN does not require a structure search.

point in the system, but instead models the average dynamic behavior of the system across all time points.

The assumptions made by the DBN are far less restrictive than the assumptions made by the GNBN and allow for a more sophisticated representation of the data. This causes DBNs to be significantly more complex than GNBNs and introduces new challenges when DBNs are used as a modeling framework. Most significantly, if the causal structure of the DBN is unknown, as is the case in my work, the structure must be learnt. Structure learning is a known NP-Hard problem (Chickering, Geiger and Heckerman 1994).

Figure 14 gives a simple example of how DBNs can be used to model fMRI data. In the example are three ROIs: the amygdala, parietal lobe and BA 40. A hypothetical time series based on the Highlow quantization is given for each region. The DBN family with the BA 40 Time t+1 random variable node as a child is shown with three Time t candidate parent nodes. Proposing each possible Time t node as the sole parent of the BA 40 Time t+1 node results in the corresponding CPTs describing the statistical relationship between the parent’s values and the child’s values at the next time step, averaged across all consecutive pairs of time points. For example, in the top CPT describing the relationship

Amygdala Time t

BA 7 Time t

BA 40 Time t+1

BA 40 Time t

?

?

?

b) BA 40 family structure with three candidate parents

High Low

High Low

High Low

Amygdala time t

BA 40 time t

BA 7 time t

.66 0

.33 1

.33 .5

.66 .5

1 0

0 1

High

BA 40 time t+1

BA 40 time t+1

BA 40 time t+1

High

High

c) Possible CPTs for BA 40 Time t+1

a) Hypothetical time series

Figure 14. Example of a DBN modeling three ROIs. a) Hypothetical ROI average voxel activation time series for each of the regions. b) DBN family for the BA 40 Time t+1 node with three candidate parents: Amygdala Time t, BA 7 Time t, and BA 40 Time t. c) The CPTs for the BA 40 Time t+1 node corresponding to each of the three candidate parents. Each entry in the CPTs is the probability the child node takes on a value based on the previous value of its parent, averaged over all time points. The columns correspond to parent values and the rows correspond to child values. The less uniform the probabilities in a CPT are, the better a predictor the parent makes for the child. In this case, the best predictor for the child node is the BA 7 Time t node, as its value gives perfect predictive ability of the BA 40 Time t+1 node.

Amygdala

BA 40

BA 7

between the Amygdala Time t node and the BA 40 Time t+1 node, the bottom-left entry of .33 indicates that when the amygdala is high, BA 40 has a 33% chance of being low in the next time step.

Each column in a CPT is a single multinomial distribution detailing the behavior of the child for a specific setting of the parents. The less uniform each of the multinomials is, the better predictor the parent is of the child. As can be readily seen in the example, the least uniform CPT results when the BA 7 Time t node is the parent and thus BA 7 is considered the best predictor of the subsequent value of BA 40.

This simple example only includes pairwise relationships, but CPTs are capable of modeling more complex higher order interactions among many parents and children nodes. It is the ability of a multinomial CPT to describe complex non-linear relationships that sets it apart from many continuous models (such as linear, bi-linear and Gaussian models).

Ideally, a DBN modeling the temporal interaction among ROIs in fMRI data would contain T columns of n nodes, where n is the number of ROIs and T is the number of time steps in the experiment (i.e., the number of fMRI images). Each node would represent the average activation of a single ROI at a specific time step and an arc between the nodes would indicate a temporal causal relationship between the two ROIs. For example, an arc from node t

iX to node 1tjX + would indicate a relationship between the value of the

i th ROI at time step t and the value of the j th ROI at time step t+1. Unfortunately, learning the structure (i.e., which links to include) for such a DBN is intractable. To reduce the size of the model, I make both the first order Markov and stationary assumptions (Section 9.2). This reduces the DBN to two columns, each with n nodes. Nodes in each of the columns no longer represent the values of ROIs at absolute time frames, but instead represent the statistical dynamics of ROIs averaged across all time frames.

A commonly used special case of DBNs is the hidden Markov model (HMM) which models the behavior of a hidden RV based on an observed RV (or a set of observed RVs). The structure for a HMM is fixed. The hidden RV in the i th column is a parent of the observed RV in the i th column and the hidden RV in the i+1 column. Further details of the HMM are outside the scope of this paper, and I refer the interested reader to (Rabiner 1989).

9.3 Gaussian Naïve Bayesian Networks

In a Gaussian naïve Bayesian network (GNBN) there are two types of nodes: a single class node, and multiple observable nodes. The class node is set as a parent of each of the observable nodes which have no other parents. The conditional probability distribution within each observable node is a conditional Gaussian distribution and is a multinomial in the class node.

A GNBN makes several strong assumptions about the underlying system being modeled. First, a node’s values are Gaussianly distributed and second, the values in each observed node are statistically independent of the values in every other observed node given the value of the class node. These are both strong assumptions that will almost certainly be violated to some degree, however, GNBNs have often been found to perform quite well

in many classification applications and are commonly used as a point of reference in machine learning literature. Figure 13 (right) gives an example of a GNBN.

9.4 Lagrange Multipliers for Maximizing the Binomial Ratio Test

Assume each class of data is represented by a bucket of red and blue colored balls. Let r1 = the number of red balls in class 1’s bucket, r2 = the number of red balls in class 2’s bucket, b1 = the number of blue balls in class 1’s bucket and b2 = the number of blue balls in class 2’s bucket. The buckets are modeled by what I refer to as the binomial ratio test (BRT). The BRT is parameterized by two RVs, Θ={ Θr, Θb} where Θr is the parameter for the number of red balls and Θb is the parameter for the number of blue balls. The likelihood of the model, M, given the data, D, is given in Equation (26), and the log-likelihood in Equation (27),

( ) ( ) ( )( ) ( )

1 1

2 2| , ,

r b

r br b

r b

M DΘ Θ

Θ =Θ Θ

� (26)

( ) ( ) ( ) ( ) ( )1 1 2 2| , log log log log ,r b r bM D r b r bΘ = Θ + Θ − Θ − Θ�� (27)

given the constraints Θr ≥ ε, Θb ≥ ε where ε is small non-zero positive number and Θr + Θb = 1. To maximize the log likelihood function under these constraints, one Lagrange multiplier term is added to the log likelihood function for each constraint. The multipliers are:

Constraint Lagrange term

Θr ≥ ε λ1(Θr – (S1)2 - ε)

Θb ≥ ε λ2(Θb – (S2)2 - ε)

Θr + Θb = 1 λ3(Θr + Θb - 1 = 0)

The λ’s are the Lagrange multipliers and the S’s are slack variables. Adding these terms into the log likelihood results in Equation (28).

( ) ( ) ( ) ( ) ( )( )( ) ( )( ) ( )

1 1 2 2

2 2

1 1 2 2 3

| , log log log log

1

r b r b

r b r b

M D r b r b

S Sλ ε λ ε λ

Θ = Θ + Θ − Θ − Θ

+ Θ − − + Θ − − + Θ + Θ −

��

(28)

Taking the partial derivatives of (28) results in the following system of equations.

Partial Derivative Result Equations

a) r

∂∂Θ��

1 21 3 0

r r

r r λ λ− + + =Θ Θ

2 1

1 3r

r r

λ λ−Θ =+

b) b

∂∂Θ��

1 22 3 0

b b

b b λ λ− + + =Θ Θ

2 1

2 3b

b b

λ λ−Θ =+

c) 1λ

∂∂��

( )2

1 0r S εΘ − − = ( )2

1r S εΘ = +

d) 2λ

∂∂��

( )2

2 0b S εΘ − − = ( )2

2b S εΘ = −

e) 3λ

∂∂��

1 0r bΘ + Θ − = 1r bΘ + Θ =

f) 1S

∂∂��

1 12 0Sλ− =

1 0λ = or ( )2

1 0S =

g) 2S

∂∂��

2 22 0Sλ− =

2 0λ = or ( )2

2 0S =

To find the values for the parameters, this system of equations are solved for Θr and Θb. However, the f) and g) partial derivatives result in two possible cases. As such, there are a total of four possible parameter settings that must be considered to find the parameters that maximize the likelihood.

Case Parameter Values Resulting Likelihood Functions

( )2

1 0S = , 2 0λ = r εΘ = , 1b εΘ = − ( )( ) ( )( )1 2 1 2log log 1r r b bε ε= − + − −��

1 0λ = ,( )2

2 0S = 1r εΘ = − , b εΘ = ( )( ) ( )( )1 2 1 2log 1 logr r b bε ε= − − + −��

1 0λ = , 2 0λ =

2 1

2 1 2 1r

r r

r r b b

−Θ =− + −

2 1

2 1 2 1b

b b

r r b b

−Θ =− + −

( )

( )

2 11 2

2 1 2 1

2 11 2

2 1 2 1

log

log

r rr r

r r b b

b bb b

r r b b

� −= −� �− + −� �

� −+ −� �− + −� �

��

( )2

1 0S = ,( )2

2 0S = r εΘ = , b εΘ = Violates constraint. ε + ε ≠ 1

The final parameters are then set by the case with the highest resulting log likelihood. If the highest likelihood is given by the third case, λ1=0 and λ2=0, the resulting parameters must be checked against the constraints. If they violate the constraints, the maximum likelihood of the first two cases determines the parameter values.

The first case will result if there are more blue balls in class 1 than class 2 and fewer red balls in class 1 than in class 2. This case results in the maximum value for Θb = 1-ε. The second case will result if there are more red balls in class 1 than class 2 and fewer blue balls in class 1 than class 2. This case results in the maximum value for Θr = 1-ε. If there are more blue balls and more red balls in class 1 than class 2, the parameters will be ranked based on which balls had the largest increase in class 1.

dynamic bayesian networks: class discriminative structure ...treport/tr/05-06/burge_proposal.pdf ·...

Documents