arxiv:1508.01023v1 [cs.lg] 5 aug 2015

Noname manuscript No.(will be inserted by the editor)

A review of heterogeneous data mining for brain disorders

Bokai Cao · Xiangnan Kong · Philip S. Yu

Abstract With rapid advances in neuroimaging tech-

niques, the research on brain disorder identification has

become an emerging area in the data mining commu-

nity. Brain disorder data poses many unique challenges

for data mining research. For example, the raw data

generated by neuroimaging experiments is in tensor rep-

resentations, with typical characteristics of high dimen-

sionality, structural complexity and nonlinear separa-

bility. Furthermore, brain connectivity networks can be

constructed from the tensor data, embedding subtle in-

teractions between brain regions. Other clinical mea-

sures are usually available reflecting the disease status

from different perspectives. It is expected that integrat-

ing complementary information in the tensor data and

the brain network data, and incorporating other clinical

parameters will be potentially transformative for inves-

tigating disease mechanisms and for informing thera-peutic interventions. Many research efforts have been

devoted to this area. They have achieved great success

in various applications, such as tensor-based modeling,

subgraph pattern mining, multi-view feature analysis.

In this paper, we review some recent data mining meth-

ods that are used for analyzing brain disorders.

Keywords Data mining · Brain diseases · Tensor

analysis · Subgraph patterns · Feature selection

B. Cao · P.S. YuDepartment of Computer Science, University of Illinois atChicago, Chicago, IL 60607.E-mail: [email protected], [email protected]

X. KongDepartment of Computer Science, Worcester Polytechnic In-stitute, Worcester, MA 01609.E-mail: [email protected]

1 Introduction

Many brain disorders are characterized by ongoing in-

jury that is clinically silent for prolonged periods and

irreversible by the time symptoms first present. New

approaches for detection of early changes in subclin-

ical periods will afford powerful tools for aiding clin-

ical diagnosis, clarifying underlying mechanisms and

informing neuroprotective interventions to slow or re-

verse neural injury for a broad spectrum of brain disor-

ders, including bipolar disorder, HIV infection on brain,

Alzheimer’s disease, Parkinson’s disease, etc. Early di-

agnosis has the potential to greatly alleviate the burden

of brain disorders and the ever increasing costs to fam-

ilies and society.

As the identification of brain disorders is extremelychallenging, many different diagnosis tools and methods

have been developed to obtain a large number of mea-

surements from various examinations and laboratory

tests. Especially, recent advances in the neuroimaging

technology have provided an efficient and noninvasive

way for studying the structural and functional connec-

tivity of the human brain, either normal or in a diseased

state [48]. This can be attributed in part to advances

in magnetic resonance imaging (MRI) capabilities [33].

Techniques such as diffusion MRI, also referred to as

diffusion tensor imaging (DTI), produce in vivo images

of the diffusion process of water molecules in biological

tissues. By leveraging the fact that the water molecule

diffusion patterns reveal microscopic details about tis-

sue architecture, DTI can be used to perform tractog-

raphy within the white matter and construct structural

connectivity networks [2,36,12,38,40]. Functional MRI

(fMRI) is a functional neuroimaging procedure that

identifies localized patterns of brain activation by de-

tecting associated changes in the cerebral blood flow.

arX

iv:1

508.

0102

3v1

[cs

.LG

] 5

Aug

201

5

2 B. Cao et al.

The primary form of fMRI uses the blood oxygenation

level dependent (BOLD) response extracted from the

gray matter [3,42,43]. Another neuroimaging technique

is positron emission tomography (PET). Using differ-

ent radioactive tracers (e.g., fluorodeoxyglucose), PET

produces a three-dimensional image of various physio-

logical, biochemical and metabolic processes [68].

A variety of data representations can be derived

from these neuroimaging experiments, which present

many unique challenges for the data mining community.

Conventional data mining algorithms are usually devel-

oped to tackle data in one specific representation, a ma-

jority of which are particularly for vector-based data.

However, the raw neuroimaging data is in the form

of tensors, from which we can further construct brain

networks connecting regions of interest (ROIs). Both

of them are highly structured considering correlations

between adjacent voxels in the tensor data and that

between connected brain regions in the brain network

data. Moreover, it is critical to explore interactions be-

tween measurements computed from the neuroimaging

and other clinical experiments which describe subjects

in different vector spaces. In this paper, we review some

recent data mining methods for (1) mining tensor imag-

ing data; (2) mining brain networks; (3) mining multi-

view feature vectors.

2 Tensor Imaging Analysis

For brain disorder identification, the raw data gener-

ated by neuroimging experiments are in tensor repre-

sentations [15,22,68]. For example, in contrast to two-

dimensional X-ray images, an fMRI sample corresponds

to a four-dimensional array by recording the sequential

changes of traceable signals in each voxel1.

Tensors are higher order arrays that generalize the

concepts of vectors (first-order tensors) and matrices

(second-order tensors), whose elements are indexed by

more than two indices. Each index expresses a mode of

variation of the data and corresponds to a coordinate

direction. In an fMRI sample, the first three modes usu-

ally encode the spatial information, while the fourth

mode encodes the temporal information. The number

of variables in each mode indicates the dimensionality

of a mode. The order of a tensor is determined by the

number of its modes. An mth-order tensor can be rep-

resented as X = (xi1,··· ,im) ∈ RI1×···×Im , where Ii is

the dimension of X along the i-th mode.

Definition 1 (Tensor product) The tensor product

of three vectors a ∈ RI1 , b ∈ RI2 and c ∈ RI3 , denoted

1 A voxel is the smallest three-dimensional point volumereferenced in a neuroimaging of the brain.

≈ + ⋯+ +

a1

c1

b1

c2

b2

a2

cR

bR

aR X

Fig. 1 Tensor factorization of a third-order tensor.

by a ⊗ b ⊗ c, represents a third-order tensor with the

elements (a⊗ b⊗ c)i1,i2,i3 = ai1bi2ci3 .

Tensor product is also referred to as outer product

in some literature. An mth-order tensor is a rank-one

tensor if it can be defined as the tensor product of m

vectors.

Definition 2 (Tensor factorization) Given a third-

order tensor X ∈ RI1×I2×I3 and an integer R, as illus-

trated in Figure 1, a tensor factorization of X can be

expressed as

X = X1 + X2 + · · ·+ XR =

R∑r=1

ar ⊗ br ⊗ cr (1)

One of the major difficulties brought by the ten-

sor data is the curse of dimensionality. The total num-

ber of voxels contained in a multi-mode tensor, say,

X = (xi1,··· ,im) ∈ RI1×···×Im is I1 × · · · × Im which is

exponential to the number of modes. If we unfold the

tensor into a vector, the number of features will be ex-

tremely high [69]. This makes traditional data mining

methods prone to overfitting, especially with a small

sample size. Both computational scalability and theo-

retical guarantee of the traditional models are compro-

mised by such high dimensionality [22].

On the other hand, complex structural information

is embedded in the tensor data. For example, in the neu-

roimaging data, values of adjacent voxels are usually

correlated with each other [33]. Such spatial relation-

ships among different voxels in a tensor image can be

very important in neuroimaging applications. Conven-

tional tensor-based approaches focus on reshaping the

tensor data into matrices/vectors and thus the original

spatial relationships are lost. The integration of struc-

tural information is expected to improve the accuracy

and interpretability of tensor models.

2.1 Classification

Suppose we have a set of tensor data D = {(Xi, yi)}ni=1

for classification problem, where Xi ∈ RI1×···×Im is the

neuroimaging data represented as an mth-order tensor

and yi ∈ {−1,+1} is the corresponding binary class la-

bel of Xi. For example, if the i-th subject has Alzheimer’s

A review of heterogeneous data mining for brain disorders 3

disease, the subject is associated with a positive label,

i.e., yi = +1. Otherwise, if the subject is in the control

group, the subject is associated with a negative label,

i.e., yi = −1.

Supervised tensor learning can be formulated as the

optimization problem of support tensor machines (STMs)

[55] which is a generalization of the standard support

vector machines (SVMs) from vector data to tensor

data. The objective of such learning algorithms is to

learn a hyperplane by which the samples with different

labels are divided as wide as possible. However, ten-

sor data may not be linearly separable in the input

space. To achieve a better performance on finding the

most discriminative biomarkers or identifying infected

subjects from the control group, in many neuroimaging

applications, nonlinear transformation of the original

tensor data should be considered. He et al. study the

problem of supervised tensor learning with nonlinear

kernels which can preserve the structure of tensor data

[22]. The proposed kernel is an extension of kernels in

the vector space to the tensor space which can take the

multidimensional structure complexity into account.

2.2 Regression

Slightly different from classifying disease status (dis-

crete label), another family of problems use tensor neu-

roimages to predict cognitive outcome (continuous la-

bel). The problems can be formulated in a regression

setup by treating clinical outcome as the real label, i.e.,

yi ∈ R, and treating tensor neuroimages as the input.

However, most classical regression methods take vec-

tors as input features. Simply reshaping a tensor into a

vector is clearly an unsatisfactory solution.

Zhou et al. exploit the tensor structure in imaging

data and integrate tensor decomposition within a statis-

tical regression paradigm to model multidimensional ar-

rays [69]. By imposing a low rank approximation to the

extremely high dimensional complex imaging data, the

curse of dimensionality is greatly alleviated, thereby al-

lowing development of a fast estimation algorithm and

regularization. Numerical analysis demonstrates its po-

tential applications in identifying regions of interest in

brains that are relevant to a particular clinical response.

2.3 Network Discovery

Modern imaging techniques have allowed us to study

the human brain as a complex system by modeling it

as a network [1]. For example, the fMRI scans consist of

activations of thousands of voxels over time embedding

a complex interaction of signals and noise [19], which

naturally presents the problem of eliciting the underly-

ing network from brain activities in the spatio-temporal

tensor data. A brain connectivity network, also called a

connectome [52], consists of nodes (gray matter regions)

and edges (white matter tracts in structural networks

or correlations between two BOLD time series in func-

tional networks).

Although the anatomical atlases in the brain have

been extensively studied for decades, task/subject spe-

cific networks have still not been completely explored

with consideration of functional or structural connec-

tivity information. An anatomically parcellated region

may contain subregions that are characterized by dra-

matically different functional or structural connectivity

patterns, thereby significantly limiting the utility of the

constructed networks. There are usually trade-offs be-

tween reducing noise and preserving utility in brain par-

cellation [33]. Thus investigating how to directly con-

struct brain networks from tensor imaging data and

understanding how they develop, deteriorate and vary

across individuals will benefit disease diagnosis [15].

Davidson et al. pose the problem of network discov-

ery from fMRI data which involves simplifying spatio-

temporal data into regions of the brain (nodes) and re-

lationships between those regions (edges) [15]. Here the

nodes represent collections of voxels that are known to

behave cohesively over time; the edges can indicate a

number of properties between nodes such as facilita-

tion/inhibition (increases/decreases activity) or proba-

bilistic (synchronized activity) relationships; the weight

associated with each edge encodes the strength of the

relationship.

A tensor can be decomposed into several factors.

However, unconstrained tensor decomposition results

of the fMRI data may not be good for node discovery

because each factor is typically not a spatially contigu-

ous region nor does it necessarily match an anatomi-

cal region. That is to say, many spatially adjacent vox-

els in the same structure are not active in the same

factor which is anatomically impossible. Therefore, to

achieve the purpose of discovering nodes while preserv-

ing anatomical adjacency, known anatomical regions in

the brain are used as masks and constraints are added

to enforce that the discovered factors should closely

match these masks [15].

Overall, current research on tensor imaging analysis

presents two directions: (1) supervised: for a particular

brain disorder, a classifier can be trained by modeling

the relationship between a set of neuroimages and their

associated labels (disease status or clinical response);

(2) unsupervised: regardless of brain disorders, a brain

network can be discovered from a given neuroimage.

4 B. Cao et al.

3 Brain Network Analysis

We have briefly introduced that brain networks can be

constructed from neuroimaging data where nodes corre-

spond to brain regions, e.g., insula, hippocampus, thala-

mus, and links correspond to the functional/structural

connectivity between brain regions. The linkage struc-

ture in brain networks can encode tremendous informa-

tion about the mental health of human subjects. For ex-

ample, in brain networks derived from functional mag-

netic resonance imaging (fMRI), functional connections

can encode the correlations between the functional ac-

tivities of brain regions. While structural links in diffu-

sion tensor imaging (DTI) brain networks can capture

the number of neural fibers connecting different brain

regions. The complex structures and the lack of vector

representations for the brain network data raise major

challenges for data mining.

Next, we will discuss different approaches on how

to conduct further analysis for constructed brain net-

works, which are also referred to as graphs hereafter.

Definition 3 (Binary graph) A binary graph is rep-

resented as G = (V,E), where V = {v1, · · · , vnv} is the

set of vertices, E ⊆ V × V is the set of deterministic

edges.

3.1 Kernel Learning on Graphs

In the setting of supervised learning on graphs, the tar-

get is to train a classifier using a given set of graph data

D = {(Gi, yi)}ni=1, so that we can predict the label y

for a test graph G. With applications to brain networks,

it is desirable to identify the disease status for a sub-

ject based on his/her uncovered brain network. Recent

development of brain network analysis has made char-

acterization of brain disorders at a whole-brain connec-

tivity level possible, thus providing a new direction for

brain disease classification.

Due to the complex structures and the lack of vector

representations, graph data can not be directly used as

the input for most data mining algorithms. A straight-

forward solution that has been extensively explored is

to first derive features from brain networks and then

construct a kernel on the feature vectors.

Wee et al. use brain connectivity networks for dis-

ease diagnosis on mild cognitive impairment (MCI),

which is an early phase of Alzheimer’s disease (AD)

and usually regarded as a good target for early diag-

nosis and therapeutic interventions [61,62,63]. In the

step of feature extraction, weighted local clustering co-

efficients of each ROI in relation to the remaining ROIs

are extracted from all the constructed brain networks to

quantify the prevalence of clustered connectivity around

the ROIs. To select the most discriminative features for

classification, statistical t-test is performed and features

with p-values smaller than a predefined threshold are

selected to construct a kernel matrix. Through the em-

ployment of the multi-kernel SVM, Wee et al. integrate

information from DTI and fMRI and achieve accurate

early detection of brain abnormalities [63].

However, such strategy simply treats a graph as a

collection of nodes/links, and then extracts local mea-

sures (e.g., clustering coefficient) for each node or per-

forms statistical analysis on each link, thereby blind-

ing the connectivity structures of brain networks. Mo-

tivated by the fact that some data in real-world appli-

cations are naturally represented by means of graphs,

while compressing and converting them to vectorial rep-

resentations would definitely lose structural informa-

tion, kernel methods for graphs have been extensively

studied for a decade [5].

A graph kernel maps the graph data from the orig-

inal graph space to the feature space and further mea-

sures the similarity between two graphs by comparing

their topological structures [49]. For example, product

graph kernel is based on the idea of counting the num-

ber of walks in product graphs [18]; marginalized graph

kernel works by comparing the label sequences gener-

ated by synchronized random walks of labeled graphs

[30]; cyclic pattern kernels for graphs count pairs of

matching cyclic/tree patterns in two graphs [23].

To identify individuals with AD/MCI from healthy

controls, instead of using only a single property of brain

networks, Jie et al. integrate multiple properties of fMRI

brain networks to improve the disease diagnosis perfor-

mance [27]. Two different yet complementary network

properties, i.e., local connectivity and global topologi-

cal properties are quantified by computing two different

types of kernels, i.e., a vector-based kernel and a graph

kernel. As a local network property, weighted cluster-

ing coefficients are extracted to compute a vector-based

kernel. As a topology-based graph kernel, Weisfeiler-

Lehman subtree kernel [49] is used to measure the topo-

logical similarity between paired fMRI brain networks.

It is shown that this type of graph kernel can effec-

tively capture the topological information from fMRI

brain networks. The multi-kernel SVM is employed to

fuse these two heterogeneous kernels for distinguishing

individuals with MCI from healthy controls.

3.2 Subgraph Pattern Mining

In brain network analysis, the ideal patterns we want

to mine from the data should take care of both local

and global graph topological information. Graph kernel


+ Alzheimer's disease + Alzheimer's disease + Alzheimer's disease

- Normal - Normal - Normal

A discriminative subgraph pattern

Fig. 2 An example of discriminative subgraph patterns inbrain networks.

methods seem promising, which however are not inter-

pretable. Subgraph patterns are more suitable for brain

networks, which can simultaneously model the network

connectivity patterns around the nodes and capture the

changes in local area [33].

Definition 4 (Subgraph) Let G′ = (V ′, E′) and G =

(V,E) be two binary graphs. G′ is a subgraph of G

(denoted as G′ ⊆ G) iff V ′ ⊆ V and E′ ⊆ E. If G′ is a

subgraph of G, then G is supergraph of G′.

A subgraph pattern, in a brain network, represents a

collection of brain regions and their connections. For ex-

ample, as shown in Figure 2, three brain regions should

work collaboratively for normal people and the absence

of any connection between them can result in Alzheimer’s

disease in different degree. Therefore, it is valuable to

understand which connections collectively play a signifi-

cant role in disease mechanism by finding discriminative

subgraph patterns in brain networks.

Mining subgraph patterns from graph data has been

extensively studied by many researchers [29,13,56,66].

In general, a variety of filtering criteria are proposed.

A typical evaluation criterion is frequency, which aims

at searching for frequently appearing subgraph features

in a graph dataset satisfying a prespecified threshold.

Most of the frequent subgraph mining approaches are

unsupervised. For example, Yan and Han develop a

depth-first search algorithm: gSpan [67]. This algorithm

builds a lexicographic order among graphs, and maps

each graph to an unique minimum DFS code as its

canonical label. Based on this lexicographic order, gSpan

adopts the depth-first search strategy to mine frequent

connected subgraphs efficiently. Many other approaches

for frequent subgraph mining have also been proposed,

e.g., AGM [26], FSG [34], MoFa [4], FFSM [24], and

Gaston [41].

0.80.6

0.3

0.2

0.50.3

0.8

0.9

0.2

0.9

0.5

0.60.8

0.5

0.6

0.8

0.9

0.6

0.016

0.004

0.036

0.016

0.064

0.144

0.576

0.144

Fig. 3 An example of fMRI brain networks (left) and all pos-sible instantiations of linkage structures between red nodes(right) [10].

Moreover, the problem of supervised subgraph min-

ing has been studied in recent work which examines

how to improve the efficiency of searching the discrim-

inative subgraph patterns for graph classification. Yan

et al. introduce two concepts structural leap search and

frequency-descending mining, and propose LEAP [66]

which is one of the first work in discriminative sub-

graph mining. Thoma et al. propose CORK which can

yield a near-optimal solution using greedy feature selec-

tion [56]. Ranu and Singh propose a scalable approach,

called GraphSig, that is capable of mining discrimi-

native subgraphs with a low frequency threshold [46].

Jin et al. propose COM which takes into account the

co-occurences of subgraph patterns, thereby facilitat-

ing the mining process [28]. Jin et al. further propose

an evolutionary computation method, called GAIA, to

mine discriminative subgraph patterns using a random-

ized searching strategy [29]. Zhu et al. design a diver-

sified discrimination score based on the log ratio which

can reduce the overlap between selected features by con-

sidering the embedding overlaps in the graphs [70].

Conventional graph mining approaches are best suited

for binary edges, where the structure of graph objects is

deterministic, and the binary edges represent the pres-

ence of linkages between the nodes [33]. In fMRI brain

network data however, there are inherently weighted

edges in the graph linkage structure, as shown in Fig-

ure 3 (left). A straightforward solution is to threshold

weighted networks to yield binary networks. However,

such simplification will result in great loss of informa-

tion. Ideal data mining methods for brain network anal-

ysis should be able to overcome these methodological

problems by generalizing the network edges to positive

and negative weighted cases, e.g., probabilistic weights

in fMRI brain networks, integral weights in DTI brain

networks.

Definition 5 (Weighted graph) A weighted graph is

represented as G = (V,E, p), where V = {v1, · · · , vnv}

6 B. Cao et al.

is the set of vertices, E ⊆ V × V is the set of nondeter-

ministic edges. p : E → (0, 1] is a function that assigns

a probability of existence to each edge in E.

fMRI brain networks can be modeled as weighted

graphs where each edge e ∈ E is associated with a

probability p(e) indicating the likelihood of whether

this edge should exist or not [32,10]. It is assumed that

p(e) of different edges in a weighted graph are indepen-

dent from each other. Therefore, by enumerating the

possible existence of all edges in a weighted graph, we

can obtain a set of binary graphs. For example, in Fig. 3

(right), consider the three red nodes and links between

them as a weighted graph. There are 23 = 8 binary

graphs that can be implied with different probabilities.

For a weighted graph G, the probability of G contain-

ing a subgraph feature G′ is defined as the probability

that a binary graph G implied by G contains subgraph

G′. Kong et al. propose a discriminative subgraph fea-

ture selection method based on dynamic programming

to compute the probability distribution of the discrim-

ination scores for each subgraph pattern within a set of

weighted graphs [32].

For brain network analysis, usually we only have a

small number of graph instances [32]. In these applica-

tions, the graph view alone is not sufficient for mining

important subgraphs. Fortunately, the side information

is available along with the graph data for brain disor-

der identification. For example, in neurological studies,

hundreds of clinical, immunologic, serologic and cogni-

tive measures may be available for each subject, apart

from brain networks. These measures compose multi-

ple side views which contain a tremendous amount of

supplemental information for diagnostic purposes. It is

desirable to extract valuable information from a plu-

rality of side views to guide the process of subgraph

mining in brain networks.

Figure 4(a) illustrates the process of selecting sub-

graph patterns in conventional graph classification ap-

proaches. Obviously, the valuable information embed-

ded in side views is not fully leveraged in feature se-

lection process. To tackle this problem, Cao et al. in-

troduce an effective algorithm for discriminative sub-

graph selection using multiple side views [9], as illus-

trated in Figure 4(b). Side information consistency is

first validated via statistical hypothesis testing which

suggests that the similarity of side view features be-

tween instances with the same label should have higher

probability to be larger than that with different labels.

Based on such observations, it is assumed that the sim-

ilarity/distance between instances in the space of sub-

graph features should be consistent with that in the

space of a side view. That is to say, if two instances

are similar in the space of a side view, they should also

Subgraph Patterns Graph Classification

Side Views

Brain NetworksMine

(a) Treating side views and subgraph patterns separately.

Subgraph Patterns Graph Classification

Side Views

Brain Networks

Guide

Mine

(b) Using side views as guidance for the process of selectingsubgraph patterns.

Fig. 4 Two strategies of leveraging side views in feature se-lection process for graph classification.

be close to each other in the space of subgraph fea-

tures. Therefore the target is to minimize the distance

between subgraph features of each pair of similar in-

stances in each side view [9]. In contrast to existing

subgraph mining approaches that focus on the graph

view alone, the proposed method can explore multiple

vector-based side views to find an optimal set of sub-

graph features for graph classification.

For graph classification, brain network analysis ap-

proaches can generally be put into three groups: (1)

extracting some local measures (e.g., clustering coeffi-

cient) to train a standard vector-based classifier; (2) di-

rectly adopting graph kernels for classification; (3) find-

ing discriminative subgraph patterns. Different types of

methods model the connectivity embedded in brain net-

works in different ways.

4 Multi-view Feature Analysis

Medical science witnesses everyday measurements from

a series of medical examinations documented for each

subject, including clinical, imaging, immunologic, sero-

logic and cognitive measures [7], as shown in Figure 5.

Each group of measures characterize the health state

of a subject from different aspects. This type of data is

named as multi-view data, and each group of measures

form a distinct view quantifying subjects in one specific

feature space. Therefore, it is critical to combine them

to improve the learning performance, while simply con-

catenating features from all views and transforming a

multi-view data into a single-view data, as the method

(a) shown in Figure 6, would fail to leverage the under-

lying correlations between different views.


HIV/seronegative

View 1

View 3

View 2

View 6

View 4

View 5

Immunologic measures

Clinical measuresSerologic measures

MRI sequence B

MRI sequence A Cognitive measures

Fig. 5 An example of multi-view learning in medical studies[6].

4.1 Multi-view Learning

Suppose we have a multi-view classification task with

n labeled instances represented from m different views:

D ={(

x(1)i ,x

(2)i , · · · ,x(m)

i , yi

)}n

i=1, where x

(v)i ∈ RIv ,

Iv is the dimensionality of the v-th view, and yi ∈{−1,+1} is the class label of the i-th instance.

Representative methods for multi-view learning can

be categorized into three groups: co-training, multiple

kernel learning, and subspace learning [65]. Generally,

the co-training style algorithm is a classic approach

for semi-supervised learning, which trains in alterna-

tion to maximize the mutual agreement on different

views. Multiple kernel learning algorithms combine ker-

nels that naturally correspond to different views, either

linearly [35] or nonlinearly [58,14] to improve learn-

ing performance. Subspace learning algorithms learn a

latent subspace, from which multiple views are gener-

ated. Multiple kernel learning and subspace learning are

generalized as co-regularization style algorithms [53],

where the disagreement between the functions of differ-

ent views is taken as a part of the objective function to

be minimized. Overall, by exploring the consistency and

complementary properties of different views, multi-view

learning is more effective than single-view learning.

In the multi-view setting for brain disorders, or for

medical studies in general, a critical problem is that

there may be limited subjects available (i.e., a small n)

yet introducing a large number of measurements (i.e.,

a large∑m

i=1 Ii). Within the multi-view data, not all

features in different views are relevant to the learning

task, and some irrelevant features may introduce un-

expected noise. The irrelevant information can even be

exaggerated after view combinations thereby degrad-

ing performance. Therefore, it is necessary to take care

of feature selection in the learning process. Feature se-

lection results can also be used by researchers to find

biomarkers for brain diseases. Such biomarkers are clin-

ically imperative for detecting injury to the brain in the

earliest stage before it is irreversible. Valid biomarkers

can be used to aid diagnosis, monitor disease progres-

sion and evaluate effects of intervention [32].

Conventional feature selection approaches can be di-

vided into three main directions: filter, wrapper, and

embedded methods [20]. Filter methods compute a dis-

crimination score of each feature independently of the

other features based on the correlation between the

feature and the label, e.g., information gain, Gini in-

dex, Relief [44,47]. Wrapper methods measure the use-

fulness of feature subsets according to their predictive

power, optimizing the subsequent induction procedure

that uses the respective subset for classification [21,45,

50,37,6]. Embedded methods perform feature selection

in the process of model training based on sparsity reg-

ularization [17,16,59,60]. For example, Miranda et al.

add a regularization term that penalizes the size of the

selected feature subset to the standard cost function of

SVM, thereby optimizing the new objective function to

conduct feature selection [39]. Essentially, the process

of feature selection and learning algorithm interact in

embedded methods which means the learning part and

the feature selection part can not be separated, while

wrapper methods utilize the learning algorithm as a

black box.

However, directly applying these feature selection

approaches to each separate view would fail to lever-

age multi-view correlations. By taking into account the

latent interactions among views and the redundancy

triggered by multiple views, it is desirable to combine

multi-view data in a principled manner and perform

feature selection to obtain consensus and discrimina-

tive low dimensional feature representations.

4.2 Modeling View Correlations

Recent years have witnessed many research efforts de-

voted to the integration of feature selection and multi-

view learning. Tang et al. study multi-view feature se-

lection in the unsupervised setting by constraining that

similar data instances from each view should have sim-

ilar pseudo-class labels [54]. Considering brain disorder

identification, different neuroimaging features may cap-

ture different but complementary characteristics of the

8 B. Cao et al.

data. For example, the voxel-based tensor features con-

vey the global information, while the ROI-based Auto-

mated Anatomical Labeling (AAL) [57] features sum-

marize the local information from multiple represen-

tative brain regions. Incorporating these data and ad-

ditional non-imaging data sources can potentially im-

prove the prediction. For Alzheimer’s disease (AD) clas-

sification, Ye et al. propose a kernel-based method for

integrating heterogeneous data, including tensor and

AAL features from MRI images, demographic informa-

tion and genetic information [68]. The kernel framework

is further extended for selecting features (biomarkers)

from heterogeneous data sources that play more signif-

icant roles than others in AD diagnosis.

Huang et al. propose a sparse composite linear dis-

criminant analysis model for identification of disease-

related brain regions of AD from multiple data sources

[25]. Two sets of parameters are learned: one represents

the common information shared by all the data sources

about a feature, and the other represents the specific

information only captured by a particular data source

about the feature. Experiments are conducted on the

PET and MRI data which measure structural and func-

tional aspects, respectively, of the same AD pathology.

However, the proposed approach requires the input as

the same set of variables from multiple data sources.

Xiang et al. investigate multi-source incomplete data

for AD and introduce a unified feature learning model

to handle block-wise missing data which achieves simul-

taneous feature-level and source-level selection [64].

For modeling view correlations, in general, a coeffi-

cient is assigned for each view, either at the view-level

or feature-level. For example, in multiple kernel learn-

ing, a kernel is constructed from each view and a set of

kernel coefficients are learned to obtain an optimal com-

bined kernel matrix. These approaches, however, fail to

explicitly consider correlations between features.

4.3 Modeling Feature Correlations

One of the key issues for multi-view classification is to

choose an appropriate tool to model features and their

correlations hidden in multiple views, since this directly

determines how information will be used. In contrast

to modeling on views, another direction for modeling

multi-view data is to directly consider the correlations

between features from multiple views. Since taking the

tensor product of their respective feature spaces cor-

responds to the interaction of features from multiple

views, the concept of tensor serves as a backbone for

incorporating multi-view features into a consensus rep-

resentation by means of tensor product, where the com-

plex multiple relationships among views are embedded

View 3

View 2

View 1

Modeling Feature selection

Method (a)

Method (b)

Method (c)

Fig. 6 Schematic view of the key differences among threestrategies of multi-view feature selection [6].

within the tensor structures. By mining structural in-

formation contained in the tensor, knowledge of multi-

view features can be extracted and used to establish a

predictive model.

Smalter et al. formulate the problem of feature selec-

tion in the tensor product space as an integer quadratic

programming problem [51]. However, this method is

computationally intractable on many views, since it di-

rectly selects features in the tensor product space re-

sulting in the curse of dimensionality, as the method (b)

shown in Figure 6. Cao et al. propose to use a tensor-

based approach to model features and their correlations

hidden in the original multi-view data [6]. The opera-

tion of tensor product can be used to bring m-view

feature vectors of each instance together, leading to

a tensorial representation for common structure across

multiple views, and allowing us to adequately diffuse re-

lationships and encode information among multi-view

features. In this manner, the multi-view classification

task is essentially transformed from an independent do-

main of each view to a consensus domain as a tensor

classification problem.

By using Xi to denote∏m

v=1⊗x(v)i , the dataset of

labeled multi-view instances can be represented as D =

{(Xi, yi)}ni=1. Note that each multi-view instance Xi

is an mth-order tensor that lies in the tensor product

space RI1×···×Im . Based on the definitions of inner prod-

uct and tensor norm, multi-view classification can be

formulated as a global convex optimization problem in

the framework of supervised tensor learning [55]. This

model is named as multi-view SVM [6], and it can be

solved with the use of optimization techniques devel-

oped for SVM.

Furthermore, a dual method for multi-view feature

selection is proposed in [6] that leverages the relation-

ship between original multi-view features and recon-

structed tensor product features to facilitate the im-


plementation of feature selection, as the method (c) in

Figure 6. It is a wrapper model which selects useful

features in conjunction with the classifier and simulta-

neously exploits the correlations among multiple views.

Following the idea of SVM-based recursive feature elim-

ination [21], multi-view feature selection is consistently

formulated and implemented in the framework of multi-

view SVM. This idea can extend to include lower order

feature interactions and to employ a variety of loss func-

tions for classification or regression [11].

5 Future Work

The human brain is one of the most complicated bi-

ological structures in the known universe. While it is

very challenging to understand how it works, especially

when disorders and diseases occur, dozens of leading

technology firms, academic institutions, scientists, and

other key contributors to the field of neuroscience have

devoted themselves to this area and made significant

improvements in various dimensions2. Data mining on

brain disorder identification has become an emerging

area and a promising research direction.

This paper provides an overview of data mining ap-

proaches with applications to brain disorder identifica-

tion which have attracted increasing attention in both

data mining and neuroscience communities in recent

years. A taxonomy is built based upon data represen-

tations, i.e., tensor imaging data, brain network data

and multi-view data, following which the relationships

between different data mining algorithms and different

neuroimaging applications are summarized. We briefly

present some potential topics of interest in the future.

Bridging heterogeneous data representations.

As introduced in this paper, we can usually derive data

from neuroimaging experiments in three representations,

including raw tensor imaging data, brain network data

and multi-view vector-based data. It is critical to study

how to train a model on a mixture of data representa-

tions, although it is very challenging to combine data

that are represented in tensor space, vector space and

graph space, respectively. There is a straightforward

idea of defining different kernels on different feature

spaces and combing them through multi-kernel algo-

rithms. However it is usually hard to interpret the re-

sults. The concept of side view has been introduced to

facilitate the process of mining brain networks, which

may also be used to guide supervised tensor learning.

It is even more interesting if we can learn on tensors

and graphs simultaneously.

2 http://www.whitehouse.gov/BRAIN

Fig. 7 A bioinformatics heterogeneous information networkschema.

Integrating multiple neuroimaging modalities.

There are a variety of neuroimaging techniques avail-

able characterizing subjects from different perspectives

and providing complementary information. For exam-

ple, DTI contains local microstructural characteristics

of water diffusion; structural MRI can be used to delin-

eate brain atrophy; fMRI records BOLD response re-

lated to neural activity; PET measures metabolic pat-

terns [63]. Based on such multimodality representation,

it is desirable to find useful patterns with rich seman-

tics. For example, it is important to know which connec-

tivity between brain regions is significant in the sense

of both structure and functionality. On the other hand,

by leveraging the complementary information embed-

ded in the multimodality representation, better perfor-

mance on disease diagnosis can be expected.

Mining bioinformatics information networks.

Bioinformatics network is a rich source of heterogeneous

information involving disease mechanisms, as shown in

Figure 7. The problems of gene-disease association and

drug-target binding prediction have been studied in the

setting of heterogeneous information networks [8,31].

For example, in gene-disease association prediction, dif-

ferent gene sequences can lead to certain diseases. Re-

searchers would like to predict the association relation-

ships between genes and diseases. Understanding the

correlations between brain disorders and other diseases

and the causality between certain genes and brain dis-

eases can be transformative for yielding new insights

concerning risk and protective relationships, for clar-

ifying disease mechanisms, for aiding diagnostics and

clinical monitoring, for biomarker discovery, for iden-

tification of new treatment targets and for evaluating

effects of intervention.

10 B. Cao et al.

References

1. O. Ajilore, L. Zhan, J. GadElkarim, A. Zhang, J. D.Feusner, S. Yang, P. M. Thompson, A. Kumar, andA. Leow. Constructing the resting state structural con-nectome. Frontiers in neuroinformatics, 7, 2013.

2. P. J. Basser and C. Pierpaoli. Microstructural and phys-iological features of tissues elucidated by quantitative-diffusion-tensor MRI. Journal of Magnetic Resonance, Se-ries B, 111(3):209–219, 1996.

3. B. Biswal, F. Zerrin Yetkin, V. M. Haughton, and J. S.Hyde. Functional connectivity in the motor cortex ofresting human brain using echo-planar MRI. Magnetic

resonance in medicine, 34(4):537–541, 1995.4. C. Borgelt and M. R. Berthold. Mining molecular frag-

ments: Finding relevant substructures of molecules. InICDM, pages 51–58. IEEE, 2002.

5. F. Camastra and A. Petrosino. Kernel methods forgraphs: A comprehensive approach. In Knowledge-Based

Intelligent Information and Engineering Systems, pages662–669. Springer, 2008.

6. B. Cao, L. He, X. Kong, P. S. Yu, Z. Hao, and A. B.Ragin. Tensor-based multi-view feature selection withapplications to brain diseases. In ICDM, pages 40–49.IEEE, 2014.

7. B. Cao, X. Kong, C. Kettering, P. S. Yu, and A. B. Ragin.Determinants of HIV-induced brain changes in three dif-ferent periods of the early clinical course: A data mininganalysis. NeuroImage: Clinical, 2015.

8. B. Cao, X. Kong, and P. S. Yu. Collective predictionof multiple types of links in heterogeneous informationnetworks. In ICDM, pages 50–59. IEEE, 2014.

9. B. Cao, X. Kong, J. Zhang, P. S. Yu, and A. B. Ragin.Mining brain networks using multiple side views for neu-rological disorder identification.

10. B. Cao, L. Zhan, X. Kong, P. S. Yu, N. Vizueta, L. L.Altshuler, and A. D. Leow. Identification of discrimina-tive subgraph patterns in fMRI brain networks in bipo-lar affective disorder. In Brain Informatics and Health.Springer, 2015.

11. B. Cao, H. Zhou, and P. S. Yu. Multi-view machines.arXiv, 2015.

12. T. L. Chenevert, J. A. Brunberg, and J. Pipe. Anisotropicdiffusion in human white matter: demonstration with mrtechniques in vivo. Radiology, 177(2):401–405, 1990.

13. H. Cheng, D. Lo, Y. Zhou, X. Wang, and X. Yan. Identi-fying bug signatures using discriminative graph mining.In ISSTA, pages 141–152. ACM, 2009.

14. C. Cortes, M. Mohri, and A. Rostamizadeh. Learningnon-linear combinations of kernels. In NIPS, pages 396–404, 2009.

15. I. Davidson, S. Gilpin, O. Carmichael, and P. Walker.Network discovery via constrained tensor analysis offMRI data. In KDD, pages 194–202. ACM, 2013.

16. Z. Fang and Z. M. Zhang. Discriminative feature selectionfor multi-view cross-domain learning. In CIKM, pages1321–1330. ACM, 2013.

17. Y. Feng, J. Xiao, Y. Zhuang, and X. Liu. Adaptive un-supervised multi-view feature selection for visual conceptrecognition. In ACCV, pages 343–357, 2012.

18. T. Gartner, P. Flach, and S. Wrobel. On graph kernels:Hardness results and efficient alternatives. In Learning

Theory and Kernel Machines, pages 129–143. Springer,2003.

19. C. R. Genovese, N. A. Lazar, and T. Nichols. Threshold-ing of statistical maps in functional neuroimaging usingthe false discovery rate. Neuroimage, 15(4):870–878, 2002.

20. I. Guyon and A. Elisseeff. An introduction to variableand feature selection. The Journal of Machine Learning

Research, 3:1157–1182, 2003.21. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene se-

lection for cancer classification using support vector ma-chines. Machine learning, 46(1-3):389–422, 2002.

22. L. He, X. Kong, P. S. Yu, A. B. Ragin, Z. Hao, andX. Yang. DuSK: A dual structure-preserving kernel forsupervised tensor learning with applications to neuroim-ages. In SDM. SIAM, 2014.

23. T. Horvath, T. Gartner, and S. Wrobel. Cyclic patternkernels for predictive graph mining. In KDD, pages 158–167. ACM, 2004.

24. J. Huan, W. Wang, and J. Prins. Efficient mining offrequent subgraphs in the presence of isomorphism. InICDM, pages 549–552. IEEE, 2003.

25. S. Huang, J. Li, J. Ye, T. Wu, K. Chen, A. Fleisher,and E. Reiman. Identifying Alzheimer’s disease-relatedbrain regions from multi-modality neuroimaging data us-ing sparse composite linear discrimination analysis. InNIPS, pages 1431–1439, 2011.

26. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures fromgraph data. In Principles of Data Mining and Knowledge

Discovery, pages 13–23. Springer, 2000.27. B. Jie, D. Zhang, W. Gao, Q. Wang, C. Wee, and D. Shen.

Integration of network topological and connectivity prop-erties for neuroimaging classification. Biomedical Engi-

neering, 61(2):576, 2014.28. N. Jin, C. Young, and W. Wang. Graph classification

based on pattern co-occurrence. In CIKM, pages 573–582. ACM, 2009.

29. N. Jin, C. Young, and W. Wang. GAIA: graph classifica-tion using evolutionary computation. In SIGMOD, pages879–890. ACM, 2010.

30. H. Kashima, K. Tsuda, and A. Inokuchi. Marginalizedkernels between labeled graphs. In ICML, volume 3, pages321–328, 2003.

31. X. Kong, B. Cao, and P. S. Yu. Multi-label classificationby mining label and instance correlations from hetero-geneous information networks. In KDD, pages 614–622.ACM, 2013.

32. X. Kong, A. B. Ragin, X. Wang, and P. S. Yu. Discrimi-native feature selection for uncertain graph classification.In SDM, pages 82–93. SIAM, 2013.

33. X. Kong and P. S. Yu. Brain network analysis: a datamining perspective. ACM SIGKDD Explorations Newslet-ter, 15(2):30–38, 2014.

34. M. Kuramochi and G. Karypis. Frequent subgraph dis-covery. In ICDM, pages 313–320. IEEE, 2001.

35. G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui,and M. I. Jordan. Learning the kernel matrix withsemidefinite programming. The Journal of Machine Learn-

ing Research, 5:27–72, 2004.36. D. Le Bihan, E. Breton, D. Lallemand, P. Grenier, E. Ca-

banis, and M. Laval-Jeantet. Mr imaging of intravoxel in-coherent motions: application to diffusion and perfusionin neurologic disorders. Radiology, 161(2):401–407, 1986.

37. S. Maldonado and R. Weber. A wrapper method for fea-ture selection using support vector machines. Information

Sciences, 179(13):2208–2217, 2009.38. M. J. McKeown, S. Makeig, G. G. Brown, T.-P. Jung,

S. S. Kindermann, A. J. Bell, and T. J. Sejnowski. Anal-ysis of fMRI data by blind separation into independentspatial components. Human Brain Mapping, 6:160–188,1998.


39. J. Miranda, R. Montoya, and R. Weber. Linear penal-ization support vector machines for feature selection. InPattern Recognition and Machine Intelligence, pages 188–192. Springer, 2005.

40. M. E. Moseley, Y. Cohen, J. Kucharczyk, J. Mintorovitch,H. Asgari, M. Wendland, J. Tsuruda, and D. Norman.Diffusion-weighted mr imaging of anisotropic water diffu-sion in cat central nervous system. Radiology, 176(2):439–445, 1990.

41. S. Nijssen and J. N. Kok. A quickstart in frequent struc-ture mining can make a difference. In KDD, pages 647–652. ACM, 2004.

42. S. Ogawa, T. Lee, A. Kay, and D. Tank. Brain magneticresonance imaging with contrast dependent on blood oxy-genation. Proceedings of the National Academy of Sciences,87(24):9868–9872, 1990.

43. S. Ogawa, T.-M. Lee, A. S. Nayak, and P. Glynn.Oxygenation-sensitive contrast in magnetic resonanceimage of rodent brain at high magnetic fields. Magnetic

resonance in medicine, 14(1):68–78, 1990.44. H. Peng, F. Long, and C. Ding. Feature selection based

on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Ma-chine Intelligence, 27(8):1226–1238, 2005.

45. A. Rakotomamonjy. Variable selection using SVM-based criteria. The Journal of Machine Learning Research,3:1357–1370, 2003.

46. S. Ranu and A. K. Singh. Graphsig: A scalable approachto mining significant subgraphs in large graph databases.In ICDE, pages 844–855. IEEE, 2009.

47. M. Robnik-Sikonja and I. Kononenko. Theoretical andempirical analysis of relieff and rrelieff. Machine learning,53(1-2):23–69, 2003.

48. M. Rubinov and O. Sporns. Complex network measuresof brain connectivity: uses and interpretations. Neuroim-

age, 52(3):1059–1069, 2010.49. N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen,

K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-lehmangraph kernels. The Journal of Machine Learning Research,12:2539–2561, 2011.

50. M.-D. Shieh and C.-C. Yang. Multiclass SVM-RFE forproduct form feature selection. Expert Systems with Ap-

plications, 35(1):531–541, 2008.51. A. Smalter, J. Huan, and G. Lushington. Feature selec-

tion in the tensor product feature space. In ICDM, pages1004–1009, 2009.

52. O. Sporns, G. Tononi, and R. Kotter. The human connec-tome: a structural description of the human brain. PLoScomputational biology, 1(4):e42, 2005.

53. S. Sun. A survey of multi-view machine learning. NeuralComputing and Applications, 23(7-8):2031–2038, 2013.

54. J. Tang, X. Hu, H. Gao, and H. Liu. Unsupervised featureselection for multi-view data in social media. In SDM,pages 270–278. SIAM, 2013.

55. D. Tao, X. Li, X. Wu, W. Hu, and S. J. Maybank. Super-vised tensor learning. Knowledge and Information Systems,13(1):1–42, 2007.

56. M. Thoma, H. Cheng, A. Gretton, J. Han, H.-P. Kriegel,A. J. Smola, L. Song, S. Y. Philip, X. Yan, and K. M.Borgwardt. Near-optimal supervised feature selectionamong frequent subgraphs. In SDM, pages 1076–1087.SIAM, 2009.

57. N. Tzourio-Mazoyer, B. Landeau, D. Papathanassiou,F. Crivello, O. Etard, N. Delcroix, B. Mazoyer, andM. Joliot. Automated anatomical labeling of activa-tions in SPM using a macroscopic anatomical parcella-tion of the MNI MRI single-subject brain. Neuroimage,15(1):273–289, 2002.

58. M. Varma and B. R. Babu. More generality in efficientmultiple kernel learning. In ICML, pages 1065–1072,2009.

59. H. Wang, F. Nie, and H. Huang. Multi-view clusteringand feature learning via structured sparsity. In ICML,pages 352–360, 2013.

60. H. Wang, F. Nie, H. Huang, and C. Ding. Heterogeneousvisual features fusion via sparse multimodal machine. InCVPR, pages 3097–3102, 2013.

61. C.-Y. Wee, P.-T. Yap, K. Denny, J. N. Browndyke,G. G. Potter, K. A. Welsh-Bohmer, L. Wang, andD. Shen. Resting-state multi-spectrum functional con-nectivity networks for identification of mci patients. PloSone, 7(5):e37828, 2012.

62. C.-Y. Wee, P.-T. Yap, W. Li, K. Denny, J. N. Browndyke,G. G. Potter, K. A. Welsh-Bohmer, L. Wang, andD. Shen. Enriched white matter connectivity networksfor accurate identification of mci patients. Neuroimage,54(3):1812–1822, 2011.

63. C.-Y. Wee, P.-T. Yap, D. Zhang, K. Denny, J. N.Browndyke, G. G. Potter, K. A. Welsh-Bohmer, L. Wang,and D. Shen. Identification of mci individuals using struc-tural and functional connectivity networks. Neuroimage,59(3):2045–2056, 2012.

64. S. Xiang, L. Yuan, W. Fan, Y. Wang, P. M. Thompson,and J. Ye. Multi-source learning with block-wise missingdata for Alzheimer’s disease prediction. In KDD, pages185–193. ACM, 2013.

65. C. Xu, D. Tao, and C. Xu. A survey on multi-view learn-ing. arXiv, 2013.

66. X. Yan, H. Cheng, J. Han, and P. S. Yu. Mining signif-icant graph patterns by leap search. In SIGMOD, pages433–444. ACM, 2008.

67. X. Yan and J. Han. gspan: Graph-based substructurepattern mining. In ICDM, pages 721–724. IEEE, 2002.

68. J. Ye, K. Chen, T. Wu, J. Li, Z. Zhao, R. Patel, M. Bae,R. Janardan, H. Liu, G. Alexander, et al. Heterogeneousdata fusion for Alzheimer’s disease study. In KDD, pages1025–1033. ACM, 2008.

69. H. Zhou, L. Li, and H. Zhu. Tensor regression with ap-plications in neuroimaging data analysis. Journal of theAmerican Statistical Association, 108(502):540–552, 2013.

70. Y. Zhu, J. X. Yu, H. Cheng, and L. Qin. Graph classi-fication: a diversified discriminative feature selection ap-proach. In CIKM, pages 205–214. ACM, 2012.

arxiv:1508.01023v1 [cs.lg] 5 aug 2015

Documents