computational biomedical informatics
DESCRIPTION
Computational BioMedical Informatics. SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept. Course Information. Instructor: Dr. Jinbo Bi Office: ITEB 233 Phone: 860-486-1458 Email: [email protected] - PowerPoint PPT PresentationTRANSCRIPT
1
Computational BioMedical Informatics
SCE 5095: Special Topics Course
Instructor: Jinbo BiComputer Science and Engineering Dept.
2
Course Information
Instructor: Dr. Jinbo Bi – Office: ITEB 233– Phone: 860-486-1458– Email: [email protected]
– Web: http://www.engr.uconn.edu/~jinbo/– Time: Mon / Wed. 2:00pm – 3:15pm – Location: CAST 204– Office hours: Mon. 3:30-4:30pm
HuskyCT– http://learn.uconn.edu– Login with your NetID and password– Illustration
3
Review of previous classes
Introduced unsupervised learning – particularly, cluster analysis techniques
Discussed one important application of cluster analysis – cardiac ultrasound view recognition
Review papers on medical, health or public health topics
This class, we start to discuss dimension reduction
4
Topics today What is feature reduction? Why feature reduction? More general motivations of component analysis Feature reduction algorithms
– Principal component analysis– Canonical correlation analysis– Independent component analysis
5
What is feature reduction?
Feature reduction refers to the mapping of the original high-dimensional data onto a lower-dimensional space.
– Criterion for feature reduction can be different based on different problem settings.
Unsupervised setting: minimize the information loss Supervised setting: maximize the class discrimination
Given a set of data points of p variables Compute the linear transformation (projection)
nxxx ,,, 21
)(: pdxGyxG dTpdp
6
dY pdTG
pX
dTdp XGYXG :
Linear transformation
Original data reduced data
What is feature reduction?
7
High-dimensional data
Gene expression Face images Handwritten digits
8
Feature reduction versus feature selection
Feature reduction– All original features are used– The transformed features are linear
combinations of the original features.
Feature selection– Only a subset of the original features are
used.
9
Why feature reduction?
Most machine learning and data mining techniques may not be effective for high-dimensional data – Curse of Dimensionality– Query accuracy and efficiency degrade
rapidly as the dimension increases.
The intrinsic dimension may be small. – For example, the number of genes
responsible for a certain type of disease may be small.
10
Understanding: reason about or obtain insights from data Combinations of observed variables may be more effective
bases for insights, even if physical meaning is obscure Visualization: projection of high-dimensional data onto 2D or
3D. Too much noise in the data Data compression: Need to “reduce” them to a smaller set
of factors for an efficient storage and retrieval Better representation of data without losing much
information Noise removal: Can build more effective data analyses on
the reduced-dimensional space: classification, clustering, pattern recognition, (positive effect on query accuracy)
Why feature reduction?
11
Application of feature reduction
Face recognition Handwritten digit recognition Text mining Image retrieval Microarray data analysis Protein classification
12
• We study phenomena that can not be directly observed – ego, personality, intelligence in psychology– Underlying factors that govern the observed data
• We want to identify and operate with underlying latent factors rather than the observed data – E.g. topics in news articles – Transcription factors in genomics
• We want to discover and exploit hidden relationships– “beautiful car” and “gorgeous automobile” are closely related– So are “driver” and “automobile”– But does your search engine know this?– Reduces noise and error in results
More general motivations
13
• Discover a new set of factors/dimensions/axes against which to represent, describe or evaluate the data– For more effective reasoning, insights, or better visualization– Reduce noise in the data– Typically a smaller set of factors: dimension reduction – Better representation of data without losing much information– Can build more effective data analyses on the reduced-dimensional space:
classification, clustering, pattern recognition• Factors are combinations of observed variables
– May be more effective bases for insights, even if physical meaning is obscure
– Observed data are described in terms of these factors rather than in terms of original variables/dimensions
More general motivations
14
Feature reduction algorithms
Unsupervised– Principal component analysis (PCA)– Canonical correlation analysis (CCA)– Independent component analysis (ICA)
Supervised – Linear discriminant analysis (LDA)– Sparse support vector machine (SSVM)
Semi-supervised – Research topic
15
Feature reduction algorithms
Linear – Principal Component Analysis (PCA)– Linear Discriminant Analysis (LDA)– Canonical Correlation Analysis (CCA)– Independent component analysis (ICA)
Nonlinear– Nonlinear feature reduction using kernels
Kernel PCA, kernel CCA, …– Manifold learning
LLE, Isomap
16
Basic Concept Areas of variance in data are where items can be best discriminated
and key underlying phenomena observed– Areas of greatest “signal” in the data
If two items or dimensions are highly correlated or dependent– They are likely to represent highly related phenomena– If they tell us about the same underlying variance in the data,
combining them to form a single measure is reasonable Parsimony Reduction in Error
So we want to combine related variables, and focus on uncorrelated or independent ones, especially those along which the observations have high variance
We want a smaller set of variables that explain most of the variance in the original data, in more compact and insightful form
17
Basic Concept
What if the dependences and correlations are not so strong or direct?
And suppose you have 3 variables, or 4, or 5, or 10000?
Look for the phenomena underlying the observed covariance/co-dependence in a set of variables
– Once again, phenomena that are uncorrelated or independent, and especially those along which the data show high variance
These phenomena are called “factors” or “principal components” or “independent components,” depending on the methods used
– Factor analysis: based on variance/covariance/correlation– Independent Component Analysis: based on independence
18
What is Principal Component Analysis?
Principal component analysis (PCA) – Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables– Retains most of the sample's information.– Useful for the compression and classification of data.
By information we mean the variation present in the sample, given by the correlations between the original variables.
19
Principal Component Analysis
Most common form of factor analysis The new variables/dimensions
– are linear combinations of the original ones– are uncorrelated with one another
Orthogonal in original dimension space– capture as much of the original variance in the
data as possible– are called Principal Components– are ordered by the fraction of the total
information each retains
20
Some Simple Demos
http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html
21
What are the new axes?
Original Variable A
Orig
inal
Var
iabl
e B
PC 1PC 2
• Orthogonal directions of greatest variance in data• Projections along PC1 discriminate the data most
along any one axis
22
Principal Components
First principal component is the direction of greatest variability (covariance) in the data
Second is the next orthogonal (uncorrelated) direction of greatest variability– So first remove all the variability along the first
component, and then find the next direction of greatest variability
And so on …
23
Principal Components Analysis (PCA) Principle
– Linear projection method to reduce the number of parameters – Transfer a set of correlated variables into a new set of uncorrelated
variables– Map the data into a space of lower dimensionality– Form of unsupervised learning
Properties– It can be viewed as a rotation of the existing axes to new positions in the
space defined by original variables– New axes are orthogonal and represent the directions with maximum
variability
24
Computing the Components
Data points are vectors in a multidimensional space Projection of vector x onto an axis (dimension) u is u.x Direction of greatest variability is that in which the average
square of the projection is greatest– I.e. u such that E((u.x)2) over all x is maximized– (we subtract the mean along each dimension, and center the
original axis system at the centroid of all data points, for simplicity)– This direction of u is the direction of the first Principal Component
25
Computing the Components
E((uTx)2) = E ((uTx) (uTx)) = E (uTx.x Tu)
The matrix C = x.xT contains the correlations (similarities) of the original axes based on how the data values project onto them
So we are looking for w that maximizes uTCu, subject to u being unit-length
It is maximized when w is the principal eigenvector of the matrix C, in which case
– uTCu = uTlu = l if u is unit-length, where l is the principal eigenvalue of the covariance matrix C
– The eigenvalue denotes the amount of variability captured along that dimension
26
Why the Eigenvectors?
Maximise uTxxTu s.t uTu = 1 Construct Langrangian uTxxTu – λuTu Vector of partial derivatives set to zero
xxTu – λu = (xxT – λI) u = 0As u ≠ 0 then u must be an eigenvector of xxT with
eigenvalue λ
27
Singular Value Decomposition
The first root is called the prinicipal eigenvalue which has an associated orthonormal (uTu = 1) eigenvector u
Subsequent roots are ordered such that λ1> λ2 >… > λM with rank(D) non-zero values.
Eigenvectors form an orthonormal basis i.e. uiTuj = δij
The eigenvalue decomposition: C = xxT = UΣUT
where U = [u1, u2, …, uM] and Σ = diag[λ 1, λ 2, …, λ M] Similarly the eigenvalue decomposition of xTx = VΣVT
The SVD is closely related to the above x=U Σ1/2 VT
The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues.
28
Computing the Components
Similarly for the next axis, etc. So, the new axes are the eigenvectors of the
covariance matrix of the original variables, which captures the similarities of the original variables based on how data samples project to them
• Geometrically: centering followed by rotation– Linear transformation
29
PCs, Variance and Least-Squares
The first PC retains the greatest amount of variation in the sample
The kth PC retains the kth greatest fraction of the variation in the sample
The kth largest eigenvalue of the correlation matrix C is the variance in the sample along the kth PC
The least-squares view: PCs are a series of linear least squares fits to a sample, each orthogonal to all previous ones
30
How Many PCs?
For n original dimensions, correlation matrix is nxn, and has up to n eigenvectors. So n PCs.
Where does dimensionality reduction come from?
31
Can ignore the components of lesser significance.
You do lose some information, but if the eigenvalues are small, you don’t lose much– p dimensions in original data – calculate p eigenvectors and eigenvalues– choose only the first d eigenvectors, based on their
eigenvalues– final data set has only d dimensions
0
5
10
15
20
25
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Varia
nce
(%)
Dimensionality Reduction
32
Eigenvectors of a Correlation Matrix
33
Geometric picture of principal components
2z
1z
• the 1st PC is a minimum distance fit to a line in X space• the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,each orthogonal to all the previous.
1z
34
Optimality property of PCA
npTndT
ndTnp
XGGXXG
XGX
)(
Dimension reductionReconstruction
ndT XGY
pdTG
npX
Original data
dpG npX
35
Optimality property of PCA
2
FXX
The matrix G consisting of the first d eigenvectors of the covariance matrix S solves the following min problem:
Main theoretical result:
dF
TG
IGXGGXdp T2
G subject to )(min
reconstruction error
PCA projection minimizes the reconstruction error among all linear projections of size d.
36
Applications of PCA
Eigenfaces for recognition. Turk and Pentland. 1991.
Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001.
Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003.
37
PCA applications -Eigenfaces
the principal eigenface looks like a bland and rogynous average human face
http://en.wikipedia.org/wiki/Image:Eigenfaces.png
38
Eigenfaces – Face Recognition
When properly weighted, eigenfaces can be summed together to create an approximate gray-scale rendering of a human face.
Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces
Hence eigenfaces provide a means of applying data compression to faces for identification purposes.
Similarly, Expert Object Recognition in Video
39
PCA for image compression
d=1 d=2 d=4 d=8
d=16 d=32 d=64 d=100Original Image
40
Topics today Feature reduction algorithms
– Principal component analysis– Canonical correlation analysis– Independent component analysis
41
Canonical correlation analysis (CCA)
CCA was developed first by H. Hotelling.– H. Hotelling. Relations between two sets of variates.
Biometrika, 28:321-377, 1936.
CCA measures the linear relationship between two multidimensional variables.
CCA finds two bases, one for each variable, that are optimal with respect to correlations.
Applications in economics, medical studies, bioinformatics and other areas.
42
Canonical correlation analysis (CCA)
Two multidimensional variables– Two different measurements on the same set of objects
Web images and associated text Protein (or gene) sequences and related literature (text) Protein sequence and corresponding gene expression In classification: feature vector and class label
– Two measurements on the same object are likely to be correlated. May not be obvious on the original measurements. Find the maximum correlation on transformed space.
43
Canonical Correlation Analysis (CCA)
TXXW
Correlation
TYYW
measurement transformationTransformed data
44
Problem definition
Find two sets of basis vectors, one for x and the other for y, such that the correlations between the projections of the variables onto these basis vectors are maximized.
: and yx ww
Given
Compute two basis vectors
,ywy y
45
Problem definition
Compute the two basis vectors so that the correlations of the projections onto these vectors are maximized.
46
Algebraic derivation of CCA
The optimization problem is equivalent to
Tyy
Tyx
Txx
Txy
YYCYXC
XXCXYC
,
,where
47
Geometric interpretation of CCA
The Geometry of CCA
),(2
1,1
,
222
22
jjjjjj
yTT
yjxTT
xj
yT
jxT
j
bacorrbaba
wYYwbwXXwa
wYbwXa
Maximization of the correlation is equivalent to the minimization of the distance.
48
Algebraic derivation of CCA
maxs.t.
The optimization problem is equivalent to
49
Algebraic derivation of CCA
Txyyx CC lll yx
50
Applications in bioinformatics
CCA can be extended to multiple views of the data– Multiple (larger than 2) data sources
Two different ways to combine different data sources– Multiple CCA
Consider all pairwise correlations– Integrated CCA
Divide into two disjoint sources
51
Applications in bioinformatics
Source: Extraction of Correlated Gene Clusters from Multiple Genomic Data by Generalized Kernel Canonical Correlation Analysis. ISMB’03 http://cg.ensmp.fr/~vert/publi/ismb03/ismb03.pdf
52
Topics today Feature reduction algorithms
– Principal component analysis– Canonical correlation analysis– Independent component analysis
53
Independent component analysis (ICA)
PCA(orthogonal coordinate)
ICA(non-orthogonal coordinate)
54
Cocktail-party problem
Multiple sound sources in room (independent) Multiple sensors receiving signals which are
mixture of original signals Estimate original source signals from mixture of
received signals Can be viewed as Blind-Source Separation as
mixing parameters are not known
55
http://www.cis.hut.fi/projects/ica/cocktail/cocktail_en.cgi
DEMO: Blind Source Seperation
56
Cocktail party or Blind Source Separation (BSS) problem– Ill posed problem, unless assumptions are made!
Most common assumption is that source signals are statistically independent. This means knowing value of one of them gives no information about the other.
Methods based on this assumption are called Independent Component Analysis methods
– statistical techniques for decomposing a complex data set into independent parts.
It can be shown that under some reasonable conditions, if the ICA assumption holds, then the source signals can be recovered up to permutation and scaling.
BSS and ICA
57
Source Separation Using ICA
W11
W21
W12
W22
+
+
Microphone 1
Microphone 2
Separation 1
Separation 2
58
Original signals (hidden sources) s1(t), s2(t), s3(t), s4(t), t=1:T
59
The ICA model
s1 s2
s3 s4
x1 x2 x3 x4
a11
a12a13
a14
xi(t) = ai1*s1(t) + ai2*s2(t) + ai3*s3(t) + ai4*s4(t)
Here, i=1:4.In vector-matrix notation, and dropping index t, this is x = A * s
60
This is recorded by the microphones: a linear mixture of the sources
xi(t) = ai1*s1(t) + ai2*s2(t) + ai3*s3(t) + ai4*s4(t)
61
Recovered signals
62
BSS
If we knew the mixing parameters aij then we would just need to solve a linear system of equations.
We know neither aij nor si. ICA was initially developed to deal with problems closely related
to the cocktail party problem Later it became evident that ICA has many other applications
– e.g. from electrical recordings of brain activity from different locations of the scalp (EEG signals) recover underlying components of brain activity
Problem: Determine the source signals s, given only the
mixtures x.
63
ICA Solution and Applicability
ICA is a statistical method, the goal of which is to decompose given multivariate data into a linear sum of statistically independent components.
For example, given two-dimensional vector , x = [ x1 x2 ] T , ICA
aims at finding the following decomposition
where a1, a2 are basis vectors and s1, s2 are basis coefficients.
Constraint: Basis coefficients s1 and s2 are statistically indepen-dent.
saasa
axx
222
121
21
11
2
1
2211 ss aax
64
Approaches to ICA
Unsupervised Learning – Factorial coding (Minimum entropy coding, Redundancy
reduction)– Maximum likelihood learning– Nonlinear information maximization (entropy maximization)– Negentropy maximization– Bayesian learning
Statistical Signal Processing– Higher-order moments or cumulants – Joint approximate diagonalization– Maximum likelihood estimation
65
Application domains of ICA
Blind source separation Image denoising Medical signal processing – fMRI, ECG, EEG Modelling of the hippocampus and visual cortex Feature extraction, face recognition Compression, redundancy reduction Watermarking Clustering Time series analysis (stock market, microarray data) Topic extraction Econometrics: Finding hidden factors in financial data
66
Image denoising
Wiener filtering
ICA filtering
Noisy imageOriginal
image