object orie’d data analysis, last time gene cell cycle data microarrays and hdlss visualization...

53
Object Orie’d Data Analysis, Last Time • Gene Cell Cycle Data • Microarrays and HDLSS visualization • DWD bias adjustment • NCI 60 Data Today: More NCI 60 Data & Detailed (math’cal) look at PCA

Upload: april-jordan

Post on 27-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Object Orie’d Data Analysis, Last Time

• Gene Cell Cycle Data

• Microarrays and HDLSS visualization

• DWD bias adjustment

• NCI 60 Data

Today: More NCI 60 Data &

Detailed (math’cal) look at PCA

Page 2: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Last Time: Checked Data Combo, using DWD Dir’ns

Page 3: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

DWD Views of NCI 60 DataInteresting Question:

Which clusters are really there?

Issues:

• DWD great at finding dir’ns of separation

• And will do so even if no real structure

• Is this happening here?

• Or: which clusters are important?

• What does “important” mean?

Page 4: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Real Clusters in NCI 60 Data

Simple Visual Approach:

• Randomly relabel data (Cancer Types)

• Recompute DWD dir’ns & visualization

• Get heuristic impression from this

Deeper Approach

• Formal Hypothesis Testing

(Done later)

Page 5: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Random Relabelling #1

Page 6: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Random Relabelling #2

Page 7: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Random Relabelling #3

Page 8: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Random Relabelling #4

Page 9: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Revisit Real Data

Page 10: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Revisit Real Data (Cont.)Heuristic Results:

Strong Clust’s Weak Clust’s Not Clust’s

Melanoma C N S NSCLC

Leukemia Ovarian Breast

Renal Colon

Later: will find way to quantify these ideas

i.e. develop statistical significance

Page 11: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

NCI 60 Controversy

• Can NCI 60 Data be normalized?• Negative Indication:• Kou, et al (2002) Bioinformatics, 18,

405-412.– Based on Gene by Gene Correlations

• Resolution:Gene by Gene Data View

vs.Multivariate Data View

Page 12: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Resolution of Paradox: Toy Data, Gene View

Page 13: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Resolution: Correlations suggest “no chance”

Page 14: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Resolution: Toy Data, PCA View

Page 15: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Resolution: PCA & DWD direct’ns

Page 16: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Resolution: DWD Adjusted

Page 17: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Resolution: DWD Adjusted, PCA view

Page 18: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Resolution: DWD Adjusted, Gene view

Page 19: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Resolution: Correlations & PC1 Projection Correl’n

Page 20: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Needed final verification of Cross-platform

Normal’n

• Is statistical power actually improved?

• Will study later

Page 21: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

DWD: Why does it work?

Rob Tibshirani Query:• Really need that complicated

stuff?(DWD is complex)

• Can’t we just use means?

• Empirical Fact (Joel Parker):(DWD better than simple methods)

Page 22: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

DWD: Why does it work?

Xuxin Liu Observation:• Key is unbalanced sub-sample

sizes(e.g biological subtypes)

• Mean methods strongly affected• DWD much more robust• Toy Example

Page 23: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

DWD: Why does it work?

Page 24: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Xuxin Liu Example

• Goals: – Bring colors together– Keep symbols distinct (interesting biology)

• Study varying sub-sample proportions:– Ratio = 1: Both methods great– Ratio = 0.61: Mean degrades, DWD good– Ratio = 0.35: Mean poor, DWD still OK– Ratio = 0.11: DWD degraded, still better

• Later: will find underlying theory

Page 25: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

PCA: Rediscovery – Renaming

Statistics: Principal Component Analysis (PCA)

Social Sciences: Factor Analysis (PCA is a subset)

Probability / Electrical Eng:Karhunen – Loeve expansion

Applied Mathematics:Proper Orthogonal Decomposition (POD)

Geo-Sciences: Empirical Orthogonal Functions (EOF)

Page 26: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

An Interesting Historical Note

The 1st (?) application of PCA to Functional

Data Analysis:

Rao, C. R. (1958) Some statistical methods

for comparison of growth curves,

Biometrics, 14, 1-17.

1st Paper with “Curves as Data” viewpoint

Page 27: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Detailed Look at PCA

Three important (and interesting) viewpoints:

1. Mathematics

2. Numerics

3. Statistics

1st: Review linear alg. and multivar. prob.

Page 28: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra

Vector Space:

• set of “vectors”, ,

• and “scalars” (coefficients),

• “closed” under “linear combination”

( in space)

e.g.

,

“ dim Euclid’n space”

xa

i

ii xa

d

d

d xx

x

x

x ,...,: 1

1

d

Page 29: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Subspace:• subset that is again a vector space• i.e. closed under linear combination• e.g. lines through the origin• e.g. planes through the origin• e.g. subsp. “generated by” a set of vector

(all linear combos of them =

= containing hyperplane

through origin)

Page 30: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Basis of subspace: set of vectors that:

• span, i.e. everything is a lin. com. of them

• are linearly indep’t, i.e. lin. Com. is unique

• e.g. “unit vector basis”

• since

d

1

0

0

,...,

0

1

0

,

0

0

1

1

0

0

0

1

0

0

0

1

212

1

d

d

xxx

x

x

x

Page 31: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Basis Matrix, of subspace of

Given a basis, ,

create matrix of columns:

dnvv ,...,1

nddnd

n

n

vv

vv

vvB

1

111

1

Page 32: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Then “linear combo” is a matrix multiplicat’n:

where

Check sizes:

n

iii aBva

1

na

a

a 1

)1()(1 nndd

Page 33: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Aside on matrix multiplication: (linear transformat’n)

For matrices

,

Define the “matrix product”

(“inner products” of columns with rows)

(composition of linear transformations)

Often useful to check sizes:

mkk

m

aa

aa

A

,1,

,11,1

nmm

n

bb

bb

B

,1,

,11,1

m

iniik

m

iiik

m

inii

m

iii

baba

baba

AB

1,,

11,,

1,,1

11,,1

nmmknk

Page 34: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Matrix trace:

• For a square matrix

• Define

• Trace commutes with matrix multiplication:

mmm

m

aa

aa

A

,1,

,11,1

m

iiiaAtr

1,)(

BAtrABtr

Page 35: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Dimension of subspace (a notion of “size”):

• number of elements in a basis (unique)

• (use basis above)

• e.g. dim of a line is 1

• e.g. dim of a plane is 2

• dimension is “degrees of freedom”

dd dim

Page 36: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Norm of a vector:

• in ,

• Idea: “length” of the vector

• Note: strange properties for high ,

e.g. “length of diagonal of unit cube” =

d 2/12/1

1

2 xxxx td

jj

d

d

Page 37: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Norm of a vector (cont.):

• “length normalized vector”:

(has length one, thus on surf. of unit sphere

& is a direction vector)

• get “distance” as:

x

x

yxyxyxyxd t ,

Page 38: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Inner (dot, scalar) product:

• for vectors and ,

• related to norm, via

yxyxyx td

jjj

1

,

xxxxx t ,

x y

Page 39: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Inner (dot, scalar) product (cont.):

• measures “angle between and ” as:

• key to “orthogonality”, i.e. “perpendicul’ty”:

if and only if

yyxx

yx

yx

yxyxangle

tt

t

11 cos,

cos,

x y

yx 0, yx

Page 40: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Orthonormal basis :

• All ortho to each other,

i.e. , for

• All have length 1,

i.e. , for

nvv ,...,1

1, ii vv

0, ' ii vv 'ii

ni ,...,1

Page 41: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Orthonormal basis (cont.):

• “Spectral Representation”:

where

check:

• Matrix notation: where i.e.

is called “transform (e.g. Fourier, wavelet) of ”

nvv ,...,1

n

iii vax

1

ii vxa ,

iii

n

iii

n

iiii avvavvavx

,,, '1'

'1'

''

aBx Bxa tt xBa t

xa

Page 42: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Parseval identity, for

in subsp. gen’d by o. n. basis :

• Pythagorean theorem

• “Decomposition of Energy”

• ANOVA - sums of squares

• Transform, , has same length as ,

i.e. “rotation in ”

x

nvv ,...,1

2

1

22

1

2, aavxx

n

ii

n

ii

a xd

Page 43: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Gram-Schmidt Ortho-normalization

Idea: Given a basis ,

find an orthonormal version,

by subtracting non-ortho part

Review of Linear Algebra (Cont.)

nvv ,...,1

111/ vvu

112211222

,/, uuvvuuvvu

113113311311333

,,/,, uuvuuvvuuvuuvvu

Page 44: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Projection of a vector onto a subspace :

• Idea: member of that is closest to

(i.e. “approx’n”)

• Find that solves:

(“least squares”)

• For inner product (Hilbert) space:

exists and is unique

Review of Linear Algebra (Cont.)x

xV

V

VxPV vxVv

min

xPV

Page 45: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Projection of a vector onto a subspace (cont.):

• General solution in : for basis matrix ,

• So “proj’n operator” is “matrix mult’n”:

(thus projection is another linear operation)

(note same operation underlies least squares)

Review of Linear Algebra (Cont.)

d VB

xBBBBxP tVV

tVVV

1

tVV

tVVV BBBBP

1

Page 46: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Projection using orthonormal basis :

• Basis matrix is “orthonormal”:

• So =

= Recon(Coeffs of “in dir’n”)

nnVtV IBB

10

01

,,

,,

1

111

1

1

nnn

n

ntn

t

vvvv

vvvv

vv

v

v

xBBxP tVVV

x V

nvv ,...,1

Page 47: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Projection using orthonormal basis (cont.):

• For “orthogonal complement”, ,

and

• Parseval inequality:

V

xPxPx VV 222xPxPx VV

2

1

22

1

22, aavxxxP

n

ii

n

iiV

Page 48: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

(Real) Unitary Matrices: with

• Orthonormal basis matrix

(so all of above applies)

• Follows that

(since have full rank, so exists …)

• Lin. trans. (mult. by ) is like “rotation” of

• But also includes “mirror images”

ddU IUU t

IUU t 1U

U d

Page 49: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Singular Value Decomposition (SVD):

For a matrix

Find a diagonal matrix ,

with entries

called singular values

And unitary (rotation) matrices ,

(recall )

so that

ndX

ndS

),min(1,..., ndss

ddU nnV

IVVUU tt tUSVX

Page 50: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Intuition behind Singular Value Decomposition:

• For a “linear transf’n” (via matrix multi’n)

• First rotate

• Second rescale coordinate axes (by )

• Third rotate again

• i.e. have diagonalized the transformation

X

vVSUvVSUvX tt

is

Page 51: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

r

SVD Compact Representation:

Useful Labeling:

Singular Values in Increasing Order

Note: singular values = 0 can be omitted

Let = # of positive singular values

Then:

Where are truncations of

trnrrrd VSUX

VSU ,,

),min(1 dnss

Page 52: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Eigenvalue Decomposition:

For a (symmetric) square matrix

Find a diagonal matrix

And an orthonormal matrix

(i.e. )

So that: , i.e.

ddX

d

D

0

01

ddB

ddtt IBBBB

DBBX tBDBX

Page 53: Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data

Review of Linear Algebra (Cont.)

Eigenvalue Decomposition (cont.):• Relation to Singular Value Decomposition

(looks similar?):• Eigenvalue decomposition “harder”• Since needs • Price is eigenvalue decomp’n is generally

complex• Except for square and symmetric

• Then eigenvalue decomp. is real valued• Thus is the sing’r value decomp. with:

VU

X

BVU