jason morton - uclrisi/aml08/morton-aml08.pdfthey describe higher order dependence among random...

27
Algebraic models for multilinear dependence Jason Morton Stanford University December 11, 2008 NIPS Joint work with Lek-Heng Lim of Berkeley J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 1 / 27

Upload: others

Post on 18-Aug-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Algebraic models for multilinear dependence

Jason Morton

Stanford University

December 11, 2008NIPS

Joint work with Lek-Heng Lim of Berkeley

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 1 / 27

Page 2: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Univariate cumulants

Mean, variance, skewness and kurtosis describe the shape of aunivariate distribution.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 2 / 27

Page 3: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Covariance matrices

The covariance matrix partly describes the dependence structure of amultivariate distribution.

PCA

Gaussian graphical models

Optimization—bilinear form computes variance

But if the variables are not multivariate Gaussian, not the whole story.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 3 / 27

Page 4: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Even if marginals normal, dependence might not be

−5 0 5−5

−4

−3

−2

−1

0

1

2

3

4

51000 Simulated Clayton(3)−Dependent N(0,1) Values

X1 ~ N(0,1)

X2

~ N

(0,1

)

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 4 / 27

Page 5: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Covariance matrix analogs: multivariate cumulants

The cumulant tensors are the multivariate analog of skewnessand kurtosis.

They describe higher order dependence among random variables.

1 Definitions: tensors and cumulants

2 Properties of cumulant tensors

3 Insights and models from algebraic geometry

4 Algorithms from Riemannian geometry

5 Potential applications

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 5 / 27

Page 6: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Symmetric tensors and actionsA tensor in coordinates is a multi-way array with a multilinear action.

Tensor JaijkK ∈ Rr×r×r is symmetric if it is invariant under allpermutations of indices

aijk = aikj = ajik = ajki = akij = akji .

Comes with an action:

Symmetric multilinear matrix multiplication. If Q is an n×rmatrix, T an r×r×r tensor, make an n×n ×n tensorK = (Q, Q, Q) · T or just Q · T where

Kαβγ =∑r ,r ,r

i ,j ,k=1qαiqβjqγktijk .

If T is r×r and Q is n×r , we have Q · T = QTQ>; for d > 2multiply on 3, 4, . . . “sides” of the multi-way array.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 6 / 27

Page 7: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Moments and Cumulants are symmetric tensors

Vector-valued random variable x = (X1, . . . , Xn).Three natural d -way tensors are:

The dth non-central moment si1,...,sd of x:

Sd(x) =[E(xi1xi2 · · · xid)

]n

i1,...,id=1.

The dth central moment Sd(x− E[x]), and

The dth cumulant κi1...id of x:

Kd(x) =

∑A1t···tAq={i1,...,id}

(−1)q−1(q − 1)!sA1 . . . sAq

n

i1,...,id=1

.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 7 / 27

Page 8: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Measuring useful properties.

For univariate x , the cumulants Kd(x) for d = 1, 2, 3, 4 are

expectation κi = E[x ],

variance κii = σ2,

skewness κiii/κ3/2ii , and

kurtosis κiiii/κ2ii .

The tensor versions are the multivariate generalizations

κijk

they provide a natural measure of non-Gaussianity.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 8 / 27

Page 9: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Alternative Definitions of Cumulants

In terms of log characteristic function,

κj1···jd (x) = (−1)d ∂d

∂tα1j1· · · ∂tαd

jd

log E(exp(i〈t, x〉)∣∣∣∣t=0

.

In terms of Edgeworth series,

log E(exp(i〈t, x〉) =∞∑

α=0

i |α|κα(x)tα

α!

where α = (α1, . . . , αd) is a multi-index, tα = tα11 · · · tαd

d , andα! = α1! · · ·αd !.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 9 / 27

Page 10: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Properties of cumulants: Multilinearity

Multilinearity: if x is a Rn-valued random variable and A ∈ Rm×n

Kd(Ax) = A · Kd(x),

where · is the multilinear action.

This makes factor models work: y = Ax implies KYd = A · KX

d ;

For example, KY2 = AKX

2 A> .

Independent Components Analysis finds an A to approximatelydiagonalize KX

d .

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 10 / 27

Page 11: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Properties of cumulants: Independence

Independence:

x1, . . . , xk are mutually independent of variables y1, . . . , yk , wehaveKd(x1 + y1, . . . , xk + yk) = Kd(x1, . . . , xk) + Kd(y1, . . . , yk).

Ki1,...,in(x) = 0 whenever there is a partition of {i1, . . . , in} intotwo nonempty sets I and J such that xI and xJ are independent.

Why we want to diagonalize in independent component analysis

Exploitable in other sparse cumulant techniques

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 11 / 27

Page 12: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Properties of cumulants: Vanishing and Extending

Gaussian: If x is multivariate normal, then Kd(x) = 0 for alld ≥ 3.

I Why you might not have heard of them: for Gaussians, thecovariance matrix does tell the whole story.

Support: There are no distributions with a bound n so that

Kd(x)

{6= 0 3 ≤ d ≤ n,

= 0 d > n.

I Parametrization is trickier when K2 doesn’t tell the whole story.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 12 / 27

Page 13: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Making cumulants useful, tractable and estimable

Cumulant tensors are a useful generalization, but too big. They have(#vars+d−1

d

)quantities, too many to

learn with a reasonable amount of data,

store, and

optimize.

Needed: small, implicit models analogous to PCA

PCA: eigenvalue decomposition of a positive semidefinite realsymmetric matrix. We need a tensor analog.

But, it isn’t as easy as it looks. . .

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 13 / 27

Page 14: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Tensor decomposition

Three possible generalizations are the same in the matrix case butnot in the tensor case. For a n × n × n tensor T ,

Name minimum r such that

Tensor rank T =∑r

i=1 ui ⊗ vi ⊗ wi

not closed

Border rank T = limε→0(Sε), Trank(Sε) = rclosed but hard to represent;defining equations unknown.

Multilinear rank T = (A, B , C ) · K , K ∈ Rr×r×r , A, B , C ∈ Rn×r ,closed and understood.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 14 / 27

Page 15: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Multilinear rank factor model

Let y = Y1, . . . , Yn be a random vector. Write the dth ordercumulant Kd(y) as a best r -multilinear rank approximation in termsof the cumulant Kd(x) of a smaller set of r factors x:

KYd ≈ Q · KX

d .

where

Q is orthonormal , and Q> projects to the factors

The column space of Q defines the s-dim subspace which bestexplains the dth order dependence.

In place of eigenvalues, we have the core tensor KXd , the

cumulant of the factors.

Have model, need loss and algorithm.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 15 / 27

Page 16: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Principal cumulant components analysisWant factors/principal components that account for variation inall cumulants simultaneously

minQ∈O(n,r), Cd∈Sd (Rr )

∞∑d=1

αd‖K̂d(y)− Q · Cd‖2,

Cd ≈ K̂d(x) not necessarily diagonal.

Appears intractable: optimization over infinite-dimensionalmanifold

O(n, r)×∏∞

d=1Sd(Rr ).

Reduces to optimization over a single Grassmannian Gr(n, r) ofdimension r(n − r),

maxQ∈Gr(n,r)

∑∞

d=1αd‖Q> · K̂d(y)‖2.

In practice ∞ = 3 or 4.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 16 / 27

Page 17: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Geometric insights

Secants of Veronese in Sd(Rn) and rank subsets— difficult tostudy.

Symmetric subspace variety in Sd(Rn) — closed, easy to study.

Stiefel manifold O(n, r) is set of n × r real matrices withorthonormal columns.

Grassman manifold Gr(n, r) is set of equivalence classes ofO(n, r) under left multiplication by O(n).

Parametrization of Sd(Rn) via

Gr(n, r)× Sd(Rr ) → Sd(Rn).

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 17 / 27

Page 18: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Coordinate-cycling heuristicsAlternating Least Squares (i.e. Gauss-Seidel) is commonly usedfor minimizing

Ψ(X , Y , Z ) = ‖A · (X , Y , Z )‖2F

for A ∈ Rl×m×n cycling between X , Y , Z and solving a leastsquares problem at each iteration.

What if A ∈ S3(Rn) and

Φ(X ) = ‖A · (X , X , X )‖2F?

Present approach: disregard symmetry of A, solve Ψ(X , Y , Z ),set

X∗ = Y∗ = Z∗ = (X∗ + Y∗ + Z∗)/3

upon final iteration.

Better: L-BFGS on Grassmannian.J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 18 / 27

Page 19: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Newton/quasi-Newton on a Grassmannian[Savas-Lim]

Objective Φ : Gr(n, r) → R, Φ(X ) = ‖A · (X , X , X )‖2F .

TX tangent space at X ∈ Gr(n, r)

Rn×r 3 ∆ ∈ TX ⇐⇒ ∆>X = 0

1 Compute Grassmann gradient ∇Φ ∈ TX .2 Compute Hessian or update Hessian approximation

H : ∆ ∈ TX → H∆ ∈ TX .

3 At X ∈ Gr(n, r), solve

H∆ = −∇Φ

for search direction ∆.4 Update iterate X : Move along geodesic from X in the direction

given by ∆.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 19 / 27

Page 20: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

L-BFGS on Grassmannian

BFGS update must be adjusted: on the Grassmannian, thevectors are defined on different points belonging to differenttangent spaces.

Parallel transport along a geodesic to new position.

Limited memory version.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 20 / 27

Page 21: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

ConvergenceCompares favorably with Alternating Least Squares.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 21 / 27

Page 22: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Mean-variance portfolio optimization

Markowitz mean-variance portfolio optimization defines risk to bevariance.

min w>K2(x)w s.t. w>E[x] > r

Evidence indicates that investors optimizing variance with respect tothe covariance matrix accept unwanted skewness and kurtosis risk.

Extreme example: selling out-of-the-money puts looks safe anduncorrelated

Many hedge funds essentially do this

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 22 / 27

Page 23: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Muti-moment portfolio optimization

So, take skewness and kurtosis into account in the objective.

Need to use skewness K3 and kurtosis K4 tensors.

Use low multilinear rank model to regularize and makeoptimization computable with many assets (linear vs. cubic)

With mean-zero returns in a #assets=m × n=#periods matrix A,

Choose an s, need m × s orthonormal projector Q

Approximate cumulant nKd = A> ·∆d ,n ≈ Q · CMultilinear forms w> · Kd ≈ w>Q · 1

nC give variance, skewness

and kurtosis

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 23 / 27

Page 24: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

Analogously to Eigenfaces,

Cumulants give features supplementing the PCA varimax subspace.

In eigenfaces, we have a centered #pixels= n ×m =#imagesmatrix A, m � n.

The eigenvectors of the covariance matrix KP2 of the pixels are

the eigenfaces.

For efficiency, we compute the covariance matrix K Images2 of the

images instead. The SVD gives both implicitly.

USV> = svd(A>)

mK Images2 = A>A = UΛU>

nKPixels2 = AA> = VΛV>

Orthonormal columns of V , eigenvectors of KP2 , are the eigenfaces.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 24 / 27

Page 25: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

we can compute Skewfaces,Centered #pixels= n ×m =#images matrix A.

Let KP3 be the (huge) third cumulant tensor of the pixels.

Analogously, we want to compute it implicitly

We just need the projector Π onto the subspace of skewfacesthat best explain KP

3 .

Let A> = USV> with dims (m2, m2, m × n).

nS I3 = A> ·∆n = U · S · V> ·∆n

mKP3 = A ·∆m = V · (S · U> ·∆m)

Pick a small multilinear rank s. If (S · U> ·∆m) ≈ Q · C3 for somem × s matrix Q and NON-diagonal core tensor C3,

mKP3 ≈ V · Q · C3 = VQ · C3

and Π = VQ is our orthonormal-column projection matrix onto the’skewmax’ subspace.

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 25 / 27

Page 26: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

and combine Eigen-, Skew-, and Kurto-faces.

Combine the information from multiple cumulants:

Do the same for procedure for the kurtosis tensor (a little morecomplicated).

Say we keep the first r principal components (columns of V ), sskewfaces, and t kurtofaces. Their span is our optimal subspace.

These three subspaces may overlap; orthogonalize the resultingr + s + t column vectors to get a final projector.

This gives an orthonormal projector basis W for the column space ofA; its

first r vectors best explain the covariance matrix KP2 ,

next s vectors, with W1:r , best explain the big skewness tensorKP

3 of the pixels, and

last t vectors, with W1:r+s , best explain pixel kurtosis KP4 .

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 26 / 27

Page 27: Jason Morton - UCLrisi/AML08/Morton-AML08.pdfThey describe higher order dependence among random variables. 1 Definitions: tensors and cumulants 2 Properties of cumulant tensors 3

[email protected]

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 27 / 27