jason morton - uclrisi/aml08/morton-aml08.pdfthey describe higher order dependence among random...

Algebraic models for multilinear dependence

Jason Morton

Stanford University

December 11, 2008NIPS

Joint work with Lek-Heng Lim of Berkeley

J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 1 / 27

Univariate cumulants

Mean, variance, skewness and kurtosis describe the shape of aunivariate distribution.


Covariance matrices

The covariance matrix partly describes the dependence structure of amultivariate distribution.

PCA

Gaussian graphical models

Optimization—bilinear form computes variance

But if the variables are not multivariate Gaussian, not the whole story.


Even if marginals normal, dependence might not be

−5 0 5−5

−4

−3

−2

−1

0

1

2

3

4

51000 Simulated Clayton(3)−Dependent N(0,1) Values

X1 ~ N(0,1)

X2

~ N

(0,1

)


Covariance matrix analogs: multivariate cumulants

The cumulant tensors are the multivariate analog of skewnessand kurtosis.

They describe higher order dependence among random variables.

1 Definitions: tensors and cumulants

2 Properties of cumulant tensors

3 Insights and models from algebraic geometry

4 Algorithms from Riemannian geometry

5 Potential applications


Symmetric tensors and actionsA tensor in coordinates is a multi-way array with a multilinear action.

Tensor JaijkK ∈ Rr×r×r is symmetric if it is invariant under allpermutations of indices

aijk = aikj = ajik = ajki = akij = akji .

Comes with an action:

Symmetric multilinear matrix multiplication. If Q is an n×rmatrix, T an r×r×r tensor, make an n×n ×n tensorK = (Q, Q, Q) · T or just Q · T where

Kαβγ =∑r ,r ,r

i ,j ,k=1qαiqβjqγktijk .

If T is r×r and Q is n×r , we have Q · T = QTQ>; for d > 2multiply on 3, 4, . . . “sides” of the multi-way array.


Moments and Cumulants are symmetric tensors

Vector-valued random variable x = (X1, . . . , Xn).Three natural d -way tensors are:

The dth non-central moment si1,...,sd of x:

Sd(x) =[E(xi1xi2 · · · xid)

]n

i1,...,id=1.

The dth central moment Sd(x− E[x]), and

The dth cumulant κi1...id of x:

Kd(x) =

∑A1t···tAq={i1,...,id}

(−1)q−1(q − 1)!sA1 . . . sAq

n

i1,...,id=1

.


Measuring useful properties.

For univariate x , the cumulants Kd(x) for d = 1, 2, 3, 4 are

expectation κi = E[x ],

variance κii = σ2,

skewness κiii/κ3/2ii , and

kurtosis κiiii/κ2ii .

The tensor versions are the multivariate generalizations

κijk

they provide a natural measure of non-Gaussianity.


Alternative Definitions of Cumulants

In terms of log characteristic function,

κj1···jd (x) = (−1)d ∂d

∂tα1j1· · · ∂tαd

jd

log E(exp(i〈t, x〉)∣∣∣∣t=0

.

In terms of Edgeworth series,

log E(exp(i〈t, x〉) =∞∑

α=0

i |α|κα(x)tα

α!

where α = (α1, . . . , αd) is a multi-index, tα = tα11 · · · tαd

d , andα! = α1! · · ·αd !.


Properties of cumulants: Multilinearity

Multilinearity: if x is a Rn-valued random variable and A ∈ Rm×n

Kd(Ax) = A · Kd(x),

where · is the multilinear action.

This makes factor models work: y = Ax implies KYd = A · KX

d ;

For example, KY2 = AKX

2 A> .

Independent Components Analysis finds an A to approximatelydiagonalize KX

d .


Properties of cumulants: Independence

Independence:

x1, . . . , xk are mutually independent of variables y1, . . . , yk , wehaveKd(x1 + y1, . . . , xk + yk) = Kd(x1, . . . , xk) + Kd(y1, . . . , yk).

Ki1,...,in(x) = 0 whenever there is a partition of {i1, . . . , in} intotwo nonempty sets I and J such that xI and xJ are independent.

Why we want to diagonalize in independent component analysis

Exploitable in other sparse cumulant techniques


Properties of cumulants: Vanishing and Extending

Gaussian: If x is multivariate normal, then Kd(x) = 0 for alld ≥ 3.

I Why you might not have heard of them: for Gaussians, thecovariance matrix does tell the whole story.

Support: There are no distributions with a bound n so that

Kd(x)

{6= 0 3 ≤ d ≤ n,

= 0 d > n.

I Parametrization is trickier when K2 doesn’t tell the whole story.


Making cumulants useful, tractable and estimable

Cumulant tensors are a useful generalization, but too big. They have(#vars+d−1

d

)quantities, too many to

learn with a reasonable amount of data,

store, and

optimize.

Needed: small, implicit models analogous to PCA

PCA: eigenvalue decomposition of a positive semidefinite realsymmetric matrix. We need a tensor analog.

But, it isn’t as easy as it looks. . .


Tensor decomposition

Three possible generalizations are the same in the matrix case butnot in the tensor case. For a n × n × n tensor T ,

Name minimum r such that

Tensor rank T =∑r

i=1 ui ⊗ vi ⊗ wi

not closed

Border rank T = limε→0(Sε), Trank(Sε) = rclosed but hard to represent;defining equations unknown.

Multilinear rank T = (A, B , C ) · K , K ∈ Rr×r×r , A, B , C ∈ Rn×r ,closed and understood.


Multilinear rank factor model

Let y = Y1, . . . , Yn be a random vector. Write the dth ordercumulant Kd(y) as a best r -multilinear rank approximation in termsof the cumulant Kd(x) of a smaller set of r factors x:

KYd ≈ Q · KX

d .

where

Q is orthonormal , and Q> projects to the factors

The column space of Q defines the s-dim subspace which bestexplains the dth order dependence.

In place of eigenvalues, we have the core tensor KXd , the

cumulant of the factors.

Have model, need loss and algorithm.


Principal cumulant components analysisWant factors/principal components that account for variation inall cumulants simultaneously

minQ∈O(n,r), Cd∈Sd (Rr )

∞∑d=1

αd‖K̂d(y)− Q · Cd‖2,

Cd ≈ K̂d(x) not necessarily diagonal.

Appears intractable: optimization over infinite-dimensionalmanifold

O(n, r)×∏∞

d=1Sd(Rr ).

Reduces to optimization over a single Grassmannian Gr(n, r) ofdimension r(n − r),

maxQ∈Gr(n,r)

∑∞

d=1αd‖Q> · K̂d(y)‖2.

In practice ∞ = 3 or 4.


Geometric insights

Secants of Veronese in Sd(Rn) and rank subsets— difficult tostudy.

Symmetric subspace variety in Sd(Rn) — closed, easy to study.

Stiefel manifold O(n, r) is set of n × r real matrices withorthonormal columns.

Grassman manifold Gr(n, r) is set of equivalence classes ofO(n, r) under left multiplication by O(n).

Parametrization of Sd(Rn) via

Gr(n, r)× Sd(Rr ) → Sd(Rn).


Coordinate-cycling heuristicsAlternating Least Squares (i.e. Gauss-Seidel) is commonly usedfor minimizing

Ψ(X , Y , Z ) = ‖A · (X , Y , Z )‖2F

for A ∈ Rl×m×n cycling between X , Y , Z and solving a leastsquares problem at each iteration.

What if A ∈ S3(Rn) and

Φ(X ) = ‖A · (X , X , X )‖2F?

Present approach: disregard symmetry of A, solve Ψ(X , Y , Z ),set

X∗ = Y∗ = Z∗ = (X∗ + Y∗ + Z∗)/3

upon final iteration.

Better: L-BFGS on Grassmannian.J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 18 / 27

Newton/quasi-Newton on a Grassmannian[Savas-Lim]

Objective Φ : Gr(n, r) → R, Φ(X ) = ‖A · (X , X , X )‖2F .

TX tangent space at X ∈ Gr(n, r)

Rn×r 3 ∆ ∈ TX ⇐⇒ ∆>X = 0

1 Compute Grassmann gradient ∇Φ ∈ TX .2 Compute Hessian or update Hessian approximation

H : ∆ ∈ TX → H∆ ∈ TX .

3 At X ∈ Gr(n, r), solve

H∆ = −∇Φ

for search direction ∆.4 Update iterate X : Move along geodesic from X in the direction

given by ∆.


L-BFGS on Grassmannian

BFGS update must be adjusted: on the Grassmannian, thevectors are defined on different points belonging to differenttangent spaces.

Parallel transport along a geodesic to new position.

Limited memory version.


ConvergenceCompares favorably with Alternating Least Squares.


Mean-variance portfolio optimization

Markowitz mean-variance portfolio optimization defines risk to bevariance.

min w>K2(x)w s.t. w>E[x] > r

Evidence indicates that investors optimizing variance with respect tothe covariance matrix accept unwanted skewness and kurtosis risk.

Extreme example: selling out-of-the-money puts looks safe anduncorrelated

Many hedge funds essentially do this


Muti-moment portfolio optimization

So, take skewness and kurtosis into account in the objective.

Need to use skewness K3 and kurtosis K4 tensors.

Use low multilinear rank model to regularize and makeoptimization computable with many assets (linear vs. cubic)

With mean-zero returns in a #assets=m × n=#periods matrix A,

Choose an s, need m × s orthonormal projector Q

Approximate cumulant nKd = A> ·∆d ,n ≈ Q · CMultilinear forms w> · Kd ≈ w>Q · 1

nC give variance, skewness

and kurtosis


Analogously to Eigenfaces,

Cumulants give features supplementing the PCA varimax subspace.

In eigenfaces, we have a centered #pixels= n ×m =#imagesmatrix A, m � n.

The eigenvectors of the covariance matrix KP2 of the pixels are

the eigenfaces.

For efficiency, we compute the covariance matrix K Images2 of the

images instead. The SVD gives both implicitly.

USV> = svd(A>)

mK Images2 = A>A = UΛU>

nKPixels2 = AA> = VΛV>

Orthonormal columns of V , eigenvectors of KP2 , are the eigenfaces.


we can compute Skewfaces,Centered #pixels= n ×m =#images matrix A.

Let KP3 be the (huge) third cumulant tensor of the pixels.

Analogously, we want to compute it implicitly

We just need the projector Π onto the subspace of skewfacesthat best explain KP

3 .

Let A> = USV> with dims (m2, m2, m × n).

nS I3 = A> ·∆n = U · S · V> ·∆n

mKP3 = A ·∆m = V · (S · U> ·∆m)

Pick a small multilinear rank s. If (S · U> ·∆m) ≈ Q · C3 for somem × s matrix Q and NON-diagonal core tensor C3,

mKP3 ≈ V · Q · C3 = VQ · C3

and Π = VQ is our orthonormal-column projection matrix onto the’skewmax’ subspace.


and combine Eigen-, Skew-, and Kurto-faces.

Combine the information from multiple cumulants:

Do the same for procedure for the kurtosis tensor (a little morecomplicated).

Say we keep the first r principal components (columns of V ), sskewfaces, and t kurtofaces. Their span is our optimal subspace.

These three subspaces may overlap; orthogonalize the resultingr + s + t column vectors to get a final projector.

This gives an orthonormal projector basis W for the column space ofA; its

first r vectors best explain the covariance matrix KP2 ,

next s vectors, with W1:r , best explain the big skewness tensorKP

3 of the pixels, and

last t vectors, with W1:r+s , best explain pixel kurtosis KP4 .


[email protected]


jason morton - uclrisi/aml08/morton-aml08.pdfthey describe higher order dependence among random...

Documents