jason morton - uclrisi/aml08/morton-aml08.pdfthey describe higher order dependence among random...
TRANSCRIPT
Algebraic models for multilinear dependence
Jason Morton
Stanford University
December 11, 2008NIPS
Joint work with Lek-Heng Lim of Berkeley
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 1 / 27
Univariate cumulants
Mean, variance, skewness and kurtosis describe the shape of aunivariate distribution.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 2 / 27
Covariance matrices
The covariance matrix partly describes the dependence structure of amultivariate distribution.
PCA
Gaussian graphical models
Optimization—bilinear form computes variance
But if the variables are not multivariate Gaussian, not the whole story.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 3 / 27
Even if marginals normal, dependence might not be
−5 0 5−5
−4
−3
−2
−1
0
1
2
3
4
51000 Simulated Clayton(3)−Dependent N(0,1) Values
X1 ~ N(0,1)
X2
~ N
(0,1
)
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 4 / 27
Covariance matrix analogs: multivariate cumulants
The cumulant tensors are the multivariate analog of skewnessand kurtosis.
They describe higher order dependence among random variables.
1 Definitions: tensors and cumulants
2 Properties of cumulant tensors
3 Insights and models from algebraic geometry
4 Algorithms from Riemannian geometry
5 Potential applications
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 5 / 27
Symmetric tensors and actionsA tensor in coordinates is a multi-way array with a multilinear action.
Tensor JaijkK ∈ Rr×r×r is symmetric if it is invariant under allpermutations of indices
aijk = aikj = ajik = ajki = akij = akji .
Comes with an action:
Symmetric multilinear matrix multiplication. If Q is an n×rmatrix, T an r×r×r tensor, make an n×n ×n tensorK = (Q, Q, Q) · T or just Q · T where
Kαβγ =∑r ,r ,r
i ,j ,k=1qαiqβjqγktijk .
If T is r×r and Q is n×r , we have Q · T = QTQ>; for d > 2multiply on 3, 4, . . . “sides” of the multi-way array.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 6 / 27
Moments and Cumulants are symmetric tensors
Vector-valued random variable x = (X1, . . . , Xn).Three natural d -way tensors are:
The dth non-central moment si1,...,sd of x:
Sd(x) =[E(xi1xi2 · · · xid)
]n
i1,...,id=1.
The dth central moment Sd(x− E[x]), and
The dth cumulant κi1...id of x:
Kd(x) =
∑A1t···tAq={i1,...,id}
(−1)q−1(q − 1)!sA1 . . . sAq
n
i1,...,id=1
.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 7 / 27
Measuring useful properties.
For univariate x , the cumulants Kd(x) for d = 1, 2, 3, 4 are
expectation κi = E[x ],
variance κii = σ2,
skewness κiii/κ3/2ii , and
kurtosis κiiii/κ2ii .
The tensor versions are the multivariate generalizations
κijk
they provide a natural measure of non-Gaussianity.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 8 / 27
Alternative Definitions of Cumulants
In terms of log characteristic function,
κj1···jd (x) = (−1)d ∂d
∂tα1j1· · · ∂tαd
jd
log E(exp(i〈t, x〉)∣∣∣∣t=0
.
In terms of Edgeworth series,
log E(exp(i〈t, x〉) =∞∑
α=0
i |α|κα(x)tα
α!
where α = (α1, . . . , αd) is a multi-index, tα = tα11 · · · tαd
d , andα! = α1! · · ·αd !.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 9 / 27
Properties of cumulants: Multilinearity
Multilinearity: if x is a Rn-valued random variable and A ∈ Rm×n
Kd(Ax) = A · Kd(x),
where · is the multilinear action.
This makes factor models work: y = Ax implies KYd = A · KX
d ;
For example, KY2 = AKX
2 A> .
Independent Components Analysis finds an A to approximatelydiagonalize KX
d .
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 10 / 27
Properties of cumulants: Independence
Independence:
x1, . . . , xk are mutually independent of variables y1, . . . , yk , wehaveKd(x1 + y1, . . . , xk + yk) = Kd(x1, . . . , xk) + Kd(y1, . . . , yk).
Ki1,...,in(x) = 0 whenever there is a partition of {i1, . . . , in} intotwo nonempty sets I and J such that xI and xJ are independent.
Why we want to diagonalize in independent component analysis
Exploitable in other sparse cumulant techniques
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 11 / 27
Properties of cumulants: Vanishing and Extending
Gaussian: If x is multivariate normal, then Kd(x) = 0 for alld ≥ 3.
I Why you might not have heard of them: for Gaussians, thecovariance matrix does tell the whole story.
Support: There are no distributions with a bound n so that
Kd(x)
{6= 0 3 ≤ d ≤ n,
= 0 d > n.
I Parametrization is trickier when K2 doesn’t tell the whole story.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 12 / 27
Making cumulants useful, tractable and estimable
Cumulant tensors are a useful generalization, but too big. They have(#vars+d−1
d
)quantities, too many to
learn with a reasonable amount of data,
store, and
optimize.
Needed: small, implicit models analogous to PCA
PCA: eigenvalue decomposition of a positive semidefinite realsymmetric matrix. We need a tensor analog.
But, it isn’t as easy as it looks. . .
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 13 / 27
Tensor decomposition
Three possible generalizations are the same in the matrix case butnot in the tensor case. For a n × n × n tensor T ,
Name minimum r such that
Tensor rank T =∑r
i=1 ui ⊗ vi ⊗ wi
not closed
Border rank T = limε→0(Sε), Trank(Sε) = rclosed but hard to represent;defining equations unknown.
Multilinear rank T = (A, B , C ) · K , K ∈ Rr×r×r , A, B , C ∈ Rn×r ,closed and understood.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 14 / 27
Multilinear rank factor model
Let y = Y1, . . . , Yn be a random vector. Write the dth ordercumulant Kd(y) as a best r -multilinear rank approximation in termsof the cumulant Kd(x) of a smaller set of r factors x:
KYd ≈ Q · KX
d .
where
Q is orthonormal , and Q> projects to the factors
The column space of Q defines the s-dim subspace which bestexplains the dth order dependence.
In place of eigenvalues, we have the core tensor KXd , the
cumulant of the factors.
Have model, need loss and algorithm.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 15 / 27
Principal cumulant components analysisWant factors/principal components that account for variation inall cumulants simultaneously
minQ∈O(n,r), Cd∈Sd (Rr )
∞∑d=1
αd‖K̂d(y)− Q · Cd‖2,
Cd ≈ K̂d(x) not necessarily diagonal.
Appears intractable: optimization over infinite-dimensionalmanifold
O(n, r)×∏∞
d=1Sd(Rr ).
Reduces to optimization over a single Grassmannian Gr(n, r) ofdimension r(n − r),
maxQ∈Gr(n,r)
∑∞
d=1αd‖Q> · K̂d(y)‖2.
In practice ∞ = 3 or 4.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 16 / 27
Geometric insights
Secants of Veronese in Sd(Rn) and rank subsets— difficult tostudy.
Symmetric subspace variety in Sd(Rn) — closed, easy to study.
Stiefel manifold O(n, r) is set of n × r real matrices withorthonormal columns.
Grassman manifold Gr(n, r) is set of equivalence classes ofO(n, r) under left multiplication by O(n).
Parametrization of Sd(Rn) via
Gr(n, r)× Sd(Rr ) → Sd(Rn).
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 17 / 27
Coordinate-cycling heuristicsAlternating Least Squares (i.e. Gauss-Seidel) is commonly usedfor minimizing
Ψ(X , Y , Z ) = ‖A · (X , Y , Z )‖2F
for A ∈ Rl×m×n cycling between X , Y , Z and solving a leastsquares problem at each iteration.
What if A ∈ S3(Rn) and
Φ(X ) = ‖A · (X , X , X )‖2F?
Present approach: disregard symmetry of A, solve Ψ(X , Y , Z ),set
X∗ = Y∗ = Z∗ = (X∗ + Y∗ + Z∗)/3
upon final iteration.
Better: L-BFGS on Grassmannian.J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 18 / 27
Newton/quasi-Newton on a Grassmannian[Savas-Lim]
Objective Φ : Gr(n, r) → R, Φ(X ) = ‖A · (X , X , X )‖2F .
TX tangent space at X ∈ Gr(n, r)
Rn×r 3 ∆ ∈ TX ⇐⇒ ∆>X = 0
1 Compute Grassmann gradient ∇Φ ∈ TX .2 Compute Hessian or update Hessian approximation
H : ∆ ∈ TX → H∆ ∈ TX .
3 At X ∈ Gr(n, r), solve
H∆ = −∇Φ
for search direction ∆.4 Update iterate X : Move along geodesic from X in the direction
given by ∆.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 19 / 27
L-BFGS on Grassmannian
BFGS update must be adjusted: on the Grassmannian, thevectors are defined on different points belonging to differenttangent spaces.
Parallel transport along a geodesic to new position.
Limited memory version.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 20 / 27
ConvergenceCompares favorably with Alternating Least Squares.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 21 / 27
Mean-variance portfolio optimization
Markowitz mean-variance portfolio optimization defines risk to bevariance.
min w>K2(x)w s.t. w>E[x] > r
Evidence indicates that investors optimizing variance with respect tothe covariance matrix accept unwanted skewness and kurtosis risk.
Extreme example: selling out-of-the-money puts looks safe anduncorrelated
Many hedge funds essentially do this
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 22 / 27
Muti-moment portfolio optimization
So, take skewness and kurtosis into account in the objective.
Need to use skewness K3 and kurtosis K4 tensors.
Use low multilinear rank model to regularize and makeoptimization computable with many assets (linear vs. cubic)
With mean-zero returns in a #assets=m × n=#periods matrix A,
Choose an s, need m × s orthonormal projector Q
Approximate cumulant nKd = A> ·∆d ,n ≈ Q · CMultilinear forms w> · Kd ≈ w>Q · 1
nC give variance, skewness
and kurtosis
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 23 / 27
Analogously to Eigenfaces,
Cumulants give features supplementing the PCA varimax subspace.
In eigenfaces, we have a centered #pixels= n ×m =#imagesmatrix A, m � n.
The eigenvectors of the covariance matrix KP2 of the pixels are
the eigenfaces.
For efficiency, we compute the covariance matrix K Images2 of the
images instead. The SVD gives both implicitly.
USV> = svd(A>)
mK Images2 = A>A = UΛU>
nKPixels2 = AA> = VΛV>
Orthonormal columns of V , eigenvectors of KP2 , are the eigenfaces.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 24 / 27
we can compute Skewfaces,Centered #pixels= n ×m =#images matrix A.
Let KP3 be the (huge) third cumulant tensor of the pixels.
Analogously, we want to compute it implicitly
We just need the projector Π onto the subspace of skewfacesthat best explain KP
3 .
Let A> = USV> with dims (m2, m2, m × n).
nS I3 = A> ·∆n = U · S · V> ·∆n
mKP3 = A ·∆m = V · (S · U> ·∆m)
Pick a small multilinear rank s. If (S · U> ·∆m) ≈ Q · C3 for somem × s matrix Q and NON-diagonal core tensor C3,
mKP3 ≈ V · Q · C3 = VQ · C3
and Π = VQ is our orthonormal-column projection matrix onto the’skewmax’ subspace.
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 25 / 27
and combine Eigen-, Skew-, and Kurto-faces.
Combine the information from multiple cumulants:
Do the same for procedure for the kurtosis tensor (a little morecomplicated).
Say we keep the first r principal components (columns of V ), sskewfaces, and t kurtofaces. Their span is our optimal subspace.
These three subspaces may overlap; orthogonalize the resultingr + s + t column vectors to get a final projector.
This gives an orthonormal projector basis W for the column space ofA; its
first r vectors best explain the covariance matrix KP2 ,
next s vectors, with W1:r , best explain the big skewness tensorKP
3 of the pixels, and
last t vectors, with W1:r+s , best explain pixel kurtosis KP4 .
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 26 / 27
J. Morton (Stanford University) Algebraic models for multilinear dependence December 11, 2008NIPS 27 / 27