low rank tucker approximation of a tensor from streaming data · low rank tucker approximation of a...
TRANSCRIPT
Low Rank Tucker Approximation of a Tensorfrom Streaming Data
Madeleine Udell
Operations Research and Information EngineeringCornell University
Based on joint work withYiming Sun (Cornell), Yang Guo (UW Madison),
Charlene Luo (Columbia), and Joel Tropp (Caltech)
April 1, 2019
Madeleine Udell, Cornell. Streaming Tucker Approximation. 1
Outline
Applications
Tucker factorization
Sketching
Reconstruction
Numerics
*
Madeleine Udell, Cornell. Streaming Tucker Approximation. 2
Big data, small laptop
X = H1 + · · ·+ HT X
HT X
Madeleine Udell, Cornell. Streaming Tucker Approximation. 3
Distributed data
X = H1 + · · ·+ HT X
H1
Ht
HT
X
Madeleine Udell, Cornell. Streaming Tucker Approximation. 4
Streaming data
X(t) = H1 + · · ·+ Ht
Ht
Madeleine Udell, Cornell. Streaming Tucker Approximation. 5
Streaming multilinear algebra
turnstile model:
X = H1 + · · ·+ HT
I tensor X presented as sum of smaller, simpler tensors Ht
I must discard Ht after it is processed
I Goal: without storing X, approximate X after seeing allupdates (with guaranteed accuracy)
applications:
I scientific simulation
I sensor measurements
I memory- or communication-limited computing
I low memory optimization
Madeleine Udell, Cornell. Streaming Tucker Approximation. 6
Linear sketch
X = H1 + · · ·+ HT
L(X) = L(H1) + · · ·+ L(HT )
I select a linear map L independent of X
I sketch L(X) is much smaller than input tensor X
I use randomness so sketch works for an arbitrary input
I essentially the only way to handle the turnstile model[Li, Nguyen & Woodruff 2014]
examples:
I L(X) = X×n Ω for some matrix Ω
I L(X) = X×n Ωnn∈[N] for some matrices Ωnn∈[N]
Madeleine Udell, Cornell. Streaming Tucker Approximation. 7
Main idea
sketch suffices for (Tucker) approximation:
I compute (randomized) linear sketch of tensor
I recover low rank (Tucker) approximation from sketch
I (optional) improve approximation by revisiting data
Madeleine Udell, Cornell. Streaming Tucker Approximation. 8
Big data, small laptop: sketch
X = H1 + · · ·+ HT L(X)
HT L(X)
I (+) reduced communication
I (+) sketch of data fits on laptop
Madeleine Udell, Cornell. Streaming Tucker Approximation. 9
Distributed data: sketch
L(X(T ))
= L(H1 + · · ·+ HT−1) + L(HT )
L(H1)
L(Ht)
L(HT )
I (+) reduced communication
I (+) no PITI (personally identifiable toast information)
I (+) sketch of data fits on laptop
Madeleine Udell, Cornell. Streaming Tucker Approximation. 10
Streaming data: sketch
L(X(t)) = L(H1 + · · ·+ Ht−1) + L(Ht)
L(Ht)
I (+) even a toaster can form sketch
Madeleine Udell, Cornell. Streaming Tucker Approximation. 11
Outline
Applications
Tucker factorization
Sketching
Reconstruction
Numerics
*
Madeleine Udell, Cornell. Streaming Tucker Approximation. 12
Notation
tensor to compress:
I tensor X ∈ RI1×···×IN with N modes
I sometimes assume I1 = · · · = IN = I for simplicity
indexing:
I [N] = 1, . . . ,N
I I(−n) = I1 × · · · × In−1 × In+1 × IN
tensor operations:
I mode n product: for A ∈ Rk×In ,X×n A ∈ RI1×···×In−1×k×In+1×···×IN
I unfolding X(n) ∈ RIn×I(−n) stacks mode-n fibers of X ascolumns of matrix
Madeleine Udell, Cornell. Streaming Tucker Approximation. 13
Tucker factorization
rank r = (r1, . . . , rN) Tucker factorization of X ∈ RI1×···×IN :
X = G×1 U1 · · · ×N UN =: JG; U1, . . . ,UNK
where
I G ∈ Rr1×···×rN is the core matrix
I Un ∈ RIn×rn is the factor matrix for each mode n ∈ [N]
(sometimes assume r1 = · · · = rN = r for simplicity)
Tucker is useful for compression: when N is small,
I Tucker stores O(rNI ) numbers for rank r3 approximation
I CP stores O(rNI ) numbers for rank r approximation
future work: one pass ST-HOSVD / tensor train?
Madeleine Udell, Cornell. Streaming Tucker Approximation. 14
Computing Tucker: HOSVD
Algorithm Higher order singular value decomposition (HOSVD)[De Lathauwer, De Moor & Vandewalle 2000, Tucker 1966]
Given: tensor X, rank r = (r1, . . . , rN)
1. Factors. Compute top rn left singular vectors Un of theunfolding X(n) for each n ∈ [N].
2. Core. Contract these with X to form the core
G = X×1 UT1 · · · ×N UT
N .
Return: Tucker approximation XHOSVD = JG; U1, . . . ,UNK
Madeleine Udell, Cornell. Streaming Tucker Approximation. 15
Two pass HOSVD
HOSVD can be computed in two passes over the tensor:
I Factors. use randomized linear algebraI need to find span of fibers of X along nth mode:
range(Un) ≈ range(X(n))
I if rank(Ω) ≥ rank(X(n)), then whp for random Ω,
range(X(n)) = range(X(n)Ω)
algorithm:1. compute sketch L(X) = X(n)Ωnn∈[N]
2. use QR on sketch to approximate range(X(n))
I Core. Computation is linear in X:
G = X×1 UT1 · · · ×N UT
N .
Source: [Halko, Martinsson & Tropp 2011, Zhou, Cichocki & Xie 2014, Battaglino,
Ballard & Kolda 2019]
Madeleine Udell, Cornell. Streaming Tucker Approximation. 16
Computing Tucker: HOOI
Algorithm Higher order orthogonal iteration (HOOI)[De Lathauwer et al. 2000]
Given: tensor X, rank r = (r1, . . . , rN)Initialize: compute X ≈ JG; U1, . . . ,UNK using HOSVDRepeat:
1. Factors. For n ∈ [N],
Un ← argminUn
‖JG; U1, . . . ,UNK−X‖2F ,
2. Core.G← argmin
G
‖JG; U1, . . . ,UNK−X‖2F .
Return: Tucker approximation XHOOI = JG; U1, . . . ,UNK
I core update has closed form G← X×1 U>1 · · · ×N U>NMadeleine Udell, Cornell. Streaming Tucker Approximation. 17
Previous work: one pass algorithm via HOOI
[Malik & Becker 2018]:
I (+) sketch design matrix to reduce size of HOOIsubproblems
I (+) exploit Tucker structure of design matrix
I (-) expensive slow reconstruction (via iterative optimization)
I (-) no error guarantees for one pass algorithm
Madeleine Udell, Cornell. Streaming Tucker Approximation. 18
Outline
Applications
Tucker factorization
Sketching
Reconstruction
Numerics
*
Madeleine Udell, Cornell. Streaming Tucker Approximation. 19
Background: randomized sketches
idea: random matrix Ω is not orthogonal to range of interest(whp)
range(X(n)) = range(X(n)Ω)
a dimension reduction map (DRM) (approximately) preservesrange of its argument
examples of DRMS: multiplication by random matrix Ω that is
I gaussian
I sparse [Achlioptas 2003, Li, Hastie & Church 2006]
I SSFRT [Woolfe, Liberty, Rokhlin & Tygert 2008]
I tensor random projection (TRP) [Sun, Guo, Tropp & Udell2018]
I . . .
Madeleine Udell, Cornell. Streaming Tucker Approximation. 20
The sketch
approximate factor matrices and core:
I Factor sketch (k). For each n ∈ [N],fix random DRM Ωn ∈ RI(−n)×kn and compute the sketch
Vn = X(n)Ωn ∈ RIn×kn .
I Core sketch (s). For each n ∈ [N],fix random DRM Φn ∈ RIn×sn . Compute the sketch
H = X×1 Φ>1 · · · ×N Φ>N ∈ Rs1×···×sN .
I Rule of thumb. Pick k as big as you can afford, pick s = 2k.
I define (H,V1, . . . ,VN) = Sketch(X; Φn,Ωnn∈[N]
)Madeleine Udell, Cornell. Streaming Tucker Approximation. 21
Low memory DRMs
factor sketch DRMs are big!
I I(−n) × kn for each n ∈ [N]
how to store?
I don’t store DRMS; instead, use pseudorandom numbergenerator to generate (parts of) DRMs as needed.
I use structured DRM:I TRP generates DRM as Khatri-Rao product of simpler,
smaller DRMsI behaves approximately like a Gaussian sketch
Source: [Sun et al. 2018, Rudelson 2012]
Madeleine Udell, Cornell. Streaming Tucker Approximation. 22
Outline
Applications
Tucker factorization
Sketching
Reconstruction
Numerics
*
Madeleine Udell, Cornell. Streaming Tucker Approximation. 23
Recovery: factor matrices
I compute QR factorization of each factor sketch Vn:
Vn = QnRn
where Qn is orthonormal and Rn is triangular
Madeleine Udell, Cornell. Streaming Tucker Approximation. 24
Two pass algorithm
Algorithm Two Pass Sketch and Low Rank Recovery
Given: tensor X, rank r = (r1, . . . , rN), DRMs Φn,Ωnn∈[N]
I Sketch. (H,V1, . . . ,VN) = Sketch(X; Φn,Ωnn∈[N]
)I Recover factor matrices. For n ∈ [N],
(Qn,∼)← QR(Vn)
I Recover core.
W← X×1 Q1 · · · ×N QN
Return: Tucker approximation X = JW; Q1, . . . ,QNK
accesses X twice: 1) to sketch 2) to recover core
Madeleine Udell, Cornell. Streaming Tucker Approximation. 25
Intuition: one pass core recovery
I we want to know W:compression of X using factor range approximations Qn
I we observe H:compression of X using random projections Φn
how to approximate W?
X ≈ X×1 Q1Q>1 × · · · ×N QNQ>N
=(X×1 Q>1 ×N · · · ×Q>N
)×1 Q1 · · · ×N QN
= W×1 Q1 · · · ×N QN
X×1 Φ>1 · · · ×N Φ>N︸ ︷︷ ︸H
≈ W×1 Φ>1 Q1 × · · · ×N Φ>NQN
we can solve for W: s > k, so each Φ>n Qn has a left inverse(whp):
W ≈H×1 (Φ>1 Q1)† × · · · ×N (Φ>NQN)†
Madeleine Udell, Cornell. Streaming Tucker Approximation. 26
One pass algorithm
Algorithm One Pass Sketch and Low Rank Recovery
Given: tensor X, rank r = (r1, . . . , rN), DRMs Φn,Ωnn∈[N]
I Sketch. (H,V1, . . . ,VN) = Sketch(X; Φn,Ωnn∈[N]
)I Recover factor matrices. For n ∈ [N],
(Qn,∼)← QR(Vn)
I Recover core.
W←H×1 (Φ>1 Q1)† × · · · ×N (Φ>NQN)†
Return: Tucker approximation X = JW; Q1, . . . ,QNK
accesses X only once, to sketch
Source: [Sun, Guo, Luo, Tropp & Udell 2019]
Madeleine Udell, Cornell. Streaming Tucker Approximation. 27
Fixed rank approximation
to truncate reconstruction to rank r, truncate core:
LemmaFor a tensor W ∈ Rk1×···×kN , orthogonal matrices Qn ∈ Rkn×rn ,
JW×1 Q1 · · · ×N QNKr = JWKr ×1 Q1 · · · ×N QN ,
where J·K denotes the best rank r Tucker approximation.
=⇒ compute fixed rank approximation using, e.g., HOOI on(small) core approximation W
Madeleine Udell, Cornell. Streaming Tucker Approximation. 28
Tail energy
For each unfolding X(n), define its ρth tail energy as
(τ (n)ρ )2 :=
min(In,I(−n))∑k>ρ
σ2k(X(n)),
where σk(X(n)) is the kth largest singular value of X(n).
Madeleine Udell, Cornell. Streaming Tucker Approximation. 29
Guarantees (I)
Theorem (Recommended parameters [Sun et al. 2019])
Sketch X with Gaussian DRMs of parameters k, s = 2k + 1.Form a rank r Tucker approximation X using the one passalgorithm. Then
E‖X− X‖2F ≤ 4
N∑n=1
(τ(n)rn )2.
If X is truly rank r, we obtain the true Tucker factorization!
Madeleine Udell, Cornell. Streaming Tucker Approximation. 30
Guarantees (II)
Theorem (Detailed guarantee [Sun et al. 2019])
Sketch X with Gaussian DRMs of parameters k, s. Form a rankr Tucker approximation X using the one pass algorithm. Then
E‖X− X‖2F ≤ (1 + ∆) min
1≤ρn<kn−1
N∑n=1
(1 +
ρnkn − ρn − 1
)(τ (n)ρn )2
where ∆ = maxNn=1 kn/(sn − kn − 1)
Madeleine Udell, Cornell. Streaming Tucker Approximation. 31
Outline
Applications
Tucker factorization
Sketching
Reconstruction
Numerics
*
Madeleine Udell, Cornell. Streaming Tucker Approximation. 32
Different DRMs perform similarly
0.0 0.1 0.2 0.3 0.4Compression Factor: δ1 = k/I
10−3
10−2
10−1
100
Diff
eren
cein
Rel
ativ
eE
rror
Vary k, I = 600 (Low Rank γ = 0.01)
0.02 0.04 0.06 0.08 0.10Compression Factor: δ1 = k/I
10−2
10−1
Diff
eren
cein
Rel
ativ
eE
rror
Vary k, I = 600 (Sparse Low Rank γ = 0.01)
0.00 0.05 0.10 0.15Compression Factor: δ1 = k/I
10−3
10−2
10−1
100
Diff
eren
cein
Rel
ativ
eE
rror
Vary k, I = 600 (Polynomial Decay)
0.0 0.1 0.2 0.3 0.4Compression Factor: δ1 = k/I
10−2
10−1
100
Diff
eren
cein
Rel
ativ
eE
rror
Vary k, I = 600 (Low Rank γ = 0.1)
0.0 0.1 0.2 0.3 0.4Compression Factor: δ1 = k/I
10−1
100
Diff
eren
cein
Rel
ativ
eE
rror
Vary k, I = 600 (Low Rank γ = 1)
SSRFT
Gaussian TRP
Sparse TRP
Comments: Synthetic data, I = 600 and r = (5, 5, 5). k/I = .4 =⇒ 20×compression.
Madeleine Udell, Cornell. Streaming Tucker Approximation. 33
Sensible reconstruction at practical compression level
105 106 107
Memory Use
10−3
10−2
10−1
100
Diff
eren
cein
Rel
ativ
eE
rror
Vary Memory Size, I = 300 (Low Rank γ = 0.01)
105 106 107
Memory Use
10−3
10−2
10−1
100
Diff
eren
cein
Rel
ativ
eE
rror
Vary Memory Size, I = 300 (Sparse Low Rank γ = 0.01)
105 106 107
Memory Use
10−3
10−2
10−1
100
Diff
eren
cein
Rel
ativ
eE
rror
Vary k, I = 300 (Polynomial Decay)
105 106 107
Memory Use
10−2
10−1
100
Diff
eren
cein
Rel
ativ
eE
rror
Vary Memory Size, I = 300 (Low Rank γ = 0.1)
105 106 107
Memory Use
10−1
100
Diff
eren
cein
Rel
ativ
eE
rror
Vary Memory Size, I = 300 (Low Rank γ = 1)
Two Pass
One Pass
TS
Comments: Error of fixed-rank approximation relative to HOOI for r = 10, I = 300using TRP. Total memory use is ((2k + 1)N + kIN) and (Kr2N + K ∗ r2N−2).Low-rank data uses γ = 0.01, 0.1, 1.
Madeleine Udell, Cornell. Streaming Tucker Approximation. 34
Combustion simulation
0 20 40 60 80 100 120x
0
20
40
60
80
100
120
yOrginal
0 20 40 60 80 100 120x
0
20
40
60
80
100
120
y
HOOI
0 20 40 60 80 100 120x
0
20
40
60
80
100
120
y
Two Pass
0 20 40 60 80 100 120x
0
20
40
60
80
100
120
yOne Pass
Comments:1408× 128× 128simulatedcombustion datafrom [Lapointe,Savard & Blanquart2015].
Madeleine Udell, Cornell. Streaming Tucker Approximation. 35
Video scene classification
Linear Sketch (k = 20)
Two-Pass Tucker (k = 20, r = 10)
One-Pass Tucker (k = 20, r = 10)
0 250 500 750 1000 1250 1500 1750 2000
Frame
One-Pass Tucker (k = 300, r = 10)
Comments: Video data 2200× 1080× 1980. Classify scenes using k-means on: 1)linear sketch along the time dimension k = 20 (Row 1); 2) The Tucker factor alongthe time dimension, computed via our two pass (Row 2) and one pass (Row 3)sketching algorithm (r , k, s) = (10, 20, 41). 3) The Tucker factor along the timedimension, computed via our one pass (Row 4) sketching algorithm(r , k, s) = (10, 300, 601).
Madeleine Udell, Cornell. Streaming Tucker Approximation. 36
Summary
Streaming Tucker approximation compresses tensor withoutstoring it.
useful for:
I streaming data
I distributed data
I low memory compute
key ideas:
I form linear sketch of tensor and recover from sketch
I random projection of tensor preserves dominant information
Madeleine Udell, Cornell. Streaming Tucker Approximation. 37
Future work + references
let’s talk!
I bigger tensors to compress?
I streaming compression for 〈your research〉?references:
I Sun, Y., Guo, Y., Tropp, J. A., and Udell, M. (2018).Tensor random projection for low memory dimensionreduction. In NeurIPS Workshop on RelationalRepresentation Learning.
I Sun, Y., Guo, Y., Luo, C., Tropp, J. A., and Udell, M.(2019). Low rank tucker approximation of a tensor fromstreaming data. In preparation.
I Tropp, J. A., Yurtsever, A., Udell, M., and Cevher, V.(2019). Streaming low-rank matrix approximation with anapplication to scientific simulation. Submitted to SISC.
Madeleine Udell, Cornell. Streaming Tucker Approximation. 38
Outline
Applications
Tucker factorization
Sketching
Reconstruction
Numerics
*
Madeleine Udell, Cornell. Streaming Tucker Approximation. 39
References
Achlioptas, D. (2003). Database-friendly random projections:Johnson-lindenstrauss with binary coins. Journal of computer and SystemSciences, 66(4), 671–687.
Battaglino, C., Ballard, G., & Kolda, T. G. (2019). Faster parallel tucker tensordecomposition using randomization.
De Lathauwer, L., De Moor, B., & Vandewalle, J. (2000). A multilinear singularvalue decomposition. SIAM journal on Matrix Analysis and Applications, 21(4),1253–1278.
Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure withrandomness: Probabilistic algorithms for constructing approximate matrixdecompositions. SIAM review, 53(2), 217–288.
Lapointe, S., Savard, B., & Blanquart, G. (2015). Differential diffusion effects,distributed burning, and local extinctions in high karlovitz premixed flames.Combustion and flame, 162(9), 3341–3355.
Li, P., Hastie, T. J., & Church, K. W. (2006). Very sparse random projections. InProceedings of the 12th ACM SIGKDD international conference on Knowledgediscovery and data mining, (pp. 287–296). ACM.
Li, Y., Nguyen, H. L., & Woodruff, D. P. (2014). Turnstile streaming algorithmsmight as well be linear sketches. In Proceedings of the forty-sixth annual ACMsymposium on Theory of computing, (pp. 174–183). ACM.
Malik, O. A. & Becker, S. (2018). Low-rank tucker decomposition of large tensorsusing tensorsketch. In Advances in Neural Information Processing Systems, (pp.10116–10126).
Madeleine Udell, Cornell. Streaming Tucker Approximation. 39
Rudelson, M. (2012). Row products of random matrices. Advances inMathematics, 231(6), 3199–3231.
Sun, Y., Guo, Y., Luo, C., Tropp, J. A., & Udell, M. (2019). Low rank tuckerapproximation of a tensor from streaming data. In preparation.
Sun, Y., Guo, Y., Tropp, J. A., & Udell, M. (2018). Tensor random projection forlow memory dimension reduction. In NeurIPS Workshop on RelationalRepresentation Learning.
Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis.Psychometrika, 31(3), 279–311.
Woolfe, F., Liberty, E., Rokhlin, V., & Tygert, M. (2008). A fast randomizedalgorithm for the approximation of matrices. Applied and ComputationalHarmonic Analysis, 25(3), 335–366.
Zhou, G., Cichocki, A., & Xie, S. (2014). Decomposition of big tensors with lowmultilinear rank. arXiv preprint arXiv:1412.1885.
Madeleine Udell, Cornell. Streaming Tucker Approximation. 39