distinguishing the signal from noise in an svd of simulation data
DESCRIPTION
My talk at the massive data and signal processing workshop at ICASSP 2012 in Kyoto Japan.TRANSCRIPT
Distinguishing signal from noise in an SVD of simulation data DAVID F. GLEICH !
PURDUE UNIVERSITY COMPUTER SCIENCE !
DEPARTMENT
PAUL G. CONSTANTINE !STANFORD UNIVERSITY
1
noise
ICASSP David Gleich · Purdue
Large scale non-linear, time dependent heat transfer problem
105 nodes, 103 time steps 30 minutes on 16 cores ~ 1GB
Questions What is the probability of failure? Which input values cause failure?
2 ICASSP David Gleich · Purdue
Insight and confidence requires multiple runs and hits the curse of dimensionality.
The problem A simulation run is time-consuming!
Our solution Use “big-data” techniques and platforms.
3 ICASSP David Gleich · Purdue
We store a few runs … Supercomputer Data computing cluster Engineer
Run 100-1000 simulations
Store them on the MapReduce cluster
Run 10000-100000 interpolated simulations for approximate statistics
… and build an interpolant from the data for computational steering.
4 ICASSP David Gleich · Purdue
Input "Parameters
Time history"of simulation
s "
5-10 of them
f
“a few gigabytes”
The Database
s1 -> f1 s2 -> f2
sk -> fk
f(s) =
2
66666666666664
q(x1, t1, s)...
q(xn
, t1, s)q(x1, t2, s)
...q(x
n
, t2, s)...
q(xn
, t
k
, s)
3
77777777777775
A single simulation at one time step
X =⇥f(s1) f(s2) ... f(sp)
⇤
The database as a matrix. 100GB – 100TB
The simulation as a vector
5 ICASSP David Gleich · Purdue
One-dimensional test problem
6
X =f(x) f1
f2
f5
x
“plot( )” “imagesc( )” X X
f (x , s) =
1
8s
log[1 + 4s(x
2 � x)]
X
i ,j = f (xi
, s
j
)
ICASSP David Gleich · Purdue
The interpolant
Motivation!Let the data give you the basis. Then find the right combination
X =⇥f(s1) f(s2) ... f(sp)
⇤
f(s) ⇡rX
j=1
uj↵j (s)
This idea was inspired by the success of other reduced order models like POD; and Paul’s residual minimizing idea.
These are the left singular vectors from X!
7 ICASSP David Gleich · Purdue
Why the SVD? It splits “space-time” from “parameters”
treat each right singular vector as samples of the unknown basis functions
split x and s
Interpolate v any way you wish
a general parameter
8 ICASSP David Gleich · Purdue
f (xi
, s
j
) =rX
`=1
U
i ,`�`Vj ,` =
rX
`=1
u`(xi
)�`v`(sj
)
f (xi
, s) =rX
`=1
u`(xi
)�`v`(s) v`(s) ⇡pX
j=1
v`(sj
)�(`)j
(s)
x is the “space-time” index
… and it has a “smoothness” property.
MapReduce and Interpolation
On the MapReduce cluster
Just one machine
Use SVD on MapReduce
cluster to get singular vector
basis
The Surrogate
Form a linear combination of singular vectors
New Samples
On the MapReduce cluster
f1
f2
f5
The Database
s1 -> f1 s2 -> f2
sk -> fk
sa -> fa sb -> fb
sc -> fc
Interpolation Sample
9/18 ICASSP David Gleich · Purdue
Interp.!
A B
A quiz!Which section would you rather try and interpolate, A or B?
10
ICASSP David Gleich · Purdue
How predictable is a !singular vector? Folk Theorem (O’Leary 2011) The singular vectors of a matrix of “smooth” data become more oscillatory as the index increases. Implication!The gradient of the singular vectors increases as the index increases. v1(s), v2(s), ... , vt (s) vt+1(s), ... , vr (s)
Predictable signal Unpredictable noise
11
ICASSP David Gleich · Purdue
!1 0 1!1
0
1
v1
!1 0 1!1
0
1
v2
!1 0 1!1
0
1
v3
!1 0 1!1
0
1
v7
Fig. 1. An example of when the functions v` become diffi-cult to interpolate. Each plot shows a singular-vector fromthe example in Section 3, which we interpret as a functionv`(s). While we might have some confidence in an interpola-tion of v1(s) and v2(s), interpolating v3(s) for s nearby 1 isproblematic, and interpolating v7(s) anywhere is dubious.
!1 0 1!1
0
1
v1
!1 0 1!1
0
1
v2
!1 0 1!0.5
0
0.5
v3
!1 0 1!0.5
0
0.5
v7
Fig. 2. For reference, we show a finer discretization of thefunctions above, which shows that interpolating v7(s) nearby1 is difficult.
Once we have determined the predictable bases, we in-terpolate them using procedures discussed above to createthe ↵`(s). From the singular values and left singular vectorscorresponding to the unpredictable bases, we can statisticallycharacterize the noise in the surrogate function. This statisti-cal characterization provides a time/space-varying predictionvariance, which is related to the errors in the surrogate.
4. COMPUTING AN SVD WITH MAPREDUCE
Recall that X is m-by-p, where m is the product of the numberof timesteps and spatial points, and p is the number of samples,and the biggest computational bottleneck in this algorithm iscomputing the SVD of this matrix. The matrix is extremely
tall-and-skinny because there usually be millions to billionsor rows and around 1000 columns. Consequently, we can usean R-SVD procedure [6] to compute the truncated-SVD of thematrix X by first doing a QR factorization of X, then an SVDon the small matrix R that results. Let
X = QR
be a QR-factorization, then R = UR⌃VT , and
X = QUR| {z }U
⌃VT
is the SVD.In practice, we use an approach in the MapReduce
paradigm [7], which first computes the R in the QR fac-torization, and then computes U = XV⌃+. This approach,although economical, may result in low accuracy if ⌃ is highlyill-conditioned and we continue to seek alternatives althoughwe do not seem to observe the worst case loss-of-accuracy. Forthe QR factorization, we use a MapReduce implementation [8]of the communication-avoiding QR scheme [9].
Initially, each row of the matrix X is a record in the MapRe-duce paradigm, as is each record of the left singular vectors U.Thus, after the SVD, the interpolation just involves distribut-ing the coefficients a via the distributed cache and performingthe inner-products. Moreover, we can compute the result formany interpolants simultaneously – a computational blockingtechnique that can amortize the effects of system overhead.
5. RESULTS
We now briefly present some results from a thermal-heatingsimulation of a complex geometry to illustrate the performanceof this method on a real-world problem. There are three param-eters s for this simulation, each of which controls a materialproperty. The simulation is done with the Aria package inthe SIERRA mechanics toolkit, both developed by SandiaNational Laboratories for their simulations. An individualsimulation has 240 time steps and 32768 spatial points andtakes about 30 minutes to complete on a 32-core machine. Ourdatabase contained the output of 1000 simulations.
The SVD of this data took 30 minutes using the Dumbopython wrapper [10] with Hadoop 0.21 [11]. In Figure 4, weshow a singular vector as a function. Subsequently, computingthe data a for a single interpolant took about 4 seconds on alaptop. To evaluate 1000 separate interpolants took 8 minutesusing a C++ code to do the matrix-vector products in a Hadoopstreaming code.
The Hadoop cluster had 62 nodes, with 4 cores oneach node. Thus, neglecting the cost of the SVD, themodel reduction procedure takes 8 minutes · (62 nodes ·4 cores/node)/1000 simulations = 1.98 core-minutes persimulation; whereas the original simulation took 32 cores ·30 minutes = 960 core-minutes, for a speedup of around 450.
A refined method with !an error model
Predictable Unpredictable ⌘j ⇠ N(0, 1)
Don’t even try to interpolate the predictable modes.
f(s) ⇡t(s)X
j=1
uj↵j (s) +rX
j=t(s)+1
uj�j⌘j
But now, how to choose t(s)?
12
ICASSP David Gleich · Purdue
Variance[f] = diag
0
@rX
j=t(s)+1
�2j ujuT
j
1
A
Our current approach to choosing the predictability
t(s) is the largest 𝜏 such that Better ideas? Come talk to me!
13
ICASSP David Gleich · Purdue
!1 0 1!1
0
1
v1
!1 0 1!1
0
1
v2
!1 0 1!1
0
1
v3
!1 0 1!1
0
1
v7
Fig. 1. An example of when the functions v` become diffi-cult to interpolate. Each plot shows a singular-vector fromthe example in Section 3, which we interpret as a functionv`(s). While we might have some confidence in an interpola-tion of v1(s) and v2(s), interpolating v3(s) for s nearby 1 isproblematic, and interpolating v7(s) anywhere is dubious.
!1 0 1!1
0
1
v1
!1 0 1!1
0
1
v2
!1 0 1!0.5
0
0.5
v3
!1 0 1!0.5
0
0.5
v7
Fig. 2. For reference, we show a finer discretization of thefunctions above, which shows that interpolating v7(s) nearby1 is difficult.
Once we have determined the predictable bases, we in-terpolate them using procedures discussed above to createthe ↵`(s). From the singular values and left singular vectorscorresponding to the unpredictable bases, we can statisticallycharacterize the noise in the surrogate function. This statisti-cal characterization provides a time/space-varying predictionvariance, which is related to the errors in the surrogate.
4. COMPUTING AN SVD WITH MAPREDUCE
Recall that X is m-by-p, where m is the product of the numberof timesteps and spatial points, and p is the number of samples,and the biggest computational bottleneck in this algorithm iscomputing the SVD of this matrix. The matrix is extremely
tall-and-skinny because there usually be millions to billionsor rows and around 1000 columns. Consequently, we can usean R-SVD procedure [6] to compute the truncated-SVD of thematrix X by first doing a QR factorization of X, then an SVDon the small matrix R that results. Let
X = QR
be a QR-factorization, then R = UR⌃VT , and
X = QUR| {z }U
⌃VT
is the SVD.In practice, we use an approach in the MapReduce
paradigm [7], which first computes the R in the QR fac-torization, and then computes U = XV⌃+. This approach,although economical, may result in low accuracy if ⌃ is highlyill-conditioned and we continue to seek alternatives althoughwe do not seem to observe the worst case loss-of-accuracy. Forthe QR factorization, we use a MapReduce implementation [8]of the communication-avoiding QR scheme [9].
Initially, each row of the matrix X is a record in the MapRe-duce paradigm, as is each record of the left singular vectors U.Thus, after the SVD, the interpolation just involves distribut-ing the coefficients a via the distributed cache and performingthe inner-products. Moreover, we can compute the result formany interpolants simultaneously – a computational blockingtechnique that can amortize the effects of system overhead.
5. RESULTS
We now briefly present some results from a thermal-heatingsimulation of a complex geometry to illustrate the performanceof this method on a real-world problem. There are three param-eters s for this simulation, each of which controls a materialproperty. The simulation is done with the Aria package inthe SIERRA mechanics toolkit, both developed by SandiaNational Laboratories for their simulations. An individualsimulation has 240 time steps and 32768 spatial points andtakes about 30 minutes to complete on a 32-core machine. Ourdatabase contained the output of 1000 simulations.
The SVD of this data took 30 minutes using the Dumbopython wrapper [10] with Hadoop 0.21 [11]. In Figure 4, weshow a singular vector as a function. Subsequently, computingthe data a for a single interpolant took about 4 seconds on alaptop. To evaluate 1000 separate interpolants took 8 minutesusing a C++ code to do the matrix-vector products in a Hadoopstreaming code.
The Hadoop cluster had 62 nodes, with 4 cores oneach node. Thus, neglecting the cost of the SVD, themodel reduction procedure takes 8 minutes · (62 nodes ·4 cores/node)/1000 simulations = 1.98 core-minutes persimulation; whereas the original simulation took 32 cores ·30 minutes = 960 core-minutes, for a speedup of around 450.
1�1
⌧X
i=1
�i
����@v
i
@s
����
< threshold
We can use more black gradients than red gradients, so error will be higher for red.
An experimental test case
A heat equation problem Two parameters that control the material properties
14
ICASSP David Gleich · Purdue
Our Reduced Order Model
The
Trut
h
Where the error is the worst
15
10-2 10-3
10-3
Error
Error
10-2
ICASSP David Gleich · Purdue
Hist
ogra
m o
f erro
rs
A Large Scale Example
Nonlinear heat transfer model 80k nodes, 300 time-steps 104 basis runs SVD of 24m x 104 data matrix 500x reduction in wall clock time (100x including the SVD) 16
ICASSP David Gleich · Purdue
SVD from QR: R-SVD
Old algorithm … … helps when A is tall and skinny.
17
Let A = QR
then A = QUR⌃RVTR
ICASSP David Gleich · Purdue
Intro to MapReduce Originated at Google for indexing web pages and computing PageRank.
The idea Bring the computations to the data.
Express algorithms in "data-local operations. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.
MM R
RMM
Input stored in triplicate
Map output"persisted to disk"before shuffle
Reduce input/"output on disk
1 MM R
RMMM
Maps Reduce
Shuffle
2
3
4
5
1 2 M M
3 4 M M
5 M
Data scalable
Fault-tolerance by design
18
ICASSP David Gleich · Purdue
MapReduce TSQR summary MapReduce is great for TSQR!Data A tall and skinny (TS) matrix by rows
Map QR factorization of local rowsReduce QR factorization of local rows
Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute (the norm of each column) 161 sec.Time to compute in qr( ) 387 sec.
On a 64-node Hadoop cluster with 4x2TB, one Core i7-920, 12GB RAM/node
Demmel et al. showed that this construction works to compute a QR factorization with minimal communication
David Gleich (Sandia) 2/22MapReduce 2011
19
ICASSP David Gleich · Purdue
Key Limitations
Computes only R and not Q Can get Q via Q = AR+ with another MR iteration. " (we currently use this for computing the SVD) Not numerically orthogonal; iterative refinement helps. We are working on better ways to compute Q "(with Austin Benson, Jim Demmel)
20
ICASSP David Gleich · Purdue
Our vision!To enable analysts and engineers to hypothesize from "data computations instead of expensive HPC computations.
Paul G. Constantine "
21
ICASSP David Gleich · Purdue