deterministic random walk preconditioning for power grid analysis

Jia Wang

Electrical and Computer EngineeringIllinois Institute of TechnologyChicago, Illinois, United States

November, 2012

Deterministic Random Walk Preconditioning

for Power Grid Analysis


for Power Grid Analysis

AgendaAgenda

Power Grid Analysis

Overview of IC and Stochastic Preconditionings


Experimental Results

2

The Key ProblemThe Key Problem

Solve the following equations for the vector x

Properties of the LHS matrix A Square, sparse, nonsingular, symmetric Diagonally dominant with nonpositive off-diagonal elements Imply symmetrix M-matrix and S.P.D.

Arise from many power grid analysis problems DC analysis Transient simulation w/o mutual inductances after

discretization [Chen and Chen DAC'01]

3

Challenges and SolutionsChallenges and Solutions The size of the problem is extremely large.

Recent public IBM power grid benchmarks can have more than 1 million variables.

Storage of A is not a concern as A is sparse.

Direct method: Cholesky factorization The factor, though remains relatively sparse via minimum

degree orderings, could still require tremendous storage. It is a waste of effort to compute the factor if it is used only a

few times, e.g. for DC analysis.

Fast approximations Exploit certain aspect of power grid, e.g. mesh structure Desirable if accuracy could be traded off for solution time

Iterative methods, e.g. based on Conjugate Gradient

Hierarchical analysis4

Iterative MethodsIterative Methods Preferred when accuracy becomes a concern.

e.g. for recent power grid contests (TAU'11, TAU'12) Preconditioner matters: classical ones or fast approximators

Approximating Factors Incomplete Cholesky (IC) [Chen and Chen DAC'01] Stochastic preconditioning [Qian and Sapatnekar ICCAD'05]

Multigrid w/ Successive Over Relaxation [Zhong and Wong ICCAD'05] on GPU [Feng et al. TCAD'11] Algebraic multigrid (AMG) [PowerRush Yang et al. ICCAD'11]

Many others Domain decomposition [Sun et al. ICCAD'07], Sub-circuit

[Chou et al. ICCAD'11], Possion solver [Yang et al. ICCAD'11], Support graph [Zhao et al. ICCAD'11]

5

Iterative MethodsIterative Methods Preferred when accuracy becomes a concern.

e.g. for recent power grid contests (TAU'11, TAU'12) Preconditioner matters: classical ones or fast approximators

Approximating Factors Incomplete Cholesky (IC) [Chen and Chen DAC'01] Stochastic preconditioning [Qian and Sapatnekar ICCAD'05]

Multigrid w/ Successive Over Relaxation [Zhong and Wong ICCAD'05] on GPU [Feng et al. TCAD'11] Algebraic multigrid (AMG) [PowerRush Yang et al. ICCAD'11]

Many others Domain decomposition [Sun et al. ICCAD'07], Sub-circuit

[Chou et al. ICCAD'11], Possion solver [Yang et al. ICCAD'11], Support graph [Zhao et al. ICCAD'11]

6

fastest convergence among explicit preconditioners

fastest for DC analysis on IBM benchmarks

Stochastic PreconditioningStochastic Preconditioning Utilize random walks on the power grid

Fast approximation for solutions [Qian et al. DAC'03] Preconditioning with approximated factors

[Qian and Sapatnekar ICCAD'05]

Advantage: fast convergence We observed that the residue can be reduced by half per

PCG iteration with less than 2X non-zeros in the approximated factors compared to those in A.

Disadvantage: long setup time Need to perform actual random walks via Monte Carlo

simulations to generate the preconditioner Remain relatively slow despite speed-up techniques

[Qian and Sapatnekar '08]

7

Can we speed-up the setup process w/o affecting preconditioner quality?

Hierarchical AnalysisHierarchical Analysis Divide-and-conquer [Zhao et al. TCAD'02]

Partition the whole grid into local and global ones Approximate the Schur complement as the exact one is

usually dense Closely related to domain decomposition methods

Hierarchical Random Walks Use random walks to approximate the Schur complement Reuse local random walks at pre-defined partitioning

interfaces [Qian et al. TCAD'05] Form the hierarchy by random sampling instead of

partitioning [Li DAC'05]

Can we use hierarchical random walks to speed-up stochastic preconditioning? Not trivial since factorizing the approximated Schur

complement may affect preconditioner quality

8

Our ContributionsOur Contributions Design an algorithm to estimate random walk

probabilities in a deterministic manner In order to build the preconditioner Exploit probabilities of previous random walks to avoid

expensive Monte Carlo simulations, and to eliminate all of them in a simplified setting

Generate preconditioners with similar quality as stochastic preconditioning in less time Compete with AMG-PCG for DC analysis Potentially efficient for transient simulations when direct

method is prohibitive due to storage requirement

In some sense, we implicitly factor the Schur complement w/o explicitly approximating it first. Similar to traditional incomplete factorizations

9

AgendaAgenda

Power Grid Analysis




10

Incomplete Cholesky FactorizationIncomplete Cholesky Factorization The exact root-free Cholesky factorization

Incomplete Cholesky [Meijerink and van der Vorst '77]

Correctness guarantee for M-matrix [Manteuffel '80]

The systematic difference could contribute to slow convergence for large problems.

Cannot decrease dk,k and li,k w/o affecting the guarantee

11

Matrix, Graph, and Random WalksMatrix, Graph, and Random Walks Construct a graph G from A

Edges in G represent non-zeros in A. Edges from/to an extra vertex represent row/column sums.

Define a random walk game on G The transfer probability is propotional to the edge weight. e.g. from 1, 1/3 chance to transfer to 2 or 3 or 4

Consider random walks starting at k, passing vertices no more than k, and ending at i>k e.g. walks like 3,1,2,1,4 but not 3,1,4,1,4

12

Stochastic PreconditioningStochastic Preconditioning Random walks vs. root-free Cholesky factorization

Ek is the expected times those random walks pass k.[Qian and Sapatnekar'08]

One can perform Monte Carlo simulations to compute Cholesky factorizations! No error accumulation/systematic difference as in IC Many properties of D and L can be easily preserved.

Diagonal elements of D are positive. Off-diagonal elements of L are nonpositive. L is column-wise diagonally dominant.

Length of the random walks cannot be bounded. e.g. 3,1,2,1,2,1,2,...,1,4, even worse if the edge from 1 to 2 has

much larger weight.

13

AgendaAgenda

Power Grid Analysis




14

Decomposition of Random WalksDecomposition of Random Walks If we can decompose an arbitrarily long walk into a

bounded number of sub-walks, We may be able to bound the worst-case Monte Carlo

simulation time. We may even be able to reduce the average Monte Carlo

simulation time.

Exclude walks returning to the starting point Walks like 3,1,2,1,3,1,2,1,3,..., would be difficult to

decompose.

Leverage monotonicity Once walks returning to the starting point are excluded, we

may construct a walk starting at k using walks starting at j<k.

Total number of sub-walks are bounded as ending points are now monotonically increasing.

15

Excluding Returning WalksExcluding Returning Walks Consider the first time a walk starting at k returns to k.

The walk may never return to k and thus all internal vertices are less than k.

Or, the walk has to travel vertices less than k before returning to k for the first time.

For probabilities, we have

The same reasoning can be applied to the expectation.

16

Monotonic DecompositionMonotonic Decomposition Consider the second vertex j in the walk starting at k

but not returning to k e.g. 3,5 or 3,1,2,1,2,1,4 Let pj be the probability of this single transfer step. We may consider more steps, see paper for details.

If j>k, then the walk should stop here.

If j<k, then the remaining part of the walk can be decomposed monotonically. e.g. decompose 1,2,1,2,1,4 into 1,2 and 2,1,2,1,4

The probabilities can be calculated from those walks starting at a vertex less than k.

17

Deterministic Random WalkDeterministic Random Walk We don't need Monte Carlo simulations any more.

The computation of probabilities can be simplified.

18

Deterministic Random WalkDeterministic Random Walk We don't need Monte Carlo simulations any more.

The computation of probabilities can be simplified.

Or, even simpler for q representing the vector of the above common expression and LHS probabilities.

A partial forward substitution

Compute L column by column, just like IC.

19

The DRW AlgorithmThe DRW Algorithm

The forward substitution should be done in a sparse manner using the algorithm proposed by Gilbert and Peierls.

Correctness is guaranteed by choosing a vertex ordering that ensures qk<1 and a proper function f.

20

Implementation DetailsImplementation Details Vertex ordering

Reverse Cuthill-McKee ordering (RCM) Guarantee qk<1 and being memory friendly

Dropping and compensation scheme f Globally specify a bound for number of fill-ins Keep all entries larger than a given threshold Keep the largest few entries until per-column bound is met Compensate the dropped entries by scaling up the

remaining entries accordingly

Differences from an IC preconditioning algorithm with similar dropping scheme DRW only uses L while IC uses both D and L. DRW usually visits more columns in L via forward

substitutions than IC to compute the next column. Compensation of L is not allowed in IC for its correctness.

21

AgendaAgenda

Power Grid Analysis




22

DC Analysis on IBM BenchmarksDC Analysis on IBM Benchmarks

Conjugate Gradient with DRW preconditioners Running times are measured on a 64-bit Linux workstation

with a 2.4GHz Intel Q6600 processor and 8GB memory. Preconditioners have the same size (# of non-zeros) as A.

PowerRush running times are obtained by scaling those from [Yang et al. ICCAD'11].

Detailed ComparisonsDetailed Comparisons

24

Sizes of all the preconditioners are 1.7 times of A.

All results are the averages of 100 random RHS vectors.

IC uses the same vertex ordering and dropping scheme as DRW, but does not compensate for dropped entries.

Stochastic preconditioning uses a different ordering than RCM, which adversely affects PCG solution time.

Impacts of Memory UsageImpacts of Memory Usage

25

For IBM DC benchmarks, DRW can be both memory and running time efficient at the same time.

Highly-optimized implementation (CHOLMOD) of direct method is very efficient once a factorization is obtained. However, it consumes a lot of resource for setup and storage.

Conclusion & Future WorkConclusion & Future Work

By estimating random walk probabilities in a deterministic manner, we avoided costly Monte Carlo simulations in building a high quality preconditioner.

Our proposed techniques could be extended to handle asymmetric matrices and matrices with positive off-diagonal elements.

We now have an updated version of the DRW algorithm that can accept any vertex ordering, while remaining correct. This enables us to apply orderings like AMD when RCM is

not suitable, e.g. for highly-coupled IBM transient simulation benchmarks published at TAU'12.

26

Thanks!Thanks!

27

deterministic random walk preconditioning for power grid analysis

Documents

discretization chen

power gridfast approximation

possion solver yang

fast convergencewe

sapatnekar iccad05advantage

tau12 preconditioner

actual random walks

support graph zhao