stochastic optimization of complex energy systems on high-performance computers cosmin g. petra...

Stochastic Optimization of Complex Energy Systems on High-Performance Computers

Cosmin G. PetraMathematics and Computer Science Division

Argonne National [email protected]

SIAM CSE 2013

Joint work with Olaf Schenk(USI Lugano),

Miles Lubin (MIT), Klaus Gaertner(WIAS Berlin)

Outline

Application of HPC to power-grid optimization under uncertainty

Parallel interior-point solver (PIPS-IPM)– structure exploiting

Revisiting linear algebra

Experiments on BG/P with the new features

2

Stochastic unit commitment with wind power

Wind Forecast – WRF(Weather Research and Forecasting) Model– Real-time grid-nested 24h simulation – 30 samples require 1h on 500 CPUs (Jazz@Argonne)

3

1min COST

s.t. , ,

, ,

ramping constr., min. up/down constr.

wind

wind

p u dsjk jk jk

s j ks

sjk kj

windsjk

j

wik ksj

ndsk

jjk

j

c c cN

p D s k

p D R s k

p

p

S N T

N

N

N

N

S T

S T

Slide courtesy of V. Zavala & E. Constantinescu

Wind farmThermal generator

4

Stochastic Formulation

Discrete distribution leads to block-angular LP

5

Large-scale (dual) block-angular LPs

• In terminology of stochastic LPs:• First-stage variables (decision now): x0

• Second-stage variables (recourse decision): x1, …, xN

• Each diagonal block is a realization of a random variable (scenario)

Extensive form

6

Computational challenges and difficulties

May require many scenarios (100s, 1,000s, 10,000s …) to accurately model uncertainty

“Large” scenarios (Wi up to 250,000 x 250,000) “Large” 1st stage (1,000s, 10,000s of variables) Easy to build a practical instance that requires 100+ GB of

RAM to solve Requires distributed memory

Real-time solution needed in our applications

Linear algebra of primal-dual interior-point methods (IPM)

7

1

2T Tx Qx c x

subj. to.

Min

0

Ax b

x

Convex quadratic problem

0

T xQ Arhs

yA

IPM Linear System

1 1

1 1

2 2

2 2

01 2 0

0

0 00 0

0 00 0

0 00 0

0 0 00 0 0 0 0 0 0

T

T

TS

ST T T T

S

N

N

H BB A

H BB A

H BB A

A A A H AA

Two-stage SP

arrow-shaped linear system(modulo a permutation)

Multi-stage SP

nested

N is the number of scenarios

2 solves per IPM iteration - predictor directions - corrector directions

8

Special Structure of KKT System (Arrow-shaped)

9

Parallel Solution Procedure for KKT System

Steps 1 and 5 trivially parallel– “Scenario-based decomposition”

Steps 1,2,3 are >95% of total execution time.

10

Components of Execution Time

Notice break in y-axis scale

11

Scenario Calculations – Steps 1 and 5

Each scenario is assigned to an MPI process, which locally performs steps 1 and 5.

Matrices are sparse and symmetric indefinite (symmetric with positive and negative eigenvalues).

Computing is very expensive when solving with the factors of against non-zero columns of and multiplying from left with

4 hours 10 minutes wall time to solve a 4h-horizon problem with 8k scenarios on 8k nodes.

Need to run under strict time requirements – For example, solve 24h-horizon problem in less

than 1h

1Ti i iB K B

iBTiB

12

Revisiting scenario computations for shared-memory

Multiple sparse right-hand sides Triangular solves phase hard to parallelize in shared-memory (multi-core) Factorization phase speeds up very well and achieves considerable peak-

performance Our approach: incomplete factorization of

Stop factorization after the elimination of (1,1) block will sit in the (2,2) block (Schur complement)

1"Compute SC" (Step 1): Ti i iB K B

0

Ti i

i

i

K BM

B

1Ti i iB K B

13

Implementation

Requires modification of the linear solver PARDISO (Schenk) -> PARDISO-SC

Pivot perturbations during factorization needed to maintain numerical stability Errors due to perturbations are absorbed by iterative refinement This would be extremely expensive in our case (many right-hand sides) We let errors propagate in the “global” Schur complement C (Step 2)

Factorize the perturbed C (denoted by ) (Step 3) After Step 1, 2 and 3, we have the factorization of an approximation matrix

10

1

NTi i i

i

C K B K B

C

1 1

1 0

, i i

N NT T

N

K B

K K K KK B

B B K

14

Pivot error absorption by preconditioned BiCGStab

Still we have to solve with

“Absorb errors” by solving Kz=r using preconditioned BiCGStab– Numerical experiments showed it is more robust than iterative refinement.

Preconditioner is Each BiCGStab iteration requires

– 2 mat-vecs: Kz– 2 applications of the preconditioner:

One application of the preconditioner resumes to performing “solve” steps 4 and 5 for

1 1

1 0

N NT T

N

K B

KK B

B B K

K

1K z

K

15

Summary of the new approach

1

1

0 1

0 0 0

1

1. Calculate , 1, ("Compute S.C.")

2. Form : ("Form S.C.")

3. Factorize ("Factor S.C.")do BiCGStab iteration ...

apply preconditioner (steps 4

T

i i iN T

i iiT

B K B i N

C K B K B

C L D L

K z

1

and 5) mat-vec ...

apply preconditioner (steps 4 and 5) mat-vec while BiCGStab did not converged

Kz

K zKz

16

Test architecture

“Intrepid” Blue Gene/P supercomputer– 40,960 nodes– Custom interconnect– Each node has quad-core 850 Mhz PowerPC processor, 2 GB

RAM

DOE INCITE Award 2012-2013 – 24 million core hours

17

Numerical experiments

4h (UC4), 12h(UC12), 24h(UC24) horizon problems

1 scenario per node (4 cores per scenario)

Large-scale: 12h horizon, up to 32k scenarios and 128k cores (k=1,024)– 16k scenarios – 2.08 billion variables, 1.81 billion constraints, KKT

system size = 3.89 billion

LAPACK+SMP ESSL BLAS for first-stage linear systems PARDISO-SC for second-stage linear systems

18

Compute SC Times

19

Time per IPM iteration UC12, 32k scenarios, 32k nodes (128k cores)

BiCGStab iteration count ranges from 0 to 1.5 Cost of absorbing factorization perturbation errors is between 10 and 30% of total

iteration cost

20

Solve to completion – UC12

Nodes/scens Wall time (sec) IPM Iterations Time per IPM iteration (sec)

4096 3548.5 103 33.57

8192 3883.7 112 34.67

16384 4208.8 123 34.80

32768 4781.7 133 35.95

Before: 4 hours 10 minutes wall time to solve UC4 problem with 8k scenarios on 8k nodes

Now: UC12

21

Weak scaling

22

Strong scaling

23

Conclusions and Future Considerations

Multicore-friendly reformulation of sparse linear algebra computations lead to one order of magnitude faster execution times.

Fast factorization-based computation of SC Robust and cheap pivot errors absorption via Krylov iterative methods

Parallel efficiency of PIPS remains good.

Performance evaluation on today’s supercomputers– IBM BG/Q – Cray XK7, XC30

stochastic optimization of complex energy systems on high-performance computers cosmin g. petra...

Documents