scalable stochastic programming cosmin g. petra mathematics and computer science division argonne...

Scalable Stochastic Programming

Cosmin G. Petra

Mathematics and Computer Science DivisionArgonne National Laboratory

petra@mcs.anl.gov

Joint work with Mihai Anitescu and Miles Lubin

Motivation

Sources of uncertainty in complex energy systems– Weather– Consumer Demand– Market prices

Applications @Argonne – Anitescu, Constantinescu, Zavala– Stochastic Unit Commitment with Wind Power Generation– Energy management of Co-generation– Economic Optimization of a Building Energy System

Stochastic Unit Commitment with Wind Power

Wind Forecast – WRF(Weather Research and Forecasting) Model– Real-time grid-nested 24h simulation – 30 samples require 1h on 500 CPUs (Jazz@Argonne)

1min COST

s.t. , ,

ramping constr., min. up/down constr.

p u dsjk jk jk

s j ks

sjk kj

windsjk

wik ksj

c c cN

p D s k

p D R s k

Slide courtesy of V. Zavala & E. Constantinescu

Wind farmThermal generator

Economic Optimization of a Building Energy System

1min COST = ( ) ( ) ( ) ( ) ( )

s.t. ( ) ( ) ( ) ( ) ( ,0) , (0)

, ( ) ( ,0)

t TSelec elec gas

elec c elec h gas hj t

jgas elec elec j j jI

I h h c I W I I

j j jj jW W WI W

C C C dS

TC S T T T T

T T TT T k

0, ( , ) 0

(0, ) ( ), [ , ]

( ) , {1

jW Wmin j maxI I I

TT L k

T x T x t t T

T T T j S

Proactive management - temperature forecasting & electricity prices Minimize daily energy costs

1 hr 3 hr 6 hr 9 hr 12 hr 16 hr 24 hr0

Horizon

lative

Slide courtesy of V. Zavala

Optimization under Uncertainty Two-stage stochastic programming with recourse (“here-and-now”)

0 0 0)( , )( ,x xMin f x Mi f xn x E_

subj. to.

0 1 2,0 0

i ix x

Min f x f

0 1, ...,

0, 0,k k k k

subj. to.1 2, , , S

( ) : ( ( ), ( ), ( ), ( ), ( ))A B b Q c

continuous discrete

Sampling

Inference Analysis

M samples

Sample average approximation (SAA)

0) ( ) (( )

x b AB x

subj. to.

Linear Algebra of Primal-Dual Interior-Point Methods

2T Tx Qx c x

subj. to.

Convex quadratic problem

T xQ Arhs

IPM Linear System

01 2 0

0 00 0

0 0 00 0 0 0 0 0 0

S ST T T T

H BB A

A A A H AA

Two-stage SP

arrow-shaped linear system(via a permutation)

Multi-stage SP

nested

The Direct Schur Complement Method (DSC) Uses the arrow shape of H

1.Implicit factorization 2. Solving Hz=r

2.1. Back substitution 2.2. Diagonal Solve

1 11 1 1 10

2 22 2 2

10 20 01 2 0

T TS NS S S S

S c cS

L DH G L L

L L L L DG G G H L

Ti i i i

Tc c c

L D L H

L G L D

, 1, ,i i i

Lw r i S

w L r wL

1 , 0,...,i i iv iD Sw

0 0 1,, .

T Ti i i i i S

L vz L z

2.3. Forward substitution

Parallelizing DSC – 1. Factorization phase

, , ,T Ti i i i i i

SL D L H L G L D

, , ,T Ti i i i i i

SL D L H L G L D

T Ti i i i i i i i p

L D L H L G L D i

c cD LL C 2. Backsolve

Process 1

Process 2

Process p

Process 1

Factorization of the 1st stage Schur complement matrix = BOTTLENECK

Sparse linear algebra

Dense linear algebra

Parallelizing DSC – 2. BacksolveProcess 1

Process 2

Process p

,i i i

T Ti i i i

D wv i

L v L z

1 1 10 0

0 0 0, ,

w r vL z vwD L

Process 1

2,i i i

T Ti i i i

D wv i

L v L z

Process 2

,i i i

T Ti i i i

D wv i

L v L z

Process p

1st stage backsolve = BOTTLENECK

izatio

Sparse linear algebra Sparse linear algebra

Dense linear algebra

Scalability of DSC

Unit commitment 76.7% efficiency

but not always the case

Large number of 1st stage variables: 38.6% efficiency

on Fusion @ Argonne

BOTTLENECK SOLUTION 1: STOCHASTIC PRECONDITIONER

Preconditioned Schur Complement (PSC)

1 , 0,...,i i iv iD Nw

0 0T T

i i i iz L v L z

iH H GC G

0 1, , ,,T Ti i i i i i i iL L iD L H G L D N

(separate process)

TM M ML D L M

00 Krylov( , , )C Mz r

REMOVES the factorization bottleneckSlightly larger backsolve bottleneck

The Stochastic Preconditioner The exact structure of C is

IID subset of n scenarios:

The stochastic preconditioner (Petra & Anitescu, 2010)

For C use the constraint preconditioner (Keller et. al., 2000)

T T Ti i i i

iiQ A B Q B A A

1 2{ , , , }nk k k K

i i i ii

k k k kki

T TnS Q A B B AQ

The “Ugly” Unit Commitment Problem

DSC on P processes vs PSC on P+1 processOptimal use of PSC – linear scaling

Factorization of the preconditioner can not behidden anymore.

• 120 scenarios

Quality of the Stochastic Preconditioner

“Exponentially” better preconditioning (Petra & Anitescu 2010)

Proof: Hoeffding inequality

Assumptions on the problem’s random data1. Boundedness2. Uniform full rank of and

24 24Pr(| ( exp) 1| ) 2

||2 ||n SS max

p L SS pS

)(A )(B

1i i i ii

T Tn k

k k k ki

S Q A B BQ An

S i i i iT T

iiS Q A B BQ A

not restrictive

Quality of the Constraint Preconditioner

has an eigenvalue 1 with order of multiplicity .

The rest of the eigenvalues satisfy

Proof: based on Bergamaschi et. al., 2004.

1M C 2r

ax1 1( ) ( ) ( )0 .min n S n SS S M C S S

The Krylov Methods Used for

BiCGStab using constraint preconditioner M

Preconditioned Projected CG (PPCG) (Gould et. al., 2001)– Preconditioned projection onto the

– Does not compute the basis for Instead,

0 0 0 0T T

nP S ZZ Z Z

0.KerA

1 10 0 0 0 0 0( )T

Ny xA rA A S

Pr .00

Tn g rS A

is computed from

200 00

xS A r

0 0Cz =r

0Z 0 .KerA

Performance of the preconditioner

Eigenvalues clustering & Krylov iterations

Affected by the well-known ill-conditioning of IPMs.

( ) , where and S

( ( )) ( ) ) ) )( ( ( )(

S Q A D

SOLUTION 2: PARALELLIZATION OF STAGE 1 LINEAR ALGEBRA

Parallelizing the 1st stage linear algebra

We distribute the 1st stage Schur complement system.

C is treated as dense.

Alternative to PSC for problems with large number of 1st stage variables.

Removes the memory bottleneck of PSC and DSC.

We investigated ScaLapack, Elemental (successor of PLAPACK)– None have a solver for symmetric indefinite matrices (Bunch-Kaufman);– LU or Cholesky only.– So we had to think of modifying either.

Q dense symm. pos. def., 0A sparse full rank.

ScaLapack (ORNL) Classical block distribution of the matrix

Blocked “down-looking” Cholesky - algorithmic blocks Size of algorithmic block = size of distribution block!

For cache-performance - large algorithmic blocks For good load balancing - small distribution blocks Must trade off cache-performance for load balancing Communication: basic MPI calls Inflexible in working with sub-blocks

Elemental (UT Austin) Unconventional “elemental” distribution: blocks of size 1.

Size of algorithmic block size of distribution block Both cache-performance (large alg. blocks) and load balancing (distrib. blocks of size 1) Communication

More sophisticated MPI calls Overhead O(log(sqrt(p))), p is the number of processors.

Sub-blocks friendly Better performance in a hybrid approach, MPI+SMP, than ScaLapack

Cholesky-based -like factorization

Can be viewed as an “implicit” normal equations approach.

In-place implementation inside Elemental: no extra memory needed.

Idea: modify the Cholesky factorization, by changing the sign after processing p columns.

It is much easier to do in Elemental, since this distributes elements, not blocks.

Twice as fast as LU

Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block.

, where 00

T T TT T

QQ LL AQ

L I LA

L AALL

A I LLA L

Distributing the 1st stage Schur complement matrix

All processors contribute to all of the elements of the (1,1) dense block

A large amount of inter-process communication occurs.

Possibly more costly than the factorization itself.

Solution: use buffer to reduce the number of messages when doing a Reduce_scatter.

approach also reduces the communication by half – only need to send lower triangle.

1 T Ti i i i

Q A B BQ QS

Reduce operations

Streamlined copying procedure - Lubin and Petra (2010) Loop over continuous memory and copy

elements in send buffer Avoids divisions and modulus ops needed to

compute the positions

“Symmetric” reduce for Only lower triangle is reduced

Fixed buffer size A variable number of columns reduced.

Effectively halves the communication (both data & # of MPI calls).

Large-scale performance

First-stage linear algebra: ScaLapack (LU), Elemental(LU), and

Strong scaling of PIPS with and 90.1% from 64 to 1024 cores 75.4% from 64 to 2048 cores > 4,000 scenarios

SAA problem:1st stage variables: 82,000

Total #: 189 millionThermal units: 1,000Wind farms: 1,200

TLDLLU

Towards real-life models – UC with transmission constraints

Current status: ISOs (Independent system operator) uses UC with– deterministic wind profiles, market prices and demand– network (transmission) constraints– Outer 1-h timestep 24 horizon simulation– Inner 5-min timestep 1h horizon corrections

Stochastic UC with transmission constraints (V. Zavala 2010)– Stochastic wind profiles & transmission constraints– Deterministic market prices and demand– 24 horizon with 1h timestep– Kirchoff’s laws are part of the constraints– The problem is huge: KKT systems are 1.8 Bil x 1.8 Bil

Generator Load node

Solving UC with transmission constraints 32k wind scenarios (k=1024) 32k nodes (131,072 cores) on Intrepid BG/P Hybrid programming model: SMP inside MPI

– Sparse 2nd-stage linear algebra: WSMP (IBM) – Dense 1st-stage linear algebra: Elemental in SMP mode

Time to solution: 20h = 4h for loading + 16h for solving Flop count is aprox. 1% of the peak For a 4h Horizon problem very good strong scaling

Real-time solution Exploiting the structure

Time-coupling in the second-stage -> block tridiagonal linear systems. Sparse backsolves must be used. Speedup in the backsolve phase > 20x is obtained. Preliminary tests for 24h horizon: time to solution < 1h (after loading). Flop count does not necessarily increase. Strong scaling does not change by much in the case of the 24h horizon problem.

The structure of the KKT system for a general 2-stage SP problem

The second-stage KKT systems has a block tridiagonal structure given by the time-coupling

Concluding remarks

PIPS – parallel interior-point solver for stochastic SAA problems

Specialized linear algebra layer– Small-sized 1st-stage subproblems DSC– Medium-sized 1st-stage PSC– Large-sized 1st-stage Distributed SC

Scenario parallelization in a hybrid programming model MPI+SMP

Thank you for your attention!

Questions?

scalable stochastic programming cosmin g. petra mathematics and computer science division argonne...

bottleneck slide

argonne slide

zavala slide

slide courtesy of

restrictive slide

process p process

backsolve process

forward substitution

Documents

beta&cosmin ajansi price list

gt 4 security goals & plans sam meder (meder@mcs.anl.gov)

radu cosmin 8218

cosmin cotroază-british iconic moments

a preconditioning technique for schur complement systems...

daniel j. mankowitz rae jeong cosmin paduraru nicolas

windows nt rootkits - cosmin stejerean

new cortright, petra. “in conversation: petra cortright...

homeless (by manu & cosmin)

dogaru, cosmin-stefan (1866-1914) funcţionarea sistemului

petra catalogue

protecting and presenting petra to the world · protecting...

simulation report-cosmin caba

petra (giordania)

dr. cosmin prodea, remus-cristian...

jordan petra

quantum computing trends - press3.mcs.anl.gov

scalable stochastic programming cosmin petra and mihai...

contributors: s.a. luca , cosmin suciu, florian dumitrescu,

cv - nica laurean cosmin engineer 2016