carnegie mellon optimized parallel distribution load flow solver on commodity multi-core cpu tao cui...

25
Carnegie Mellon Optimized Parallel Distribution Load Flow Solver on Commodity Multi-core CPU Tao Cui (Presenter) Franz Franchetti Dept. of ECE. Carnegie Mellon University Pittsburgh PA [email protected] This work is supported by NSF 0931978 & 1116802

Upload: neal-willis

Post on 16-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Carnegie Mellon

Optimized Parallel Distribution Load Flow Solver on Commodity Multi-core CPU

Tao Cui (Presenter)Franz FranchettiDept. of ECE. Carnegie Mellon UniversityPittsburgh [email protected]

This work is supported by NSF 0931978 & 1116802

Carnegie Mellon

Smart Grids

2Image by Dr. M.Sanchez

Carnegie Mellon

Smart Grids

New players in the grid

Challenges Undispatchable, large variances, great impact on grid Large population exhibits stochastic properties

3

Images from wikipedia

Source: LBNL-3884e Source: ORNL/TM2004/291 Source: Pantos 2011

Carnegie Mellon

4

Conventional Distribution System Passively receiving power Few realtime monitoring or controls

Challenges in Distribution System Solar, wind, stochastic Large variance and impact

Smart Distribution System New Sensors: Smart Meters High Performance Deskside Supercomputer

A Computational Tool for Probabilistic Grid Monitoring

Hydro Plant

Medium Size Power

Plant

Coal Plant

Nuclear Plant

Extra High Voltage

High Voltage

Factory

Motivation

Image from: Wikipedia

~Tflop/s$10001kW power

Image from: Dell

Carnegie Mellon

Outline

Motivation

Distribution System Load Flow Analysis

Code Optimization

Real Time Implementation

Conclusion

5

Carnegie Mellon

6

Core: Distribution Load Flow

Distribution System: Radial , high R/X ratio, varying Z, unbalanced NOT suitable for transmission load flow Forward / Backward Sweep (FBS)

Implicit Z-matrix, detail model, convergence Generalized Component Model [Kersting2006]

One Terminal Node Model:Constant PQ:

Two Terminal Link Model:

abc abc abcn m m

abc abc abcm n m

I c V d I

V A V B I

*

abc abc abcS V I

Two Terminal Link Model

Node n Node m

A

B

C

A

B

C

[Vabc]n [Vabc]m

[Iabc]m[Iabc]n

One Terminal Node Model

A

B

C

[Vabc]

[Iabc]

[Sabc]

Source: IEEE PES Distribution System Analysis

Subcommittee

IEEE 37 NodeTest Feeder:Based on an actual feeder in California

Backward:Forward:

Carnegie Mellon

7

Core: Distribution Load Flow

Forward / Backward Sweep [Kersting2006]

Branch current based Input: substation voltage, load; output: all node voltages Steps:

1: Initial current = 0, Initial voltage V = V0; 2: Compute node current In using Node model; 3: Backward: Compute branch current Ib using Link model & KCL; 4: Forward: Update Vk+1 = Vk based on Ib over Link model; 5: Check convergence (|dS|<Error Limit) stop or go to step 2.

V1=1

3 421

InfiniteBus

Load34

[I ]12 [I ]

2000 ft. 2500 ft.

V2 V3V4

Z12 Xfm23(V,I) Z34S4

S3S2

Ib1Ib2

Ib3

Forward

Backward

IEEE 4 Node Test Feeder Example

Carnegie Mellon

8

Core: Distribution Load Flow

3-Phase Voltage on IEEE 37 Nodes Test Feeder799

701

702

703

730

709

708

733

734

737

738

711

741

705

712 742

713

704

714

718

720

706

725

707

724 722

727

744

729 728731

732

710

735 736

775

740

799

701

702

703

730

709

708

733

734

737

738

711

741

705

712 742

713

704

714

718

720

706

725

707

724 722

727

744

729 728731

732

710

735 736

775

740

799

701

702

703

730

709

708

733

734

737

738

711

741

705

712 742

713

704

714

718

720

706

725

707

724 722

727

744

729 728731

732

710

735 736

775

740

Phase A Phase B Phase C

ANSI C84.1: Nominal: 115, Range A:110~126V Range B:107~127V

1.10.90 Nominal

Carnegie Mellon

9

Our Approach

Random Number Generator Basic Uniform RNG + Transformation for different PDFs Parallel strategy for multi-thread implementation

Optimized Parallel Distribution Load Flow Solver Code optimizations Highly parallel implementation for Monte Carlo applications

Density Estimation & Visualization Kernel density estimation

Density Estimation & VisualizationRandom Variable Sampling

00.5

1

0

0.510

0.5

1

Hypercube SamplingInverse CDF

Parallel High Performance Power Flow Solver

Vectorized Solver

Vectorized Solver

Vectorized Solver

Main Thread

Parallel Threads

… 2~64 threads ...

Carnegie Mellon

Outline

Motivation

Distribution System Load Flow Analysis

Code Optimization

Real Time Implementation

Conclusion

10

Carnegie Mellon

11

Optimization: Data Structures

Data Structure Optimization Baseline: C++ object oriented, a tree object Translate to array access, exploit spatial/temporal locality

Other techniques: unroll innermost loops, scalar replacement, pre-compute as much as possible

(C++)

(C Array)

N0

N2

N3 N4 N5

N1

Load

Substation

Load Load Load

Load

N0

N2 N1

N3 N4 N5

C++ Tree Structure

pN0

pN1

pN2

pN3

pN4

Forward SweepPointer Array

pN5

pN5

pN4

pN3

pN2

pN1

Backward SweepPointer Array

pN0

Array Access

N5 Data:

ParametersCurrentVoltage

N4 Data:

ParametersCurrentVoltage

Consecutive in Data Array

...

GridLab-D: the Smart Grid Simulator www.gridlabd.org, opensource since 2009

Carnegie Mellon

Optimization: Pattern Based Syntehsis

12

Algorithm-level Optimization Pattern based matrix-vector multiplication

For A,B,c,d matrices: Multi-grounded Cable: diagonal matrix Ignore shunt & coupling: c = 0, d = I, A = I

Reduce unnecessary operationsReduce unnecessary storage for better memory accessSimilar to [Belgin2009]

abc abc abcn m m

abc abc abcm n m

I c V d I

V A V B I

case 1: case 2: case N:

code 1 code 2 code N

(C Pattern)

switch (mat_type){ case real_diag_equal_mat: output[0] = *constant * input[0]; ... output[5] = *constant * input[5]; break; case imag_diag_equal_mat: output[0] = -*constant * input[3]; output[1] = -*constant * input[4]; output[2] = -*constant * input[5]; output[3] = *constant * input[0]; output[4] = *constant * input[1]; output[5] = *constant * input[2]; break; ...}

mat_type

a,0,0,0,b,0,0,0,c a,b,c+

Full Matrix Compressed

Carnegie Mellon

13

Data Parallelism (SIMD)

SIMD parallelization SIMD: Single Instruction Multiple Data

SSE: Streaming SIMD Extensions 128bit, 4 floats in one register eg. 4 “fadd” at cost of 1 “addps”

AVX: Advanced Vector eXtensions (256bit, 8 floats), Larrabee (512bit, 16 floats) Vectorized solver on SIMD level for MCS:

Assumptions & Limitations: converge at same step

vector registerxmm1

vector operationaddps xmm0, xmm1

xmm0

xmm0 1 2 4 4 5 1 1 3

+ + + +

6 3 5 7

4-way SSE example

Sample FBS Load Flow Result

Sample FBS Load Flow Result

SIMD Instructions

Scalar Instructions

Scalar Solver:

Vectorized Solver:

Vector Register: 4 floats in SSE

Scalar Register:1 float

(SIMD)

Carnegie Mellon

Variant Synthesis with SPIRAL

14

Symbolic process [Puschel2005]: pattern based matrix vector product code

# Use Print_SWBody() to print code cases for all matrix patterns;Print_SWBody:= function(case, MMr, MMi, opts, unparser) local MatAll, st, cs; MatAll:=NullMat(6,6); MatAll{ [ 1 .. 3 ] }{ [ 1 .. 3 ] }:= MMr; MatAll{ [ 4 .. 6 ] }{ [ 4 .. 6 ] }:= MMr; MatAll{ [ 4 .. 6 ] }{ [ 1 .. 3 ] }:= MMi; MatAll{ [ 1 .. 3 ] }{ [ 4 .. 6 ] }:= -MMi; st:=Blk(MatAll); # SPL description of patterned matrix cs:= CodeSums(st, opts); # Compiled to Intermediate Code unparser.opts:=opts; # Options: Targeted language, Intrinsics Print("case ", case, ":"); Print("\n{\n"); Unparse(cs, unparser, 0,1); # Unparse Intermediate Code to actual code Print("}\n");end;

SPIRAL Script

switch(pattern){case zero_matrix:{ *(Y) = 0.0; *((Y + 1)) = 0.0; *((Y + 2)) = 0.0; *((Y + 3)) = 0.0; *((Y + 4)) = 0.0; *((Y + 5)) = 0.0;}case real_diagonal_equal_matrix:{ *(Y) = (*(u1)**(X)); *((Y + 1)) = (*(u1)**((X + 1))); *((Y + 2)) = (*(u1)**((X + 2))); *((Y + 3)) = (*(u1)**((X + 3))); *((Y + 4)) = (*(u1)**((X + 4))); *((Y + 5)) = (*(u1)**((X + 5)));}}<...hundreds of lines of code>

Scalar Code

switch(mat_type){case zero_matrix:{ *(Y) = _mm256_set1_ps(0.0f); *((Y + 1)) = _mm256_set1_ps(0.0f); *((Y + 2)) = _mm256_set1_ps(0.0f); *((Y + 3)) = _mm256_set1_ps(0.0f); *((Y + 4)) = _mm256_set1_ps(0.0f); *((Y + 5)) = _mm256_set1_ps(0.0f);}case real_diagonal_equal_matrix:{__m256 a21, s43, s44, s45, s46, s47, s48; a21 = *(u1); s43 = _mm256_mul_ps(a21, *(X)); *(Y) = s43; s44 = _mm256_mul_ps(a21, *((X + 1))); *((Y + 1)) = s44; s45 = _mm256_mul_ps(a21, *((X + 2))); *((Y + 2)) = s45; s46 = _mm256_mul_ps(a21, *((X + 3))); *((Y + 3)) = s46; s47 = _mm256_mul_ps(a21, *((X + 4))); *((Y + 4)) = s47; s48 = _mm256_mul_ps(a21, *((X + 5))); *((Y + 5)) = s48;}<...hundreds of lines of code>

AVX Code

case 1: case 2: case N:

SPL

Com

pile

r

Carnegie Mellon

Multithreading, Run across All CPUs

Vectorized load flow solver in each thread Each thread pinned to a physical core exclusively Fully utilize computation power of Multi-core CPUs Double buffer (automatic load balancing for MCS application)

15

RNG & Load Flow in Buf AN

RNG & Load Flow in Buf A2

RNG & Load Flow in Buf A1

Real Time Interval(SCADA Interval)

Scheduling Thd 0

Computing Thd 2

Computing Thd 1

Computing Thd N

Sync Signal

KDE in all Buf Bs Result Out

RNG & Load Flow in Buf BN

RNG & Load Flow in Buf B2

RNG & Load Flow in Buf B1

Sync Signal Out

Switch Buffer A,B

KDE in all Buf As Result Out

RNG: Random Number GeneratorKDE: Kernel Density Estimation

(Multi-Core)

Multithreading

Carnegie Mellon

16

Performance Results: Across Sizes

Performance of Optimized Code, Mass Amount Load Flow

0

10

20

30

40

50

60

70

80

90

4 8 16 32 64 128 256 512 1024 2048Bus Number

Optimized Scalar with Pattern Optimized AVX with PatternOptimized Multicore AVX with Pattern Multicore AVX

Performance on Core i7 2670QM 2.2GHz Quadcore Performance[Gflop/s]

Pseudo flop/s: >60 % peak

Flop/s: 50% Peak

Carnegie Mellon

Details: Performance Gains

17

1

2

4

8

16

32

64

128

4 16 64 256

Performance [Gflop/s]

C++Baseline ScalarScalar Pattern AVXAVX Pattern MultiThread AVXMultiThread AVX Pattern (Fully Optimized)

Bus Numbers

Performance Impact of Optimization & Parallelization on Core i7

>20x>20x

>50x>50x

Carnegie Mellon

Performance Results: Across Machines

18

Problem Size (IEEE Test Feeders)

Approx.flops

Approx. Time / Core2 Extreme

Approx. Time / Core i7

Baseline. C++ ICC –o3(~300x faster then pure Matlab scripts)

Comments

IEEE37: one iteration 12 K ~ 0.3 us ~ 0.3 us

IEEE37: one load flow (5 Iter) 60 K ~ 1.5 us ~ 1.5 us 0.01 kVA error

IEEE37: 1 million load flow 60 G ~ < 2 s ~ < 1 s ~ 60 s (>5 hrs Matlab) SCADA Interval: 4 seconds

IEEE123: 1 million load flow 200 G ~ < 10 s ~ < 3.5 s ~ 200 s (>15 hrs Matlab)

0

20

40

60

80

100

120

Core2Extreme 2.66GHz(4-core, SSE)

Xeon X5680 3.33GHz(6-core, SSE)

Core i7-2670QM 2.2GHz(4-core, AVX)

2 Xeon7560 2.27GHz(16-core, SSE)

Optimized ScalarSIMD (SSE or AVX)Multi-Core

Performance on Different Machines for IEEE37Performance [Gflop/s]

Carnegie Mellon

19

Accuracy

Convergence of Monte Carlo Very crude. MCG59+ICDF, 50 trials with “time(NULL)” seeds

0.8 0.9 1

0

5

10

15

20

100 run

Ph

ase

A

0.98 1 1.020

20

40

60

80

100

Ph

ase

B

0.93 0.94 0.950

50

100

150

Ph

ase

C

voltage, (p.u.)

0.8 0.9 1

0

5

10

15

20

1000 run

0.98 1 1.020

20

40

60

80

100

0.93 0.94 0.950

50

100

150

voltage, (p.u.)

0.8 0.9 1

0

5

10

15

20

10000 run

0.98 1 1.020

20

40

60

80

100

0.93 0.94 0.950

50

100

150

voltage, (p.u.)

0.8 0.9 1

0

5

10

15

20

100000 run

0.98 1 1.020

20

40

60

80

100

0.93 0.94 0.950

50

100

150

voltage, (p.u.)

0.8 0.9 1

0

5

10

15

20

1000000 run

0.98 1 1.020

20

40

60

80

100

0.93 0.94 0.950

50

100

150

voltage, (p.u.)

0.8 0.9 1

0

5

10

15

20

10000000 run

0.98 1 1.020

20

40

60

80

100

0.93 0.94 0.950

50

100

150

voltage, (p.u.)

Out: Voltage on

Node 738

-400 -200 0 200 4000

1

2

3

4x 10

-3

Active power, kw

In: Active PowerP~ u=0,std=100kwon Phase A of Node 738,711,741

Carnegie Mellon

Outline

Motivation

Distribution System Load Flow Analysis

Code Optimization

Real Time Implementation

Conclusion

20

Carnegie Mellon

21

System Implementation

Distribution System Probabilistic Monitoring System (DSPMS) System Structure:

MCS solver running on Multi-core Desktop Server (Code optimization) Results published via ECE Web Server (TCP/IP socket) Web based dynamic User Interface by client side scripts (JavaScript) Smart meters in campus building (MODBUS/TCP)

Campus NetworkWeb Server

@ ECE CMU

Web UI

Web UI

Web UIInternet

Monte Carlo solver

Multi-core Desktop Computing Server

Aggregator

MODBUS/TCP

Carnegie Mellon

22

System Implementation

Distribution System Probabilistic Monitoring System (DSPMS) Web Server and User Interface

Link: www.ece.cmu.edu/~tcui/test/DistSim/DSPMS.htm

Web Server @ ECE CMU

Web UI

Intertnet

Carnegie Mellon

23

Conclusion Smart Distribution Network:

Impact of renewable and stochastic

Commodity HPC & code optimization: Millions of cases /sec on $1K class machine

Distribution System Probabilistic Monitor: A prove of concept real time application

0

10

20

30

40

50

60

70

80

90

4 8 16 32 64 128 256 512 1024 2048Bus Number

Optimized Scalar with Pattern Optimized AVX with PatternOptimized Multicore AVX with Pattern Multicore AVX

Performance on Core i7 2670QM 2.2GHz Quadcore Performance[Gflop/s]

source: LBNL-3884e

Campus NetworkWeb Server

@ ECE CMU

Web UI

Web UI

Web UIInternet

Monte Carlo solver

Multi-core Desktop Computing Server

Aggregator

MODBUS/TCP

Carnegie Mellon

References [LBNL-3884E]. Mills, A. Implications of Wide-Area Geographic Diversity for Short-Term

Variability of Solar Power, LBNL-3884E. Lawrence Berkeley National Laboratory, Berkeley, [ORNL/TM2004/291]. B. Kirby, "Frequency Regulation Basics and Trends," ORNL/TM 2004/291, Oak

Ridge National Laboratory, December 2004. [Pantos2011]. Miloš Pantoš Stochastic optimal charging of electric-drive vehicles with renewable

energy Energy, Volume 36, Issue 11, November 2011 [Ghosh97]. A.K. Ghosh, D. L. Lubkeman, M. J. Downey, R. H. Jones “Distribution Circuit State Estimation

Using a Probabilistic Approach,” IEEE Transactions on Power Systems, vol. 12, no. 1, pp. 45-51, 1997 [Belgin2009]. M. Belgin, G. Back, and C. J. Ribbens, “Pattern-based sparse matrix representation for

memory-efficient smvm kernels,” in Proceedings of the 23rd international conference on Supercomputing, ser. ICS ’09. New York, NY, USA: ACM, 2009, pp. 100–109.

[Puschel2005]. M. Puschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo, “SPIRAL: Code generation for DSP transforms,” Proceedings of the IEEE, special issue on “Program Generation, Optimization, and Adaptation”, vol. 93, no. 2, pp. 232– 275, 2005.

[Kersting2006]. W. Kersting, Distribution system modeling and analysis. CRC, 2006

24

Carnegie Mellon

The End

Thank You!Q&A

25