carnegie mellon optimized parallel distribution load flow solver on commodity multi-core cpu tao cui...
TRANSCRIPT
Carnegie Mellon
Optimized Parallel Distribution Load Flow Solver on Commodity Multi-core CPU
Tao Cui (Presenter)Franz FranchettiDept. of ECE. Carnegie Mellon UniversityPittsburgh [email protected]
This work is supported by NSF 0931978 & 1116802
Carnegie Mellon
Smart Grids
New players in the grid
Challenges Undispatchable, large variances, great impact on grid Large population exhibits stochastic properties
3
Images from wikipedia
Source: LBNL-3884e Source: ORNL/TM2004/291 Source: Pantos 2011
Carnegie Mellon
4
Conventional Distribution System Passively receiving power Few realtime monitoring or controls
Challenges in Distribution System Solar, wind, stochastic Large variance and impact
Smart Distribution System New Sensors: Smart Meters High Performance Deskside Supercomputer
A Computational Tool for Probabilistic Grid Monitoring
Hydro Plant
Medium Size Power
Plant
Coal Plant
Nuclear Plant
Extra High Voltage
High Voltage
Factory
Motivation
Image from: Wikipedia
~Tflop/s$10001kW power
Image from: Dell
Carnegie Mellon
Outline
Motivation
Distribution System Load Flow Analysis
Code Optimization
Real Time Implementation
Conclusion
5
Carnegie Mellon
6
Core: Distribution Load Flow
Distribution System: Radial , high R/X ratio, varying Z, unbalanced NOT suitable for transmission load flow Forward / Backward Sweep (FBS)
Implicit Z-matrix, detail model, convergence Generalized Component Model [Kersting2006]
One Terminal Node Model:Constant PQ:
Two Terminal Link Model:
abc abc abcn m m
abc abc abcm n m
I c V d I
V A V B I
*
abc abc abcS V I
Two Terminal Link Model
Node n Node m
A
B
C
A
B
C
[Vabc]n [Vabc]m
[Iabc]m[Iabc]n
One Terminal Node Model
A
B
C
[Vabc]
[Iabc]
[Sabc]
Source: IEEE PES Distribution System Analysis
Subcommittee
IEEE 37 NodeTest Feeder:Based on an actual feeder in California
Backward:Forward:
Carnegie Mellon
7
Core: Distribution Load Flow
Forward / Backward Sweep [Kersting2006]
Branch current based Input: substation voltage, load; output: all node voltages Steps:
1: Initial current = 0, Initial voltage V = V0; 2: Compute node current In using Node model; 3: Backward: Compute branch current Ib using Link model & KCL; 4: Forward: Update Vk+1 = Vk based on Ib over Link model; 5: Check convergence (|dS|<Error Limit) stop or go to step 2.
V1=1
3 421
InfiniteBus
Load34
[I ]12 [I ]
2000 ft. 2500 ft.
V2 V3V4
Z12 Xfm23(V,I) Z34S4
S3S2
Ib1Ib2
Ib3
Forward
Backward
IEEE 4 Node Test Feeder Example
Carnegie Mellon
8
Core: Distribution Load Flow
3-Phase Voltage on IEEE 37 Nodes Test Feeder799
701
702
703
730
709
708
733
734
737
738
711
741
705
712 742
713
704
714
718
720
706
725
707
724 722
727
744
729 728731
732
710
735 736
775
740
799
701
702
703
730
709
708
733
734
737
738
711
741
705
712 742
713
704
714
718
720
706
725
707
724 722
727
744
729 728731
732
710
735 736
775
740
799
701
702
703
730
709
708
733
734
737
738
711
741
705
712 742
713
704
714
718
720
706
725
707
724 722
727
744
729 728731
732
710
735 736
775
740
Phase A Phase B Phase C
ANSI C84.1: Nominal: 115, Range A:110~126V Range B:107~127V
1.10.90 Nominal
Carnegie Mellon
9
Our Approach
Random Number Generator Basic Uniform RNG + Transformation for different PDFs Parallel strategy for multi-thread implementation
Optimized Parallel Distribution Load Flow Solver Code optimizations Highly parallel implementation for Monte Carlo applications
Density Estimation & Visualization Kernel density estimation
Density Estimation & VisualizationRandom Variable Sampling
00.5
1
0
0.510
0.5
1
Hypercube SamplingInverse CDF
Parallel High Performance Power Flow Solver
Vectorized Solver
Vectorized Solver
Vectorized Solver
Main Thread
Parallel Threads
… 2~64 threads ...
Carnegie Mellon
Outline
Motivation
Distribution System Load Flow Analysis
Code Optimization
Real Time Implementation
Conclusion
10
Carnegie Mellon
11
Optimization: Data Structures
Data Structure Optimization Baseline: C++ object oriented, a tree object Translate to array access, exploit spatial/temporal locality
Other techniques: unroll innermost loops, scalar replacement, pre-compute as much as possible
(C++)
(C Array)
N0
N2
N3 N4 N5
N1
Load
Substation
Load Load Load
Load
N0
N2 N1
N3 N4 N5
C++ Tree Structure
pN0
pN1
pN2
pN3
pN4
Forward SweepPointer Array
pN5
pN5
pN4
pN3
pN2
pN1
Backward SweepPointer Array
pN0
Array Access
N5 Data:
ParametersCurrentVoltage
N4 Data:
ParametersCurrentVoltage
Consecutive in Data Array
...
GridLab-D: the Smart Grid Simulator www.gridlabd.org, opensource since 2009
Carnegie Mellon
Optimization: Pattern Based Syntehsis
12
Algorithm-level Optimization Pattern based matrix-vector multiplication
For A,B,c,d matrices: Multi-grounded Cable: diagonal matrix Ignore shunt & coupling: c = 0, d = I, A = I
Reduce unnecessary operationsReduce unnecessary storage for better memory accessSimilar to [Belgin2009]
abc abc abcn m m
abc abc abcm n m
I c V d I
V A V B I
case 1: case 2: case N:
…
code 1 code 2 code N
(C Pattern)
switch (mat_type){ case real_diag_equal_mat: output[0] = *constant * input[0]; ... output[5] = *constant * input[5]; break; case imag_diag_equal_mat: output[0] = -*constant * input[3]; output[1] = -*constant * input[4]; output[2] = -*constant * input[5]; output[3] = *constant * input[0]; output[4] = *constant * input[1]; output[5] = *constant * input[2]; break; ...}
mat_type
a,0,0,0,b,0,0,0,c a,b,c+
Full Matrix Compressed
Carnegie Mellon
13
Data Parallelism (SIMD)
SIMD parallelization SIMD: Single Instruction Multiple Data
SSE: Streaming SIMD Extensions 128bit, 4 floats in one register eg. 4 “fadd” at cost of 1 “addps”
AVX: Advanced Vector eXtensions (256bit, 8 floats), Larrabee (512bit, 16 floats) Vectorized solver on SIMD level for MCS:
Assumptions & Limitations: converge at same step
vector registerxmm1
vector operationaddps xmm0, xmm1
xmm0
xmm0 1 2 4 4 5 1 1 3
+ + + +
6 3 5 7
4-way SSE example
Sample FBS Load Flow Result
Sample FBS Load Flow Result
SIMD Instructions
Scalar Instructions
Scalar Solver:
Vectorized Solver:
Vector Register: 4 floats in SSE
Scalar Register:1 float
(SIMD)
Carnegie Mellon
Variant Synthesis with SPIRAL
14
Symbolic process [Puschel2005]: pattern based matrix vector product code
# Use Print_SWBody() to print code cases for all matrix patterns;Print_SWBody:= function(case, MMr, MMi, opts, unparser) local MatAll, st, cs; MatAll:=NullMat(6,6); MatAll{ [ 1 .. 3 ] }{ [ 1 .. 3 ] }:= MMr; MatAll{ [ 4 .. 6 ] }{ [ 4 .. 6 ] }:= MMr; MatAll{ [ 4 .. 6 ] }{ [ 1 .. 3 ] }:= MMi; MatAll{ [ 1 .. 3 ] }{ [ 4 .. 6 ] }:= -MMi; st:=Blk(MatAll); # SPL description of patterned matrix cs:= CodeSums(st, opts); # Compiled to Intermediate Code unparser.opts:=opts; # Options: Targeted language, Intrinsics Print("case ", case, ":"); Print("\n{\n"); Unparse(cs, unparser, 0,1); # Unparse Intermediate Code to actual code Print("}\n");end;
SPIRAL Script
switch(pattern){case zero_matrix:{ *(Y) = 0.0; *((Y + 1)) = 0.0; *((Y + 2)) = 0.0; *((Y + 3)) = 0.0; *((Y + 4)) = 0.0; *((Y + 5)) = 0.0;}case real_diagonal_equal_matrix:{ *(Y) = (*(u1)**(X)); *((Y + 1)) = (*(u1)**((X + 1))); *((Y + 2)) = (*(u1)**((X + 2))); *((Y + 3)) = (*(u1)**((X + 3))); *((Y + 4)) = (*(u1)**((X + 4))); *((Y + 5)) = (*(u1)**((X + 5)));}}<...hundreds of lines of code>
Scalar Code
switch(mat_type){case zero_matrix:{ *(Y) = _mm256_set1_ps(0.0f); *((Y + 1)) = _mm256_set1_ps(0.0f); *((Y + 2)) = _mm256_set1_ps(0.0f); *((Y + 3)) = _mm256_set1_ps(0.0f); *((Y + 4)) = _mm256_set1_ps(0.0f); *((Y + 5)) = _mm256_set1_ps(0.0f);}case real_diagonal_equal_matrix:{__m256 a21, s43, s44, s45, s46, s47, s48; a21 = *(u1); s43 = _mm256_mul_ps(a21, *(X)); *(Y) = s43; s44 = _mm256_mul_ps(a21, *((X + 1))); *((Y + 1)) = s44; s45 = _mm256_mul_ps(a21, *((X + 2))); *((Y + 2)) = s45; s46 = _mm256_mul_ps(a21, *((X + 3))); *((Y + 3)) = s46; s47 = _mm256_mul_ps(a21, *((X + 4))); *((Y + 4)) = s47; s48 = _mm256_mul_ps(a21, *((X + 5))); *((Y + 5)) = s48;}<...hundreds of lines of code>
AVX Code
case 1: case 2: case N:
…
SPL
Com
pile
r
Carnegie Mellon
Multithreading, Run across All CPUs
Vectorized load flow solver in each thread Each thread pinned to a physical core exclusively Fully utilize computation power of Multi-core CPUs Double buffer (automatic load balancing for MCS application)
15
RNG & Load Flow in Buf AN
RNG & Load Flow in Buf A2
RNG & Load Flow in Buf A1
Real Time Interval(SCADA Interval)
Scheduling Thd 0
Computing Thd 2
Computing Thd 1
Computing Thd N
Sync Signal
KDE in all Buf Bs Result Out
RNG & Load Flow in Buf BN
RNG & Load Flow in Buf B2
RNG & Load Flow in Buf B1
Sync Signal Out
Switch Buffer A,B
KDE in all Buf As Result Out
RNG: Random Number GeneratorKDE: Kernel Density Estimation
(Multi-Core)
Multithreading
Carnegie Mellon
16
Performance Results: Across Sizes
Performance of Optimized Code, Mass Amount Load Flow
0
10
20
30
40
50
60
70
80
90
4 8 16 32 64 128 256 512 1024 2048Bus Number
Optimized Scalar with Pattern Optimized AVX with PatternOptimized Multicore AVX with Pattern Multicore AVX
Performance on Core i7 2670QM 2.2GHz Quadcore Performance[Gflop/s]
Pseudo flop/s: >60 % peak
Flop/s: 50% Peak
Carnegie Mellon
Details: Performance Gains
17
1
2
4
8
16
32
64
128
4 16 64 256
Performance [Gflop/s]
C++Baseline ScalarScalar Pattern AVXAVX Pattern MultiThread AVXMultiThread AVX Pattern (Fully Optimized)
Bus Numbers
Performance Impact of Optimization & Parallelization on Core i7
>20x>20x
>50x>50x
Carnegie Mellon
Performance Results: Across Machines
18
Problem Size (IEEE Test Feeders)
Approx.flops
Approx. Time / Core2 Extreme
Approx. Time / Core i7
Baseline. C++ ICC –o3(~300x faster then pure Matlab scripts)
Comments
IEEE37: one iteration 12 K ~ 0.3 us ~ 0.3 us
IEEE37: one load flow (5 Iter) 60 K ~ 1.5 us ~ 1.5 us 0.01 kVA error
IEEE37: 1 million load flow 60 G ~ < 2 s ~ < 1 s ~ 60 s (>5 hrs Matlab) SCADA Interval: 4 seconds
IEEE123: 1 million load flow 200 G ~ < 10 s ~ < 3.5 s ~ 200 s (>15 hrs Matlab)
0
20
40
60
80
100
120
Core2Extreme 2.66GHz(4-core, SSE)
Xeon X5680 3.33GHz(6-core, SSE)
Core i7-2670QM 2.2GHz(4-core, AVX)
2 Xeon7560 2.27GHz(16-core, SSE)
Optimized ScalarSIMD (SSE or AVX)Multi-Core
Performance on Different Machines for IEEE37Performance [Gflop/s]
Carnegie Mellon
19
Accuracy
Convergence of Monte Carlo Very crude. MCG59+ICDF, 50 trials with “time(NULL)” seeds
0.8 0.9 1
0
5
10
15
20
100 run
Ph
ase
A
0.98 1 1.020
20
40
60
80
100
Ph
ase
B
0.93 0.94 0.950
50
100
150
Ph
ase
C
voltage, (p.u.)
0.8 0.9 1
0
5
10
15
20
1000 run
0.98 1 1.020
20
40
60
80
100
0.93 0.94 0.950
50
100
150
voltage, (p.u.)
0.8 0.9 1
0
5
10
15
20
10000 run
0.98 1 1.020
20
40
60
80
100
0.93 0.94 0.950
50
100
150
voltage, (p.u.)
0.8 0.9 1
0
5
10
15
20
100000 run
0.98 1 1.020
20
40
60
80
100
0.93 0.94 0.950
50
100
150
voltage, (p.u.)
0.8 0.9 1
0
5
10
15
20
1000000 run
0.98 1 1.020
20
40
60
80
100
0.93 0.94 0.950
50
100
150
voltage, (p.u.)
0.8 0.9 1
0
5
10
15
20
10000000 run
0.98 1 1.020
20
40
60
80
100
0.93 0.94 0.950
50
100
150
voltage, (p.u.)
Out: Voltage on
Node 738
-400 -200 0 200 4000
1
2
3
4x 10
-3
Active power, kw
In: Active PowerP~ u=0,std=100kwon Phase A of Node 738,711,741
Carnegie Mellon
Outline
Motivation
Distribution System Load Flow Analysis
Code Optimization
Real Time Implementation
Conclusion
20
Carnegie Mellon
21
System Implementation
Distribution System Probabilistic Monitoring System (DSPMS) System Structure:
MCS solver running on Multi-core Desktop Server (Code optimization) Results published via ECE Web Server (TCP/IP socket) Web based dynamic User Interface by client side scripts (JavaScript) Smart meters in campus building (MODBUS/TCP)
Campus NetworkWeb Server
@ ECE CMU
Web UI
Web UI
Web UIInternet
Monte Carlo solver
Multi-core Desktop Computing Server
Aggregator
MODBUS/TCP
Carnegie Mellon
22
System Implementation
Distribution System Probabilistic Monitoring System (DSPMS) Web Server and User Interface
Link: www.ece.cmu.edu/~tcui/test/DistSim/DSPMS.htm
Web Server @ ECE CMU
Web UI
Intertnet
Carnegie Mellon
23
Conclusion Smart Distribution Network:
Impact of renewable and stochastic
Commodity HPC & code optimization: Millions of cases /sec on $1K class machine
Distribution System Probabilistic Monitor: A prove of concept real time application
0
10
20
30
40
50
60
70
80
90
4 8 16 32 64 128 256 512 1024 2048Bus Number
Optimized Scalar with Pattern Optimized AVX with PatternOptimized Multicore AVX with Pattern Multicore AVX
Performance on Core i7 2670QM 2.2GHz Quadcore Performance[Gflop/s]
source: LBNL-3884e
Campus NetworkWeb Server
@ ECE CMU
Web UI
Web UI
Web UIInternet
Monte Carlo solver
Multi-core Desktop Computing Server
Aggregator
MODBUS/TCP
Carnegie Mellon
References [LBNL-3884E]. Mills, A. Implications of Wide-Area Geographic Diversity for Short-Term
Variability of Solar Power, LBNL-3884E. Lawrence Berkeley National Laboratory, Berkeley, [ORNL/TM2004/291]. B. Kirby, "Frequency Regulation Basics and Trends," ORNL/TM 2004/291, Oak
Ridge National Laboratory, December 2004. [Pantos2011]. Miloš Pantoš Stochastic optimal charging of electric-drive vehicles with renewable
energy Energy, Volume 36, Issue 11, November 2011 [Ghosh97]. A.K. Ghosh, D. L. Lubkeman, M. J. Downey, R. H. Jones “Distribution Circuit State Estimation
Using a Probabilistic Approach,” IEEE Transactions on Power Systems, vol. 12, no. 1, pp. 45-51, 1997 [Belgin2009]. M. Belgin, G. Back, and C. J. Ribbens, “Pattern-based sparse matrix representation for
memory-efficient smvm kernels,” in Proceedings of the 23rd international conference on Supercomputing, ser. ICS ’09. New York, NY, USA: ACM, 2009, pp. 100–109.
[Puschel2005]. M. Puschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo, “SPIRAL: Code generation for DSP transforms,” Proceedings of the IEEE, special issue on “Program Generation, Optimization, and Adaptation”, vol. 93, no. 2, pp. 232– 275, 2005.
[Kersting2006]. W. Kersting, Distribution system modeling and analysis. CRC, 2006
24