power profiling of cholesky and qr factorizations on ... · power proﬁling of cholesky and qr...

Power Profiling of Cholesky and QRFactorizations on Distributed Memory Systems

George Bosilca1 Hatem Ltaief2 Jack Dongarra1 3 4

1Innovative Computing LaboratoryUniversity of Tennessee Knoxville

2KAUST Supercomputing LaboratoryThuwal, Saudi Arabia

3Oak Ridge National Lab

4University of Manchester

International Conference on Energy-Aware High PerformanceComputing

Hamburg, GermanySept 12, 2012Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 1 / 26

Outline

1 Motivations

2 From LAPACK to PLASMA

3 From PLASMA to DPLASMA

4 Power Measurements Technique

5 Power Measurements Results

6 Summary and Future Work

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 2 / 26

Motivations

The Top500 List

Rank Name Vendor Cores Rmax (Tflop/s) Power (KW)1 Sequoia IBM BG/Q 1572864 16324.75 (81%) 7890.02 K Computer Fujitsu SPARC64 705024 10510.00 (93%) 12659.93 Mira IBM BG/Q 786432 8162.38 (81%) 3945.04 SuperMUC Intel Xeon E5 147456 2897.00 (91%) 3422.75 Tianhe-1A Intel Xeon X5670 + M2050 186368 2566.00 (55%) 4040.06 Jaguar Cray XK6 Opteron + M2090 298592 1941.00 (74%) 5142.07 Fermi IBM BG/Q 163840 1725.49 (82%) 821.98 JuQUEEN IBM BG/Q 131072 1380.39 (82%) 657.59 Curie thin nodes Intel Xeon E5 77184 1359.00 (81%) 2251.010 Nebulae Intel Xeon X5670 + M2050 120640 1271.00 (42%) 2580.0


Motivations

The Top500 List

Rank Name Vendor Cores Rmax (Tflop/s) Power (KW)1 Sequoia IBM BG/Q 1572864 16324.75 (81%) 7890.02 K Computer Fujitsu SPARC64 705024 10510.00 (93%) 12659.93 Mira IBM BG/Q 786432 8162.38 (81%) 3945.04 SuperMUC Intel Xeon E5 147456 2897.00 (91%) 3422.75 Tianhe-1A Intel Xeon X5670 + M2050 186368 2566.00 (55%) 4040.06 Jaguar Cray XK6 Opteron + M2090 298592 1941.00 (74%) 5142.07 Fermi IBM BG/Q 163840 1725.49 (82%) 821.98 JuQUEEN IBM BG/Q 131072 1380.39 (82%) 657.59 Curie thin nodes Intel Xeon E5 77184 1359.00 (81%) 2251.010 Nebulae Intel Xeon X5670 + M2050 120640 1271.00 (42%) 2580.0

Human brain: 20 PetaFLOPS! (cf Kurzweil)


Motivations

Today’s Special Meal

8 MW needed to feed the babyExascale roadmap says up to 20 MW Power EnvelopeHuge challenge: achieving 2 orders of magnitude inperformance by only doubling the power rateHigh level of concurrencyIngredients: Fine-grain parallelism, Dynamic runtime systems,Power EfficiencyFlops are cheap, Data movement is expensiveCo-designed Hardware and Software solutions


From LAPACK to PLASMA

Block Algorithms

Panel-Update SequenceTransformations are blocked/accumulated within the Panel(Level 2 BLAS)Transformations applied at once on the trailing submatrix(Level 3 BLAS)Parallelism hidden inside the BLASFork-join Model



One-Sided Block Algorithms: LU



Block Algorithms: Fork-Join Paradigm



Tile Data Layout Format

LAPACK: column-major format PLASMA: tile format



PLASMA: Tile Algorithms

PLASMA: Parallel Linear Algebra for Scalable Multi-coreArchitectures =⇒ http://icl.cs.utk.edu/plasma/

Parallelism is brought to the foreMay require the redesign of linear algebra algorithmsTile data layout translationRemove unnecessary synchronization points betweenPanel-Update sequencesDAG execution where nodes represent tasks and edgesdefine dependencies between themDynamic runtime system environment QUARK



Dynamic Runtime System QUARK

Basic Ideas:

Conceptually similar to out-of-order processor schedulingDynamic runtime DAG schedulerOut-of-order execution flow of fine-grained tasksTask scheduling as soon as dependencies are satisfiedProducer-Consumer

Similar projects: SuperMatrix, OMPSs, StarPU



DataFlow Programming Model

Five decades OLD conceptProgramming paradigm that models a program as adirected graph of the data flowing between operations (cf.Wikipedia)Think "how things connect" rather than "how things happen"Assembly lineInherently parallel


From PLASMA to DPLASMA

2D Block Cyclic Distribution

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

0

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3 2

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

(a) Column-major data layout for-mat within a block.

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

0

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3 2

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

(b) Tile data layout format within ablock.

Figure: Two-Dimensional Block Cyclic Data Distribution.



DAGuE Dynamic Runtime Scheduler

Bosilca et. al, UTKBosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 14 / 26


DAGuE Dynamic Runtime Scheduler

Bosilca et. al, UTKBosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 15 / 26

Power Measurements Technique

The PowerPack Framework

Dual-socket quad-core Intel Xeon system from Virginia Tech,clocked at 2.8GHz with 8GB of memoryMeasurements from power meters attached to thehardware of the systemFine-grain measurement (100ms) allows power consumptionto be measured on a per-component basisCPU, memory, hard disk, motherboard and System (as awhole)N = 40000 for all experiments


Power Measurements Technique

The PowerPack Framework

K. Cameron et. al, Virginia Tech


Power Measurements Results

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) block size = 32.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) block size = 128.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(c) block size = 512.

Figure: Impact of the block size onthe power profiles (Watts) of theScaLAPACK Cholesky.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) tile size = 48.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) tile size = 192.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35P

ow

er

(Wa

tts)

Time (seconds)

System

CPU

MemoryNetwork

(c) tile size = 768.

Figure: Impact of the tile size onthe power profiles (Watts) of theDPLASMA Cholesky.



0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) block size = 32.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) block size = 128.

5000

10000

15000

20000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(c) block size = 512.

Figure: Impact of the block size onthe power profiles (Watts) of theScaLAPACK QR Factorization.

0

5000

10000

15000

20000

25000

0 50 100 150 200 250 300 350 400 450

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) tile size = 48.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) tile size = 192.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100P

ow

er

(Wa

tts)

Time (seconds)

System

CPU

Memory

Network

(c) tile size = 768.

Figure: Impact of the tile size onthe power profiles (Watts) of theDPLASMA QR Factorization.



0

1000

2000

3000

4000

5000

6000

7000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) Number of cores = 128.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) Number of cores = 512.

Figure: Impact of the number ofcores on the power profiles (Watts)of the ScaLAPACK CholeskyFactorization.

0

1000

2000

3000

4000

5000

6000

7000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network


0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network


Figure: Impact of the number ofcores on the power profiles (Watts)of the DPLASMA CholeskyFactorization.



0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network


0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network


Figure: Impact of the number ofcores on the power profiles (Watts)of the ScaLAPACK QR.

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network


0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network


Figure: Impact of the number ofcores on the power profiles (Watts)of the DPLASMA QR.



0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure: Power Profiles of the Cholesky Factorization.



0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure: Power Profiles of the QR Factorization.



# Cores Library Cholesky QR

128ScaLAPACK 192000 672000DPLASMA 128000 540000



Figure: Total amount of energy (joule) used for each test based on thenumber of cores


Summary and Future Work

Conclusion

Stressing the system’s componentsDPLASMA Cholesky algorithms decrease the energyconsumption up to 62% compared to ScaLAPACK CholeskyDPLASMA QR algorithms decrease the energy consumptionup to 40% compared to ScaLAPACK QRAsynchronous execution runtime and adapted algorithmscan lead to significantly improved efficiencies and powersaving


Summary and Future Work

What’s next?

Power analysis of advanced numerical algorithms ondistributed systems (two-sided transformations, treereduction, mixed precisions)Comparisons with other DLA libraries: Elemental, Eigen-K,ELPADistributed heterogeneous architecturesScheduler interaction through DVFS/Intel RAPL technologyRunning on IBM BG P/Q and exploit embedded powercollection hardware tools.


power profiling of cholesky and qr factorizations on ... · power proﬁling of cholesky and qr...

Documents