on data intensive computing and exascale

34
1 On Data Intensive Computing and Exascale Alok Choudhary John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Northwestern University [email protected]

Upload: others

Post on 27-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Data Intensive Computing and Exascale

1

On Data Intensive Computing and Exascale

Alok Choudhary John G. Searle Professor

Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management

Northwestern University [email protected]

Page 2: On Data Intensive Computing and Exascale

Slide 2 2

Science and Society Transformed by Data

Page 3: On Data Intensive Computing and Exascale

Slide 3

Crowd Sourcing

Devices/Sensor

Simulations

DATA DRIVEN

Discovering Knowledge from Massive Data – Data Driven?

Page 4: On Data Intensive Computing and Exascale

Slide 4

Data Intensive (DI) ● Depends on the

perspective –  Processor, memory,

application, storage?

● An application can be data intensive without (necessarily) being I/O intensive

Data Driven (DD) ●  Operations are driven (and

defined) by data –  Massive transactions –  BIG analytics

u Top-down query (well-defined operations) u Bottom up discovery (unpredictable time-to-result)

–  BIG data processing –  Predictive computing

●  Usage model further differentiates these

–  Single App, users –  Large number, sharing, historical/

temporal

“Data intensive” vs “Data Driven”

Very few large-scale applications of practical importance are NOT Data Intensive

Page 5: On Data Intensive Computing and Exascale

Slide 5

It is a Process: Transactional to Temporal

010101 101010 111000 000111

Raw Data

Knowledge

Target Data

Transformed Data

Patterns

Understanding

Integration

Page 6: On Data Intensive Computing and Exascale

Slide 6

Process Illustration: Predictive Insights: Extreme Events

1. Pre-process ancillary climate model outputs

Steps for Discovery of Multivariate Non-linear Interactions

Steps for Predictive Modeling of Hurricanes

SST impact on hurricane

frequency & intensity

IPCC AR4 Models: CMIP3 datasets Monthly mean sea surface temperature Monthly mean atmospheric temperature Daily horizontal wind at 250/850 hPa

2. Construct multivariate nonlinear climate network

3. Detect & track communities

3. Find non-linear relationships

4. Validate w/ hindcasts

5. Determine non-stationary non-i.i.d. climates states &

Build hurricane models

(A)

Page 7: On Data Intensive Computing and Exascale

Slide 7

CMIP3 à CMIP5

●  Coupled Model Inter comparison Project

●  Spatial resolution: 1 – 0.25 degrees ●  Temporal resolution: 6 hours – 3 hours ● Models: 24 - 37

●  Simulation experiments: 10s - 100s –  Control runs & hindcast

–  Decadal & centennial-scale forecasts

●  Covers 1000s of simulation years ●  100+ variables

●  10s of TBs to 10s of PBs

Summary of CMIP5 model experiments, grouped into three tiers

Page 8: On Data Intensive Computing and Exascale

Slide 8

A “DATA DRIVEN DISCOVERY” WORTH A THOUSAND SIMULATIONS?

A different way of thinking?

Page 9: On Data Intensive Computing and Exascale

Slide 9

Discovering Materials : Simulations à Analytics

!"#$%&'()"#*"+*,-*.&/01()"#*02%232$/*

! "#$%&%'%(#)(*#+,#-$.%(/&'0(1$#/$()#2+34#$(5$5267(89:;(! :+,&2&*(,52&#.&*('3<=5(&$)#2+34#$(3..5.(85>6>(5=5*'2#($5634?&'7@(+3%%@(3'#+&*(23.&&@(A(?3=5$*5(%@(,@(.@()(5=5*'2#$%;(

4&/01()5/*6"0/71#8*

! "#$%'2-*'(.3'3(+&$&$6(+#.5=%('#(,25.&*'()#2+34#$(5$5267(-%&$6(*05+&*3=()#2+-=3(3$.(.52&?3<=5(5+,&2&*3=(&$)#2+34#$(

6"0/7*-527'2)"#*

! B5%'(+#.5=(#$(-$%55$(.3'3(! CDE)#=.(*2#%%(?3=&.34#$(8.3'3(.&?&.5.(&$'#(CD(%56+5$'%@(+#.5=(<-&='(#$(F(%56+5$'%(3$.('5%'5.(#$(25+3&$&$6(C(%56+5$'G(,2#*5%%(25,53'5.(CD(4+5%(/&'0(.&H525$'('5%'(%56+5$';(

92&8/*$(27/*,-*.&/01()"#*

! I-$(*#+<&$3'#2&3=(=&%'(#)(*#+,#-$.%('02#-60('05(9:(+#.5=(

:(&//#1#8*

! B052+#.7$3+&*(%'3<&=&'7(3$.(05-2&%4*%(

;27102)"#*

! J'2-*'-25(,25.&*4#$(! K-3$'-+(+5*03$&*3=(+#.5=&$6(

!"#$%&'(")%'*+*%,(+"-+(.)&')/+0"#1"2&3,+

4%,(+"-+1).3%05"&,+

67")(*%,(.3+7%879

1"(.&5'*+0'&3%3'(.,+

:;+#"3.*+

6('$*.+3%,0"<.).3+,()20(2).,+

='>+

=$>+

Page 10: On Data Intensive Computing and Exascale

Slide 10

The Data Driven Discovery Ecosystem

Transactional: Data

Generation

Historical : Data

Processing/Organization

Relational: Discovery/Predictive Modeling

Refine Model/ New

Experiments/ Control

Feedback

Historical data

Learning Models

Trigger/questions Predict

(A)

Data reduction

Data Management

Data Query

Page 11: On Data Intensive Computing and Exascale

Slide 11

COMPUTE CENTRIC TO DISCOVERY CENTRIC: HOW MANY FLOPS IS A LOOKUP WORTH?

Page 12: On Data Intensive Computing and Exascale

Potential System Architecture (IESP) Systems 2011

K computer 2019 Difference

Today & 2019

System peak 10.5 Pflop/s 1 Eflop/s O(100)

Power 12.7 MW ~20 MW

System memory 1.6 PB 32 - 64 PB O(10)

Node performance 128 GF 1,2 or 15TF O(10) – O(100)

Node memory BW 64 GB/s 2 - 4TB/s O(100)

Node concurrency 8 O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW 20 GB/s 200-400GB/s O(10)

System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 705,024 O(billion) O(1,000)

MTTI days O(1 day) - O(10)

Page 13: On Data Intensive Computing and Exascale

Potential System Architecture with a cap of $200M and 20MW Systems 2011

K computer 2019 Difference

Today & 2019

System peak 10.5 Pflop/s 1 Eflop/s O(100)

Power 12.7 MW ~20 MW

System memory 1.6 PB 32 - 64 PB O(10)

Node performance 128 GF 1,2 or 15TF O(10) – O(100)

Node memory BW 64 GB/s 2 - 4TB/s O(100)

Node concurrency 8 O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW 20 GB/s 200-400GB/s O(10)

System size (nodes) 88,124 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 705,024 O(billion) O(1,000)

MTTI days O(1 day) - O(10)

Page 14: On Data Intensive Computing and Exascale

Slide 14

Balanced Approach to Architecture

FLOPS

Memory

Storage

Storage

Memory

FLOPS/OPS

Page 15: On Data Intensive Computing and Exascale

Slide 15

I/O software stack in HPC – Narrow Universe

15  

A   A  

A   A   A  

A  

A  

A   A  

A  

A  

A  

A  

A  

A  

A  

A  

A  

A  

A  

A  

A  A   A   A   A   A   A  

A  A   A   A   A   A   A  

 applica*on/compute    nodes  A  

D  D   D   D   D   D   D  

D   I/O  delegate/discovery    nodes  

S   S   S   S   S   S   S   S   S   I/O  servers  

compute  nodes  

15

Compute node

Compute node

Compute node

Compute node

network

I/O Server

I/O Server

I/O Server

Applications

Client-side File System

Parallel netCDF, HDF5, ...

MPI-IO

Page 16: On Data Intensive Computing and Exascale

Slide 16

Supercomputers (Current): Illustration of Simulation Dataset Sizes (dated!)

Application On-Line Data Off-Line Data FLASH: Buoyancy-Driven Turbulent Nuclear Burning 75TB 300TB

Reactor Core Hydrodynamics 2TB 5TB

Computational Nuclear Structure 4TB 40TB

Computational Protein Structure 1TB 2TB

Performance Evaluation and Analysis 1TB 1TB Kinetics and Thermodynamics of Metal and

Complex Hydride Nanoparticles 5TB 100TB

Climate Science 10TB 345TB

Parkinson's Disease 2.5TB 50TB

Plasma Microturbulence 2TB 10TB

Lattice QCD 1TB 44TB

Thermal Striping in Sodium Cooled Reactors 4TB 8TB

Gating Mechanisms of Membrane Proteins 10TB 10TB

© Alok Choudhary

Page 17: On Data Intensive Computing and Exascale

Slide 17

Balanced Approach to Architecture: I/O + Analytics

FLOPS

Memory

Storage

Storage

Memory

FLOPS/OPS

Parallel netCDF

Parallel File System

Page 18: On Data Intensive Computing and Exascale

Slide 18

Scaling I/O and Analytics: Accelerating time to Discovery

I/O was previously a major bottleneck: The only reason execution at this scale became possible was due to I/O scaling.

●  Global Cloud Resolving Model (GCRM) –  Simulate circulation associated with large convective clouds

–  Developed by David Randell (Colorado State U) & Karen Schuchardt (PNNL)

●  Geodesic grid model ●  1.4 PB data per simulation

–  4 km resolution, 3 hourly, 1 simulated year –  1.5 TB per checkpoint

Page 19: On Data Intensive Computing and Exascale

Slide 19

Analytics and I/O : Accelerating time to Discovery

●  Improved I/O throughput

–  Using PnetCDF optimizations, massive scalability

–  For 3.5 km grid resolution, grid size is 41.9M cells with 256 vertical layers

–  Data analysis read and simulation checkpoint

Page 20: On Data Intensive Computing and Exascale

Jan 25, 2012 Slide 20

ARCHITECTURE, ALGORITHMS, BENCHMARKING, CO-DESIGN

Page 21: On Data Intensive Computing and Exascale

Slide 21

Large-scale Analytics: Data Analysis Kernels

Performance typically dominated by a few kernels (Illustrative).

Application

Top 3 Kernels Σ

(%) Kernel 1 (%) Kernel 2 (%) Kernel 3 (%)

K-means Distance (68) Center (21) minDist (10) 99

Fuzzy K-means Center (58) Distance (39) fuzzySum (1) 98

BIRCH Distance (54) Variance (22) Redist (10) 86

HOP Density (39) Search (30) Gather (23) 92

Naïve Bayesian probCal (49) Variance (38) dataRead (10) 97

ScalParC Classify (37) giniCalc (36) Compare (24) 97

Apriori Subset (58) dataRead (14) Increment (8) 80

Eclat Intersect (39) addClass (23) invertC (10) 72

SVMlight quotMatrix (57) quadGrad (38) quotUpdate (2) 97

Page 22: On Data Intensive Computing and Exascale

Slide 22

Data mining Programs: Do they have different characteristics?

0 1 2 3 4 5

8 9

11

aprio

ri ba

yesi

an

birc

h ec

lat

hop

sc

alpa

rc

kMea

ns

fuzz

y rs

earc

h se

mph

y sn

p ge

nene

t sv

m-rf

e

NU-MineBench

6 7

10

Clus

ter N

umbe

r

gcc

bzip

2 gz

ip

mcf

tw

olf

vort

ex

vpr

pars

er

apsi

ar

t eq

uake

lu

cas

mes

a m

grid

sw

im

wup

wis

e ra

wca

udio

ep

ic

enco

de

cjpe

g m

peg2

pe

gwit gs

to

ast

Q17

Q

3 Q

4 Q

6

SPEC FP MediaBench TPC-H SPEC INT

Page 23: On Data Intensive Computing and Exascale

Slide 23

NU-Minebench versus Others

●  number  of  data  references  per  instruc*on    is  significantly  higher.  ●  L2  miss  rates  are  considerably  high  due  to  the  inherent  streaming  nature  

of  data  retrieval.  ●  The  high  ALU  opera*ons  per  instruc*on  is  high,  which  indicates  the  

extensive  amount  of  computa*ons  performed  in  data  mining  applica*ons.    

Parameter†

Benchmark of Applications

SPECINT SPECFP MediaBench TPC-H MineBench

Data References 0.81 0.55 0.56 0.48 1.10

Bus Accesses 0.030 0.034 0.002 0.010 0.037

Instruction Decodes 1.17 1.02 1.28 1.08 0.78

Resource Related Stalls 0.66 1.04 0.14 0.69 0.43

CPI 1.43 1.66 1.16 1.36 1.54

ALU Instructions 0.25 0.29 0.27 0.30 0.31

L1 Misses 0.023 0.008 0.010 0.029 0.016

L2 Misses 0.003 0.003 0.0004 0.002 0.006

Branches 0.13 0.03 0.16 0.11 0.14

Branch Mispredictions 0.009 0.0008 0.016 0.0006 0.006 † The numbers shown here for the parameters are values per instruction

Page 24: On Data Intensive Computing and Exascale

Slide 24

Scalable Data Analytics Kernels

●  Parallel hierarchical clustering

–  Speedup of 18,000 on 16k processors

–  I/O significant at large scale

Page 25: On Data Intensive Computing and Exascale

Slide 25

Power-aware Data Analytics: Approximation is a TOP Option in analytics (Co-design)

0

0.2

0.4

0.6

0.8

1

1.2

0.00001

0.0001

0.001

0.01

0.1

1

10

100

4  Bits 8  Bits 12  Bits 16  Bits 20  Bits 32  Bits

Relativ

e  Energy  Con

sumed

(w.r.t.  No  Co

mpression

)

Average  Clusterin

g  Error  (%

)  in  Log  Scale

Bits  used  in  representing  input  data

K-­‐Means  Clustering  :  Error  Vs  Energy

Average  Error(%)Data  Set  AData  Set  B

Power-aware analytics ●  Reduced bit fixed-point

representations

●  Pearson correlation –  2.5-3.5 times faster

–  50-70% less energy

●  K-means

–  ~44% less energy with an error of only 0.03% using 12-bit representation

Energy Consumption Correlations

Speedup Correlation

Page 26: On Data Intensive Computing and Exascale

Slide 26

In-memory Kmeans Clustering

● Data set fits into the device memory -> small amount of transfer over PCI bus ● Data set A : 5.7 Million records, Data set B : 8

Million records, attributes = 60, clusters = 32 ● Energy savings – 44% for 12-bit quantization

26  

Data  Set  A Data  Set  B

00.0020.0040.0060.0080.010.0120.0140.0160.018

4  Bits

8  Bits

12  Bits

16  Bits

20  Bits

32  Bits

4  Bits

8  Bits

12  Bits

16  Bits

20  Bits

32  Bits

Energy  Con

sumed

 (kJ)

Energy  consumed  in  data  transfer

Transfer  Energy

Data  Set  A Data  Set  B

0

5

10

15

20

25

30

4  Bits

8  Bits

12  Bits

16  Bits

20  Bits

32  Bits

4  Bits

8  Bits

12  Bits

16  Bits

20  Bits

32  Bits

Energy  Con

sumed

 (kJ)

Energy  Consumption  of  in-­‐order  GPU  K-­‐means

Quantization  EnergyKernel  EnergyGPU  Active  Energy

Page 27: On Data Intensive Computing and Exascale

Slide 27

Accelerators: Principal Component Analysis/ Data Reduction

Page 28: On Data Intensive Computing and Exascale

Slide 28

Co-Design: Analytics Algorithms at Scale with power efficiency and using new memory hierarchies

Large-scale data-driven science for complex, multivariate, spatio-temporal, non-linear, and dynamic systems:

High Performance Computing Efficient analytics on future generation exascale HPC platforms with complex memory hierarchies

kernels, features, dependencies

Relationship Mining Discovery of complex

dependence structures such as non-linear relationships

Predictive Modeling Model typical and extreme behavior from multivariate

spatio-temporal data

Complex Networks Study collective behavior of

interacting subsystems

relationships community structure- function-dynamics

The traditional model of developing algorithms on small data and then parallelizing them WILL NOT WORK in most cases because the characteristics of the algorithm and solution itself will depend on the data: You don’t want to “remove” the needle from the haystack, you want to find it!

Page 29: On Data Intensive Computing and Exascale

Slide 29

Data Analytics – Broad Impact

Illustrative Applications Feature, data reduction, or analytics task

Data analysis kernels

Chemistry, Climate, Combustion, Cosmology, Fusion, Materials science, Plasma

Clustering k-means, fuzzy k-means, BIRCH, MAFIA, DBSCAN, HOP, SNN, Dynamic Time Warping, Random Walk

Biology, Climate, Combustion, Cosmology, Plasma, Renewable energy

Statistics Extrema, mean, quantiles, standard deviation, copulas, value-based extraction, sampling

Biology, Climate, Fusion, Plasma Feature selection Data slicing, LVF, SFG, SBG, ABB, RELIEF

Chemistry, Materials science, Plasma, Climate

Data transformations Fourier transform, wavelet transform, PCA/SVD/EOF analysis, multidimensional scaling, differentiation, integration

Combustion, Earth science Topology Morse-Smale complexes, Reeb graphs, level set decomposition

Earth science Geometry Fractal dimension, curvature, torsion

Biology, Climate, Cosmology, Fusion Classification ScalParC, decision trees, Naïve Bayes, SVMlight, RIPPER

Chemistry, Climate, Combustion, Cosmology, Fusion, Plasma

Data compression PPM, LZW, JPEG, wavelet compression, PCA, Fixed-point representation

Climate Anomaly detection Entropy, LOF, GBAD

Climate, Earth science Similarity / distance Cosine similarity, correlation (TAPER), mutual information, Student's t-test, Eulerian distance, Mahalanobis distance, Jaccard coefficient, Tanimoto coefficient, shortest paths

Cosmology Halos and sub-halos SUBFIND, AHF

Page 30: On Data Intensive Computing and Exascale

Slide 30

Network Effect and Precise Interest Targeting

4/12/12 Voxsup Confidential

Page 31: On Data Intensive Computing and Exascale

Slide 31

Network Effect and Precise Interest Targeting

4/12/12 Voxsup Confidential

Page 32: On Data Intensive Computing and Exascale

Slide 32

Network Effect and Precise Interest Targeting

4/12/12 Voxsup Confidential

Page 33: On Data Intensive Computing and Exascale

Slide 33

Network Effect and Precise Interest Targeting

4/12/12 Voxsup Confidential

Page 34: On Data Intensive Computing and Exascale

Slide 34

Thank You!

Alok Choudhary, Northwestern University

[email protected]