simmatrix: simulator for many -task computing execution fabric at exascale

35
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale Ke Wang Data-Intensive Distributed Systems Laboratory Computer Science Department Illinois Institute of Technology April 8 th , 2013 ACM HPC Symposium

Upload: maris

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

SimMatrix: SIMulator for MAny -Task computing execution fabRIc at eXascale. Ke Wang Data-Intensive Distributed Systems Laboratory Computer Science Department Illinois Institute of Technology April 8 th , 2013 ACM HPC Symposium. Outline. Introduction & Motivation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

SimMatrix: SIMulator for MAny-Task computing

execution fabRIc at eXascale

Ke WangData-Intensive Distributed Systems Laboratory

Computer Science DepartmentIllinois Institute of Technology

April 8th, 2013ACM HPC Symposium

Page 2: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Outline

• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 2

Page 3: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Outline

• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 3

Page 4: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

0

50

100

150

200

250

300

2004 2006 2008 2010 2012 2014 2016 2018

Num

ber o

f Cor

es

0102030405060708090100

Man

ufac

turin

g Pr

oces

s

Number of CoresProcessing

Pat Helland, Microsoft, The Irresistible Forces Meet the Movable Objects, November 9th, 2007

Manycore Computing

• Today (2013): Multicore Computing– O(10) cores commodity architectures– O(100) cores proprietary architectures– O(1000) GPU hardware threads

• Near future (~2019): Manycore Computing– ~1000 cores/threads commodity architectures

4SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale

Page 5: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Exascale Computing

Top500 Performance Development,

http://top500.org/static/lists/2011/11/TOP500_201111_Poster.pdf 5

• Today (2013): 10 Petascale Computing– O(100K) nodes – O(1M) cores

• Near future (~2019): Exascale Computing– ~1M nodes (10X) – ~1B processor-cores/threads (1000X)

Page 6: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Major Challenges of Exascale Computing

• Memory and Storage– minimizing data movement through the memory hierarchy (e.g.

persistent storage, solid state memory, volatile memory, caches, and registers)

• Concurrency and Locality– harnessing the many magnitude orders of increased parallelism

fueled by the many-core computing era (Accelerator, GPU, MIC)

• Resiliency– making both the infrastructure (hardware) and applications fault

tolerant in face of a decreasing mean-time-to-failure (MTTF).

• Energy and Power– 20MW limitation

6SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale

Page 7: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

MTC: Many-Task Computing

Number of Tasks

Input Data Size

Hi

Med

Low1 1K 1M

HPC(Heroic

MPI Tasks)

HTC/MTC(Many Loosely Coupled Tasks)

MapReduce/MTC(Data Analysis,

Mining)

MTC(Big Data and Many Tasks)

• Bridge the gap between HPC and HTC

• Applied in clusters, grids, and supercomputers

• Loosely coupled apps with HPC orientations

• Many activities coupled by file system ops

• Many resources over short time periods

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 7

Page 8: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

MTC Middleware

• Falkon– Fast and

Lightweight Task Execution Framework

– http://datasys.cs.iit.edu/projects/Falkon/index.html

• Swift– Parallel

Programming System

– http://www.ci.uchicago.edu/swift/index.php

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 8

Page 9: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Outline

• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 9

Page 10: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Long-Term Aims• Address major exascale computing challenges:

– Memory and Storage– Concurrency and Locality– Resiliency

• Explore scheduling architecture and techniques to enable MTC at exascale

• Analyze, design and implement a distributed data-aware execution fabric (MATRIX) supporting HPC/MTC workloads at exascale

• Integrate MATRIX with parallel programming systems (e.g. Swift, Charm++, MapReduce) and with the FusionFS distributed file system

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 10

Page 11: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

This Work’s Contributions

– Architect, design and implement a job scheduling system simulator, SimMatrix, at the node/core level

– Performance evaluation among SimMatrix, SimGrid and GridSim; evaluation done up to millions of nodes, billions of cores, and tens of billions of tasks

– Supports of homogenous/heterogeneous systems, various programming models (HPC/MTC), and scheduling strategies (centralized/distributed/hierarchical)

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 11

Page 12: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Outline

• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 12

Page 13: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale

OverviewJob Scheduling Systems

• Efficiently manage the distributed computing power of workstations, servers, and supercomputers in order to maximize job throughput and system utilization.– Load balancing is critical

• Different scheduling strategies– Centralized scheduling hinders the scalability– Hierarchical scheduling has long job turnaround time – Distributed scheduling is a promising approach to exascale

• Work Stealing – a distributed scheduling strategy – Starved processors steal tasks from overloaded ones

13

Page 14: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

SimMatrix Architecture

Client

Submit tasks

Submit tasks

ClientArbitrary Node

Figure 1: SimMatrix architectures; the left part is the centralized one with a single dispatcher (head node) talking to all compute

nodes, the right part is the distributed topology with a dispatcher sitting in each compute node

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 14

Dispatcher

Page 15: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Simulations

• Continuous time simulation– Abandoned the idea of creating a separate thread

per simulated node: we found that on our 48-core system with 256GB of memory, we were limited to 32K threads

• Discrete event simulation– A viable approach (today) to explore scheduling

techniques at exascale (millions of nodes and billions of cores)

– Created an unique object per simulated node, and converted any behavior (state change) to an event

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 15

Page 16: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Outline

• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 16

Page 17: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

At the Heart of SimMatrixGlobal Event Queue

Figure 2: Event State Transition Diagram

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 17

• All events are inserted to the queue, sorted based on the occurrence time ascending

• Handle the first event, advance the simulation time and update the event queue

• Implemented as red-black tree based “TreeSet” in Java, which ensures Θ(log ) 𝑛time for insert & remove

LogVisual

StealAvailable

cores

Has ta

sks

First node needs

more tasks

Global Event Queue

Sorted by time

Insert Event(time:t)

No waiting tasks

TaskEnd

Has Waiting Tasks and

available cores

Failed

No Tas

ks

Dis

patc

h ta

sks

TaskRec

TaskDispStart

First node needs tasks

Page 18: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Simulator Features

• Node load information– Load: Number of busy cores– Nested hash map groups nodes based on load, provides

extremely fast lookup for the next available nodes

• Dynamic Task Submission– Aims to reduce the task waiting time, the memory foot-print

• Dynamic Poll interval– Exponential backoff to reduce the number of messages and

increase speed of simulation

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 18

Page 19: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Implementation

• SimMatrix is developed in JAVA– Sun 64-bit JDK version 1.7.0_03– Code accessible at:

• http://datasys.cs.iit.edu/~kewang/software.html • SimMatrix has no other dependencies

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 19

Page 20: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Outline

• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 20

Page 21: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Experiment Environment

• Fusion system:– fusion.cs.iit.edu– 48 AMD Opteron cores at 800MHz (Only need

one core)– 256GB RAM– 64-bit Linux kernel 2.6.31.5– Sun 64-bit JDK version 1.7.0_23

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 21

Page 22: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Metrics• Throughput

– Number of tasks finished per second. Calculated as total-number-of-tasks/simulation-time.

• Efficiency– The ratio between the ideal simulation time of completing a given workload

and the real simulation time. The ideal simulation time is calculated by taking the average task execution time multiplied by the number of tasks per core.

• CPU Time/Time per task• Memory/Memory per task

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 22

Page 23: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Workloads (Sleep tasks)

• Synthetic workloads: – Uniform distribution with average task execution time of 5000s

(AVE_5K); also homogeneous workload with all tasks having 1 sec execution time (ALL_1)

• Realistic application workloads: – Obtained from real traces taken from running MTC applications

on Blue Gene/P over a 17-month period.– 34.8M tasks with the minimum runtime of 0 seconds, maximum

runtime of 1469.62 seconds, average runtime of 95.20 seconds, and standard deviation of 188.08

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 23

Page 24: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale

ValidationValidate SimMatrix against the state-of-the-art MTC systems (e.g. Falkon, MATRIX)

24

1. Simulator makes simplifying assumptions, such as the network. 2. It is also difficult to model communication congestion, resource sharing and the effects on performance,

and the variability that comes with real systems. 3. We believe the relatively small differences (2.8% and 5.85%) demonstrate that SimMatrix is accurate

enough to produce convincible results (at least at modest scales).

Page 25: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Resource Requirement up to Exascale1M Nodes, 1B tasks and 10B tasks

Memory• Centralized:

14.1GB • Distributed:

142.1GB

CPU Time• Centralized:

17.4 hours• Distributed:

162.8 hours

Still relatively moderate

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 25

Page 26: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Centralized vs. Distributed Scheduling

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 26

1. AVE_5K: efficiency drops to 0.05% for centralized, but remains 90%+ for distributed at exascale2. ALL_1: centralized saturates at 8 nodes with upper bound throughput of 1000 task/sec, distributed starts to

saturate at 32K nodes, and finally achieves throughput of 75M task/sec3. Reason of saturation: the final stage, work stealing requires too many messages as the system scales up, to

the point where the number of messages is saturating either the network and/or processing capacity4. Solution: set an upper bound of the poll interval; having sufficiently long tasks to amortize the cost of so

many messages. (AVE_12 tasks can achieve 90% efficiency at exascale with throughput of 75M task/sec)

Page 27: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

SimMatrix vs. SimGrid and GridSim

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 27

1. Comparison: Centralized scheduling2. Scale: GridSim 256 nodes, SimGrid 65K nodes, SimMatrix 1M nodes3. Time Per Task: GridSim is increasing, SimGrid keeps constant, SimMatrix decreases and then almost keeps

constant4. Memory Per Task: GridSim and SimGrid are decreasing , then keep constant, SimMatrix keeps decreasing5. Conclusion: SimMatrix is more resource efficient at large scales

Page 28: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Application Domains of SimMatrix• Data Centers: large-scale data centers (e.g. Google,

Amazon) are composed of thousands of (10 to 100× in near future) servers geographically distributed around the world. Load balancing among all the servers with data-intensive workloads is very important, yet non-trivial. SimMatrix is able to study different network topologies connecting all the servers and data-aware scheduling, which could be applied in scheduling of data centers.

• Grid Environment: not only could SimMatrix be configured as homogeneous scheduling system, it can also be tuned into heterogeneous one. Different Grids could configure SimMatrix and do scheduling individually without interaction with each other.

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 28

Page 29: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Application Domains of SimMatrix

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 29

Workflow System: although SimMatrix relies on high level workflow systems (Swift, Charm++) to manage the data-flow and task dependency now, we could develop SimMatrix to simulate workflow system with dependent tasks. We have already run SimMatrix with MTC workload achieved from Swift workflow system up to exascale, and achieved ~87% efficiency

Page 30: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Application Domains of SimMatrixMany-core Simulation: instead of configuring SimMatrix as an exascale system, we also configured it as a single many-core chip node up to thousands of cores with 2D/3D mesh topology. We applied work-stealing at the core level within one many-core node, and found that up to thousand cores level, 2D mesh topology needs at least 13 hops of neighbors, while 3D mesh needs at least 5, in order to achieve high system efficiency.

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 30

Page 31: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Outline

• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 31

Page 32: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Related Work

• Real Job Scheduling Systems: – Condor (University of Wisconsin), Bradley et al, 2013 – PBS (NASA Ames) , Corbatto et al, 2013 – SLURM (LLNL), Danny et al. 2013– Falkon (University of Chicago), Raicu et al, SC07

• Job Scheduling System Simulators:– SimJava (University of Edinburgh), Wheeler et al,

2004 (thread-based) – GridSim (University of Melbourne, Australia), Buyya et

al, 2010 (thread-based)– SimGrid (INRIA), Lucas et al, 2013 (Parallel DES)

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 32

Page 33: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Outline

• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 33

Page 34: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

Conclusion & Future Work

• Conclusion: – Exascale computing will bring several challenges, which need to

be solved by new programming models. – MTC could potentially address the exascale challenges, however,

efficient job scheduling systems at extreme scales are needed. – SimMatrix is light-weight enough to enable the study of different

scheduling strategies and architectures at exascale• Future Work:

– Explore different network topologies (fat tree, 3D/4D, InfiniBand) – Work flow and task dependency simulation– Different workloads of both HPC and MTC simulation

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 34

Page 35: SimMatrix:  SIMulator for  MAny -Task computing  execution  fabRIc  at  eXascale

• More information:– http://datasys.cs.iit.edu/~kewang/ – http://datasys.cs.iit.edu/projects/SimMatrix/

• Contact:– [email protected]

• Questions?

More Information

SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 35