buddy scheduling in distributed scientific applications on smt processors nikola vouk december 20,...

30
Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science at NCSU

Upload: meagan-garrett

Post on 14-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Buddy Scheduling in Distributed Scientific Applications on SMT

Processors

Nikola Vouk

December 20, 2004

In Fulfillment of Requirement for Master’s of Science at NCSU

Page 2: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Introduction

• Background of Simultaneous Multi-Threading

• Goals• Modified MPICH Communication Library

Modifications• Low latency Synchronization Primitives• Promotional Thread Scheduler• Modified MPICH Evaluation• Conclusions/Future Work

Page 3: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Background

• Simultaneous Multi-threading Architecture– Modern super-scalar processor– Multiple Program Contexts Core– Can fetch from any context and execute from

all contexts simultaneously

Page 4: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 5: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Problems–Not enough TLP in programs to fully utilize modern processors–Increase in TLP through multiple threads–On average faster

Page 6: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

– Designed for Multi-processors. – MPI+OpenMP - CPU Intensive– Legacy applications assume sole use of whole

processor and cache– Compute/Communicate Model– Many suffer performance drops due to

hardware sharing

Parallel Scientific Applications

Page 7: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Solutions

• Take advantage of inherent concurrency in Compute/Communicate Model

• Provide private access to whole processor

• Minimize thread synchronization

• Transparent to application

Page 8: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Modified MPICH

• MPI Communications Library

• Channel P4 library - TCP/IP

• Serial library

• Application must stop computing to send/receive data synchronously or asynchronously

Page 9: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Modified MPICH

• Asynchronous Communication functions handled by helper thread

• New connections handled by helper thread

• No longer uses signals, nor forked processes

• Isend/Irecv handle setup of each function, then continues without waiting for function completion (short circuit)

• Threads communicated through Request Queue

Page 10: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Normal MPICH Design

Void Listener( ) { while (!done) {

waitForNetworkActivity(); SignalMainThread( ) }}

CPU

void main( ) { MPI_Isend(args);...}

MPI_Isend(args) { …. return 0;}

Main Thread Listener Thread

Page 11: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Modified MPICH DESIGN

SMT Threadvoid SMT_LOOP( ) { while (! done) {

do {checktForRequest();checkForNetwork();

} while (no_request); action = dequeueRequest( ) retvalue = action->function(args);

} } MPI_Isend_SMT( args ) {

.....return 0;

} CPU 2CPU 1

void main() { MPI_Isend(args);...}

MPI_Isend(args) { return enqueue(args);}

Primary/Master Thread

Helper Thread

Page 12: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 13: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Synchronization Primitives

Page 14: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Evaluation Kernel 1 - Latency

MWAIT User:41,000 ns

MWAIT KERNEL:2,900 ns

Condition Variable:10,700 ns

PAUSE Spin-Lock:500 ns

NOP Spin-Lock:500 ns

Page 15: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Evaluation Kernel 2 - Notification Call Overhead

MWAIT User:450 ns

MWAIT KERNEL:3,400 ns

Condition Variable:5,200 ns

PAUSE Spin-Lock:500 ns

NOP Spin-Lock:500 ns

Page 16: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Evaluation Kernel 3 - Resource Impact

MWAIT User:61,000 ns

MWAIT KERNEL:57,000ns

Condition Variable:63,000 ns

PAUSE Spin-Lock:129,500 ns

NOP Spin-Lock:131,000 ns

Page 17: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Promotional Buddy Scheduling• Master and Buddy Tasks• Master task is a regular task scheduled as normal• Buddy task is a regular task when master not scheduled• Whenever Master Task runs, Buddy Task is co-scheduled

ahead of all other tasks• Remains scheduled until blocks, or master gets

unscheduledPurpose: • Provide low latency IPC, • Minimize scheduling latency• Isolate master and buddy with processor

Page 18: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Promotional Thread Scheduling Kernel

• 0 to 5 background task

• Measure latency of notification

Page 19: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 20: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Modified MPICH EvaluationBenchmarks

• sPHOT– Very light network activity (Gather/Scatter 2x)

• sPPM– Exclusively isend/irecv , few messages, – large packets - (30 MB total/thread)

• SMG2000– Isend, Irecv - High message volume, small packets

• IRS– Gather/Scatter/Bcast Predominate

• Sweep3D– Send/Recv - Medium volume of messages

Page 21: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 22: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 23: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 24: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 25: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 26: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 27: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 28: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science
Page 29: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Conclusions

• Benchmarks reflect strengths and weaknesses of threading model

• Promotional Scheduler Kernel Successful

• Synchronization Kernels reflected in code

Page 30: Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science

Future Work

• Future work will include looking into gang scheduling to further evaluate the promotional scheduling

• Modify MPICH for page-locked requests, to prevent page faults for MWAIT User

• Red-Black Kernels • Allow sub-32k isends to complete normally