an accurate prefetch technique for dynamic paging behaviour for software distributed shared memory

24
Jie Cai and Peter Strazdins Research School of Computer Science The Australian National University ICPP 2012 Pittsburgh, PA, USA An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Upload: camila

Post on 31-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory. Jie Cai and Peter Strazdins Research School of Computer Science The Australian National University ICPP 2012 Pittsburgh, PA, USA. Outline. Introduction Background - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Jie Cai and Peter Strazdins

Research School of Computer Science

The Australian National University

ICPP 2012

Pittsburgh, PA, USA

An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared

Memory

Page 2: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

• Introduction• Background• Related Work on Existing Prefetch Techniques• Stride-augmented Run-length Encoding Method

(sRLE)• Dynamic Region-based Prefetch Technique• Evaluation Results• Conclusion

Outline

ICPP 2012 @ Pittsburgh, PA

Page 3: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

• Software Distributed Shared Memory (sDSM) systems provide programming environments that enable the use of shared programming model such as OpenMP on clusters.

• sDSM systems inherit the good programmability of shared memory programming models. • Removing explicit control of data exchange from programmer

• However, sDSM suffers from significant system overheads.• Prefetch techniques, fitting well with lazy release consistency (LRC), can

be used to improve performance.

• Prefetch techniques for sDSM face two major challenges:• Applications’ dynamic memory access patterns• Page misses caused by non-global synchronization operations

Introduction

ICPP 2012 @ Pittsburgh, PA

Page 4: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

• In this talk, we address the challenges of prefetch techniques for sDSM systems• Reconstruct page miss record using strided-augmented

run-length encoding (sRLE) method• Designed a dynamic region-based prefetch (DReP)

technique based on sRLE’d records to predict and issue prefetches.

• Implemented into the only commercialized sDSM system, Intel Cluster OpenMP (CLOMP)

• DReP and sRLE with CLOMP are evaluated using NPB-OMP benchmark suite, LINPACK, and a memory consistency cost micro-benchmark (MCBENCH)

Introduction (Cont.)

ICPP 2012 @ Pittsburgh, PA

Page 5: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Background (1)

ICPP 2012 @ Pittsburgh, PA

Start parallel region, implicit barrier

Single Thread Sequential Region #3

Single Thread Sequential Region #1

Single Thread Sequential Region #2

Parallel Region #1Thread1 Thread2 Thread3Thread0

Explicit barrier

Thread1 Thread2 Thread3Thread0 Parallel Region #2

End parallel region, implicit barrier

Start parallel region, implicit barrier

End parallel region, implicit barrier

Thread1 Thread2 Thread3Thread0 Parallel Region #3

Start parallel region, implicit barrier

Single Thread Sequential Region #3

Single Thread Sequential Region #1

Single Thread Sequential Region #2

Parallel Region #1Thread1 Thread2 Thread3Thread0

Explicit barrier

Thread1 Thread2 Thread3Thread0 Parallel Region #2

End parallel region, implicit barrier

Start parallel region, implicit barrier

End parallel region, implicit barrier

Thread1 Thread2 Thread3Thread0 Parallel Region #3

• Fork-join type shared memory programming models:• Regions are separated

using global synchronizations, e.g. implicit and explicit barriers;

• Region-executions are multiple executions of the same region when this region is enclosed in a loop.

Page 6: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Background (2)

ICPP 2012 @ Pittsburgh, PA

• sDSM memory consistency model• Each process has a local

view of the shared pages• The shared pages are

kept consistent via mprotect (please refer to the page state machine for details).

Local

Memory

Local

Memory… … …MPI

Shared Memory Programming Model (OpenMP)

Virtual Shared Memory

Global memory is managed in blocks, pages

Local

Memory

Local

Memory… … …MPI

Shared Memory Programming Model (OpenMP)

Virtual Shared Memory

Global memory is managed in blocks, pages

Page 7: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Background (3)

ICPP 2012 @ Pittsburgh, PA

• sDSM memory consistency costs• The major sDSM system

overhead is the memory consistency cost.

• MCBENCH is a in-house developed micro-benchmark that measures this cost for different OpenMP implementations, including cluster enabled OpenMPs.

Page 8: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Related Work

ICPP 2012 @ Pittsburgh, PA

• Dynamic Aggregation (C. Amza et al. 1997)• Simple assumption of temporal paging behavior before and after

a barrier.

• B+ and Adaptive++ (R. Bianchini et al. 1996 & 1998)• B+: simple assumption of temporal paging behavior before and

after a barrier.• Adaptive++: assuming page misses occurred before a barrier or

even before the previous barrier will occur again after the barrier.

• Third order differential finite context method (TODFCM) (E. Speight et al. 2002)• Generic technique prefetch a page when three previous

consecutive misses had happened before.

Page 9: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Related Work (Cont.)

ICPP 2012 @ Pittsburgh, PA

• Temporal region-based pretech (TReP) technique (J. Cai et al. 2010)• Deployed idea of region and region-executions• Assume page misses in the previous region-execution will occur

in the current region-execution• Considered temporal paging behaviour between consecutive

region-executions

• Hybrid region-based prefetch (HReP) technique (J. Cai et al. 2010)• Deployed idea of region and region-executions• Combined TReP and Adaptive++

• Addressed temporal paging behaviour between consecutive region-executions and spatial paging behaviour within a region-execution.

Page 10: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

sRLE Method -- Observation

ICPP 2012 @ Pittsburgh, PA

• LINPACK dynamic page access pattern with 4 processes

• Corresponding dynamic page miss pattern

Page 11: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

sRLE Method

ICPP 2012 @ Pittsburgh, PA

• Step (a) group sub-list with common stride;

• Step (b) encode the sub-lists into first level format: • (start page, stride, run

length)

• Step (c) group consecutive encoded sub-lists with common stride into second level encoding format:• (first level encoded record,

stride, run length)• Ordinary page fault list can

be converted to 2D fault regions with sRLE.

Page 12: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

DReP Technique Designs

ICPP 2012 @ Pittsburgh, PA

• All page fault records (per region) has been encoded twice with sRLE method.

• Each record contains a list of second level encoded entries.

Page 13: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

DReP Technique Designs (cont.)

ICPP 2012 @ Pittsburgh, PA

• At the beginning of each region-execution, DReP predict and prefetch pages.

Beginning of a region-execution

Previouslyexecuted twice?

No

No Prefetch issued

Yes

Compare every entries between two records. Issue prefetches ONLY for

the following three cases.

Prefetch the entry if it is common to both

lists

When strides and run lengthes are common to both lists, predict a start page, and prefetch with the common

strides and run length

pred.l1_en_col.start_page = p_list.l1_en_col.start_page +(p_list.l1_en_col.start_page − bp_list.l1_en_col.start_page)

When strides are common and run lengthes are highly similar to both lists, predict a start page and a run lengthes, then prefetch with the common strides.

Case 1: Case 2: Case 3:

pred.l1_en_col.run len = p_list.l1_en_col.run_len +(p_list.l1_en_col.run_len − bp_list.l1_en_col.run_len)

pred.run_len = p_list.run_len + (p_list.run_len−bp_list.run_len)

Page 14: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

DReP Implementation

ICPP 2012 @ Pittsburgh, PA

• DReP has been implemented into Intel Cluster OpenMP runtime.• New region notification user interactive interface:

• KMP_USER_NOTIFY_NEW_REGION(1) : 1 indicates this is a parallel region• KMP_USER_NOTIFY_NEW_REGION(0): 0 indicates this is a sequential region

• Flush filtering solved the problem of single page can be missed multiple times within one region-execution by removing duplicated records.

• Enlarged message header of the communication layer which can accommodate 128 page IDs to leverage network bandwidth.

• Each process first communicate to its right neighbor that avoid network congestion.

Page 15: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

DReP Implementation (Cont.)

ICPP 2012 @ Pittsburgh, PA

• DReP has been implemented into Intel Cluster OpenMP runtime.• Page state machine

has been updated with two new introduced page states

• Prefetched_diff• Prefetched_page

Page 16: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Evaluation

ICPP 2012 @ Pittsburgh, PA

• Experimental setup• Software and benchmarks

• NPB-OMP suite• LINPAK OpenMP implementation (n=8196, nb=64)• MCBENCH (a = 4MB, c = 4B and 4KB)

• Hardware platform• 8-node Intel cluster• Each node consists of 2 Intel E5472 3.0Ghz CPUs• 16GB memory• Gigabit Ethernet• DDR Infiniband

Page 17: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Efficiency and Coverage

ICPP 2012 @ Pittsburgh, PA

• Nf: total number of page faults• Np: number of prefetches• Nu: number of useful prefetches, Nu = Nf*C• C = Nu/Nf, coverage• E = Nu/Np, efficiency• Bold font represents best results

Page 18: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Efficiency and Coverage (Cont.)

ICPP 2012 @ Pittsburgh, PA

• Bold font represents best results

• MCBENCH: DReP vs TReP and HReP• c = 4B: extreme false sharing• c= = 4KB: no false sharing

Page 19: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Memory Consistency Cost

ICPP 2012 @ Pittsburgh, PA

• Measured using MCBENCH, a = 4MB, c = 4B and 4KB• c = 4B: extreme false sharing (reduced ~86% cost)• c = 4KB: no false sharing

Page 20: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Memory Consistency Cost (Cont.)

ICPP 2012 @ Pittsburgh, PA

• LINPACK OpenMP implementation with n=8196 and nb=64• DReP is represented as a reduction rate to that of original

CLOMP implementation, e.g. (Orig-DReP)/Orig.

Page 21: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Memory Consistency Cost (Cont.)

ICPP 2012 @ Pittsburgh, PA

• NPB-OMP• Rates are represented as an average of each class from A to

C.

Page 22: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Overhead Analysis of DReP

ICPP 2012 @ Pittsburgh, PA

• NPB-OMP IS.C• Tsegv: total memory consistency cost in seconds for original CLOMP and

DReP enabled CLOMP.• TMK Comm (% to Tsegv): communication time spent in the DSM layer of

CLOMP (TMK)• TMK local (% to Tsegv): the local software overhead of TMK layer• DReP Comm (% to Tsegv): communication cost of data prefetching• DReP local (% to Tsegv): the local software cost introduced by DReP• Communication costs are further broke down to cost for transferring diffs and

pages.

Page 23: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Conclusions

ICPP 2012 @ Pittsburgh, PA

• With assistance of sRLE, DReP accurately analyses the paging behaviour exhibiting both static and dynamic memory access patterns, such as NPB-OMP and LINPACK.

• On average of NPB and LINPACK, DReP improves 34% efficiency and 47% coverage based on existing prefetch techniques, in details:

• 55% and 5% better efficiency compared to Adaptive++ and TODFCM; 55% and 44% better coverage compared to Adaptive++ and TODFCM

• 47% and 30% better efficiency compared to TReP and HReP; and 56% and 34% better coverage compared to TReP and HReP.

• DReP dramatically reduces 86% memory consistency cost for the false sharing scenario; and ~45% and ~38% for LINPACK and NPB on GigE and IB respectively.

• A detailed breakdown analysis showed a ~2% introduced overhead for DReP.

Page 24: An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory

Acknowledgement

ICPP 2012 @ Pittsburgh, PA

• Australian Research Council Grant LP0669726• ANU CECS Faculty Research Grant• Intel Corp.• Sun Microsystems• NCI National Facility / ANU Supercomputer Facility