zhongkai chen 3/25/2010. jinglei wang; yibo xue; haixia wang; dongsheng wang dept. of comput. sci....

NETWORK VICTIM CACHE: LEVERAGING

NETWORK-ON-CHIP FOR MANAGING SHARED

CACHES IN CHIP MULTIPROCESSORS

Zhongkai Chen

3/25/2010

PAPER INFORMATIONJinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang

Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China

This paper appears in: Embedded and Multimedia Computing, 2009. EM-Com 2009. 4th International

Publication Date : 10-12 Dec. 2009

OUTLINE Introduction

ProblemsNetwork on Chip (NoC)Victim Cache

Network Victim Cache DesignBaseline ArchitectureNVC Scheme

Performance Evaluation

INTRODUCTION The large working sets of commercial and

scientific workloads favor a shared L2 cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests in Chip Multiprocessors (CMP)

Two Important hurdles that restrict the scalability of these chip multiprocessors: the on-chip memory cost of directory the long L1 miss latencies

INTRODUCTIONNetwork on Chip (NoC)

In a NoC system, modules such as processor cores, memories and specialized IP blocks exchange data using a network as a "public transportation" sub-system for the information traffic.

An NoC is constructed from multiple point-to-point data links interconnected by routers, such that messages can be relayed from any source module to any destination module over several links, by making routing decisions at the routers.

INTRODUCTIONVictim Cache

A victim cache is a cache used to hold blocks evicted from a CPU cache upon replacement. The victim cache lies between the main cache and its refill path.

The victim cache is usually fully associative, and is intended to reduce the number of conflict misses. Only a small fraction of the memory accesses of the program require high associativity. The victim cache exploits this property by providing high associativity to only these accesses.

NETWORK VICTIM CACHE DESIGN Baseline Architecture

The tile CMP is organized as 2D array of replicated tiles each with a core, a private L1 cache, an L2 cache slice, and a router that connects the tile to the network on chip.

The L2 cache slices form a logically shared L2. L1 cache misses are sent to the corresponding home tile, which looks up the directory information and performs the actions needed to ensure coherence.

L1 caches are kept coherent by using directory-based cache coherence protocol.

Directory

NETWORK VICTIM CACHE DESIGN Baseline Router Architecture

In tiled CMP, L1 cache and L2 cache are attached to router through Network Interface Component (NIC). Routers are connected together by four direction interfaces to form a 2D network on chip.

NVC SCHEME The Network Victim Cache (NVC)

The difference from the baseline router architecture is the modification of network interface component. VC and DC are added into the network interface component.

Remove directory information from L2 caches and stored it in Directory Caches (DC) in the network interface components to save memory space

The saved directory space is used as Victim Caches (VC) to capture and store evictions from local L1 caches to reduce subsequent L1 miss latencies.

NVC SCHEME

At the home tile, the DC captures L1 miss request in the network interface component and looks up directory information of the requesting block. It fetches data block from local L2 cache and sends reply back to the requestor.

If a L1 cache line is evicted because of a conflict or capacity miss, we attempt to keep a copy of the victim line in the VC to reduce subsequent access latency to the same line.

L1 Cache DC

Miss Request

L2 Cache

Fetched Data Block

Evicted by a conflict or capacity

miss

VC

NVC SCHEME

All L1 misses will first check VC when they flow through the network interface component in case there’s a valid block. On a VC miss, the request continues to travel to the home tile. On a VC hit, the block is invalidated in the VC and moved into the L1 cache.

L1 Cache

Miss Request

Move Back

VCMiss

DC … …

Hit -> Invalidate

PERFORMANCE EVALUATION Simulation Environment

Use GEMS simulator to evaluate the performance of NVC against over the baseline CMP.

The number of entries of VC is equal to that of L1 cache and the number of entries of DC is twice that of L1 cache.

Detailed system parameters

8 workloads from SPLASH-2 and PARSEC benchmarkson Solaris 10 operating system

PERFORMANCE EVALUATION Impact on L1 cache miss latency

NVC decreases the L1 cache miss latencies by 21-49%, and by 31% on average. For water benchmark, small working set makes most of L1 misses can be satisfied in local victim cache, and then reduces the L1 miss latencies by 49%.

Normalized L1 cache average miss latency

PERFORMANCE EVALUATION Impact on execution time

NVC reduces the execution time of each benchmark by 10-34%.The execution time of lu and water are reduced by 34%. For water benchmark, small working set makes most of L1 misses can be satisfied in local victim cache and leads to better performance. NVC improves performance of CMP by 23% on average.

Execution time

PERFORMANCE EVALUATION On-Chip Network Traffic Reduction

An additional benefit of NVC is the reduction of on-chip coherence traffic. NVC reduces the number of coherence messages of each benchmark by 16-48%, and by 28% on average. NVC eliminates some inter-tile messages when accesses can be resolved in local victim caches.

PERFORMANCE EVALUATION Scalability

Compared to conventional shared L2 cache design, NVC increases on-chip storage by only 0.18%. As the number of cores increases, the saved directory storage from L2 cache will increase significantly, while the storage overhead of the proposed scheme will increase far slower. NVC can provide much better scalability than the conventional shared L2 cache design when the number of cores increases.

Thank you

zhongkai chen 3/25/2010. jinglei wang; yibo xue; haixia wang; dongsheng wang dept. of comput. sci....

Documents

main cache

cpu cache

network victim cache

l1 cache misses

private l1 cache

l1 cache line

local l2 cache

l2 cache slices