uncorq: unconstrained snoop request delivery in embedded-ring multiprocessors

24
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors http://iacoma.cs.uiuc.edu Karin Strauss AMD Advanced Architecture and Technology Lab Xiaowei Shen IBM Research Josep Torrellas University of Illinois at Urbana- Champaign

Upload: maree

Post on 02-Feb-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors. http://iacoma.cs.uiuc.edu. Motivation. CMPs are ubiquitous. Shared memory + caches = cache coherence. Traditional cache coherence solutions. shared bus-based: electrical, layout issues. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Uncorq:

Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

http://iacoma.cs.uiuc.edu

Karin Strauss AMD Advanced Architecture and Technology Lab

Xiaowei Shen IBM Research

Josep Torrellas University of Illinois at Urbana-Champaign

Page 2: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 2QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Motivation

• CMPs are ubiquitous

• Shared memory + caches = cache coherence

• directory-based: indirection, storage

• shared bus-based: electrical, layout issues

• Traditional cache coherence solutions

Page 3: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 3QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Embedded-ring cache coherence [ISCA 2006]

• logical ring is embedded in network

• control messages use ring

Simple and inexpensive to implement

• Novel snoopy cache coherence for mid-sized machines

• data messages use any path

Snoop requests can have long latencies

Page 4: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 4QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Contributions

• Propose invariant for transaction serialization

• Propose performance enhancements

• reduces cache-to-cache transfer latency

• Uncorq: unconstrained snoop request delivery

• Simple hardware data prefetching technique

• reduces memory-to-cache transfer latency

Page 5: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 5QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

logical ring

Embedded-ring terminology

• snoop request• snoop response• snoop request + response

• data

• Types of messages:

+requestresponse

snoop op. outcome

positive snoop op. outcome

+ positive response

data

A B

control messages

• Snoopy, invalidate protocol

responserequest

• Single supplier protocol

request

Page 6: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Ordering invariant

Page 7: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 7QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Transaction serialization

inv

inv

ack

ack

read

data S

MSI

S A BS I S

timeinv

I

old value new value

Page 8: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 8QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Serialization enforcement with embedded-ring

• Logical unidirectional ring provides partial ordering

• Distributed algorithm establishes global order for same-address transactions

• one is declared the “winner” (first to reach supplier)• others have to retry

• On simultaneous transactions to same address:

A

requestrequest requestrequestresponseresponse responseresponse

Page 9: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 9QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

How to serialize transactions

A

B

S

A’s request and response

B’s request and response

No clear “first” transaction

B’s request reaches S first

A receives B’s positive response before its own

A retries: B A

Ring guarantees responsesare forwarded in the order Sperformed snoop operations

+

+

Page 10: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 10QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Enforcing transaction serialization

Ordering Invariant: the order in which responses travel the ring after leaving the supplier must be the same as the order in which the supplier processed their corresponding requests.

• Node whose request arrives at supplier node first is the “winner”

• What we need to enforce transaction serialization:

loser node sees other node’s positive response before its own

+

S

+request

response

Page 11: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Uncorq:Unconstrained snoop

request delivery

Page 12: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 12QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Uncorq idea

Baselinerequest

response

Idea: requests do not have to follow the ring(but responses do)

Uncorq

Page 13: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 13QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Benefit of Uncorq

Baseline

Reduced cache-to-cache transfer latency

Uncorq

savings

request snoop data

request reachessupplier node

time

Page 14: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 14QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Implications of Uncorq

• Uncorq no longer restricts order of requests

• Nodes may receive and process requests in any order

• Responses may also get reordered

Problem: distributed algorithm relies on the fact that response order reflects order of requests at supplier

Page 15: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 15QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Example: incorrect transaction ordering

A

B

S

S

++ +

S

Ordering invariant

A node cannot forward any other response ifit has an outstanding positive snoop outcome

request

response

Page 16: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 16QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

How Uncorq stalls responses

+ +

addr Crequestsresponses

+

• Local transaction table (per-node structure)

A B …C

• records messages that node is currently processing

+request

response

Page 17: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 17QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Optimization: prefetching from memory

• Predict when no node will supply data

• Access memory in parallel with ring snoop

R

memory(1)

(2)

R

memory

(1)

(1)

• Goal: reduce latency of memory-to-cache transfers

unoptimized optimized

Page 18: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Evaluation

Page 19: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 19QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• 64 nodes in a single CMP

Experimental setup

• SESC simulator (sesc.sourceforge.net)

• SPLASH-2, SPECjbb and SPECweb workloads

• Interconnection network: 2D torus with embedded-ring

Page 20: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 20QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Cache-to-cache transfer latency

substantial reduction in latencyUncorq

0

2

4

6

8

10

0 100 200 300 400 500 600

distribution (%)

0

20

40

60

80

100

cumulative

distribution (%)

cache-to-cache transfer latency

Baseline

0

2

4

6

8

10

0 100 200 300 400 500 600

distribution (%)

0

20

40

60

80

100

cumulative

distribution (%)

cache-to-cache transfer latency

Page 21: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 21QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Execution Time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SPLASH-2 SPECjbb SPECweb

normalized

execution time

BaselineUncorqUncorq+Pref

• Uncorq + Pref performs the best (reduction: 13-26%)

• Uncorq significantly reduces execution time (reduction: 5-23%)

Page 22: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 22QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Also in the paper

• Serialization mechanism for case with no supplier

• System and node forward progress

• Fences and memory consistency issues

• Characterization of prefetching mechanism

• Comparison against ccHyperTransport

Page 23: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Karin Strauss - “Uncorq” 23QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Conclusion

• Propose invariant for transaction serialization

• Propose performance enhancements• Uncorq: unconstrained snoop request delivery

• Reduce execution time by 13-26%

• Simple hardware data prefetching technique

Page 24: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Uncorq:

Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

http://iacoma.cs.uiuc.edu

Karin Strauss AMD Advanced Architecture and Technology Lab

Xiaowei Shen IBM Research

Josep Torrellas University of Illinois at Urbana-Champaign