piranha: a scalable architecture based on single-chip multiprocessing barroso, gharachorloo,...

Post on 02-Jan-2016

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Piranha: A Scalable Architecture Based on Single-Chip

MultiprocessingBarroso, Gharachorloo, McNamara, et.

AlProceedings of the 27th Annual ISCA,

June 2000

Presented by Garver MooreECE259

Spring 2006Professor Daniel Sorin

Motivation

• Economic: High demand for OLTP machines• Disconnect between ILP-focus and this demand • OLTP

-- High memory latency-- Little ILP (Get, process, store)-- Large TLP

• OLTP unserved by aggressive ILP machines• Use “old” cores, ASIC design methodology for “glueless,”

scalable OLTP machines and low development costs and time to market

• Amdahl’s Law

The Piranha Processing Node*

*Directly from Barroso et. al

Separate I/D L1 for each CPU Logically shared interleaved L2 cache. Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec.

180 nm process (2000)Almost entirely ASIC design50% clock speed, 200% area versus full-custom methodology

CPU:AlphaECE152 workSingle in-order8-stage pipeline

Communication Assist+ Home Engine and Remote Engine support shared memory across multiple nodes

+ System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc.

+ OQ, Router, IQ, Switch standard

+Total inter-node I/O Bandwidth : 32 GB/sec+ Each link and block here

corresponds to actual wiring and module.

+ This allows for rapid parallel development and an semi-custom design methodology

+ Also facilitates multiple clock domains

THERE IS NO INHERENTI/O CAPABILITY.

I/O Organization+ Smaller than processing node

+ Router 2 links, alleviates need for routing table

+ Memory is globally visible and part of coherency scheme

+ CPU optimized placement for drivers, translations etc. with low-latency access needs to I/O.

+ Re-used dL1 design provides interface to PCI/X interface

+ Supports arbitrary I/O:P ratio, network topology

+ Glueless scaling up to 1024 nodes of any type supports application specific customization

Coherence: Local+ L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory

+ Chip ICS responsible for all on-chip communication

+ L2 is “non-inclusive”.

+ “Large victim buffer” for L1s. Keeps tags and state copies of L1 data

+ The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist.

+ L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory.

+L2 on forwards blocks conflicting requests

Coherence: Global• Trades ECC granularity for “free” directory data storage (4x

granularity leaves 44 bits per 64 bit line)• Invalidation-based distributed directory protocol• Some optimizations• No NACKing: Deadlock avoidance through I/O, L, H priority virtual

lanes: L: Home node, low priority. H: Forwarded requests, replies• Also guarantee forwards always serviced by targets: e.g. owner

writes back to home, holds data until home acknowledges. • Removes NACK/Retry traffic, as well as “ownership change”

(DASH), retry-counts (Origin), “No, seriously” (Token).• Routing toward empty buffers for old messages linear buffer

dependence on N. Share buffer space among lanes, and “CMI” invalidations avoid deadlock.

Evaluation Methodology

• Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications)

• Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware

• “Fudged” for full-custom effect• Four evaluations: P1 (One-core Piranha @

500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)

Results

Questions/Discussion

• Deadlock avoidance w/o NACK

• CMP vs SMP

• “Fishy” evaluation methodology?

• Specialized computing

• Buildability?

top related