piranha: a scalable architecture based on single-chip multiprocessing barroso, gharachorloo,...
Post on 02-Jan-2016
212 Views
Preview:
TRANSCRIPT
Piranha: A Scalable Architecture Based on Single-Chip
MultiprocessingBarroso, Gharachorloo, McNamara, et.
AlProceedings of the 27th Annual ISCA,
June 2000
Presented by Garver MooreECE259
Spring 2006Professor Daniel Sorin
Motivation
• Economic: High demand for OLTP machines• Disconnect between ILP-focus and this demand • OLTP
-- High memory latency-- Little ILP (Get, process, store)-- Large TLP
• OLTP unserved by aggressive ILP machines• Use “old” cores, ASIC design methodology for “glueless,”
scalable OLTP machines and low development costs and time to market
• Amdahl’s Law
The Piranha Processing Node*
*Directly from Barroso et. al
Separate I/D L1 for each CPU Logically shared interleaved L2 cache. Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec.
180 nm process (2000)Almost entirely ASIC design50% clock speed, 200% area versus full-custom methodology
CPU:AlphaECE152 workSingle in-order8-stage pipeline
Communication Assist+ Home Engine and Remote Engine support shared memory across multiple nodes
+ System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc.
+ OQ, Router, IQ, Switch standard
+Total inter-node I/O Bandwidth : 32 GB/sec+ Each link and block here
corresponds to actual wiring and module.
+ This allows for rapid parallel development and an semi-custom design methodology
+ Also facilitates multiple clock domains
THERE IS NO INHERENTI/O CAPABILITY.
I/O Organization+ Smaller than processing node
+ Router 2 links, alleviates need for routing table
+ Memory is globally visible and part of coherency scheme
+ CPU optimized placement for drivers, translations etc. with low-latency access needs to I/O.
+ Re-used dL1 design provides interface to PCI/X interface
+ Supports arbitrary I/O:P ratio, network topology
+ Glueless scaling up to 1024 nodes of any type supports application specific customization
Coherence: Local+ L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory
+ Chip ICS responsible for all on-chip communication
+ L2 is “non-inclusive”.
+ “Large victim buffer” for L1s. Keeps tags and state copies of L1 data
+ The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist.
+ L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory.
+L2 on forwards blocks conflicting requests
Coherence: Global• Trades ECC granularity for “free” directory data storage (4x
granularity leaves 44 bits per 64 bit line)• Invalidation-based distributed directory protocol• Some optimizations• No NACKing: Deadlock avoidance through I/O, L, H priority virtual
lanes: L: Home node, low priority. H: Forwarded requests, replies• Also guarantee forwards always serviced by targets: e.g. owner
writes back to home, holds data until home acknowledges. • Removes NACK/Retry traffic, as well as “ownership change”
(DASH), retry-counts (Origin), “No, seriously” (Token).• Routing toward empty buffers for old messages linear buffer
dependence on N. Share buffer space among lanes, and “CMI” invalidations avoid deadlock.
Evaluation Methodology
• Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications)
• Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware
• “Fudged” for full-custom effect• Four evaluations: P1 (One-core Piranha @
500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)
Results
Questions/Discussion
• Deadlock avoidance w/o NACK
• CMP vs SMP
• “Fishy” evaluation methodology?
• Specialized computing
• Buildability?
top related