piranha: a scalable architecture based on single-chip multiprocessing barroso, gharachorloo,...
TRANSCRIPT
![Page 1: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June](https://reader035.vdocument.in/reader035/viewer/2022072016/56649ee65503460f94bf6644/html5/thumbnails/1.jpg)
Piranha: A Scalable Architecture Based on Single-Chip
MultiprocessingBarroso, Gharachorloo, McNamara, et.
AlProceedings of the 27th Annual ISCA,
June 2000
Presented by Garver MooreECE259
Spring 2006Professor Daniel Sorin
![Page 2: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June](https://reader035.vdocument.in/reader035/viewer/2022072016/56649ee65503460f94bf6644/html5/thumbnails/2.jpg)
Motivation
• Economic: High demand for OLTP machines• Disconnect between ILP-focus and this demand • OLTP
-- High memory latency-- Little ILP (Get, process, store)-- Large TLP
• OLTP unserved by aggressive ILP machines• Use “old” cores, ASIC design methodology for “glueless,”
scalable OLTP machines and low development costs and time to market
• Amdahl’s Law
![Page 3: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June](https://reader035.vdocument.in/reader035/viewer/2022072016/56649ee65503460f94bf6644/html5/thumbnails/3.jpg)
The Piranha Processing Node*
*Directly from Barroso et. al
Separate I/D L1 for each CPU Logically shared interleaved L2 cache. Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec.
180 nm process (2000)Almost entirely ASIC design50% clock speed, 200% area versus full-custom methodology
CPU:AlphaECE152 workSingle in-order8-stage pipeline
![Page 4: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June](https://reader035.vdocument.in/reader035/viewer/2022072016/56649ee65503460f94bf6644/html5/thumbnails/4.jpg)
Communication Assist+ Home Engine and Remote Engine support shared memory across multiple nodes
+ System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc.
+ OQ, Router, IQ, Switch standard
+Total inter-node I/O Bandwidth : 32 GB/sec+ Each link and block here
corresponds to actual wiring and module.
+ This allows for rapid parallel development and an semi-custom design methodology
+ Also facilitates multiple clock domains
THERE IS NO INHERENTI/O CAPABILITY.
![Page 5: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June](https://reader035.vdocument.in/reader035/viewer/2022072016/56649ee65503460f94bf6644/html5/thumbnails/5.jpg)
I/O Organization+ Smaller than processing node
+ Router 2 links, alleviates need for routing table
+ Memory is globally visible and part of coherency scheme
+ CPU optimized placement for drivers, translations etc. with low-latency access needs to I/O.
+ Re-used dL1 design provides interface to PCI/X interface
+ Supports arbitrary I/O:P ratio, network topology
+ Glueless scaling up to 1024 nodes of any type supports application specific customization
![Page 6: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June](https://reader035.vdocument.in/reader035/viewer/2022072016/56649ee65503460f94bf6644/html5/thumbnails/6.jpg)
Coherence: Local+ L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory
+ Chip ICS responsible for all on-chip communication
+ L2 is “non-inclusive”.
+ “Large victim buffer” for L1s. Keeps tags and state copies of L1 data
+ The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist.
+ L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory.
+L2 on forwards blocks conflicting requests
![Page 7: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June](https://reader035.vdocument.in/reader035/viewer/2022072016/56649ee65503460f94bf6644/html5/thumbnails/7.jpg)
Coherence: Global• Trades ECC granularity for “free” directory data storage (4x
granularity leaves 44 bits per 64 bit line)• Invalidation-based distributed directory protocol• Some optimizations• No NACKing: Deadlock avoidance through I/O, L, H priority virtual
lanes: L: Home node, low priority. H: Forwarded requests, replies• Also guarantee forwards always serviced by targets: e.g. owner
writes back to home, holds data until home acknowledges. • Removes NACK/Retry traffic, as well as “ownership change”
(DASH), retry-counts (Origin), “No, seriously” (Token).• Routing toward empty buffers for old messages linear buffer
dependence on N. Share buffer space among lanes, and “CMI” invalidations avoid deadlock.
![Page 8: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June](https://reader035.vdocument.in/reader035/viewer/2022072016/56649ee65503460f94bf6644/html5/thumbnails/8.jpg)
Evaluation Methodology
• Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications)
• Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware
• “Fudged” for full-custom effect• Four evaluations: P1 (One-core Piranha @
500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)
![Page 9: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June](https://reader035.vdocument.in/reader035/viewer/2022072016/56649ee65503460f94bf6644/html5/thumbnails/9.jpg)
Results
![Page 10: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June](https://reader035.vdocument.in/reader035/viewer/2022072016/56649ee65503460f94bf6644/html5/thumbnails/10.jpg)
Questions/Discussion
• Deadlock avoidance w/o NACK
• CMP vs SMP
• “Fishy” evaluation methodology?
• Specialized computing
• Buildability?