ccnoc : on-chip interconnects for cache-coherent manycore server chips

22
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli LS I Integrated Systems Laboratory

Upload: cyndi

Post on 20-Feb-2016

25 views

Category:

Documents


0 download

DESCRIPTION

CCNoC : On-Chip Interconnects for Cache-Coherent Manycore Server Chips. CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli. LSI. Integrated Systems Laboratory. NoCs Major Power Consumer . Move towards manycore Tiled architectures - PowerPoint PPT Presentation

TRANSCRIPT

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips

CCNoC: On-Chip Interconnects forCache-Coherent Manycore Server ChipsCiprianSeiculescuStavros VolosNaser Khosro PourBabak Falsafi Giovanni De Micheli

LSIIntegratedSystemsLaboratory1NoCs Major Power Consumer Move towards manycore Tiled architectures

Network-on-Chip (NoC) Significant power consumer40% MIT RAW30% Intel Tera-scale

Cache coherent CMPServer workloadsC$C$C$C$C$C$C$C$C$C$C$C$C$C$C$C$CoreCore$$Crossbar2Proposals to Reduce NoC PowerMultiple networksBetter area and power [Balfour & Dally ICS 2006]

Commercial server workloadsTraffic patterns are different

Run on cache coherent CMPsStrong relation between coherence protocol and NoC

Not optimized for Commercial Server Workload traffic3ContributionsCommercial server workloadsOptimized for reuse in L1, little sharingFull blown coherence protocol in CMPsOnly some transitions are frequent

Duality in Request/Response message size

CCNoCFull advantage of heterogeneity Same number of buffers 16% less power same performance as Mesh4OutlineOverview

Why CCNoC?

Dual-router design

Evaluation

Conclusions5Dual Router is More EfficientDual routerTwo crossbars per routing node

Wires less expensive on-chipUse more wires for better performanceArea and power grows faster than connectivityBalfour & Dally ICS 2006Dual router: better performance, power and area

N bit wideN/2 bit wideN/2 bit wideRight Dual Router DesignAvoid protocol level deadlockSeparate Requests ResponsesUse Virtual Channels

CCNoC sub-networksRequest / ResponseNo VCs neededSame number of buffersBuffers are power hungryH.S.Wang & L.S.Peh, MICRO 2003Protocol ActivityCMPs implement full blown coherence protocol

Some transitions are frequent [Hardavellas ISCA 2009]Read clean blockEvict clean blockWrite to unshared block

Other transitions needed for correctness (infrequent)Read dirty blockEvict dirtyWrite to shared blockFrequent Read Protocol ActivityReaderDirectoryWriterRead ReqRead RespEvict Clean ReqShort ReqShort ReqShort RespLong Resp9Frequent Write Protocol ActivityWriterDirectoryFetch/Upgrade ReqFetchRespShort ReqShort ReqShort RespLong RespUpgrade Resp10Infrequent Read Protocol ActivityReaderDirectoryWriterRead ReqRead RespShort ReqShort ReqShort RespLong RespDowngrade ReqDowngrade Resp11Infrequent Write Protocol ActivityWriterDirectoryReader 1Fetch/Upgrade ReqFetch RespShort ReqShort ReqShort RespLong RespReader 2Upgrade RespInv ReqInv ReqInv RespInv RespEvict Dirty Req12Traffic AnalysisRequest: 93% shortResponse: 86% long13CCNoC RouterRequest network narrow: optimized for short messages Response network wide: optimized for long messages RequestSwitchResponseSwitchNIRouter14Previous WorkBalfour et al. ICS 2006Better than single large routerRead/Write trafficSame number of reads and writes

Yoon et al. DAC 2010Physical channel better then virtual channel

Not optimized for cache coherent CMPRunning commercial server workloadsOutlineOverview

Why CCNoC?

Dual-router design

Evaluation

Conclusions16Evaluation MethodologyFLEXUSFull system simulation 16 or 8 UltraSPARC III ISA coresSplit I/D, 64KB L11 or 2 MB L2

ORION 2.0power estimationarea estimationWorkloadsOLTP: TPC-CIBM DB2 and OracleDSS: TPC-H IBM DB2Q1, Q6, Q13, Q16Web: SPECweb99 Apache and ZeusScientific: EM3DMultiprogrammed:SPEC2K 2x: gcc, twolf, art, mcf

17Evaluation NoCsMesh-128 - baseline128 bit flit widthTorus - reference128 bit flit widthMesh-176 high performance 176 bit flit widthCCNoCRequest: 48 bit flit widthResponse: 128 bit flit widthSwitchesWormhole flow controlInput queued Transmission protocolOn/OffInput buffers2 entry

18PerformancePerformance loss: 2% Torus, 8% Mesh-17619Power SavingsPower savings: 16% Mesh-128, 22% Torus, 38% Mesh-17620ConclusionsDuality in Request/Response trafficRequest: dominated by short messagesResponse: dominated by long messages

Proposed CCNoCNarrow request networkWide response network

Showed significant power savings22% against Torus38% against Mesh-176 21Thank you!Q&A22