region-based hierarchical operation partitioning for multicluster processors michael chu, kevin fan,...
TRANSCRIPT
Region-based Hierarchical Operation Partitioning for Multicluster Processors
Michael Chu, Kevin Fan, Scott Mahlke
University of MichiganPresented by Cristian Petrescu-
Prahova
Clustered Register Files Why?
Register file cost and access time grows with the square of he number of register ports
Bypass logic grows quadratically with the number of operations issued per cycle
Distance separating FUs from register file increases with a large number of FUs
=> Clustered register files Decentralized architecture with several small register
files Each register file supplies operands to a subset of FUs Multiflow Trace, Alpha 21264, TI C6x, Analog
Tigersharc (two clusters); reconfigurable meshes?
Goal Partition operations across the resources
available on each cluster to maximize ILP Minimize inter-cluster communication Rule of thumb:
2 identical clusters processor loose ~20% performance
4 identical clusters processor loose ~30% performance
Nonidentical clusters lead to even more performance loss
Well Known Technique:Bottom-Up Greedy Recurse along DFG,
critical path first Assign each operation a
cluster based on estimates of when the operation and its predecessors can complete earliest (from scheduler)
Problem 1: makes local decisions (see figure)
Problem 2: is slow - needs to query accurate cluster status info for each operation considered
Region-Based Hierarchical Operation Partitioning
Works on acyclic DFGs extracted from the complete program based on region decomposition. I assume region ~ loop (?!?)
Two phases: Weigth calculation: Node and Edge Partitioning: Coarsening and Refining
Node Weight Calculation
Reflects the quantity of resources per operation
Ignores dependencies Individual weight (FUs)
Shared weight (ports, buses)
Edge Weight Calculation Measure of criticalness Based on the notion of slack
First come first serve slack distribution
Coarsening Partitioning Multilevel graph partitioning algorithm (Chaco,
Metis) Works by coarsening highly related nodes into
partitions, takes in account only edge weights Takes a snapshot of each step for refining step
Refinement Partitioning Traverse back the coarsening stages, making
improvements to the initial partition At each stage the coarsened nodes available at that
point are considered for movement to another cluster Highly related operations are grouped together at each
stage because we follow the coarsening process backwards
Metrics Cluster weight
estimate of the load per cluster the cluster with highest weight is denoted ‘the imbalanced
cluster’ System load
Estimates the load across all clusters Gain
The gain of moving operations into other clusters
Cluster Weight Individual resource
constraint per cluster, per cycle (op groups)
Total node weight per cluster per cycle (shared constraints)
Cycle weight per cluster
Cluster weight
Sytem Load Inter-cluster move
overhead Total load, based
on cycle by cycle estimation
Gain Load gain
Edge gain
Move gain
Example
Evaluation Implemented using Trimaran tool set Compared with BUG algorithm 5 DSP benchmarks (high ILP), SPECint2000 (low ILP) 5 configurations, functional units: integer (I), float
(F), memory (M), branch (B)
Improvement in dynamic total cycles of RHOP over BUG
Comparison of BUG and RHOP clustering performance versus a 1-cluster machine
2-1111 processor 4-1111 processor
Histogram of RHOP versus BUG
Achieved schedule length versus critical path length. Numbers of top are dynamic execution percentage
Compiling performance: number of calls to the resource table