burleson, umass1 adaptive system on a chip (asoc): a backbone for power-aware signal processing...
Post on 21-Dec-2015
214 views
TRANSCRIPT
Burleson, UMASS 1
Adaptive System on a Chip (ASOC): A Backbone for Power-Aware
Signal Processing Cores
Andrew Laffely, Jian Liang, Russ Tessier and Wayne Burleson
Electrical and Computer EngineeringUniversity of Massachusetts Amherst
{burleson}@ecs.umass.edu
This material is based upon work supported by the National Science Foundation under Grant No. 9988238 and SRC Tasks 766 and 1075
Burleson/UMASS 2
Challenges in Media Processing• Increasingly complex, heterogeneous algorithms
• Variable run-times (e.g. data-dependent iterations)• Variable quality• Variable power consumption
• Large data-sets, usually streaming• Memory size, ports and latency issues
• Advancing semiconductor technology (Moore’s Law)• Interconnect (on-chip and I/O)• Clocking• Power (consumption and distribution)• Design and Verification
Burleson/UMASS 3
aSoC: adaptive System on a Chip
• Tiled SoC architectureDCT
VLE
MemoryViterbiFIR
EncryptControl
Motion Estimationand Compensation
Burleson/UMASS 4
aSoC: adaptive System on a Chip
• Tiled SoC architecture• Supports the use of
independently developed heterogeneous cores• Pick and place cores
which best perform the given application
• Increase performance
• Save power• Cores may be any
number of tiles in size
DCT
VLE
MemoryViterbiFIR
EncryptControl
Motion Estimationand Compensation
Burleson/UMASS 5
aSoC: adaptive System on a Chip
• Tiled SoC architecture• Supports the use of
independently developed heterogeneous cores
• Connected with an interconnect mesh• Restricted to near
neighbor communications
• Creates pipeline• Decreases cycle time
DCT
VLE
MemoryViterbiFIR
EncryptControl
Motion Estimationand Compensation
Burleson/UMASS 6
aSoC: adaptive System on a Chip
• Tiled SoC architecture• Supports the use of
independently developed heterogeneous cores
• Connected with an optimized fixed interconnect mesh
• Using a communication interface (CI) to manage data
• Network port (Coreport) for each core, I/O queues,handshake
• Each CI uses a memory and FSM to repetitively process a predefined (static) schedule of communications
• High-speed 5x5 bidirectional crossbar
DCT
VLE
MemoryViterbiFIR
EncryptControl
Motion Estimationand Compensation
Burleson/UMASS 7
Communication Interface
• Custom design to maximize speed and reduce power• Core-ports• Crossbar• Controller• Instruction
memory• Local frequency
and voltage supply
Core
Core-ports
DecoderLocal
Frequency& Voltage
North to South & East
Instruction Memory
PC
Controller
North
South
East
West
Local Config.
North
South
East
West
Inputs Outputs
Crossbar
Burleson/UMASS 9
Research Thrusts
• aSoC Infrastructure1,3
• Communication Interface
• Interconnect3
• Power Distribution• Clock System• Power Management
• Design Technology• Compiler1,3 (Partitioner,
Mapper, Placer, Scheduler)
• Simulator1
• Cores• Motion estimation2,3
• Discrete Cosine Transform2,3
• AES Cryptography3
• Huffman Coding• Adaptive Viterbi2,3
• 3D Graphics1,2,3
• Smart Card2,3 • MP3
• ARM• DSP• Cache2,3
• FPGA• MAC
1 PhD Dissertation 2 Masters Thesis3 Publications
Burleson/UMASS 10
Voltage Scaling Approach• Core-ports
• Single buffer for each stream to cross clock/voltage barrier between core and interface
• Reading/Writing success rates indicate core utilization
• Input blocked: Core too slow
• Output blocked: Core too fast
• Controller • Interprets core-port
success rates to adjust local clock and voltage Interconnect
Buffer
InputCore-port
OutputCore-port
Core
Clockand
SupplyController
LocalVdd
LocalClock
Blocked
Blocked
ProcessingPipeline
Burleson/UMASS 11
Vdd Selection Criteria
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
2
4
6
8
10
12
Voltage
NormalizedDelay
0.73
• As Vdd decreases delay increases exponentially
• Use curve to match available clock frequencies to voltages
• The voltage and frequency change reduces power by 79%, 96%, and 98.7% • P = C(Vdd)2f
Normalized Core Critical Path Delay vs. Vdd
Max Speed
1/4 Speed
1/2 Speed
1/8 Speed
1.16
Burleson/UMASS 12
Architecture Evaluation(Motion Estimation)
• Array-based architecture • Pipelined ME
• Parameterized search window size• Full search• Choose 16x16
or 8x8 windows• Reduce power
AddressGeneration
Unit ProcessingElement
Array
Memory
FIFOs
Burleson/UMASS 13
Power Aware Core
• Custom motion estimation core• Choose search method
• Full search• 960-600mW (bit width and pel sub-sampling)
• Spiral search• 76mW
• Three step search• 25mW
Data taken with SynopsysTM Power Compiler at the RTL level
Burleson/UMASS 14
aSoC Support
• Multiple streams in and out through dedicated core ports
• Easy to manage on both sides of the port
• Schedule configuration streams in with the data
• Stream A: Input Frame• Stream B: Configuration
(Choose search mode and size)
• Stream C: Motion Vectors
Motion Estimation
Core
in1 in2 out2out1
Stream AStream B
Stream C
Coreports
Burleson/UMASS 15
Reconfigurable Interconnect
• P-frame
• I-frame
ME MC
-
+
InputFrame
DCTInputFrame
DCT
Burleson/UMASS 16
aSoC Support
• Lumped ME, MC and Summation into one double core
DCTMotion Estimation& Compensation
Burleson/UMASS 17
aSoC Support: P-Frame
InputFrame
(Stream A)
DCTMotion Estimation& Compensation
DifferenceFrame
(Stream B)
Burleson/UMASS 18
aSoC Support: Schedule Change
InputFrame
(Stream A)
DCTMotion Estimation& Compensation
DifferenceFrame
(Stream B)
Configuration Streams (C & D)
Burleson/UMASS 19
aSoC Support: Schedule Change
InputFrame
(Stream A)
DCTMotion Estimation& Compensation
DifferenceFrame
(Stream B)
Configuration(Streams C)
Schedule 1
Schedule 2
PC
Burleson/UMASS 20
aSoC Support: Schedule Change
InputFrame
(Stream A)
DCTMotion Estimation& Compensation
DifferenceFrame
(Stream B)
Configuration(Streams C)
Schedule 1
Schedule 2
PC
Burleson/UMASS 21
aSoC Support: Schedule Change
InputFrame
(Stream A)
DCTMotion Estimation& Compensation
Configuration(Streams D)
Schedule 1
Schedule 2
PC
Burleson/UMASS 22
aSoC Support: Schedule Change
InputFrame
(Stream A’)
DCTMotion Estimation& Compensation
Configuration(Streams D)
Schedule 1
Schedule 2
PC
Burleson/UMASS 23
aSoC Support: I-Frame
InputFrame
(Stream A’)
DCTMotion Estimation& Compensation
OFF
Burleson/UMASS 24
Operating Frequency?
• Interconnect synchronized• H-tree clock distribution
• Core frequencies depend on critical path• Tile provides clock reference• Coreport provides asynchronous boundary
• Dynamic core configuration requires dynamic clock configuration• aSoC clock reference provides multiples of
interconnect clock (… 4x, 2x, 1x, 0.5x, 0.25x, …)
• Configured through the tile controller
Burleson/UMASS 25
Clock Distribution
64 tile aSoC 70nm 100nm 130nm 180nm
Chip Area (9.24mm)2 (13.3mm)2 (17.2mm)2 (23.8mm)2
Frequency 5 GHz 2 GHz 1 GHz 0.5 GHz
Power 126 mW 240 mW 445 mW 784 mW
Mean Skew 41 ps 50 ps 92 ps 70.6 ps
Percent Skew
21 % 10 % 9 % 4 %
Tile• Tiled architecture extends life
of globally synchronous systems
• Precise H-tree implementation• Load is small and equal at
each branch• Skew can be reduced by 70%
with advanced deskew circuits1
1 S. Tan et al. “Clock Generation and Distribution for the First IA-64 Microprocessor” IEEE JSSC, Nov. 2000
Burleson/UMASS 26
Mixed vs. Fixed Core Frequencies
• Cores not designed with clock gating• Core power from Synopsys RTL simulation• Interconnect from SPICE• Assumes 10 cycle schedule, 4 pixels/word
Optimal Independent Frequencies
Fixed Worst Case 105MHz
Core: Mode
Frequency MHz
Power mW
Power mW
ME: Full Search
105 973 973
ME: Spiral
9.9 76 659
ME: Three Step Search
2.75 25 580
DCT 9.6 54 349 Interconnect 6.34 0.14 0.81
Burleson/UMASS 27
Current Density and Clocking
• Red: fixed worst case clocking
• Short spikes of high current
• Green: optimal independent clocking
• Slow and low
• Optimal clocking eliminates current spikes (also improved battery life)
DeadlineProcess Start
ME: Full Search
ME: Spiral
ME: Three Step Search
DCT
Time
Current
Burleson/UMASS 28
Power Distribution
64 tile aSoC
Vh Vmh Vml Vl
Voltage 1.8V 1.16V 0.73V 0.6V
Current per Core
110mA 25mA 13mA 7mA
Total Power 12.1 W 1.86 W 607 mW 269 mW
• Heterogeneous power-aware cores require multiple power supply voltages
• Tile structure enables uniform interwoven grid
• Larger grid for higher current demands
• Reduced resistance• Higher capacitance
Gnd
Vh
Vl
Vml
Vmh
Burleson/UMASS 29
Advanced Signaling Techniques (building on SRC-funded work)
Differential current sensing
Booster Insertion
Multi-level current signalingPhase coding
Burleson/UMASS 30
Interconnect Characterization:Comparing delay and power of signaling techniques for different tile sizes at 250nm, 180nm, 130nm, 100n (available via web-based tool Network on Chip Interconnect Calculator NOCIC)
Burleson/UMASS 31
Conclusions• Regular Tiled Architecture
• Task-based parallelism using heterogeneous cores• Predictable interconnect• Regular core interface, Vdd and clock control, and configuration
control• Static scheduling
• High-level global schedule of inter-core communication• Accomodates dynamic workloads with queues and local handshakes
• Demonstration using Motion Estimation and DCT• Variable search window and search algorithm provide power/quality
tradeoff • Power savings using scalable approaches to dynamic clock and
power variation• Simple clock dividers leveraging existing clock distribution methods• Route multiple power supplies to allow rapid switching and avoid
overhead of on-chip power regulation
Burleson/UMASS 32
Ongoing Work• Satellite Set-top Box application
• Developed at Hughes Networks using 7 distinct RISC cores. Compare ASOC with in-house shared memory approach for interconnections.
• New and more complete wireless and multimedia systems• Jpeg2000, mpeg-4, 3d Graphics, …
• ASOC parameter optimization• Tile sizes, bus widths, clocks, VDDs
• Coping with Core irregularity• Size, I/O positions, shapes, bus widths, communication interfaces
• Interconnect circuit optimization (NoCIC)• Leakage Power issues• Reliability, Test, Fault-Tolerance and Security• Compilation: especially Partitioning, Mapping
• Prototypes: .18u MOSIS of communication interface, ~25K transistors, verification of interface logic and timing
• ASOC in Education: Circuits, architecture and core design projects
Burleson/UMASS 33
Implications (perhaps controversial )• Multi-core architectures will be needed to maintain
Moore’s law (interconnect, memory, parallelism)• Task-based parallelism may be easier to program,
extract and implement than data parallelism (think multi-core rather than instruction level parallelism)
• Global coarse synchronization provides an approach to hard-real time computing for dynamic workloads (ie video coding).
• Dynamic Power savings exploiting fine-grain workload variations can be achieved through straightforward clock and power scaling methods.
• Interconnect standards will be specified by silicon foundries similar to cell libraries and memories
Burleson/UMASS 34
Design Flowhttp://vsp2.ecs.umass.edu/vspg/658/TA_Tools/design_flow.html
• Architecture to Layout • Architecture: Block diagram of system and behavioral
description• Logic: Gate level or schematic description• Circuit: Transistor configurations and sizings• Layout: Floorplanning, clock and power distribution
• Tools• VerilogXL: behavioral representation• VTVT: standard cell library• Synopsys: standard cell gate level netlist generation• Silicon Ensemble: standard cell netlist to layout• Cadence LayoutPlus: schematic and layout design• NCSU CDK: design and extraction rules• Cadence Layout vs. Schematic: layout verification• HSPICE: circuit simulator