©alex doboli 2006 performance improvement through customization alex doboli, ph.d. department of...
DESCRIPTION
©Alex Doboli 2006 Application Specific Customization Design methods for optimizing the system performance through architecture customization to the application characteristics A block (Subroutine) is critical with respect to performance P, if P changes significantly with the modification of the block –Customization to reduce the system execution time Related design steps: –Profiling –Selecting the blocks in HW –Finding the nature of the used HW circuits –Implementing the systemTRANSCRIPT
©Alex Doboli 2006
Performance Improvement through Customization
Alex Doboli, Ph.D.Department of Electrical and Computer Engineering
State University of New York at Stony BrookEmail: [email protected]
©Alex Doboli 2006
Overview of the Chapter
• Design methods for optimizing the system performance through architecture customization to the application characteristics
• Design methodology for execution time speedup of time critical applications
• Considered architecture: one processor and one co-processor• Steps: specification, profiling, identification of performance-
critical blocks, functional partitioning, hardware-software partitioning, hardware resource allocation, mapping to resources, scheduling
• PSoC programmable blocks supporting customization:– Programmable digital blocks– Blocks with dedicated functionality: PWM, MAC, Decimator
©Alex Doboli 2006
Application Specific Customization
• Design methods for optimizing the system performance through architecture customization to the application characteristics
• A block (Subroutine) is critical with respect to performance P, if P changes significantly with the modification of the block – Customization to reduce the system execution time
• Related design steps: – Profiling – Selecting the blocks in HW – Finding the nature of the used HW circuits – Implementing the system
©Alex Doboli 2006
Implementing the System
• Layered implementation for reusability:– Circuit layer– Low-level firmware layer: ISR, drivers, physical addresses &
data (bits, registers)– High-level firmware level (API):used in applications, symbolic
names, abstract data– Application layer
©Alex Doboli 2006
Design tasks
• Architectures with single CPU and associated coprocessors: – Critical parts are implemented as dedicated coprocessors– Tasks:
• Hardware-software partitioning• Hardware resource allocation
• Architectures with single CPU and shared coprocessors:– Critical parts share hardware (lower cost, some performance
loss)– Tasks:
• Hardware-software partitioning• Hardware resource allocation• Binding (mapping) of blocks to hardware• Scheduling
©Alex Doboli 2006
Design tasks
• Architectures with multiple CPUs and shared coprocessors:– Multimedia, image processing, telecommunications– Tasks:
• Hardware-software partitioning• Allocating CPUs• Allocating interconnect structure• Hardware resource allocation• Binding (mapping) of critical blocks to hardware• Mapping software to CPUs• Mapping data communications to interconnect• Scheduling• Etc.
©Alex Doboli 2006
Design Flow
©Alex Doboli 2006
Discussed Design Methodology
Related design tasks:• Specification and performance profiling• Hardware-software partitioning• Hardware resource allocation• Binding (mapping) of critical blocks to hardware• Scheduling
©Alex Doboli 2006
Processor-Coprocessor Architecture
©Alex Doboli 2006
Data Intensive Application
Application characteristics:– Data dominated: large number of computations– Known number of loop iterations– Iterations are uncorrelated or with few correlations (can be
eliminated through transformations)
©Alex Doboli 2006
Array Organization in Memory
• Two dimensional arrays are stored at consecutive memory addresses
A[m][n] is at index i = m x SZ + n• Assumption: data length = memory word length (e.g., one byte
for PSoC)
©Alex Doboli 2006
Profiling
• Related Steps:– Produce assembly code
• More precise determination of execution time & memory requirements (code & data)
• Can be done statically
– Functional partitioning• Hardware related code re-organization• Code organized as blocks, where each block is a well-
defined HW circuit
– Find the performance critical blocks • Find system performance sensitivity with respect to the
individual blocks
©Alex Doboli 2006
Block Structure
©Alex Doboli 2006
Refined Block Structure
©Alex Doboli 2006
Customized Hardware
©Alex Doboli 2006
Customized Hardware
• Hardware for Blocks 10-1 and 11-1:
©Alex Doboli 2006
Customized Hardware
• Data path for Blocks 10-1 and 11-1
©Alex Doboli 2006
Customized Hardware
• Controller circuits for Blocks 10-1 and 11-1
©Alex Doboli 2006
Controller
©Alex Doboli 2006
Customized Hardware
• Controller circuit •for Blocks 10-1 and 11-1
©Alex Doboli 2006
Counter Functionality
©Alex Doboli 2006
Data Flow Graph
©Alex Doboli 2006
Scheduling
©Alex Doboli 2006
Programmable Digital Blocks
©Alex Doboli 2006
Programmable Digital Block
©Alex Doboli 2006
Programmable Digital Block
Programmable digital block inputs:
• DATA (primary input): RI (4 bits) – connections to GPIO, DB RO (4 bits) Broadcast: BCROW (4 rows) ACMP (comparator outputs) Input of the previous block High, Low
• AUX (auxiliary input):
RI (4 bits)
• PO (primary output): RO (4 bits) GOO GOE
©Alex Doboli 2006
Programmable Digital Block
Programmable digital block inputs:
• AO (auxiliary output): RO (4 bits)
• CLK (separate for each digital block):SYSCLKX2CLK32VC1, VC2, VC3 BroadcastRI RO Low, High CLK of previous digital block
©Alex Doboli 2006
Programmable Digital Blocks
• Related registers:
• Data: DR0, DR1, DR2
• Function: CR0, FN
• Inputs: IN
• Outputs: OU
• Interrupts: INT
• Functions: timer, counter, deadband, CRC
©Alex Doboli 2006
Related Registers: registers DRx
• Related registers: DR0, DR1, DR2
Register DBB00 DBB01 DCB02 DCB03
DR0 0,20H 0,24H 0,28H 0,2CH
DR1 0,21H 0,25H 0,29H 0,2DH
DR2 0,22H 0,26H 0,2AH 0,2EH
©Alex Doboli 2006
Related Registers: registers DRx
• Related registers: DR0, DR1, DR2
Register DBB10 DBB11 DCB12 DCB13
DR0 0,30H 0,34H 0,38H 0,3CH
DR1 0,31H 0,35H 0,39H 0,3DH
DR2 0,32H 0,36H 0,3AH 0,3EH
©Alex Doboli 2006
Related Registers: registers DRx
• Related registers: DR0, DR1, DR2
Register DBB20 DBB21 DCB22 DCB23
DR0 0,40H 0,44H 0,48H 0,4CH
DR1 0,41H 0,45H 0,49H 0,4DH
DR2 0,42H 0,46H 0,4AH 0,4EH
©Alex Doboli 2006
Related Registers: registers DRx
• Related registers: DR0, DR1, DR2
Register DBB30 DBB31 DCB32 DCB33
DR0 0,50H 0,54H 0,58H 0,5CH
DR1 0,51H 0,55H 0,59H 0,5DH
DR2 0,52H 0,56H 0,5AH 0,5EH
©Alex Doboli 2006
Related Registers: registers CR0 & FN
• Related registers: CR0 and FN
Register DBB00 DBB01 DCB02 DCB03
CR0 0,23H 0,27H 0,2BH 0,2FH
FN 1,20H 1,24H 1,28H 1,2CH
©Alex Doboli 2006
Related Registers: registers CR0 & FN
• Related registers: CR0 and FN
Register DBB10 DBB11 DCB12 DCB13
CR0 0,33H 0,37H 0,3BH 0,3FH
FN 1,30H 1,34H 1,38H 1,3CH
©Alex Doboli 2006
Related Registers: registers CR0 & FN
• Related registers: CR0 and FN
Register DBB20 DBB21 DCB22 DCB23
CR0 0,43H 0,47H 0,4BH 0.4FH
FN 1,40H 1,44H 1,48H 1,4CH
©Alex Doboli 2006
Related Registers: registers CR0 & FN
• Related registers: CR0 and FN
Register DBB30 DBB31 DCB32 DCB33
CR0 0,53H 0,57H 0,5BH 0,5FH
FN 1,50H 1,54H 1,58H 1,5CH
©Alex Doboli 2006
Related Registers: registers IN & OU
• Related registers: IN and OU
Register DBB00 DBB01 DCB02 DCB03
IN 1,21H 1,25H 1,29H 1,2DH
OU 1,22H 1,26H` 1,2AH 1,2EH
©Alex Doboli 2006
Related Registers: registers IN & OU
• Related registers: IN and OU
Register DBB10 DBB11 DCB12 DCB13
IN 1,31H 1,35H 1,39H 1,3DH
OU 1,32H 1,36H 1,3AH 1,3EH
©Alex Doboli 2006
Related Registers: registers IN & OU
• Related registers: IN and OU
Register DBB20 DBB21 DCB22 DCB23
IN 1,41H 1,45H 1,49H 1,4DH
OU 1,42H 1,46H 1,4AH 1,4EH
©Alex Doboli 2006
Related Registers: registers IN & OU
• Related registers: IN and OU
Register DBB30 DBB31 DCB32 DCB33
IN 1,51H 1,55H 1,59H 1,5DH
OU 1,52H 1,56H 1,5AH 1,5EH
©Alex Doboli 2006
Interconnect
©Alex Doboli 2006
Interconnect
©Alex Doboli 2006
Row Digital Interconnect (RDI)
©Alex Doboli 2006
Programmable Clocks
• Clocks are programmed using register IN (bits 3-0)• Synchronized with SYSCLK2 or SYSCLKx2
– Programmed using register OU (bits 7-6)
©Alex Doboli 2006
Timer Block
©Alex Doboli 2006
Timer Functionality
• Timer block functionality:– Terminal count:
• Generates a signal with programmable timing frequency• Write function• Main timer function• Generate interrupt
– Compare functionality:• Write compare value• Read compare value• Compare function• Interrupts
– Capture functionality:• Read value in register DR0
©Alex Doboli 2006
Timer Block Data Flow
©Alex Doboli 2006
Main Timer Functionality
©Alex Doboli 2006
Timing Diagram
• Useful for generating interrupts after certain amount of time• Hardware support for implementing timing constraints
• maximum time range, minimum time range
©Alex Doboli 2006
Terminal Count Firmware Routines
©Alex Doboli 2006
Compare Functionality
©Alex Doboli 2006
Compare Function Firmware Routines
©Alex Doboli 2006
Capture Related Functionality
©Alex Doboli 2006
Counter Functionality
©Alex Doboli 2006
Dead Band Circuit
©Alex Doboli 2006
Pulse Width Modulation (PWM)
• Function: produces signal with programmable period and pulse width
Duty cycle = Pulse width / Period
counterInterrupt (TC/compare)
Compare valueEnable BC
Inverted/noninverted
enabled
standalone
©Alex Doboli 2006
PWM Firmware Functions
• Possible functions:– PWM_Start– PWM_Stop– PWM_Write_PulseWidth– PWM_WritePeriod– PWM_bReadCounter– PWM_bReadPulseWidth– PWM_EnableInterrupts– PWM_DisableInterrupts
• Programmable blocks: counter
©Alex Doboli 2006
PWM Firmware Functions
©Alex Doboli 2006
PWM Firmware Functions
©Alex Doboli 2006
PWM Firmware Function
©Alex Doboli 2006
PWM API
©Alex Doboli 2006
Software PWM
Pulse widthOff time
©Alex Doboli 2006
Software PWM
• Tuning capabilities: • Fine (in steps of 12 cycles) • Coarse (in steps of 30 cycles)
• Minimum Pulse width is 30 clock cycles
• Minimum period is 60 clock cycles
• Not a solution if faster signals are needed
©Alex Doboli 2006
Multiple Accumulate Circuit (MAC)
• Functionalities (selected using operands):• Fast multiplication (uses regs MUL_X and MUL_Y)• Multiplication followed by summing (MAC_X and MAC_Y)
Fast multiplication
Multiply-accumulate
©Alex Doboli 2006
Exercise: Scalar Product (C code)
©Alex Doboli 2006
Object Code from C Compiler
©Alex Doboli 2006
Object Code from C Compiler
©Alex Doboli 2006
Assembly Code without MAC
©Alex Doboli 2006
Assembly Code using MAC
©Alex Doboli 2006
Experimental Results
• Execution time for different implementations (in clock cycles)
Vector size
C code without
MAC
C code with MAC
Assembly code
without MAC
Assembly code with
MAC
16 8958 6043 2861 390
64 45177 23659 11932 1580
256 - - 52268 6188
• in addition, 1494 clock cycles for initialization
©Alex Doboli 2006
Decimator Blocks
H(z) = (1/M)2 [1 / (1-z-1)]2 (1 – z-M)2
• Digital low pass filtering and down-conversion • Used in down-sampling after modulators • Incremental ADC
• Functionality:
Integration at rate Bdifferentiation at
rate B/M
©Alex Doboli 2006
Resolution vs. M
©Alex Doboli 2006
Type 1 Decimator Circuit
• Functionality: single or double integration; differentiation is in software
input
First integration
Second integration
©Alex Doboli 2006
Type 2 Decimator Circuit
• Realizes integration and differentiation• Results in register DEC_DH (0, E4H) & DEC_DL (0, E5H)• writing to registers DEC_DH and DEC_DL clears the accumulator• Programming using registers DEC_CR0, DEC_CR1, and DEC_CR2
©Alex Doboli 2006
Type 2 Decimator
• Register DEC_CR0 (0, E6H):– Selects analog comparator column that is gated (bits IGEN)
• for incremental ADC– Selects gating signals from digital block (bit ICLKS0)
• For incremental ADC– Selects analog comparator column (bits DCOL)– Selects clock for decimator registers (bit DCLKS0)
• Register DEC_CR1 (0, E7H):– Used for incremental ADC or DS ADC (bit ECNT)– Selects gating signals (bits ICLKS)– Selects clock for decimator registers (bit DCLKS)– Selects digital block latch (bit IDEC)
©Alex Doboli 2006
Type 2 Decimator
• Register DEC_CR2 (1, E7H):– Selects mode: type 1, incremental ADC, type 2 (bits Mode)– Data output shifting: no shift, one position shift, two positions
shift, four positions shift (bits Data out shift) – Semantics of input data (bits Data format) for addition
• Input ‘1’ is always interpreted ‘1’ • Input ‘0’ is ‘-’1’ or ‘0’
– Select the decimator rate M (bits Decimation rate)• Off, 32, 50, 64, 125, 128, 250, 256
©Alex Doboli 2006
Decimator Circuit