ECSE 436
1
DSP architecture
Review of basic computer architecture concepts
C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining
ECSE 436
2
DSP architecture
Review of basic computer architecture concepts
C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining
ECSE 436
3
Instruction Set Architecture (ISA)
Computers run programs made of simple operations called “instructions”
The list of instructions offered by the machine is the “instruction set”
The instruction set is what is visible to the programmer (really the compiler, although humans can directly program in “assembly language”)
Many different DSPs can share the same ISA but have different hardware (i.e. the implementation of the ISA is different)
ECSE 436
4
Instructions
Two kinds of information in a computer: instructions data
Instructions are stored as numbers, just like data
Instructions and data are stored in the memory
ECSE 436
5
Basic Computer Organization
CPU
registers
memory
storeload
PC IR
OPCODE OPERANDS
Limited numberof fast registersfor temporarystorage
Large amountof slow memoryArranged as an arrayof bytes
Instructions are loaded into an Instruction register (IR) from the address pointed to by the program counter (PC). The PC is incremented by the instruction size (in bytes) for each new instruction. E.g. PC PC + 4
ECSE 436
6
Load/Store Architecture (Reg-Reg)
CPU
registers
memory
storeload
PC IR
• Instructions can ONLY get their data and store their data from/to registers.
• The register numbers are specified in the operand fields of the instruction
• Since data is stored in memory, we need special “load” and “store” instructions for transfers between registers and memory. These two instructions are the ONLY ones allowed to access memory
ECSE 436
7
DSP architecture
Review of basic computer architecture concepts
C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining
ECSE 436
8
C6000 Architecture
TMS320C62x/C64x 16-bit fixed point DSP
TMS320C67x 32-bit floating point DSP Instuction set is a superset of the C62x
VLIW Architecture Very Long Instruction Word
ECSE 436
9
VLIW
VLIW is an architecture that exploits instruction level parallelism (ILP) in the code
What is ILP?
An instruction is dependent on another if it uses (produces) a value produced (used) by the other instruction
ECSE 436
10
Example
add c,d,emult b,e,a
The mult instruction must wait for the add instruction to finish before it can execute (sequential data flow)
e
ECSE 436
11
Example
add a,b,eadd c,d,fadd e,f,g
The first two adds have no data dependency and could even be switched in the code with no effect on the correctness of the answer
The first two adds could be executed in parallel if we had the hardware to do it (two adders)
+
+
+
a b
c d
ef
g
ECSE 436
12
Scheduling
Given a set of hardware resources (functional units), e.g. a number of adders, multipliers, etc…,
the process of determining which instructions can be executed in parallel and which functional units to use on any given clock cycle is called instruction scheduling
ECSE 436
13
VLIW
VLIW is an architecture that depends on the user (compiler) to do the scheduling
Instructions are packed into a very long instruction word (256 bits)
There is no scheduling hardware on the chip like on a Pentium 4 which uses hardware, or dynamic scheduling
Benefits simple hardware
Drawbacks requires sophisticated compilers code compatibility – need to recompile if you use a different
DSP, even one with the same ISA
ECSE 436
14
C6713 Architecture
ECSE 436
15
Maximum Performance
C6713 8 functional units, two MACS per cycle 225 MHz 1800 MIPS
6 of the 8 units floating point 225 MHz 1350 MFLOPS
ECSE 436
16
DSP architecture
Review of basic computer architecture concepts
C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining
ECSE 436
17
Addressing Modes
Load/Store must load registers from memory, process data,
store back to memory Linear (indirect addressing)
32 registers A0-A15, B0-B15 can act as pointers
*R register R contains the address of memory location where a data value is stored
ECSE 436
18
Linear Addressing
*R++(d) R contains the address. After R is used, postincrement by discplacement d (default is d = 1), -- post decrements
*++R(d) preincrement or predecrement
*+R(d) preincrement without modification
ECSE 436
19
Circular Addressing
ECSE 436
20
Circular Addressing
Address Mode Register (AMR)
ECSE 436
21
DSP architecture
Review of basic computer architecture concepts
C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining
ECSE 436
22
TMS320 Assemby Language
[label][:] mnemonic [operand list] [; comment]
[x] means that x is optional
label symbolic name for the address of the program line
mnemonic instruction, assembler directive, macro cannot start in column 1
operands constants: binary (e.g. 010101b), decimal, hexdecimal (e.g. 0x9f or 9fh) register names symbols defined by assembler directives
ECSE 436
23
Assembler Directives
The assembler produces COFF (common-obect file format) files
COFF files are divided into sections that contain instructions or data
Assembler directives are instructions to the assembler on how to manipulate these sections or to define constants they are not machine instructions see Section 4.1 in the text for more details
ECSE 436
24
C6000 ISA
parallelconditional execution
functional unit
ECSE 436
25
Instruction Packing
Instruction 1 ; instructions 1 and 2 Instruction 2 ; are executed sequentially Instruction 3 ; instructions 3, 4, and 5|| Instruction 4 ; are executed in parallel|| Instruction 5
VELOCITI: 1 to 8 execute packets in a fetch packet
ECSE 436
26
Sample Instructions
ADD .L1 A3,A7,A7 ;add A3+A7->A7
SUB .S1 A1,1,A1 ;subtract 1 from A1
MPY .M2 A7,B7,B6 ; mult 16LSBs of A7,B7->B6|| MPYH .M1 A7,B7,A6 ; mult 16MSBs of A7,B7->A6
LDH .D2 *B2++,B7 ; load (B2) -> B7, inc B2|| LDH .D1 *A2++,A7 ; load (A2) -> A7, inc A2
ECSE 436
27
Sample Instructions
Loop MVKL .S1 x,A4 ; move 16 LSBs of x addr->A4
MVKH .S2 x,A4 ; move 16 MSBs of x addr->A4
SUB .S1 A1,1,A1 ; decrement A1[A1] B .S2 Loop ; branch to Loop if A1 != 0
NOP 5 ; 5 NOP instructionsSTW .D1 A3, *A7 ; store A3 into (A7)
ECSE 436
28
Linear Assembly
To effectively program a DSP using assembly language, you need to do the scheduling by hand!
Need to account for the number of clock cycles each functional unit takes, etc…
Difficult, so TI has linear assembly you don’t have to schedule it, the compiler does it
for you can use CPU resources without worrying about
scheduling, register allocation, etc…
ECSE 436
29
DSP architecture
Review of basic computer architecture concepts
C6000 architecture: VLIW Principle and Scheduling Addressing Assembly and linear assembly Pipelining
ECSE 436
30
Pipelining
Key technique to make fast CPUs
Multiple instructions are overlapped in execution
E.g. Automotive assembly line
ECSE 436
31
body (B) 1 hour
paint (P) 1 hour
Wheels (W) 1 hour
Pipelining: principle
ECSE 436
32
BobTime (h)
1
2
3
4
5
6
B1
0
P1
W1
B2
P2
W2
2 cars / 6 hours 1/3 car / hour
Pipelining: principle(II)
ECSE 436
33
BobTime (h)
1
2
3
4
5
6
B1
0
P1
W1
B2
P2
W2
Alice Bill
B3
B4
B5
B6
P3
P4
P5
W3
W4
1 car / hour (3 x speedup)
Pipelining: principle(III)
ECSE 436
34
COMB. LOGIC
cycle time
cycle time
Pipelining: principle(IV)
ECSE 436
35
Performance Gain
Pipelining a datapath m times can result in up to m times improvement in cycle time E.g. 5-stage pipelined processor is potentially 5
times faster than an unpipelined processor
In reality, this is limited to less than m because of restrictions in overlapping instructions
ECSE 436
36
5-Stage RISC Pipeline
ECSE 436
37
16-Stage C6713 Pipeline
Fetch (4 stages) calc. address, send address, wait, receive
Decode (2 stages) separate fetch packets into execute packets
Execute (10 stages) Different instructions require different number of
cycles to execute
38
Software and I/O
ECSE 436
39
Software and I/O
Code efficiency and programming techniques Loop unrolling Software pipelining
I/O considerations Interrupts DMA Block processing
ECSE 436
40
Software and I/O
Code efficiency and programming techniques Loop unrolling Software pipelining
I/O considerations Interrupts DMA Block processing
ECSE 436
41
Code Efficiency
Intrinsic functions e.g. _add2, _mpy, sadd see TMS320C62x/C67x Programmers Guide
Packed data use word access to operate on 16-bit data store in
the high and low parts of a 32-bit register
ECSE 436
42
Loop Unrolling
A loop is a compact way of representing a repetitive sequence of instructions, but…
The loop condition test is overhead To remove the loop overhead, unroll the loop
(make copies of the loop code) key way of exposing parallelism !!! The compiler can now look across loop iterations
to find parallel instructions parallelism increased, but so is code size
ECSE 436
43
Example
; program A: code without unrollingMVK 4,B0
loop:LDH *A5++,A0
|| LDH *A6++,A1ADD A0,A1,A2 ;add 4 times…SUB B0,1,B0
[B0] B loop
ECSE 436
44
Example
; program B: code with unrolling onceMVK 2,B0
loop:LDH *A5++,A0
|| LDH *A6++,A1 ; add first 2 numbers
ADD A0,A1,A2…LDH *A5++,A0
|| LDH *A6++,A1 ; add other 2 numbers
ADD A0,A1,A2…SUB B0,1,B0
[B0] B loop
ECSE 436
45
Software and I/O
Code efficiency and programming techniques Loop unrolling Software pipelining
I/O considerations Interrupts DMA Block processing
ECSE 436
46
Software Pipelining
Software pipelining compiler technique (don’t confuse with h/w
pipelining) Schedule multiple iterations of a loop together to fill
any empty cycles and maximize functional unit usage
-O2 –O3
ECSE 436
47
Software Pipelining
The general idea of this optimization is to uncover long sequences of statements without branch statements
Reorganize loops to interleave instructions from different iterations Dependent instructions within a single loop
iteration are then separated from one another by an entire loop body
Increases possibilities of scheduling
ECSE 436
48
Software Pipelining
Iteration 0 Iteration
1 Iteration 2 Iteration
3 Iteration 4
Software- pipelined iteration
ECSE 436
49
Software Pipelining
Advantage: yields shorter code than loop unrolling and uses fewer registers
Software pipelining is crucial for VLIW processors
Often, both software pipelining and loop unrolling are used
ECSE 436
50
Software and I/O
Code efficiency and programming techniques Loop unrolling Software pipelining
I/O considerations Interrupts DMA Block processing
ECSE 436
51
Interrupts
A signal that causes the processor to suspend its current program and execute a special subroutine interrupt service routine (ISR)
Sources On-chip peripherals
timers, serial ports External
resets, external peripherals Software interrupts
arithmetic exceptions (divide by zero, overflow)
ECSE 436
52
Interrupts
ECSE 436
53
Interrupts
ECSE 436
54
Interrupts
ECSE 436
55
Software and I/O
Code efficiency and programming techniques Loop unrolling Software pipelining
I/O considerations Interrupts DMA Block processing
ECSE 436
56
Direct Memory Access
Data transfer without intervention of processsor memory and CPU peripherals and CPU
DMA channel: source address destination address element count in a frame number of frames in a block
ECSE 436
57
Software and I/O
Code efficiency and programming techniques Loop unrolling Software pipelining
I/O considerations Interrupts DMA Block processing
ECSE 436
58
Block Processing
ECSE 436
59
Ping-Pong Buffering
Ping-pong buffer (double buffer)
DMA channel delivers N samples of data in and out of buffers while the DSP operates on data in the current buffer
Next block, roles of the buffers are changed