![Page 1: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/1.jpg)
1
Design with Microprocessors
Tajana Simunic Rosing Department of Computer Science and Engineering University of California, San Diego.
![Page 2: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/2.jpg)
2 Tajana Simunic Rosing
ES Design
Verification and Validation
Hardware Hardware components
![Page 3: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/3.jpg)
Embedded System Hardware n Embedded system hardware is frequently
designed in a loop (“hardware in a loop“):
F cyber-physical systems
![Page 4: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/4.jpg)
Hardware platform architecture
4 Tajana Simunic Rosing
![Page 5: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/5.jpg)
System-on-Chip platforms
5 Tajana Simunic Rosing
Nvidia Tegra 2 die photo
Qualcomm Snapdragon block diagram
General processor n Application processor (CPU) Specialized units n Graphics processing unit (GPU) n Various digital signals processing (DSPs)
![Page 6: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/6.jpg)
Processor comparison metrics
6 Tajana Simunic Rosing
Clock frequency Computation speed Memory subsystem • Indeterminacy in execution
• Cache miss: compulsory, conflict, capacity Power consumption Idle power draw, dynamic range, sleep modes Chip area Can be critical for embedded form factor Versatility/specialization FPGAs, ASICs Non-technical Development environment, prior expertise,
licensing
n E.g. ARM Cortex-A, Intel Atom, TI C54x, TI 60x DSPs, Xilinx Virtex-7, single purpose controller
![Page 7: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/7.jpg)
Processor comparison metrics
7 Tajana Simunic Rosing
Parallelism Superscalar pipeline • Depth & width à latency & throughput Multithreading • GPU workload requires different programming effort
Instruction set architecture
Complex instruction set computer (CISC): • Many addressing modes • Many operations per instruction • E.g. TI C54x
Reduced instruction set computer (RISC): • Load/store • Easy to pipeline • E.g. ARM
Very Long Instruction Word (VLIW)
![Page 8: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/8.jpg)
Processor comparison metrics n How do you define “speed”?
¨ Clock speed – but instructions per cycle may differ ¨ Instructions per second – but work per instr. may differ
n Practical evaluation ¨ Dhrystone: Synthetic benchmark, developed in 1984
n Dhrystones/sec (a.k.a. Dhrystone MIPS) to normalize difference in instruction count between RISC/CISC
n MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital’s VAX 11/780) ¨ SPEC: set of more realistic benchmarks, but oriented to desktops ¨ EEMBC: EDN Embedded Benchmark Consortium, www.eembc.org
n Suites of benchmarks: automotive, consumer electronics, networking, office automation, telecommunications
n E.g. CoreMark, intended to replace Dhrystone ¨ 0xbench: Integrated Android benchmarks
n Covers C library, system calls, Javascript (web performance), graphics, Dalvik VM garbage collection
8 Tajana Simunic Rosing
![Page 9: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/9.jpg)
PARALLEL ARCHITECTURES
9 Tajana Simunic Rosing
![Page 10: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/10.jpg)
10 Tajana Simunic Rosing
Parallelism in programs n Exists in several levels
of granularity ¨ Task ¨ Data ¨ Instruction
P1 P2
P3
Ld r1, r2 Add r3,r4 Sub r5,r6
![Page 11: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/11.jpg)
11 Tajana Simunic Rosing
Parallelism extraction n Static
¨ Use compiler to analyze program code
¨ Can make use of high-level language constructs
¨ Cannot inspect data values ¨ Simpler CPU control
n Dynamic ¨ Use hardware to identify
opportunities ¨ Can make use of data
values ¨ More complex CPU
![Page 12: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/12.jpg)
12 Tajana Simunic Rosing
Superscalar n Instruction-level parallelism n Replicated execution resources n RISC instructions are pipelined
¨ n inst/cycle à n2 HW
Register file
Execution unit
Execution unit
n
n
n
![Page 13: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/13.jpg)
13 Tajana Simunic Rosing
Simple VLIW architecture n Compile time assignment of instructions to FUs n Large register file feeds multiple function units.
Register file
EBOX (execution unit) Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP
ALU ALU Load/store Load/store FU
![Page 14: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/14.jpg)
14 Tajana Simunic Rosing
Clustered VLIW architecture
n Register file, function units divided into clusters.
Execution
Register file
Execution
Register file
Cluster bus
![Page 15: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/15.jpg)
Embedded processor trends
15
ARM11 ARM Cortex-A8 ARM Cortex-A9 Qualcomm Scorpion Qualcomm Krait[1] ARM Cortex-A15 MPCore
Decode single-issue 2-wide 2-wide 2-wide 3-wide 3-wide Pipeline depth 8 stages 13 stages 8 stages 10 stages 11 stages 15/17-25 stages Out of Order
Execution No No Yes Yes, non-speculative [2] Yes Yes
FPU VFPv2 (pipelined) VFPv3 (not pipelined)
VFPv3-D16 or VFPv3-D32 (typical)
(pipelined) VFPv3 (pipelined) VFPv4 (pipelined) [3] VFPv4 (pipelined)
NEON None Yes (Partially 128-bit wide)
Optional (Partially 128-bit wide) Yes (128-bit wide) Yes (128-bit wide) Yes (128-bit wide)
Process Technology 90 nm 65/45 nm 45/40/32/28 nm 65/45 nm 28 nm 32/28 nm
L0 Cache 4kB + 4kB direct mapped
L1 Cache Varying, typically 16 kB + 16 kB 32 kB + 32 kB 32 kB + 32 kB 32 kB + 32 kB 16 kB + 16 kB 4-
way set associative 32 kB + 32 kB per
core
L2 Cache Varying, typically none
256 or 512 (typical) kB 1 MB
256 kB (Single-core)/512 kB (Dual-
core)
1 MB 8-way set associative (Dual-core)/2 MB (Quad-
core)
up to 4 MB per cluster, up to 8 MB
per chip
Core Configurations 1 1 1, 2, 4 1, 2 2, 4 2, 4, 8(4×2) DMIPS/MHz
speed per core 1.25 2.0 2.5 2.1 3.3 3.5
NEON: Advanced SIMD extension is a combined 64- and 128-bit instruction set that provides standardized acceleration for media and DSP apps
Parallelization • Multiple cores • Multiple-issue per core • Out of order execution • SIMD (NEON)
Process technology • Higher density allows more parallel structures • Higher power density, thermal issues
![Page 16: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/16.jpg)
Why isn’t everything just massively parallel?
n Types of architectural hazards ¨ Data (e.g. read-after-write, pointer aliasing) ¨ Structural ¨ Control flow
n Difficult to fully utilize parallel structures ¨ Programs have real dependencies that limit ILP ¨ Utilization of parallel structures is dependent on
programming model ¨ Limited window size during instruction issue ¨ Memory delays
n High cost of errors in prediction/speculation ¨ Performance: Stalls introduced to wait for reissue ¨ Energy: Wasted power going down wrong execution path
16 Tajana Simunic Rosing
![Page 17: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/17.jpg)
ARMV7 ARCHITECTURE General/applications processing
17 Tajana Simunic Rosing
![Page 18: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/18.jpg)
ARMv7
n ARM assembly language - RISC n ARM programming model
¨ Audio players, pagers etc.; 130 MIPS
n ARM memory organization n ARM data operations (32 bit) n ARM flow of control n Hardware-based floating point unit
18 Tajana Simunic Rosing
![Page 19: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/19.jpg)
19 Tajana Simunic Rosing
ARM programming model
r0 r1 r2 r3 r4 r5 r6 r7
r8 r9
r10 r11 r12 r13 r14
r15 (PC)
CPSR
31 0
N Z C V
Current Program Status Register
![Page 20: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/20.jpg)
20 Tajana Simunic Rosing
ARM status bits n Every arithmetic, logical, or shifting
operation sets CPSR bits: ¨ N (negative), Z (zero), C (carry), V (overflow).
n Examples: ¨ -1 + 1 = 0: NZCV = 0110. ¨ 231-1+1 = -231: NZCV = 1001.
![Page 21: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/21.jpg)
21 Tajana Simunic Rosing
ARM pipeline execution
add r0,r1,#5
sub r2,r3,r6
cmp r2,#3
fetch
time
decode
fetch
execute
decode
fetch
execute
decode execute
1 2 3
![Page 22: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/22.jpg)
22 Tajana Simunic Rosing
ARM data instructions n ADD, ADC : add (w.
carry) n SUB, SBC : subtract
(w. carry) n MUL, MLA : multiply
(and accumulate)
n AND, ORR, EOR n BIC : bit clear n LSL, LSR : logical
shift left/right n ASL, ASR : arithmetic
shift left/right n ROR : rotate right n RRX : rotate right
extended with C
![Page 23: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/23.jpg)
23 Tajana Simunic Rosing
ARM flow of control n All operations can be performed
conditionally, testing CPSR: ¨ EQ, NE, CS, CC, MI, PL, VS, VC, HI, LS, GE, LT, GT, LE
n Branch operation: B #100 ¨ Can be performed conditionally.
![Page 24: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/24.jpg)
24 Tajana Simunic Rosing
ARM comparison instructions n CMP : compare n CMN : negated compare n TST : bit-wise AND n TEQ : bit-wise XOR n These instructions set only the NZCV bits
of CPSR.
![Page 25: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/25.jpg)
25 Tajana Simunic Rosing
ARM load/store/move instructions
n LDR, LDRH, LDRB : load (half-word, byte) n STR, STRH, STRB : store (half-word,
byte) n Addressing modes:
¨ register indirect : LDR r0,[r1] ¨ with second register : LDR r0,[r1,-r2] ¨ with constant : LDR r0,[r1,#4]
n MOV, MVN : move (negated)MOV r0, r1 ; sets r0 to r1
![Page 26: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/26.jpg)
26 Tajana Simunic Rosing
Addressing modes n Base-plus-offset addressing:
LDR r0,[r1,#16] ¨ Loads from location r1+16
n Auto-indexing increments base register: LDR r0,[r1,#16]!
n Post-indexing fetches, then does offset: LDR r0,[r1],#16 ¨ Loads r0 from r1, then adds 16 to r1
![Page 27: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/27.jpg)
27 Tajana Simunic Rosing
ARM subroutine linkage n Branch and link instruction:
BL foo ¨ Copies current PC to r14.
n To return from subroutine: MOV r15,r14
![Page 28: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/28.jpg)
ARM Summary n Load/store architecture n Most instructions are RISCy, operate in
single cycle. ¨ Some multi-register operations take longer.
n All instructions can be executed conditionally
n ARMv7-A is deployed in: ¨ Cortex-A15 (Snapdragon Krait, Nvidia Tegra,
TI OMAP), Cortex-A5 (AMD Fusion), Atmel microcontrollers
28 Tajana Simunic Rosing
![Page 29: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/29.jpg)
Next generation: ARMv8 n Addition of 64-bit support
¨ Larger virtual address space n New instruction set (A64)
¨ Fewer conditional instructions ¨ New instructions to support 64-bit operands ¨ No arbitrary length load/store multiple instructions
n Enhanced cryptography (both 32 and 64-bit) n Mostly backwards compatible with ARMv7
n Enable expansion into traditional/higher performance markets ¨ Mobile phones, servers, supercomputers ¨ Cortex-A53, Apple Cyclone, Nvidia Denver
29 Tajana Simunic Rosing
![Page 30: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/30.jpg)
GRAPHICS PROCESSING (GPU)
30 Tajana Simunic Rosing
![Page 31: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/31.jpg)
Graphics pipeline
n Primary architectural elements: ¨ Vertex shaders, pixel shaders (fragment shaders)
n Performance measured in GFLOPS
31 http://m.iopscience.iop.org/1742-5468/2009/06/P06016/figures
![Page 32: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/32.jpg)
GPU programming n Compute Unified Device Architecture (CUDA)
¨ Enables GPGPUs: GPUs can be used for general purpose processing (i.e., not exclusively graphics)
n Single program multiple data (SPMD) ¨ Programming has to be explicit ¨ Threading directives and memory addressing
32 Tajana Simunic Rosing
![Page 33: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/33.jpg)
GPU programming n A kernel scales across any number
of parallel processors ¨ Contains grid of thread blocks
n Thread block shape can be 1D, 2D, 3D ¨ All threads in a thread block run
the same code, over the same data in shared memory space
¨ Threads have thread ID numbers within block to select work and address shared data
¨ Threads in different blocks cannot cooperate
n Threads are grouped by warps (scheduling units)
33 Tajana Simunic Rosing
KQueue Kernel # 1 Kernel # 2
Thread Block
warp 1
warp 2 warp 5
warp 4
warp 3 warp 0
![Page 34: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/34.jpg)
Nvidia GeForce Block Diagram n Ultra low power (ULP) version of GeForce used in Tegra 3, 4 n Desktop GTX 280 shown; embedded versions have different
number of TPCs, SMs, etc
Rajib Nath
Geometry Controller
SMC
Texture L1 Cache
SM SM SM
Thread Processing Cluster (TPC)
Streaming Multiprocessor (SM)
Shared Memory
SFU SFU SP SP
SP SP
SP SP
SP SP
Constant Cache Mul; Thread Issue Instruc;on Cache
Register File
Floa;ng Point Unit
Integer Unit
Streaming Processor (SP)
![Page 35: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/35.jpg)
Qualcomm Adreno 2xx n Snapdragon S1-4 chipsets (e.g. MSM8x60)
¨ Newest generation Adreno 5xx released in 2015+ n Unified shaders
¨ Same instruction set for fragment and vertex processing
¨ More versatile hardware n 5-way VLIW n Also used for non-gaming apps, e.g. browser
¨ Behavior in non-gaming is not as well-understood
![Page 36: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/36.jpg)
Qualcomm Adreno 2xx
n Snapdragon S4 chipset (MSM8960) n “Digital core rail power” (red) includes GPU, video decode, &
modem digital blocks n GLBenchmark: high-end gaming content
¨ CPU power up to 750mW @ 1.5 GHz ¨ GPU power up to 1.6W @ 400Mhz
36 http://www.anandtech.com/print/5559/qualcomm-snapdragon-s4-krait-performance-preview-msm8960-adreno-225-benchmarks
![Page 37: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/37.jpg)
DIGITAL SIGNAL PROCESSING (DSP)
37 Tajana Simunic Rosing
![Page 38: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/38.jpg)
Digital signal processing (DSP) n Processing or manipulation of signals using digital
techniques n Interfacing with the physical world
¨ E.g. audio, digital images, speech processing, medical monitoring (EKG)
Source: Dr D. H. Crawford
ADC DAC Digital Signal
Processor Analog to Digital
Converter
Digital to Analog
Converter
Input Signal
Output Signal
![Page 39: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/39.jpg)
Fundamental DSP Operations n Filtering
¨ Finite impulse response (FIR)
n Frequency transforms ¨ Fast Fourier (FFT) ¨ Discrete cosine (DCT) ¨ Inverse discrete
cosine (IDCT)
Source: Dr D. H. Crawford
∑−
=−=
1
0)()(
L
ii inxany
for (n=0; n<N; n++) { s=0; for (i=0; i<L; i++) { s += a[i] * x[n-i]; } y[n] = s; }
Pseudo C code
![Page 40: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/40.jpg)
DSP architectural features n Fixed-point vs Floating point n VLIW or specialized SIMD techniques
¨ E.g. Qualcomm Hexagon DSP (VLIW) dispatches up to 4 instructions to 4 execution units per cycle
n No virtual memory or context switching n Separate instruction and data storage
¨ Harvard architecture vs Von Neumann n Pipelined FUs are integrated into the datapath
n Main DSP Manufacturers: ¨ Texas Instruments (http://www.ti.com) ¨ Motorola (http://www.motorola.com) ¨ Analog Devices (http://www.analog.com)
Source: Dr D. H. Crawford
![Page 41: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/41.jpg)
Speech processing Audio effects
Image compression Video encoding
Noise cancellation Virtual/augmented reality Actuation error detection
What is DSP Used For?
![Page 42: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/42.jpg)
Speech processing n Encoding n Compression n Synthesis n Recognition
Source: Dr D. H. Crawford
The blue--- s---p--o---------t i-s--on--the-- k--ey a---g--ai----n------
“oo” in “blue” “o” in “spot” “ee” in “key” “e” in “again” “s” in “spot” “k” in “key”
![Page 43: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/43.jpg)
Image processing n Trade off “good enough” quality in “essential” regions n Enable higher transmission bandwidth, minimal
storage, media interactivity n Still image encoding: JPEG (Joint Photographic
Experts Group) ¨ JPEG2000: Wavelet Transform based
n Video encoding: MPEG (Moving Pictures Experts Group) ¨ MPEG-4 (aka H.264)
n Variable macroblock sizes (4x4 to 16x16) n Enhanced to allow lossless regions
Source: Dr D. H. Crawford
![Page 44: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/44.jpg)
JPEG Codec
44 Source: Xilinx
Coefficient quantization DCT
Zig-zag run-length encoding
Huffman encoding
Original pixel data
Compressed data
Coefficient denormalization IDCT
Zig-zag run-length expansion
Huffman decoding
Encoding
Decoding
lossy lossless
Reconstructed pixel data
![Page 45: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/45.jpg)
MPEG-2 Codec
45 Source: Xilinx
Quantization DCT Variable length coder (VLC)
Original video Encoding
- Bitstream out
Motion compensation
Motion estimation Frame store
Inverse quantization
IDCT
+ Loop filter
Intraframe
Interframe
![Page 46: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/46.jpg)
MPEG-2 Codec
46 Source: Xilinx
Decoding
Motion compensation
Frame store
+Inverse
quantization DCT Variable length decoder (VLD)
Bitstream buffer Output IDCT
Intraframe
Interframe
![Page 47: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/47.jpg)
DCT/IDCT Concept n Used for lossy compression n Discrete cosine transform (DCT)
¨ Represent a finite sequence of data points with a sum of cosine (even) functions of different frequencies
¨ Similar to the discrete Fourier transform (DFT), but using only real numbers
¨ If a coefficient has a lot of variance over a set, then it cannot be removed without affecting the picture quality
n Inverse DCT ¨ Reconstruct sequence from frequency
coefficients
47 Source: Xilinx
![Page 48: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/48.jpg)
2D DCT & IDCT n Image divided into macroblocks of 8x8 pixels n “The DCT” of each group is an 8x8 transform
coefficient array; entries represent spatial frequencies
48 Source: Xilnx
![Page 49: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/49.jpg)
DCT & IDCT operations
n Dedicated functional unit: fused multiply-add n Common to DCT/IDCT and many other DSP
operations 49
Source: Xilinx
DCT:
IDCT:
![Page 50: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/50.jpg)
Group of Pictures (GOP) n A structure of consecutive frames that can be decoded
without any other reference frames n Transmitted sequence is not the same as displayed sequence n Inter-frame prediction/compression exploits temporal
redundancy between neighboring frames n Intra-frame coding is applied only to the current frame
![Page 51: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/51.jpg)
Types of frames n I frame (intra-coded)
¨ Coded without reference to other frames ¨ Begins each GOP
n P frame (predictive-coded) ¨ Coded with reference to a previous reference
frame (either I or P) ¨ Size is usually about 1/3rd of an I frame
n B frame (bi-directional predictive-coded) ¨ Coded with reference to both previous and future
reference frames (either I or P) ¨ Size is usually about 1/6th of an I frame
![Page 52: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/52.jpg)
Fixed-Point Design
Idea
Floating-Point Algorithm
Quantization
Fixed-Point Algorithm
Code Generation
Target System
Algorithm
LevelIm
plementation
Level
Range Estimation
n Digital signal processing algorithms ¨ Early development in floating point ¨ Converted into fixed point for
production to gain efficiency n Fixed-point digital hardware
¨ Lower area ¨ Lower power ¨ Lower per unit production cost
Copyright Kyungtae Han [2]
![Page 53: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/53.jpg)
Fixed-Point Design n All variables have to be annotated manually
¨ Value ranges are well known ¨ Avoid overflow ¨ Minimize quantization effects ¨ Find optimum wordlength
n Manual process supported by simulation ¨ Time-consuming ¨ Error prone
Copyright Kyungtae Han [2]
![Page 54: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/54.jpg)
Fixed-Point Representation n Fixed point type
¨ Wordlength ¨ Integer wordlength
n Quantization modes ¨ Round ¨ Truncation
n Overflow modes ¨ Saturation ¨ Saturation to zero ¨ Wrap-around
S X X X X X
Wordlength
Integer wordlength
SystemC format www.systemc.org
X X X X X
Wordlength
Integer wordlength = -2
Copyright Kyungtae Han [2]
![Page 55: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/55.jpg)
Fixed-Point Representation
x = 0.5 x 0.125 + 0.25 x 0.125
= 0.0625 + 0.03125
= 0.09375
n For integer word length iwl=1 and fractional word length fwl=3 decimal digits, the less significant digits are automatically chopped off: x = 0.093
n Like a floating point system with numbers ∈ (-1..1), with no stored exponent (bits used to increase precision).
Ø Automatic scaling: Shifting after multiplications and divisions in order to maintain binary point.
![Page 56: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/56.jpg)
TI C54x family (CISC) n Modified Harvard architecture: separate buses
for program code and data ¨ PB: program read bus ¨ CB, DB: data read busses ¨ EB: data write bus ¨ PAB, CAB, DAB, EAB: address busses
n Can generate two data memory addresses per cycle ¨ Stored in auxiliary register address units
n High performance, reproducible behavior, optimized for different memory structures
56 Tajana Simunic Rosing
![Page 57: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/57.jpg)
57 Tajana Simunic Rosing
TI C54x architectural features
n 40-bit ALU & Barrel shifter ¨ Input from accumulator or data memory ¨ Output to ALU
n 17 x 17 multiplier n Single-cycle exponent encoder n Two address generators with dedicated
registers n Accumulators
¨ Low-order (0-15), high-order (16-31), guard (32-39)
![Page 58: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/58.jpg)
58 Tajana Simunic Rosing
TI C54x instruction set features
n Compare, select and store unit (CSSU) unit ¨ Compares high and low accumulator words ¨ Accelerates Viterbi operations
n Repeat and block repeat instructions n Instructions that read 2, 3 operands
simultaneously n Three IDLE instructions
¨ Selectively shut down CPU, on-chip peripherals, whole chip including phase-locked loop
![Page 59: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/59.jpg)
59 Tajana Simunic Rosing
TI C54x pipeline n Prefetch: Send PC address on program address bus n Fetch: Load instruction from program bus to IR n Decode n Access: Put operand addresses on busses n Read: Get operands from busses n Execute
![Page 60: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/60.jpg)
60 Tajana Simunic Rosing
Addressing Modes
Data
Immediate
Register-direct
Register indirect
Direct
Indirect
Data
Operand field
Register address
Register address
Memory address
Memory address
Memory address Data
Data
Memory address
Data
Addressing mode
Register-file contents
Memory contents
![Page 61: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/61.jpg)
TIs C55x Pipeline
n Prefetch 1: ¨ Send address to memory
n Prefetch 2: ¨ Wait for response
n Fetch: ¨ Get instruction from memory
and put in IBQ n Predecode:
¨ Identify where instructions begin and end
¨ Identify parallel instructions
n Decode: ¨ Decode an instruction pair or single
instruction. n Address:
¨ Perform address calculations. n Access 1/2:
¨ Send address to memory; wait. n Read:
¨ Read data from memory. Evaluate condition registers.
n Execute: ¨ Read/modify registers. Set conditions.
n W/W+: ¨ Write data to MMR-addressed
registers, memory; finish.
61 Tajana Simunic Rosing
fetch execute
4 7-8
![Page 62: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/62.jpg)
62 Tajana Simunic Rosing
C55x organization
Instruction unit
Program flow unit
Address unit
Data unit
3 data read busses 3 data read address busses program address bus
program read bus
2 data write busses 2 data write address busses
16
24
24
16
24
32
Instruction fetch Data read from memory
D bus
Single operand read
C, D busses
Dual operand read
B bus
Dual-multiply coefficient Writes
![Page 63: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/63.jpg)
63 Tajana Simunic Rosing
C55x hardware extensions
n Target image/video applications ¨ DCT/IDCT ¨ Pixel interpolation ¨ Motion estimation
n Available in 5509 and 5510 ¨ Equivalent C-callable functions for other devices.
![Page 64: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/64.jpg)
64 Tajana Simunic Rosing
TI C62/C67 (VLIW) n Up to 8 instructions/cycle n 32 32-bit registers n Function units
¨ Two multipliers ¨ Six ALUs
n Data operations ¨ 8/16/32-bit arithmetic ¨ 40-bit operations ¨ Bit manipulation operations
![Page 65: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/65.jpg)
Partitioned register files n Many memory ports are required to supply enough
operands per cycle. n Memories with many ports are expensive. Ø Registers are partitioned into sets, e.g. for TI
C60x:
65 Tajana Simunic Rosing
register file A register file B
L1 S1 M1 D1 D2 M2 S2 L2
Data bus
Address bus
Data path A Data path B
![Page 66: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/66.jpg)
66 Tajana Simunic Rosing
C6x data paths n General-purpose register files (A and B,
16 words each) n Eight functional units:
¨ .L1, .L2, .S1, .S2, .M1, .M2, .D1, .D2 n Two load units (LD1, LD2) n Two store units (ST1, ST2) n Two register file cross paths (1X and 2X) n Two data address paths (DA1 and DA2)
![Page 67: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/67.jpg)
67 Tajana Simunic Rosing
C6x functional units n .L
¨ 32/40-bit arithmetic ¨ Leftmost 1 counting ¨ Logical ops
n .S ¨ 32-bit arithmetic ¨ 32/40-bit shift and 32-bit field ¨ Branches ¨ Constants
n .M ¨ 16 x 16 multiply
n .D ¨ 32-bit add, subtract, circular address ¨ Load, store with 5/15-bit constant offset
![Page 68: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/68.jpg)
68 Tajana Simunic Rosing
C6x system n On-chip RAM n 32-bit external memory: SDRAM, SRAM n Host port n Multiple serial ports n Multichannel DMA n 32-bit timer
![Page 69: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/69.jpg)
PROGRAMMABLE LOGIC DEVICES (PLD)
69 Tajana Simunic Rosing
![Page 70: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/70.jpg)
Programmable Logic Devices (PLD) n Simple PLD (SPLD) �
¨ Programmable logic array (PLA) ¨ Programmable array logic (PAL) –
fixed OR plane n Complex PLD (CPLD)
¨ Building block: macrocells n Variations:
¨ Antifuse PLD ¨ Erasable EPLD & EEPLD
70 Tajana Simunic Rosing
Name Re-programmable Volatile Technology Fuse No No Bipolar
EPROM Yes – out of circuit No UVCMOS
EEPROM Yes – in circuit No EECMOS
SRAM Yes – in circuit Yes CMOS
Antifuse No No CMOS+
![Page 71: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/71.jpg)
Antifuse PLDs n Actel Axcelerator family
• Antifuse: – open when not programmed – Low resistance when programmed
![Page 72: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/72.jpg)
Clk MUX
Output MUXQ
F/B MUX
Invert Control
AND ARRAY
CLK
pad
8 Product Term AND-OR Array + Programmable MUX's
Programmable polarity
I/O Pin
Seq. Logic Block
Programmable feedback
Erasable Programmable Logic Devices
n EPLDs: CMOS erasable programmable ROM (EPROM) erased by UV light
n Altera’s building block is a MACROCELL
![Page 73: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/73.jpg)
Logic Array
Blocks
(similar to macrocells)
Global Routing: Programmable Interconnect Array (PIA)
8 Fixed Inputs 52 I/O Pins 8 LABs 16 Macrocells/LAB 32 Expanders/LAB
EPM5128:
Complex Programmable Logic Devices (CPLD)
n Altera Multiple Array Matrix (MAX) architecture n AND-OR structures are relatively limited, cannot share
signals/product terms among macrocells
LAB A LAB H
LAB B LAB G
LAB C LAB F
LAB D LAB E
P I A
![Page 74: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/74.jpg)
74
Altera MAX 7k (EEPLD) Logic Block
![Page 75: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/75.jpg)
75
SRAM based PLD n Altera Flex 10k Block Diagram
![Page 76: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/76.jpg)
76
SRAM based PLD n Altera Flex 10k Logic Array Block (LAB)
![Page 77: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/77.jpg)
77
SRAM based PLD n Altera Flex 10k Logic Element (LE)
![Page 78: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/78.jpg)
Field-Programmable Gate Arrays (FPGA) n Unlike PLA/PAL, configuration is stored in volatile SRAM n External memory (ROM) n Used to emulate/prototype ASICs
78 Tajana Simunic Rosing
Components Logic blocks Implement combinational and sequential logic
Based on lookup tables (LUT) Interconnect Wires connecting I/O to logic blocks I/O blocks Special logic blocks at periphery of device for
external connections Specialized blocks e.g. DSP
![Page 79: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/79.jpg)
79
FPGA with DSP n Altera Stratix II: Block Diagram
![Page 80: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/80.jpg)
FPGA with DSP n Altera Stratix II DSP
block
80
![Page 81: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/81.jpg)
Xilinx Virtex-7 n Up to 2M logic cells n High block RAM-to-logic cell ratios
(up to 68 Mb with 1,139K logic cells)
n Configurable logic blocks (CLB) organized into 2 slices ¨ Lookup tables, carry chains,
registers ¨ Distributed memory and shift
register logic n Target market:
¨ Test and measurement (T&M) applications, bridging/switch fabric, RADAR, ASIC emulation
81 http://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf
![Page 82: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/82.jpg)
Xilinx Virtex-7 Slice n Modern slices come in two varieties
¨ SLICEM implementing logical, shift register, and memory functions
¨ SLICEL implementing logical functions only
n SLICEM placed closer to DSP slices for easy access to coefficients
82 http://www.xilinx.com/support/documentation/white_papers/wp405-7Series-Logical-Advantage.pdf
Packing 2 independent logical functions into 1 LUT
![Page 83: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/83.jpg)
Application-specific integrated circuit (ASICs)
n Custom integrated circuits that have been designed for a single use or application
n Standard single-purpose processors ¨ “Off-the-shelf”, pre-designed for a common task (e.g.
peripherals) ¨ serial transmission ¨ analog/digital conversions
83 Tajana Simunic Rosing
![Page 84: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/84.jpg)
Combined system: CPU+FPGA+ASICs
n Actel Fusion Family ¨ ARM7 CPU with FPGA and ASIC implementations of
“smart peripherals” for analog functions
84
![Page 85: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/85.jpg)
PROCESSOR COMPARISON THROUGH APPLICATIONS
85 Tajana Simunic Rosing
![Page 86: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/86.jpg)
Stereo Vision n Moving window with data sharing between
threads to avoid reprocessing ¨ Optimized implementation per device
n GPU – slow data sharing ¨ Limited speed due to memory access conflicts
n FPGA – 242 data windows (121 per image) ¨ Stepwise reduction after window size
becomes too large ¨ Pipelined execution on logic units
![Page 87: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/87.jpg)
Stereo Vision
![Page 88: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/88.jpg)
Krajník, Tomáš, et al. "FPGA-based module for SURF extraction." Machine vision and applications 25.3 (2014)
SURF Feature Detection n Efficiently detect and describe interest points in
images n Enables applications such as object recognition
and 3D reconstruction
88
CPU GPU FPGA Stage 1: Detector speed [ms] 5200 105 100 Stage 2: Descriptor speed [ms] 1.4 0.1 0.7 Power consumption [W] 24 24 6 Mass [g] 850 850 210 Volume [cm3] 600 600 180
![Page 89: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/89.jpg)
Summary n Processor metrics, trends n Architectures and functions
¨ CPUs ¨ GPU ¨ DSP
n Implementations ¨ Programmable logic – PLDs and FPGAs ¨ Custom ASICs
89 Tajana Simunic Rosing
![Page 90: Design with Microprocessors - University of California ...cseweb.ucsd.edu/classes/wi15/cse237A-a/handouts/8_cpus.pdf · $ Simpler CPU control ! Dynamic $ ... Compile time assignment](https://reader031.vdocument.in/reader031/viewer/2022030505/5ab342c17f8b9aea528e231b/html5/thumbnails/90.jpg)
Sources and References n Peter Marwedel, “Embedded Systems
Design,” 2004. n Frank Vahid, Tony Givargis, “Embedded
System Design,” Wiley, 2002. n Wayne Wolf, “Computers as
Components,” Morgan Kaufmann, 2001.
90 Tajana Simunic Rosing