design with microprocessors - university of california...
Post on 22-Mar-2018
217 Views
Preview:
TRANSCRIPT
1
Design with Microprocessors
Tajana Simunic Rosing Department of Computer Science and Engineering University of California, San Diego.
2 Tajana Simunic Rosing
ES Design
Verification and Validation
Hardware Hardware components
Embedded System Hardware n Embedded system hardware is frequently
designed in a loop (“hardware in a loop“):
F cyber-physical systems
Hardware platform architecture
4 Tajana Simunic Rosing
System-on-Chip platforms
5 Tajana Simunic Rosing
Nvidia Tegra 2 die photo
Qualcomm Snapdragon block diagram
General processor n Application processor (CPU) Specialized units n Graphics processing unit (GPU) n Various digital signals processing (DSPs)
Processor comparison metrics
6 Tajana Simunic Rosing
Clock frequency Computation speed Memory subsystem • Indeterminacy in execution
• Cache miss: compulsory, conflict, capacity Power consumption Idle power draw, dynamic range, sleep modes Chip area Can be critical for embedded form factor Versatility/specialization FPGAs, ASICs Non-technical Development environment, prior expertise,
licensing
n E.g. ARM Cortex-A, Intel Atom, TI C54x, TI 60x DSPs, Xilinx Virtex-7, single purpose controller
Processor comparison metrics
7 Tajana Simunic Rosing
Parallelism Superscalar pipeline • Depth & width à latency & throughput Multithreading • GPU workload requires different programming effort
Instruction set architecture
Complex instruction set computer (CISC): • Many addressing modes • Many operations per instruction • E.g. TI C54x
Reduced instruction set computer (RISC): • Load/store • Easy to pipeline • E.g. ARM
Very Long Instruction Word (VLIW)
Processor comparison metrics n How do you define “speed”?
¨ Clock speed – but instructions per cycle may differ ¨ Instructions per second – but work per instr. may differ
n Practical evaluation ¨ Dhrystone: Synthetic benchmark, developed in 1984
n Dhrystones/sec (a.k.a. Dhrystone MIPS) to normalize difference in instruction count between RISC/CISC
n MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital’s VAX 11/780) ¨ SPEC: set of more realistic benchmarks, but oriented to desktops ¨ EEMBC: EDN Embedded Benchmark Consortium, www.eembc.org
n Suites of benchmarks: automotive, consumer electronics, networking, office automation, telecommunications
n E.g. CoreMark, intended to replace Dhrystone ¨ 0xbench: Integrated Android benchmarks
n Covers C library, system calls, Javascript (web performance), graphics, Dalvik VM garbage collection
8 Tajana Simunic Rosing
PARALLEL ARCHITECTURES
9 Tajana Simunic Rosing
10 Tajana Simunic Rosing
Parallelism in programs n Exists in several levels
of granularity ¨ Task ¨ Data ¨ Instruction
P1 P2
P3
Ld r1, r2 Add r3,r4 Sub r5,r6
11 Tajana Simunic Rosing
Parallelism extraction n Static
¨ Use compiler to analyze program code
¨ Can make use of high-level language constructs
¨ Cannot inspect data values ¨ Simpler CPU control
n Dynamic ¨ Use hardware to identify
opportunities ¨ Can make use of data
values ¨ More complex CPU
12 Tajana Simunic Rosing
Superscalar n Instruction-level parallelism n Replicated execution resources n RISC instructions are pipelined
¨ n inst/cycle à n2 HW
Register file
Execution unit
Execution unit
n
n
n
13 Tajana Simunic Rosing
Simple VLIW architecture n Compile time assignment of instructions to FUs n Large register file feeds multiple function units.
Register file
EBOX (execution unit) Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP
ALU ALU Load/store Load/store FU
14 Tajana Simunic Rosing
Clustered VLIW architecture
n Register file, function units divided into clusters.
Execution
Register file
Execution
Register file
Cluster bus
Embedded processor trends
15
ARM11 ARM Cortex-A8 ARM Cortex-A9 Qualcomm Scorpion Qualcomm Krait[1] ARM Cortex-A15 MPCore
Decode single-issue 2-wide 2-wide 2-wide 3-wide 3-wide Pipeline depth 8 stages 13 stages 8 stages 10 stages 11 stages 15/17-25 stages Out of Order
Execution No No Yes Yes, non-speculative [2] Yes Yes
FPU VFPv2 (pipelined) VFPv3 (not pipelined)
VFPv3-D16 or VFPv3-D32 (typical)
(pipelined) VFPv3 (pipelined) VFPv4 (pipelined) [3] VFPv4 (pipelined)
NEON None Yes (Partially 128-bit wide)
Optional (Partially 128-bit wide) Yes (128-bit wide) Yes (128-bit wide) Yes (128-bit wide)
Process Technology 90 nm 65/45 nm 45/40/32/28 nm 65/45 nm 28 nm 32/28 nm
L0 Cache 4kB + 4kB direct mapped
L1 Cache Varying, typically 16 kB + 16 kB 32 kB + 32 kB 32 kB + 32 kB 32 kB + 32 kB 16 kB + 16 kB 4-
way set associative 32 kB + 32 kB per
core
L2 Cache Varying, typically none
256 or 512 (typical) kB 1 MB
256 kB (Single-core)/512 kB (Dual-
core)
1 MB 8-way set associative (Dual-core)/2 MB (Quad-
core)
up to 4 MB per cluster, up to 8 MB
per chip
Core Configurations 1 1 1, 2, 4 1, 2 2, 4 2, 4, 8(4×2) DMIPS/MHz
speed per core 1.25 2.0 2.5 2.1 3.3 3.5
NEON: Advanced SIMD extension is a combined 64- and 128-bit instruction set that provides standardized acceleration for media and DSP apps
Parallelization • Multiple cores • Multiple-issue per core • Out of order execution • SIMD (NEON)
Process technology • Higher density allows more parallel structures • Higher power density, thermal issues
Why isn’t everything just massively parallel?
n Types of architectural hazards ¨ Data (e.g. read-after-write, pointer aliasing) ¨ Structural ¨ Control flow
n Difficult to fully utilize parallel structures ¨ Programs have real dependencies that limit ILP ¨ Utilization of parallel structures is dependent on
programming model ¨ Limited window size during instruction issue ¨ Memory delays
n High cost of errors in prediction/speculation ¨ Performance: Stalls introduced to wait for reissue ¨ Energy: Wasted power going down wrong execution path
16 Tajana Simunic Rosing
ARMV7 ARCHITECTURE General/applications processing
17 Tajana Simunic Rosing
ARMv7
n ARM assembly language - RISC n ARM programming model
¨ Audio players, pagers etc.; 130 MIPS
n ARM memory organization n ARM data operations (32 bit) n ARM flow of control n Hardware-based floating point unit
18 Tajana Simunic Rosing
19 Tajana Simunic Rosing
ARM programming model
r0 r1 r2 r3 r4 r5 r6 r7
r8 r9
r10 r11 r12 r13 r14
r15 (PC)
CPSR
31 0
N Z C V
Current Program Status Register
20 Tajana Simunic Rosing
ARM status bits n Every arithmetic, logical, or shifting
operation sets CPSR bits: ¨ N (negative), Z (zero), C (carry), V (overflow).
n Examples: ¨ -1 + 1 = 0: NZCV = 0110. ¨ 231-1+1 = -231: NZCV = 1001.
21 Tajana Simunic Rosing
ARM pipeline execution
add r0,r1,#5
sub r2,r3,r6
cmp r2,#3
fetch
time
decode
fetch
execute
decode
fetch
execute
decode execute
1 2 3
22 Tajana Simunic Rosing
ARM data instructions n ADD, ADC : add (w.
carry) n SUB, SBC : subtract
(w. carry) n MUL, MLA : multiply
(and accumulate)
n AND, ORR, EOR n BIC : bit clear n LSL, LSR : logical
shift left/right n ASL, ASR : arithmetic
shift left/right n ROR : rotate right n RRX : rotate right
extended with C
23 Tajana Simunic Rosing
ARM flow of control n All operations can be performed
conditionally, testing CPSR: ¨ EQ, NE, CS, CC, MI, PL, VS, VC, HI, LS, GE, LT, GT, LE
n Branch operation: B #100 ¨ Can be performed conditionally.
24 Tajana Simunic Rosing
ARM comparison instructions n CMP : compare n CMN : negated compare n TST : bit-wise AND n TEQ : bit-wise XOR n These instructions set only the NZCV bits
of CPSR.
25 Tajana Simunic Rosing
ARM load/store/move instructions
n LDR, LDRH, LDRB : load (half-word, byte) n STR, STRH, STRB : store (half-word,
byte) n Addressing modes:
¨ register indirect : LDR r0,[r1] ¨ with second register : LDR r0,[r1,-r2] ¨ with constant : LDR r0,[r1,#4]
n MOV, MVN : move (negated)MOV r0, r1 ; sets r0 to r1
26 Tajana Simunic Rosing
Addressing modes n Base-plus-offset addressing:
LDR r0,[r1,#16] ¨ Loads from location r1+16
n Auto-indexing increments base register: LDR r0,[r1,#16]!
n Post-indexing fetches, then does offset: LDR r0,[r1],#16 ¨ Loads r0 from r1, then adds 16 to r1
27 Tajana Simunic Rosing
ARM subroutine linkage n Branch and link instruction:
BL foo ¨ Copies current PC to r14.
n To return from subroutine: MOV r15,r14
ARM Summary n Load/store architecture n Most instructions are RISCy, operate in
single cycle. ¨ Some multi-register operations take longer.
n All instructions can be executed conditionally
n ARMv7-A is deployed in: ¨ Cortex-A15 (Snapdragon Krait, Nvidia Tegra,
TI OMAP), Cortex-A5 (AMD Fusion), Atmel microcontrollers
28 Tajana Simunic Rosing
Next generation: ARMv8 n Addition of 64-bit support
¨ Larger virtual address space n New instruction set (A64)
¨ Fewer conditional instructions ¨ New instructions to support 64-bit operands ¨ No arbitrary length load/store multiple instructions
n Enhanced cryptography (both 32 and 64-bit) n Mostly backwards compatible with ARMv7
n Enable expansion into traditional/higher performance markets ¨ Mobile phones, servers, supercomputers ¨ Cortex-A53, Apple Cyclone, Nvidia Denver
29 Tajana Simunic Rosing
GRAPHICS PROCESSING (GPU)
30 Tajana Simunic Rosing
Graphics pipeline
n Primary architectural elements: ¨ Vertex shaders, pixel shaders (fragment shaders)
n Performance measured in GFLOPS
31 http://m.iopscience.iop.org/1742-5468/2009/06/P06016/figures
GPU programming n Compute Unified Device Architecture (CUDA)
¨ Enables GPGPUs: GPUs can be used for general purpose processing (i.e., not exclusively graphics)
n Single program multiple data (SPMD) ¨ Programming has to be explicit ¨ Threading directives and memory addressing
32 Tajana Simunic Rosing
GPU programming n A kernel scales across any number
of parallel processors ¨ Contains grid of thread blocks
n Thread block shape can be 1D, 2D, 3D ¨ All threads in a thread block run
the same code, over the same data in shared memory space
¨ Threads have thread ID numbers within block to select work and address shared data
¨ Threads in different blocks cannot cooperate
n Threads are grouped by warps (scheduling units)
33 Tajana Simunic Rosing
KQueue Kernel # 1 Kernel # 2
Thread Block
warp 1
warp 2 warp 5
warp 4
warp 3 warp 0
Nvidia GeForce Block Diagram n Ultra low power (ULP) version of GeForce used in Tegra 3, 4 n Desktop GTX 280 shown; embedded versions have different
number of TPCs, SMs, etc
Rajib Nath
Geometry Controller
SMC
Texture L1 Cache
SM SM SM
Thread Processing Cluster (TPC)
Streaming Multiprocessor (SM)
Shared Memory
SFU SFU SP SP
SP SP
SP SP
SP SP
Constant Cache Mul; Thread Issue Instruc;on Cache
Register File
Floa;ng Point Unit
Integer Unit
Streaming Processor (SP)
Qualcomm Adreno 2xx n Snapdragon S1-4 chipsets (e.g. MSM8x60)
¨ Newest generation Adreno 5xx released in 2015+ n Unified shaders
¨ Same instruction set for fragment and vertex processing
¨ More versatile hardware n 5-way VLIW n Also used for non-gaming apps, e.g. browser
¨ Behavior in non-gaming is not as well-understood
Qualcomm Adreno 2xx
n Snapdragon S4 chipset (MSM8960) n “Digital core rail power” (red) includes GPU, video decode, &
modem digital blocks n GLBenchmark: high-end gaming content
¨ CPU power up to 750mW @ 1.5 GHz ¨ GPU power up to 1.6W @ 400Mhz
36 http://www.anandtech.com/print/5559/qualcomm-snapdragon-s4-krait-performance-preview-msm8960-adreno-225-benchmarks
DIGITAL SIGNAL PROCESSING (DSP)
37 Tajana Simunic Rosing
Digital signal processing (DSP) n Processing or manipulation of signals using digital
techniques n Interfacing with the physical world
¨ E.g. audio, digital images, speech processing, medical monitoring (EKG)
Source: Dr D. H. Crawford
ADC DAC Digital Signal
Processor Analog to Digital
Converter
Digital to Analog
Converter
Input Signal
Output Signal
Fundamental DSP Operations n Filtering
¨ Finite impulse response (FIR)
n Frequency transforms ¨ Fast Fourier (FFT) ¨ Discrete cosine (DCT) ¨ Inverse discrete
cosine (IDCT)
Source: Dr D. H. Crawford
∑−
=−=
1
0)()(
L
ii inxany
for (n=0; n<N; n++) { s=0; for (i=0; i<L; i++) { s += a[i] * x[n-i]; } y[n] = s; }
Pseudo C code
DSP architectural features n Fixed-point vs Floating point n VLIW or specialized SIMD techniques
¨ E.g. Qualcomm Hexagon DSP (VLIW) dispatches up to 4 instructions to 4 execution units per cycle
n No virtual memory or context switching n Separate instruction and data storage
¨ Harvard architecture vs Von Neumann n Pipelined FUs are integrated into the datapath
n Main DSP Manufacturers: ¨ Texas Instruments (http://www.ti.com) ¨ Motorola (http://www.motorola.com) ¨ Analog Devices (http://www.analog.com)
Source: Dr D. H. Crawford
Speech processing Audio effects
Image compression Video encoding
Noise cancellation Virtual/augmented reality Actuation error detection
What is DSP Used For?
Speech processing n Encoding n Compression n Synthesis n Recognition
Source: Dr D. H. Crawford
The blue--- s---p--o---------t i-s--on--the-- k--ey a---g--ai----n------
“oo” in “blue” “o” in “spot” “ee” in “key” “e” in “again” “s” in “spot” “k” in “key”
Image processing n Trade off “good enough” quality in “essential” regions n Enable higher transmission bandwidth, minimal
storage, media interactivity n Still image encoding: JPEG (Joint Photographic
Experts Group) ¨ JPEG2000: Wavelet Transform based
n Video encoding: MPEG (Moving Pictures Experts Group) ¨ MPEG-4 (aka H.264)
n Variable macroblock sizes (4x4 to 16x16) n Enhanced to allow lossless regions
Source: Dr D. H. Crawford
JPEG Codec
44 Source: Xilinx
Coefficient quantization DCT
Zig-zag run-length encoding
Huffman encoding
Original pixel data
Compressed data
Coefficient denormalization IDCT
Zig-zag run-length expansion
Huffman decoding
Encoding
Decoding
lossy lossless
Reconstructed pixel data
MPEG-2 Codec
45 Source: Xilinx
Quantization DCT Variable length coder (VLC)
Original video Encoding
- Bitstream out
Motion compensation
Motion estimation Frame store
Inverse quantization
IDCT
+ Loop filter
Intraframe
Interframe
MPEG-2 Codec
46 Source: Xilinx
Decoding
Motion compensation
Frame store
+Inverse
quantization DCT Variable length decoder (VLD)
Bitstream buffer Output IDCT
Intraframe
Interframe
DCT/IDCT Concept n Used for lossy compression n Discrete cosine transform (DCT)
¨ Represent a finite sequence of data points with a sum of cosine (even) functions of different frequencies
¨ Similar to the discrete Fourier transform (DFT), but using only real numbers
¨ If a coefficient has a lot of variance over a set, then it cannot be removed without affecting the picture quality
n Inverse DCT ¨ Reconstruct sequence from frequency
coefficients
47 Source: Xilinx
2D DCT & IDCT n Image divided into macroblocks of 8x8 pixels n “The DCT” of each group is an 8x8 transform
coefficient array; entries represent spatial frequencies
48 Source: Xilnx
DCT & IDCT operations
n Dedicated functional unit: fused multiply-add n Common to DCT/IDCT and many other DSP
operations 49
Source: Xilinx
DCT:
IDCT:
Group of Pictures (GOP) n A structure of consecutive frames that can be decoded
without any other reference frames n Transmitted sequence is not the same as displayed sequence n Inter-frame prediction/compression exploits temporal
redundancy between neighboring frames n Intra-frame coding is applied only to the current frame
Types of frames n I frame (intra-coded)
¨ Coded without reference to other frames ¨ Begins each GOP
n P frame (predictive-coded) ¨ Coded with reference to a previous reference
frame (either I or P) ¨ Size is usually about 1/3rd of an I frame
n B frame (bi-directional predictive-coded) ¨ Coded with reference to both previous and future
reference frames (either I or P) ¨ Size is usually about 1/6th of an I frame
Fixed-Point Design
Idea
Floating-Point Algorithm
Quantization
Fixed-Point Algorithm
Code Generation
Target System
Algorithm
LevelIm
plementation
Level
Range Estimation
n Digital signal processing algorithms ¨ Early development in floating point ¨ Converted into fixed point for
production to gain efficiency n Fixed-point digital hardware
¨ Lower area ¨ Lower power ¨ Lower per unit production cost
Copyright Kyungtae Han [2]
Fixed-Point Design n All variables have to be annotated manually
¨ Value ranges are well known ¨ Avoid overflow ¨ Minimize quantization effects ¨ Find optimum wordlength
n Manual process supported by simulation ¨ Time-consuming ¨ Error prone
Copyright Kyungtae Han [2]
Fixed-Point Representation n Fixed point type
¨ Wordlength ¨ Integer wordlength
n Quantization modes ¨ Round ¨ Truncation
n Overflow modes ¨ Saturation ¨ Saturation to zero ¨ Wrap-around
S X X X X X
Wordlength
Integer wordlength
SystemC format www.systemc.org
X X X X X
Wordlength
Integer wordlength = -2
Copyright Kyungtae Han [2]
Fixed-Point Representation
x = 0.5 x 0.125 + 0.25 x 0.125
= 0.0625 + 0.03125
= 0.09375
n For integer word length iwl=1 and fractional word length fwl=3 decimal digits, the less significant digits are automatically chopped off: x = 0.093
n Like a floating point system with numbers ∈ (-1..1), with no stored exponent (bits used to increase precision).
Ø Automatic scaling: Shifting after multiplications and divisions in order to maintain binary point.
TI C54x family (CISC) n Modified Harvard architecture: separate buses
for program code and data ¨ PB: program read bus ¨ CB, DB: data read busses ¨ EB: data write bus ¨ PAB, CAB, DAB, EAB: address busses
n Can generate two data memory addresses per cycle ¨ Stored in auxiliary register address units
n High performance, reproducible behavior, optimized for different memory structures
56 Tajana Simunic Rosing
57 Tajana Simunic Rosing
TI C54x architectural features
n 40-bit ALU & Barrel shifter ¨ Input from accumulator or data memory ¨ Output to ALU
n 17 x 17 multiplier n Single-cycle exponent encoder n Two address generators with dedicated
registers n Accumulators
¨ Low-order (0-15), high-order (16-31), guard (32-39)
58 Tajana Simunic Rosing
TI C54x instruction set features
n Compare, select and store unit (CSSU) unit ¨ Compares high and low accumulator words ¨ Accelerates Viterbi operations
n Repeat and block repeat instructions n Instructions that read 2, 3 operands
simultaneously n Three IDLE instructions
¨ Selectively shut down CPU, on-chip peripherals, whole chip including phase-locked loop
59 Tajana Simunic Rosing
TI C54x pipeline n Prefetch: Send PC address on program address bus n Fetch: Load instruction from program bus to IR n Decode n Access: Put operand addresses on busses n Read: Get operands from busses n Execute
60 Tajana Simunic Rosing
Addressing Modes
Data
Immediate
Register-direct
Register indirect
Direct
Indirect
Data
Operand field
Register address
Register address
Memory address
Memory address
Memory address Data
Data
Memory address
Data
Addressing mode
Register-file contents
Memory contents
TIs C55x Pipeline
n Prefetch 1: ¨ Send address to memory
n Prefetch 2: ¨ Wait for response
n Fetch: ¨ Get instruction from memory
and put in IBQ n Predecode:
¨ Identify where instructions begin and end
¨ Identify parallel instructions
n Decode: ¨ Decode an instruction pair or single
instruction. n Address:
¨ Perform address calculations. n Access 1/2:
¨ Send address to memory; wait. n Read:
¨ Read data from memory. Evaluate condition registers.
n Execute: ¨ Read/modify registers. Set conditions.
n W/W+: ¨ Write data to MMR-addressed
registers, memory; finish.
61 Tajana Simunic Rosing
fetch execute
4 7-8
62 Tajana Simunic Rosing
C55x organization
Instruction unit
Program flow unit
Address unit
Data unit
3 data read busses 3 data read address busses program address bus
program read bus
2 data write busses 2 data write address busses
16
24
24
16
24
32
Instruction fetch Data read from memory
D bus
Single operand read
C, D busses
Dual operand read
B bus
Dual-multiply coefficient Writes
63 Tajana Simunic Rosing
C55x hardware extensions
n Target image/video applications ¨ DCT/IDCT ¨ Pixel interpolation ¨ Motion estimation
n Available in 5509 and 5510 ¨ Equivalent C-callable functions for other devices.
64 Tajana Simunic Rosing
TI C62/C67 (VLIW) n Up to 8 instructions/cycle n 32 32-bit registers n Function units
¨ Two multipliers ¨ Six ALUs
n Data operations ¨ 8/16/32-bit arithmetic ¨ 40-bit operations ¨ Bit manipulation operations
Partitioned register files n Many memory ports are required to supply enough
operands per cycle. n Memories with many ports are expensive. Ø Registers are partitioned into sets, e.g. for TI
C60x:
65 Tajana Simunic Rosing
register file A register file B
L1 S1 M1 D1 D2 M2 S2 L2
Data bus
Address bus
Data path A Data path B
66 Tajana Simunic Rosing
C6x data paths n General-purpose register files (A and B,
16 words each) n Eight functional units:
¨ .L1, .L2, .S1, .S2, .M1, .M2, .D1, .D2 n Two load units (LD1, LD2) n Two store units (ST1, ST2) n Two register file cross paths (1X and 2X) n Two data address paths (DA1 and DA2)
67 Tajana Simunic Rosing
C6x functional units n .L
¨ 32/40-bit arithmetic ¨ Leftmost 1 counting ¨ Logical ops
n .S ¨ 32-bit arithmetic ¨ 32/40-bit shift and 32-bit field ¨ Branches ¨ Constants
n .M ¨ 16 x 16 multiply
n .D ¨ 32-bit add, subtract, circular address ¨ Load, store with 5/15-bit constant offset
68 Tajana Simunic Rosing
C6x system n On-chip RAM n 32-bit external memory: SDRAM, SRAM n Host port n Multiple serial ports n Multichannel DMA n 32-bit timer
PROGRAMMABLE LOGIC DEVICES (PLD)
69 Tajana Simunic Rosing
Programmable Logic Devices (PLD) n Simple PLD (SPLD) �
¨ Programmable logic array (PLA) ¨ Programmable array logic (PAL) –
fixed OR plane n Complex PLD (CPLD)
¨ Building block: macrocells n Variations:
¨ Antifuse PLD ¨ Erasable EPLD & EEPLD
70 Tajana Simunic Rosing
Name Re-programmable Volatile Technology Fuse No No Bipolar
EPROM Yes – out of circuit No UVCMOS
EEPROM Yes – in circuit No EECMOS
SRAM Yes – in circuit Yes CMOS
Antifuse No No CMOS+
Antifuse PLDs n Actel Axcelerator family
• Antifuse: – open when not programmed – Low resistance when programmed
Clk MUX
Output MUXQ
F/B MUX
Invert Control
AND ARRAY
CLK
pad
8 Product Term AND-OR Array + Programmable MUX's
Programmable polarity
I/O Pin
Seq. Logic Block
Programmable feedback
Erasable Programmable Logic Devices
n EPLDs: CMOS erasable programmable ROM (EPROM) erased by UV light
n Altera’s building block is a MACROCELL
Logic Array
Blocks
(similar to macrocells)
Global Routing: Programmable Interconnect Array (PIA)
8 Fixed Inputs 52 I/O Pins 8 LABs 16 Macrocells/LAB 32 Expanders/LAB
EPM5128:
Complex Programmable Logic Devices (CPLD)
n Altera Multiple Array Matrix (MAX) architecture n AND-OR structures are relatively limited, cannot share
signals/product terms among macrocells
LAB A LAB H
LAB B LAB G
LAB C LAB F
LAB D LAB E
P I A
74
Altera MAX 7k (EEPLD) Logic Block
75
SRAM based PLD n Altera Flex 10k Block Diagram
76
SRAM based PLD n Altera Flex 10k Logic Array Block (LAB)
77
SRAM based PLD n Altera Flex 10k Logic Element (LE)
Field-Programmable Gate Arrays (FPGA) n Unlike PLA/PAL, configuration is stored in volatile SRAM n External memory (ROM) n Used to emulate/prototype ASICs
78 Tajana Simunic Rosing
Components Logic blocks Implement combinational and sequential logic
Based on lookup tables (LUT) Interconnect Wires connecting I/O to logic blocks I/O blocks Special logic blocks at periphery of device for
external connections Specialized blocks e.g. DSP
79
FPGA with DSP n Altera Stratix II: Block Diagram
FPGA with DSP n Altera Stratix II DSP
block
80
Xilinx Virtex-7 n Up to 2M logic cells n High block RAM-to-logic cell ratios
(up to 68 Mb with 1,139K logic cells)
n Configurable logic blocks (CLB) organized into 2 slices ¨ Lookup tables, carry chains,
registers ¨ Distributed memory and shift
register logic n Target market:
¨ Test and measurement (T&M) applications, bridging/switch fabric, RADAR, ASIC emulation
81 http://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf
Xilinx Virtex-7 Slice n Modern slices come in two varieties
¨ SLICEM implementing logical, shift register, and memory functions
¨ SLICEL implementing logical functions only
n SLICEM placed closer to DSP slices for easy access to coefficients
82 http://www.xilinx.com/support/documentation/white_papers/wp405-7Series-Logical-Advantage.pdf
Packing 2 independent logical functions into 1 LUT
Application-specific integrated circuit (ASICs)
n Custom integrated circuits that have been designed for a single use or application
n Standard single-purpose processors ¨ “Off-the-shelf”, pre-designed for a common task (e.g.
peripherals) ¨ serial transmission ¨ analog/digital conversions
83 Tajana Simunic Rosing
Combined system: CPU+FPGA+ASICs
n Actel Fusion Family ¨ ARM7 CPU with FPGA and ASIC implementations of
“smart peripherals” for analog functions
84
PROCESSOR COMPARISON THROUGH APPLICATIONS
85 Tajana Simunic Rosing
Stereo Vision n Moving window with data sharing between
threads to avoid reprocessing ¨ Optimized implementation per device
n GPU – slow data sharing ¨ Limited speed due to memory access conflicts
n FPGA – 242 data windows (121 per image) ¨ Stepwise reduction after window size
becomes too large ¨ Pipelined execution on logic units
Stereo Vision
Krajník, Tomáš, et al. "FPGA-based module for SURF extraction." Machine vision and applications 25.3 (2014)
SURF Feature Detection n Efficiently detect and describe interest points in
images n Enables applications such as object recognition
and 3D reconstruction
88
CPU GPU FPGA Stage 1: Detector speed [ms] 5200 105 100 Stage 2: Descriptor speed [ms] 1.4 0.1 0.7 Power consumption [W] 24 24 6 Mass [g] 850 850 210 Volume [cm3] 600 600 180
Summary n Processor metrics, trends n Architectures and functions
¨ CPUs ¨ GPU ¨ DSP
n Implementations ¨ Programmable logic – PLDs and FPGAs ¨ Custom ASICs
89 Tajana Simunic Rosing
Sources and References n Peter Marwedel, “Embedded Systems
Design,” 2004. n Frank Vahid, Tony Givargis, “Embedded
System Design,” Wiley, 2002. n Wayne Wolf, “Computers as
Components,” Morgan Kaufmann, 2001.
90 Tajana Simunic Rosing
top related