design with microprocessors - university of california...

Design with Microprocessors

Tajana Simunic Rosing Department of Computer Science and Engineering University of California, San Diego.

2 Tajana Simunic Rosing

ES Design

Verification and Validation

Hardware Hardware components

Embedded System Hardware n  Embedded system hardware is frequently

designed in a loop (“hardware in a loop“):

F cyber-physical systems

Hardware platform architecture

System-on-Chip platforms

Nvidia Tegra 2 die photo

Qualcomm Snapdragon block diagram

General processor n  Application processor (CPU) Specialized units n  Graphics processing unit (GPU) n  Various digital signals processing (DSPs)

Processor comparison metrics

Clock frequency Computation speed Memory subsystem •  Indeterminacy in execution

•  Cache miss: compulsory, conflict, capacity Power consumption Idle power draw, dynamic range, sleep modes Chip area Can be critical for embedded form factor Versatility/specialization FPGAs, ASICs Non-technical Development environment, prior expertise,

licensing

n  E.g. ARM Cortex-A, Intel Atom, TI C54x, TI 60x DSPs, Xilinx Virtex-7, single purpose controller

Processor comparison metrics

Parallelism Superscalar pipeline •  Depth & width à latency & throughput Multithreading •  GPU workload requires different programming effort

Instruction set architecture

Complex instruction set computer (CISC): •  Many addressing modes •  Many operations per instruction •  E.g. TI C54x

Reduced instruction set computer (RISC): •  Load/store •  Easy to pipeline •  E.g. ARM

Very Long Instruction Word (VLIW)

Processor comparison metrics n  How do you define “speed”?

¨  Clock speed – but instructions per cycle may differ ¨  Instructions per second – but work per instr. may differ

n  Practical evaluation ¨  Dhrystone: Synthetic benchmark, developed in 1984

n  Dhrystones/sec (a.k.a. Dhrystone MIPS) to normalize difference in instruction count between RISC/CISC

n  MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital’s VAX 11/780) ¨  SPEC: set of more realistic benchmarks, but oriented to desktops ¨  EEMBC: EDN Embedded Benchmark Consortium, www.eembc.org

n  Suites of benchmarks: automotive, consumer electronics, networking, office automation, telecommunications

n  E.g. CoreMark, intended to replace Dhrystone ¨  0xbench: Integrated Android benchmarks

n  Covers C library, system calls, Javascript (web performance), graphics, Dalvik VM garbage collection

PARALLEL ARCHITECTURES

Parallelism in programs n  Exists in several levels

of granularity ¨ Task ¨ Data ¨  Instruction

Ld r1, r2 Add r3,r4 Sub r5,r6

Parallelism extraction n  Static

¨  Use compiler to analyze program code

¨  Can make use of high-level language constructs

¨  Cannot inspect data values ¨  Simpler CPU control

n  Dynamic ¨  Use hardware to identify

opportunities ¨  Can make use of data

values ¨  More complex CPU

Superscalar n  Instruction-level parallelism n  Replicated execution resources n  RISC instructions are pipelined

¨  n inst/cycle à n2 HW

Register file

Execution unit

Simple VLIW architecture n  Compile time assignment of instructions to FUs n  Large register file feeds multiple function units.

Register file

EBOX (execution unit) Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP

ALU ALU Load/store Load/store FU

Clustered VLIW architecture

n  Register file, function units divided into clusters.

Execution

Register file

Execution

Register file

Cluster bus

Embedded processor trends

ARM11 ARM Cortex-A8 ARM Cortex-A9 Qualcomm Scorpion Qualcomm Krait[1] ARM Cortex-A15 MPCore

Decode single-issue 2-wide 2-wide 2-wide 3-wide 3-wide Pipeline depth 8 stages 13 stages 8 stages 10 stages 11 stages 15/17-25 stages Out of Order

Execution No No Yes Yes, non-speculative [2] Yes Yes

FPU VFPv2 (pipelined) VFPv3 (not pipelined)

VFPv3-D16 or VFPv3-D32 (typical)

(pipelined) VFPv3 (pipelined) VFPv4 (pipelined) [3] VFPv4 (pipelined)

NEON None Yes (Partially 128-bit wide)

Optional (Partially 128-bit wide) Yes (128-bit wide) Yes (128-bit wide) Yes (128-bit wide)

Process Technology 90 nm 65/45 nm 45/40/32/28 nm 65/45 nm 28 nm 32/28 nm

L0 Cache 4kB + 4kB direct mapped

L1 Cache Varying, typically 16 kB + 16 kB 32 kB + 32 kB 32 kB + 32 kB 32 kB + 32 kB 16 kB + 16 kB 4-

way set associative 32 kB + 32 kB per

L2 Cache Varying, typically none

256 or 512 (typical) kB 1 MB

256 kB (Single-core)/512 kB (Dual-

1 MB 8-way set associative (Dual-core)/2 MB (Quad-

up to 4 MB per cluster, up to 8 MB

per chip

Core Configurations 1 1 1, 2, 4 1, 2 2, 4 2, 4, 8(4×2) DMIPS/MHz

speed per core 1.25 2.0 2.5 2.1 3.3 3.5

NEON: Advanced SIMD extension is a combined 64- and 128-bit instruction set that provides standardized acceleration for media and DSP apps

Parallelization •  Multiple cores •  Multiple-issue per core •  Out of order execution •  SIMD (NEON)

Process technology •  Higher density allows more parallel structures •  Higher power density, thermal issues

Why isn’t everything just massively parallel?

n  Types of architectural hazards ¨  Data (e.g. read-after-write, pointer aliasing) ¨  Structural ¨  Control flow

n  Difficult to fully utilize parallel structures ¨  Programs have real dependencies that limit ILP ¨  Utilization of parallel structures is dependent on

programming model ¨  Limited window size during instruction issue ¨  Memory delays

n  High cost of errors in prediction/speculation ¨  Performance: Stalls introduced to wait for reissue ¨  Energy: Wasted power going down wrong execution path

ARMV7 ARCHITECTURE General/applications processing

n ARM assembly language - RISC n ARM programming model

¨ Audio players, pagers etc.; 130 MIPS

n ARM memory organization n ARM data operations (32 bit) n ARM flow of control n Hardware-based floating point unit

ARM programming model

r0 r1 r2 r3 r4 r5 r6 r7

r10 r11 r12 r13 r14

r15 (PC)

N Z C V

Current Program Status Register

ARM status bits n Every arithmetic, logical, or shifting

operation sets CPSR bits: ¨ N (negative), Z (zero), C (carry), V (overflow).

n Examples: ¨ -1 + 1 = 0: NZCV = 0110. ¨ 231-1+1 = -231: NZCV = 1001.

ARM pipeline execution

add r0,r1,#5

sub r2,r3,r6

cmp r2,#3

decode

execute

decode

execute

decode execute

ARM data instructions n  ADD, ADC : add (w.

carry) n  SUB, SBC : subtract

(w. carry) n  MUL, MLA : multiply

(and accumulate)

n  AND, ORR, EOR n  BIC : bit clear n  LSL, LSR : logical

shift left/right n  ASL, ASR : arithmetic

shift left/right n  ROR : rotate right n  RRX : rotate right

extended with C

ARM flow of control n All operations can be performed

conditionally, testing CPSR: ¨ EQ, NE, CS, CC, MI, PL, VS, VC, HI, LS, GE, LT, GT, LE

n Branch operation: B #100 ¨ Can be performed conditionally.

ARM comparison instructions n CMP : compare n CMN : negated compare n TST : bit-wise AND n TEQ : bit-wise XOR n These instructions set only the NZCV bits

of CPSR.

ARM load/store/move instructions

n  LDR, LDRH, LDRB : load (half-word, byte) n STR, STRH, STRB : store (half-word,

byte) n Addressing modes:

¨ register indirect : LDR r0,[r1] ¨ with second register : LDR r0,[r1,-r2] ¨ with constant : LDR r0,[r1,#4]

n MOV, MVN : move (negated)MOV r0, r1 ; sets r0 to r1

Addressing modes n Base-plus-offset addressing:

LDR r0,[r1,#16] ¨ Loads from location r1+16

n Auto-indexing increments base register: LDR r0,[r1,#16]!

n Post-indexing fetches, then does offset: LDR r0,[r1],#16 ¨ Loads r0 from r1, then adds 16 to r1

ARM subroutine linkage n Branch and link instruction:

BL foo ¨ Copies current PC to r14.

n To return from subroutine: MOV r15,r14

ARM Summary n  Load/store architecture n Most instructions are RISCy, operate in

single cycle. ¨ Some multi-register operations take longer.

n All instructions can be executed conditionally

n ARMv7-A is deployed in: ¨ Cortex-A15 (Snapdragon Krait, Nvidia Tegra,

TI OMAP), Cortex-A5 (AMD Fusion), Atmel microcontrollers

Next generation: ARMv8 n  Addition of 64-bit support

¨ Larger virtual address space n  New instruction set (A64)

¨ Fewer conditional instructions ¨ New instructions to support 64-bit operands ¨ No arbitrary length load/store multiple instructions

n  Enhanced cryptography (both 32 and 64-bit) n  Mostly backwards compatible with ARMv7

n  Enable expansion into traditional/higher performance markets ¨ Mobile phones, servers, supercomputers ¨ Cortex-A53, Apple Cyclone, Nvidia Denver

GRAPHICS PROCESSING (GPU)

Graphics pipeline

n  Primary architectural elements: ¨ Vertex shaders, pixel shaders (fragment shaders)

n  Performance measured in GFLOPS

31 http://m.iopscience.iop.org/1742-5468/2009/06/P06016/figures

GPU programming n  Compute Unified Device Architecture (CUDA)

¨ Enables GPGPUs: GPUs can be used for general purpose processing (i.e., not exclusively graphics)

n  Single program multiple data (SPMD) ¨ Programming has to be explicit ¨ Threading directives and memory addressing

GPU programming n  A kernel scales across any number

of parallel processors ¨  Contains grid of thread blocks

n  Thread block shape can be 1D, 2D, 3D ¨  All threads in a thread block run

the same code, over the same data in shared memory space

¨  Threads have thread ID numbers within block to select work and address shared data

¨  Threads in different blocks cannot cooperate

n  Threads are grouped by warps (scheduling units)

KQueue Kernel # 1 Kernel # 2

Thread Block

warp 1

warp 2 warp 5

warp 4

warp 3 warp 0

Nvidia GeForce Block Diagram n  Ultra low power (ULP) version of GeForce used in Tegra 3, 4 n  Desktop GTX 280 shown; embedded versions have different

number of TPCs, SMs, etc

Rajib Nath

Geometry Controller

Texture L1 Cache

SM SM SM

Thread Processing Cluster (TPC)

Streaming Multiprocessor (SM)

Shared Memory

SFU SFU SP SP

Constant Cache Mul; Thread Issue Instruc;on Cache

Register File

Floa;ng Point Unit

Integer Unit

Streaming Processor (SP)

Qualcomm Adreno 2xx n  Snapdragon S1-4 chipsets (e.g. MSM8x60)

¨ Newest generation Adreno 5xx released in 2015+ n  Unified shaders

¨ Same instruction set for fragment and vertex processing

¨ More versatile hardware n  5-way VLIW n  Also used for non-gaming apps, e.g. browser

¨ Behavior in non-gaming is not as well-understood

Qualcomm Adreno 2xx

n  Snapdragon S4 chipset (MSM8960) n  “Digital core rail power” (red) includes GPU, video decode, &

modem digital blocks n  GLBenchmark: high-end gaming content

¨  CPU power up to 750mW @ 1.5 GHz ¨  GPU power up to 1.6W @ 400Mhz

36 http://www.anandtech.com/print/5559/qualcomm-snapdragon-s4-krait-performance-preview-msm8960-adreno-225-benchmarks

DIGITAL SIGNAL PROCESSING (DSP)

Digital signal processing (DSP) n  Processing or manipulation of signals using digital

techniques n  Interfacing with the physical world

¨ E.g. audio, digital images, speech processing, medical monitoring (EKG)

Source: Dr D. H. Crawford

ADC DAC Digital Signal

Processor Analog to Digital

Converter

Digital to Analog

Converter

Input Signal

Output Signal

Fundamental DSP Operations n Filtering

¨ Finite impulse response (FIR)

n Frequency transforms ¨ Fast Fourier (FFT) ¨ Discrete cosine (DCT) ¨ Inverse discrete

cosine (IDCT)

∑−

ii inxany

for (n=0; n<N; n++) { s=0; for (i=0; i<L; i++) { s += a[i] * x[n-i]; } y[n] = s; }

Pseudo C code

DSP architectural features n  Fixed-point vs Floating point n  VLIW or specialized SIMD techniques

¨ E.g. Qualcomm Hexagon DSP (VLIW) dispatches up to 4 instructions to 4 execution units per cycle

n  No virtual memory or context switching n  Separate instruction and data storage

¨ Harvard architecture vs Von Neumann n  Pipelined FUs are integrated into the datapath

n  Main DSP Manufacturers: ¨ Texas Instruments (http://www.ti.com) ¨ Motorola (http://www.motorola.com) ¨ Analog Devices (http://www.analog.com)

Speech processing Audio effects

Image compression Video encoding

Noise cancellation Virtual/augmented reality Actuation error detection

What is DSP Used For?

Speech processing n Encoding n Compression n Synthesis n Recognition

The blue--- s---p--o---------t i-s--on--the-- k--ey a---g--ai----n------

“oo” in “blue” “o” in “spot” “ee” in “key” “e” in “again” “s” in “spot” “k” in “key”

Image processing n  Trade off “good enough” quality in “essential” regions n  Enable higher transmission bandwidth, minimal

storage, media interactivity n  Still image encoding: JPEG (Joint Photographic

Experts Group) ¨  JPEG2000: Wavelet Transform based

n  Video encoding: MPEG (Moving Pictures Experts Group) ¨  MPEG-4 (aka H.264)

n  Variable macroblock sizes (4x4 to 16x16) n  Enhanced to allow lossless regions

JPEG Codec

44 Source: Xilinx

Coefficient quantization DCT

Zig-zag run-length encoding

Huffman encoding

Original pixel data

Compressed data

Coefficient denormalization IDCT

Zig-zag run-length expansion

Huffman decoding

Encoding

Decoding

lossy lossless

Reconstructed pixel data

MPEG-2 Codec

45 Source: Xilinx

Quantization DCT Variable length coder (VLC)

Original video Encoding

- Bitstream out

Motion compensation

Motion estimation Frame store

Inverse quantization

+ Loop filter

Intraframe

Interframe

MPEG-2 Codec

46 Source: Xilinx

Decoding

Motion compensation

Frame store

+Inverse

quantization DCT Variable length decoder (VLD)

Bitstream buffer Output IDCT

Intraframe

Interframe

DCT/IDCT Concept n  Used for lossy compression n  Discrete cosine transform (DCT)

¨ Represent a finite sequence of data points with a sum of cosine (even) functions of different frequencies

¨ Similar to the discrete Fourier transform (DFT), but using only real numbers

¨ If a coefficient has a lot of variance over a set, then it cannot be removed without affecting the picture quality

n  Inverse DCT ¨ Reconstruct sequence from frequency

coefficients

47 Source: Xilinx

2D DCT & IDCT n  Image divided into macroblocks of 8x8 pixels n  “The DCT” of each group is an 8x8 transform

coefficient array; entries represent spatial frequencies

48 Source: Xilnx

DCT & IDCT operations

n  Dedicated functional unit: fused multiply-add n  Common to DCT/IDCT and many other DSP

operations 49

Source: Xilinx

Group of Pictures (GOP) n  A structure of consecutive frames that can be decoded

without any other reference frames n  Transmitted sequence is not the same as displayed sequence n  Inter-frame prediction/compression exploits temporal

redundancy between neighboring frames n  Intra-frame coding is applied only to the current frame

Types of frames n  I frame (intra-coded)

¨ Coded without reference to other frames ¨ Begins each GOP

n  P frame (predictive-coded) ¨ Coded with reference to a previous reference

frame (either I or P) ¨ Size is usually about 1/3rd of an I frame

n  B frame (bi-directional predictive-coded) ¨ Coded with reference to both previous and future

reference frames (either I or P) ¨ Size is usually about 1/6th of an I frame

Fixed-Point Design

Floating-Point Algorithm

Quantization

Fixed-Point Algorithm

Code Generation

Target System

Algorithm

LevelIm

plementation

Range Estimation

n  Digital signal processing algorithms ¨  Early development in floating point ¨  Converted into fixed point for

production to gain efficiency n  Fixed-point digital hardware

¨  Lower area ¨  Lower power ¨  Lower per unit production cost

Copyright Kyungtae Han [2]

Fixed-Point Design n  All variables have to be annotated manually

¨ Value ranges are well known ¨ Avoid overflow ¨ Minimize quantization effects ¨ Find optimum wordlength

n  Manual process supported by simulation ¨ Time-consuming ¨ Error prone

Fixed-Point Representation n  Fixed point type

¨  Wordlength ¨  Integer wordlength

n  Quantization modes ¨  Round ¨  Truncation

n  Overflow modes ¨  Saturation ¨  Saturation to zero ¨  Wrap-around

S X X X X X

Wordlength

Integer wordlength

SystemC format www.systemc.org

X X X X X

Wordlength

Integer wordlength = -2

Fixed-Point Representation

x = 0.5 x 0.125 + 0.25 x 0.125

= 0.0625 + 0.03125

= 0.09375

n  For integer word length iwl=1 and fractional word length fwl=3 decimal digits, the less significant digits are automatically chopped off: x = 0.093

n  Like a floating point system with numbers ∈ (-1..1), with no stored exponent (bits used to increase precision).

Ø  Automatic scaling: Shifting after multiplications and divisions in order to maintain binary point.

TI C54x family (CISC) n  Modified Harvard architecture: separate buses

for program code and data ¨ PB: program read bus ¨ CB, DB: data read busses ¨ EB: data write bus ¨ PAB, CAB, DAB, EAB: address busses

n  Can generate two data memory addresses per cycle ¨ Stored in auxiliary register address units

n  High performance, reproducible behavior, optimized for different memory structures

TI C54x architectural features

n  40-bit ALU & Barrel shifter ¨ Input from accumulator or data memory ¨ Output to ALU

n  17 x 17 multiplier n  Single-cycle exponent encoder n  Two address generators with dedicated

registers n  Accumulators

¨ Low-order (0-15), high-order (16-31), guard (32-39)

TI C54x instruction set features

n  Compare, select and store unit (CSSU) unit ¨ Compares high and low accumulator words ¨ Accelerates Viterbi operations

n  Repeat and block repeat instructions n  Instructions that read 2, 3 operands

simultaneously n  Three IDLE instructions

¨ Selectively shut down CPU, on-chip peripherals, whole chip including phase-locked loop

TI C54x pipeline n  Prefetch: Send PC address on program address bus n  Fetch: Load instruction from program bus to IR n  Decode n  Access: Put operand addresses on busses n  Read: Get operands from busses n  Execute

Addressing Modes

Immediate

Register-direct

Register indirect

Direct

Indirect

Operand field

Register address

Memory address

Memory address Data

Memory address

Addressing mode

Register-file contents

Memory contents

TIs C55x Pipeline

n  Prefetch 1: ¨  Send address to memory

n  Prefetch 2: ¨  Wait for response

n  Fetch: ¨  Get instruction from memory

and put in IBQ n  Predecode:

¨  Identify where instructions begin and end

¨  Identify parallel instructions

n  Decode: ¨  Decode an instruction pair or single

instruction. n  Address:

¨  Perform address calculations. n  Access 1/2:

¨  Send address to memory; wait. n  Read:

¨  Read data from memory. Evaluate condition registers.

n  Execute: ¨  Read/modify registers. Set conditions.

n  W/W+: ¨  Write data to MMR-addressed

registers, memory; finish.

fetch execute

C55x organization

Instruction unit

Program flow unit

Address unit

Data unit

3 data read busses 3 data read address busses program address bus

program read bus

2 data write busses 2 data write address busses

Instruction fetch Data read from memory

Single operand read

C, D busses

Dual operand read

Dual-multiply coefficient Writes

C55x hardware extensions

n Target image/video applications ¨ DCT/IDCT ¨ Pixel interpolation ¨ Motion estimation

n Available in 5509 and 5510 ¨ Equivalent C-callable functions for other devices.

TI C62/C67 (VLIW) n Up to 8 instructions/cycle n  32 32-bit registers n Function units

¨ Two multipliers ¨ Six ALUs

n Data operations ¨ 8/16/32-bit arithmetic ¨ 40-bit operations ¨ Bit manipulation operations

Partitioned register files n  Many memory ports are required to supply enough

operands per cycle. n  Memories with many ports are expensive. Ø  Registers are partitioned into sets, e.g. for TI

register file A register file B

L1 S1 M1 D1 D2 M2 S2 L2

Data bus

Address bus

Data path A Data path B

C6x data paths n General-purpose register files (A and B,

16 words each) n Eight functional units:

¨ .L1, .L2, .S1, .S2, .M1, .M2, .D1, .D2 n Two load units (LD1, LD2) n Two store units (ST1, ST2) n Two register file cross paths (1X and 2X) n Two data address paths (DA1 and DA2)

C6x functional units n  .L

¨  32/40-bit arithmetic ¨  Leftmost 1 counting ¨  Logical ops

n  .S ¨  32-bit arithmetic ¨  32/40-bit shift and 32-bit field ¨  Branches ¨  Constants

n  .M ¨  16 x 16 multiply

n  .D ¨  32-bit add, subtract, circular address ¨  Load, store with 5/15-bit constant offset

C6x system n On-chip RAM n  32-bit external memory: SDRAM, SRAM n Host port n Multiple serial ports n Multichannel DMA n  32-bit timer

PROGRAMMABLE LOGIC DEVICES (PLD)

Programmable Logic Devices (PLD) n  Simple PLD (SPLD) �

¨  Programmable logic array (PLA) ¨  Programmable array logic (PAL) –

fixed OR plane n  Complex PLD (CPLD)

¨  Building block: macrocells n  Variations:

¨  Antifuse PLD ¨  Erasable EPLD & EEPLD

Name Re-programmable Volatile Technology Fuse No No Bipolar

EPROM Yes – out of circuit No UVCMOS

EEPROM Yes – in circuit No EECMOS

SRAM Yes – in circuit Yes CMOS

Antifuse No No CMOS+

Antifuse PLDs n  Actel Axcelerator family

•  Antifuse: –  open when not programmed –  Low resistance when programmed

Clk MUX

Output MUXQ

F/B MUX

Invert Control

AND ARRAY

8 Product Term AND-OR Array + Programmable MUX's

Programmable polarity

I/O Pin

Seq. Logic Block

Programmable feedback

Erasable Programmable Logic Devices

n  EPLDs: CMOS erasable programmable ROM (EPROM) erased by UV light

n  Altera’s building block is a MACROCELL

Logic Array

Blocks

(similar to macrocells)

Global Routing: Programmable Interconnect Array (PIA)

8 Fixed Inputs 52 I/O Pins 8 LABs 16 Macrocells/LAB 32 Expanders/LAB

EPM5128:

Complex Programmable Logic Devices (CPLD)

n  Altera Multiple Array Matrix (MAX) architecture n  AND-OR structures are relatively limited, cannot share

signals/product terms among macrocells

LAB A LAB H

LAB B LAB G

LAB C LAB F

LAB D LAB E

Altera MAX 7k (EEPLD) Logic Block

SRAM based PLD n  Altera Flex 10k Block Diagram

SRAM based PLD n  Altera Flex 10k Logic Array Block (LAB)

SRAM based PLD n Altera Flex 10k Logic Element (LE)

Field-Programmable Gate Arrays (FPGA) n  Unlike PLA/PAL, configuration is stored in volatile SRAM n  External memory (ROM) n  Used to emulate/prototype ASICs

Components Logic blocks Implement combinational and sequential logic

Based on lookup tables (LUT) Interconnect Wires connecting I/O to logic blocks I/O blocks Special logic blocks at periphery of device for

external connections Specialized blocks e.g. DSP

FPGA with DSP n Altera Stratix II: Block Diagram

FPGA with DSP n  Altera Stratix II DSP

Xilinx Virtex-7 n  Up to 2M logic cells n  High block RAM-to-logic cell ratios

(up to 68 Mb with 1,139K logic cells)

n  Configurable logic blocks (CLB) organized into 2 slices ¨  Lookup tables, carry chains,

registers ¨  Distributed memory and shift

register logic n  Target market:

¨  Test and measurement (T&M) applications, bridging/switch fabric, RADAR, ASIC emulation

81 http://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf

Xilinx Virtex-7 Slice n  Modern slices come in two varieties

¨  SLICEM implementing logical, shift register, and memory functions

¨  SLICEL implementing logical functions only

n  SLICEM placed closer to DSP slices for easy access to coefficients

82 http://www.xilinx.com/support/documentation/white_papers/wp405-7Series-Logical-Advantage.pdf

Packing 2 independent logical functions into 1 LUT

Application-specific integrated circuit (ASICs)

n  Custom integrated circuits that have been designed for a single use or application

n  Standard single-purpose processors ¨  “Off-the-shelf”, pre-designed for a common task (e.g.

peripherals) ¨  serial transmission ¨  analog/digital conversions

Combined system: CPU+FPGA+ASICs

n  Actel Fusion Family ¨ ARM7 CPU with FPGA and ASIC implementations of

“smart peripherals” for analog functions

PROCESSOR COMPARISON THROUGH APPLICATIONS

Stereo Vision n Moving window with data sharing between

threads to avoid reprocessing ¨ Optimized implementation per device

n GPU – slow data sharing ¨ Limited speed due to memory access conflicts

n FPGA – 242 data windows (121 per image) ¨ Stepwise reduction after window size

becomes too large ¨ Pipelined execution on logic units

Stereo Vision

Krajník, Tomáš, et al. "FPGA-based module for SURF extraction." Machine vision and applications 25.3 (2014)

SURF Feature Detection n  Efficiently detect and describe interest points in

images n  Enables applications such as object recognition

and 3D reconstruction

CPU GPU FPGA Stage 1: Detector speed [ms] 5200 105 100 Stage 2: Descriptor speed [ms] 1.4 0.1 0.7 Power consumption [W] 24 24 6 Mass [g] 850 850 210 Volume [cm3] 600 600 180

Summary n Processor metrics, trends n Architectures and functions

¨ CPUs ¨ GPU ¨ DSP

n  Implementations ¨ Programmable logic – PLDs and FPGAs ¨ Custom ASICs

Sources and References n Peter Marwedel, “Embedded Systems

Design,” 2004. n Frank Vahid, Tony Givargis, “Embedded

System Design,” Wiley, 2002. n Wayne Wolf, “Computers as

Components,” Morgan Kaufmann, 2001.

design with microprocessors - university of california...

Documents

nvidia cuda video encoder - university of...

quiz1% - cseweb.ucsd.edu

the college classroom (wi15) session 6: cooperative learning...

wine recommendation for cellartracker course assignment...

the college classroom (wi15) session 2: developing expertise

course information - cseweb.ucsd.edu

cse 190 lecture 1 - cseweb.ucsd.edu

cse 127: computer security - cseweb.ucsd.edu

cse140 midterm1 solution - cseweb.ucsd.edu

fastbert - cseweb.ucsd.edu

real time operating systems - computer science and...

lecture 18 - cseweb.ucsd.edu

presentation for mini project 1 - cseweb.ucsd.edu

large&scale*sor,ng:**...

cse 237a - home | computer science and...

polymorphism - cseweb.ucsd.edu

distributed programming and remote procedure...

cse 127: introduction to security - cseweb.ucsd.edu

c meaning in product reviews with c -l generative...

udp, tcp, and unix sockets - university of california, san...