1 tema iii – microcontrollers and microprocessors

Tema III – Microcontrollers and Microprocessors

1

Roberto Gutiérrez Mazón

2

¨  Introduction ¨  Processor Architectural Features. Datapath &

pipeline. ¨  Data Representation: Fixed-point vs Floating-point ¨  Interrupts, Exceptions, Watch-Dog, … ¨  32-bit microcontroller. ARM Cortex-M3

¤  ARM Cortex-M3 Architecture. Programmers Model.

¨  32/64bit microprocessor. ¤  Intel x86, UltraSparc Architecture. Programmers Model

3

What is “Computer Architecture”??

Processor Architectural Features

Instruction Set Architecture

Applications

Compiler

Operating System

Firmware

I/O system Instr. Set Proc.

Digital Design Circuit Design

Datapath & Control

Layout & fab Semiconductor Materials

4

Introduction

¨  Moore`s Law ¨  “Cramming More Components onto

Integrated Circuits” ¤  Gordon Moore, Electronics, 1965

¨  Nº on transistors on cost-effective integrated circuit double every 18 months

5

Introduction

¨  Prehistoric Computer Architecture: ¤  The Z1 was the first mechanical freely

programmable computer in the world which used Boolean logic and binary floating point numbers

¤  Memory: 64 words of 22bits.

¤  Clock Frequency: 1Hz

¤  Registers: two 22bits floating-point registers. ¤  ALU: add (5 seg), sub, mult. (16 seg) ,div

(18seg).

¤  Weight: 1000 kg

6

Introduction

¨  The zEC12 Zseries IBM Microprocessor: ¤  5.5 GHz in IBM 32nm PD-SOI CMOS

technology ¤  2.75 billion transistors in 597 mm2 ¤  64-bit virtual addressing

n  original S/360 was 24-bit, and S/370 was 31-bit extension

¤  Six-core design ¤  Three-issue out-of-order superscalar

pipeline ¤  Out-of-order memory accesses ¤  Redundant datapaths

n  every instruction performed in two parallel datapaths and results compared

¤  64KB L1 I-cache, 128KB L1 D-cache on-chip

¤  1MB private L2 unified instruction and data cache per core, on-chip

¤  On-Chip 48MB eDRAM L3 cache ¤  Scales to 120-core multiprocessor with

384MB of shared L4 eDRAM

Maquina Diferencias Baggage (1832)

1er Transistor (Shokley, Bardeen,Brattain) (1947)

IC 4004 Intel (1971) IC 486DX2 Intel (1989)

Intel Quad (2007)

Cell (2005)

Nanotecnología (¿?)

MEMS(2000) Procesadores opticos (¿?)

7

Introduction

ENIAC(1946)

8

Introduction

C B

Wirelessly networked into large scale sensor arrays

Battery Solar Cells

Processor, SRAM and PMU

Sensors, timers

Cortex-M0 +16KB RAM 65nm UWB Radio antenna

10 kB Storage memory ~3fW/bit

12µAh Li-ion Battery

Wireless Sensor Network

A

University of Michigan

Cortex-M0; 65¢

9

Introduction

4200 ARM powered Neutrino Detectors

Work supported by the National Science Foundation and University of Wisconsin-Madison

2.5km 70 bore holes 2.5km deep 60 detectors per string starting 1.5km down 1km3 of active telescope

1km

10

Introduction

11





12

Programming Model ¨  M i c r o p r o c e s s o r s c a n b e

programmed directly using an assembly language.

¨  Differences with high-level languages: ¤  Use commands to execute data

movements, arithmetic, logic and program control operations.

¤  Use registers to hold data for operation.

¨  Programmers need to know not only the assembly language for the microprocessor, but also the internal configuration of the microprocessor.


High-Level Language

Assembly Language

Operating System

Instruction SetArchitecture

Microarchitecture

Digital Logic Level 0

Level 1

Level 2

Level 3

Level 4

Level 5

13

A Basic Processor ¨  The basic components:

¤  Processor with its associate temporary memory (registers and cache if available) for code execution

¤  Main memory and secondary memory where code and data are temporary and permanently stored

¤  Input and output modules that provide interface between the processor and the user

¨  Connected through an interface bus consists of

¤  Address, Data, and Control signals. e.g. AMBA bus for the ARM-based processor


Reg

iste

rs

Processor core

Cache/SRAM memory

Main memory

Storage memory

I/O Interface

Address bus, data bus, and bus control signals

14


The gap widens between DRAM, disk, and CPU speeds.

110

1001,000

10,000100,000

1,000,00010,000,000

100,000,000

1980 1985 1990 1995 2000

year

ns

Disk seek timeDRAM access timeSRAM access timeCPU cycle time

register cache memory disk Access time

(cycles) 1 1-10 50-100 20,000,000

15

Memory Hierarchy ¨  A typical processor is supported by:

¤  on-board main memory (e.g. SDRAM up to GB)

¤  on-chip or on-die cache memory (e.g. SRAM KB to MB)

¤  on-die registers

¨  Some processors also provide general purpose on-chip ¤  SRAM (e.g. embedded processor) which may be

configured as SRAM/Cache combination (e.g. TI’s DSP)

¨  Typically, a processor also utilizes secondary non-volatile memory ¤  For permanent code and data storage like Flash-

based memory and hard disk


Larger, slower, and cheaper (per byte) storage devices

registers

on-chip L1 cache (SRAM)

main memory (DRAM)

local secondary storage (virtual memory) (local disks)

remote secondary storage (tapes, distributed file systems, Web servers)

off-chip L2 cache (SRAM)

L0:

L1:

L2:

L3:

L4:

L5:

Smaller, faster, and more expensive (per byte) storage devices

16


¨  Multiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.

Processor Chip

L1 Data 1 cycle latency

16 KB 4-way assoc

Write-through 32B lines

L1 Instruction 16 KB, 4-way

32B lines

Regs. L2 Unified 128KB--2 MB 4-way assoc Write-back

Write allocate 32B lines

Main Memory

Up to 4GB

Pentium III cache hierarchy

17

Address Space ¨  Address space of a processor depends on its address

decoding mechanism. ¤  Size will depend on the number of address bit used.

¨  Depending on the processor design, there may be two types of address space: ¤  One is used by normal memory access. ¤  Another one is reserved for I/O peripheral registers (control,

status, and data). ¤  Need extra control signal or special means of accessing the

alternate address space.


18

Address Space ¨  Refer to the range of address that can be accessed by the processor determined

by the number of address bit utilized in the processor architecture. ¨  Some processor families (e.g. ARM) utilize only one address space for both

memory and I/O devices

¤  i.e. everything is mapped in the same address space

I/O Reg

I/O Reg

Memory

Processor

0x00000000

0xFFFFFFFF I/O

Data

Code


0x00000000

0xFFFFFFFF

0x0000

0xFFFF

I/O Address Space

19

Memory mapped vs I/O mapped ¨  Some processor families have two address spaces. ¨  E.g., for the x86 processor, memory and I/O devices can be mapped in

two different address spaces: ¤  Memory address space and I/O address space

Memory Address Space

Processor

I/O Reg

I/O Reg Data

Code

Data

Code


20

Memory system Architectures ¨  Two types of information are found in a typical program code:

¤  Instruction codes for execution ¤  Data that is used by the instruction codes

¨  Two classes of memory system design to store these information: ¤  Von Neumann architecture ¤  Harvard architecture

0000h

FFFFh

Code

Data

Code

Data Table

Data

Processor

Single path (bus) for both Code & Data

0000h

FFFFh

Code

Code

Data

Data

Processor

Separate bus for Code & Data

Data

Code 7FFFh

8000h

Von Neumann

Harvard


21

Processor Size ¨  The processor size is described in terms of ‘bits’ (e.g. an 8 bit, 32-bit

processor).

¤  Corresponds to the data size that can be manipulated at a time by the processor.

¤  Typically reflected in the size of the processor (internal) data path and register bank.

¨  Hence an 8-bit processor can only manipulate byte size data at a time, while a 32-bit processor can handle 32-bit double word size data at a time.

•  Even though the data content may only be of single byte size.


22

Registers ¨  The most fundamental storage area

in the processor is closely located to the processor provides very fast access, operating at the processor clock but is of limited amount (less than 100 typical)

¨  Most are of the general purpose type and can store any type of information: ¤  data – e.g. timer value, constants ¤  address – e.g. ASCII table, stack

¨  Some are reserved for specific purpose ¤  program counter (IP). ¤  program status register (SR).


I-1 I-2 I-3 I-4

PC program

I-1instructionregister

op1op2

memory fetch

ALU

registers

writ

e

decode

execute

read

writ

e(output)

registers

flags

program counter instruction queue

23

Data Organization in Memory ¨  A typical memory contains a storage location that can store data of a certain

fixed size (most commonly of the 8-bit (byte) size). Each location is provided with a unique address.

¨  Depending on the data path size of the processor. The memory content is accessible in the size of an 8-bit byte, a 16-bit half word, a 32-bit word, and even a 64-bit double word.

¨  A 32-bit data consists of four bytes of data, and are stored in four successive memory locations. Data and code must be aligned to the respective address size boundary. ¤  E.g. for a 32-bit system, align to the word boundary, with the lowest two address bits equal

to zero

¨  But what is the order of the four bytes of data?. Depends on the Endianness adopted


24

Data Endianness ¨  In the Little Endian format, the least significant byte (LSB) is stored in the lowest

address of the memory, with the most significant byte (MSB) stored in the highest address location of the memory.

¨  In the Big Endian format, the least significant byte (LSB) is stored in the highest address of the memory, with the most significant byte (MSB) stored in the lowest address location of the memory.


0x000000


0x000000

MSB LSB

Big Endian Little Endian


25

Top Boot and Botton Boot ¨  Different processor family uses different location for its reset vector boot-up purpose. ¨  Examples:

¤  x86 boot up from the top of the memory space ¤  ARM boot up from the bottom of its memory space

00..00h

FF..FFh

Reset vector

Data

Program

Data

Processor

00..00h

FF..FFh

Program

Data

Data Processor

Reset vector


x86 ARM

26

CISC – Complex Instruction Set Computer. Philosophy: Hardware is always faster than the software. Objective: Instruction set should be as powerful as possible

With a power instruction set, fewer instructions needed to complete (and less memory) the same task as RISC. CISC was developed at a time (early 60’s), when memory technology was not so advanced. Memory was small (in terms of kilobytes) and expensive. But for embedded systems, especially Internet Appliances, memory efficiency comes into play again, especially in chip area and power. ¨  Many instructions ¨  Complex instructions

¤  Each instruction can execute several low level operations

¨  Complex addressing modes ¤  Smaller number of registers needed

¨  A semantically rich instruction set is accommodated by allowing instructions that can be of variable lengths


27

RISC – Reduce Instruction Set Computer. By reducing the number of instructions that a processor supports and thereby reducing the complexity of the chip, it is possible to make individual instructions execute faster and achieve a net gain in performance even though more instructions might be required to accomplish a task. RISC trades-off instruction set complexity for instruction execution timing. Large register set: having more registers allows memory access to be minimized. Load/Store architecture: operating data in memory directly is one of the most expensive in terms of clock cycle. Fixed length instruction encoding: This simplifies instruction fetching and decoding logic and allows easy implementation of pipelining.

¤  All instructions are register-to-register format except Load/Store which access memory ¤  All instructions execute in a single cycle save branch instructions which require two. ¤  Almost all single instruction size & same format.


28

Limitations of CISC ¨  A highly encoded instruction set

n e e d s t o b e d e c o d e d b y hardwired microcode electronic circuitry. ¤  More complex hardware design ¤  Slower instruction decoding/

execution ¨  Variable length instructions

different execution time among instructions affect pipelined operations.

Advantages of CISC ¨  As each instruction can execute several

low level operations, the code size is r e d u c e d t o s a v e o n m e m o r y requirement. less main memory access is required and hence faster.

¨  Backward code compatibility is maintained. ¤  Can add new (and more powerful)

instructions while retaining the ‘old’ instruction set for code compatibility (i.e. the legacy program can still run)

¨  Easy to program. ¤  direct support of high-level language

constructs. ¤  complex instructions that fit well with high-

level language expression.


29

Limitations of RISC ¨  Fewer instructions than CISC:

¤  Compared to CISC, RISC needs more instructions to execute one task.

¤  code density is less. ¤  need more memory.

¨  No complex instruction: ¤  No hardware support for division,

floating-point arithmetic operation. ¤  Need a more complex compiler and a

longer compiling time

But ARM also adds DSP-l ike instructions to support commonly used signal processing function.

Advantages of RISC ¨  Simpler instructions:

¤  One clock per instruction gives faster execution than on a CISC processor with the same clock speed

¨  Simpler addressing mode: ¤  Faster decoding

¨  Fixed length instructions: ¤  Faster decoding and better pipeline

performance

¨  Simpler hardware: ¤  Less silicon area ¤  Less power consumption


30

CISC RISC

Any instruction may reference memory Only load/store references memory

Many instructions & addressing modes Few instructions & addressing modes

Variable instruction formats Fixed instruction formats

Single register set Multiple register sets

Multi-clock cycle instructions Single-clock cycle instructions

Micro-program interprets instructions Hardware (FSM) executes instructions

Complexity is in the micro-program Complexity is in the compiler

Less to no pipelining Highly pipelined

Program code size small Program code size large


31

RISC vs CISC ¨  RISC machines: SUN SPARC, SGI Mips, HP PA-RISC, ARM ¨  CISC machines: Intel 80x86, Motorola 680x0 ¨  What really distinguishes RISC from CISC these days lies in the architecture and

not in the instruction set. ¨  CISC occurs whenever there is a disparity in speed between CPU operations and

memory accesses due to technology or cost. ¨  What about combining both ideas?

¤  Intel 8086 Pentium P6 architecture is externally CISC but internally RISC & CISC! ¤  Intel IA-64 executes many instructions in parallel.


CISC (Intel 486)

RISC (MIPS R4000)

#instructions 235 94

Addr. modes 11 1

Inst. Size (bytes) 1-12 4

GP registers 8 32

32

Instruction Code Format ¨  Opcode encoding depends on the number of bit used.

¤  Example: For ARM, all instructions are of 32-bit length, but only 8 bits (bit 20 to 28) are used to encode the instruction. Hence a total of 28 = 256 different instructions possible.

¨  A typical instruction is encoded with a specific bit pattern that consists of the following: ¤  Opcode field specifying the operation to be performed. ¤  Operand(s) identification (address) field that depends on the modes of

addressing; n  this provides the address of the register/memory location (s) that store the

operand(s), or the operand itself.


33

Operand Addressing Types: ¨  Immediate addressing.

¤  Operand is given in the instruction.

¨  Register addressing. ¤  Operand is stored in a register.

¨  Direct addressing. ¤  Operand is stored in memory, with the address

given in the instruction.

¨  Indirect (Index) addressing. ¤  Operand is stored in memory, with the address

given in a register (address adds with an offset given in the instruction).

¨  Implied addressing ¤  Implicit location like stack and program counter.

Instruction Opcode Types: ¨  General categories of

instruction operations: ¤  Data transfer ¤  E.g. move, load, and store ¤  Data manipulation ¤  E.g. add, subtract, logical

operation ¤  Program control ¤  E.g. branch, subroutine call


34

Instruction Execution ¨  Multiple stages are involved in executing an instruction. Example:

1)  Fetching the instruction code. Reads the instruction from the memory 1)  Decoding the instruction code. Determining which instruction is to be executed 2)  Executing the instruction code. Performs the operations necessary to complete

what the instruction is suppose to do. Read data from memory, write data to memory or I/O device, perform only operations within CPU or combination of those.

¨  Hence multiple processor clock cycles are needed to execute one single instruction.

Fetch Instruction

Decode Instruction

Execute Instruction

time

Fetch Instruction

Decode Instruction

Execute Instruction

1st instruction 2nd instruction


35

Instruction Pipeline ¨  Pipeline allows concurrent execution of multiple different

instructions at the same time ¨  During a normal operation

•  While one instruction is being executed. •  The next instruction is being decoded. •  And a third instruction is being fetched from memory. •  Allows effective throughput to increase to one instruction per clock cycle.


36

Pipeline Architecture: Longer pipeline can also be used to further break down the operation carried out in the individual stage. Simpler logic for each stage to increase system clock.

Example: A 5-stage instruction pipeline

Fetch Instruction

Decode Instruction

Fetch Operand

Execute Instruction

Store Result

Parallel execution of multiple instructions.

Assume instructions are completely independent!

Fetch Instruction

Decode Instruction

Fetch Operand

Execute Instruction

Store Result

Fetch Instruction

Decode Instruction

Fetch Operand

Execute Instruction

Store Result

Fetch Instruction

Decode Instruction

Fetch Operand

Execute Instruction

Store Result

Fetch Instruction

Decode Instruction

Fetch Operand

Execute Instruction

Store Result

time

1st

2nd 3rd

4th

5th

Maximum Speedup é Number of stages Speedup ≈ Time for unpipelined operation Time for longest stage


37

ARM Cortex-A15


38


pipeline. ¨  Data Representation: Fixed-point vs Floating-

point ¨  Interrupts, Exceptions, Watch-Dog, … ¨  32-bit microcontroller. ARM Cortex-M3



39

¨  Numerical values represented as binary fractions: -1.0 ≤ value < 1.0 ¨  Why a fractional representation?

¤  Multiplying a fraction by a fraction always results in a fraction and will not produce an overflow (e.g., 0.99 x 0.9999 = less than 1). Successive additions may cause overflow

¤  Normalized representation is convenient. Signal processing is multiplication-intensive.

¤  Coefficients from digital filter designs are typically already in fractional form.

Data representation. Fixed-point vs Floating-point

-20 2-1 2-2 2-3 2-(n-1)

Radix point

Sign bit

40

¨  Fixed-point Notation: ¤  Decimal point is always in a fixed location (e.g., 0.74, 0.34, etc.). ¤  Fixed-point notation prevents overflow (useful with a small dynamic range). ¤  Fixed-point notation is less expensive.

¨  How is fixed-point notation realized in a DSP? ¤  Most fixed-point DSPs are 16 bits. ¤  The range of numbers that can be represented is 215-1 to -215.

¤  The most common fixed-point format is Q15.


Q15 Notation Bit 15 Bits 14 to 0

sign two’s complement number

41


Dynamic range in Q15

Number representations in Q15

Rules for operations Avoid operations with numbers larger than 1

2.0 x (0.5 x 0.45) = (0.2 x 0.5 x 0.45) x 10

= (0.5 x 0.45) + (0.5 x 0.45)

Scale numbers before the operation

0.5 in Q15 = 0.5 x 32767 =16384

Number Biggest Smallest

Fractional number 0.999 -1.000

Scaled integer for Q15 32767 -32768

Decimal Q15 = Decimal x 215 Q15 Integer

0.5 0.5 x 32767 16384

0.05 0.05 x 32767 1638

0.0012 0.0012 x 32767 39

Addition

Multiplication 2 x 0.5 x 0.45 =

Decimal Q15 Scale back Q15 / 32767

0.5 + 0.05 = 0.55 16384 + 1638 = 18022 0.55

0.5 – 0.05 = 0.45 16384 – 1638 = 14746 0.45

Decimal Q15 Back to Q15 Product / 32767

Scale back Q15 / 32767

0.5 x 0.45 = 0.225 16384 x 14745 = 241584537 7373

0.225 + 0.225 = 0.45 7373 + 7373 = 14746 0.45

42

¨  Floating-point Notation:


Conversion equations

Special case

e = exponent is a signed two’s compliment 8-bit field and determines the location of the binary Q point

s = sign of mantissa (s = 0 positive, s =1 negative) f = fractional part of the mantissa; an implied 1.0 is added to this fraction

but is not allocated in the bit field since this value is always present

single-precision floating-point format

Bit No

Exponent (e)

Hex two’s comp. 00 01 7F FF 80

Decimal 1 127 -1 -128 0

31 ... 24 23 22 .............. 0e s f

8 bits 1 bit 23 bits

Binary Decimal Equation s = 0 X = 01.f x 2e X = 01.f x 2e 1 s = 1 X = 10.f x 2e X = ( -2 + 0.f ) x 2e 2

s = 0 X = 0 e = -128

43

¨  Floating-point Numbers:


Calculate 1.0e0 In hex 00 00 00 00 In binary 00000000000000000000000000000000 s = 0 Equation 1 applies: X = 01.f x2e

01.0 x 20 = 1.0

e = 0

f = 0

Calculate 1.5e01 In hex 03 70 00 00 In binary 00110111000000000000000000000000 s = 0 Equation 1 applies: X = 01.f x2e

0011 e = 3 s111 f = 0.5 + 0.25 + 0.125 = 0.875 X = 01.875 x 23 = 15.0 decimal

...

Calculate -2.0e0 In hex 00 80 00 00 In binary 00000000100000000000000000000000 s = 1 Equation 2 applies: X = ( -2.0 + 0.f ) x 2e ( -2.0 + 0.0 ) x 20 = -2.0

e = 0

f = 0

Addition 1.5 + (-2.0) = 0.5 Multiplication 1.5e00 x 1.5e01 = 2.25e01 = 22.5

44

¨  Dynamic Range ¤  Ranges of number systems

¤  The dynamic range of floating-point representation is very large ¤  Conclusion

n  Largest integer x (1.5 x 10 29 ) ~ = largest floating point n  Largest Q15 x (1.03 x 10 34 ) ~ = largest floating point


Numbers Base 2 Decimal Two’s

Complement Hex

Largest Integer 231 - 1 2 147 483 647 7F FF FF FF

Smallest Integer - 231 -2 147 483 648 80 00 00 00

Largest Q15 215 - 1 32 767 7F FF

Smallest Q15 - 215 -32 768 80 00

Largest Floating Point ( 2 - 2-23 ) x 2127 3.402823 x 1038 7F 7F FF FD

Smallest Floating Point -2 x 2127 -3.402823 x 1038 83 39 44 6E

45

¨  DSP devices are designed as floating point or fixed point. ¨  Floating-point devices usually have a full set of fixed-point instructions. ¨  Floating point devices are easier to program. ¨  Fixed-point devices can emulate floating point in software.


Characteristic Floating point Fixed point Dynamic range much larger smaller

Resolution comparable comparable

Speed comparable comparable

Ease of programming much easier more difficult

Compiler efficiency more efficient less efficient

Power consumption comparable comparable

Chip cost comparable comparable

System cost comparable comparable

Design cost less more

Time to market faster slower

46

¨  Applications which require: ¤  High precision. ¤  Wide dynamic range. ¤  High signal-to-noise ratio. ¤  Ease of use.

Need a floating point processor. ¨  Drawback of floating point processors:

¤  Higher power consumption. ¤  Can be more expensive. ¤  Can be slower than fixed-point counterparts and larger in size.

DSP Data representation. Fixed-point vs Floating-point

47





Interrupt, Exceptions, Watch-Dog, … 48

¨  Exceptions:

¤  Exception handling is a combination of hardware behaviors and software constructs designed to manage a unique condition. n  Related to the current program flow. n  Result of unexpected error conditions (such as a bus error). n  Result of illegal operations (guarded memory access). n  Some exceptions can be programmed to occur (FIT, PIT). n  A software routine could not execute properly (divide by 0).

¤  Exception handling changes the normal flow of software execution.


¨  Interrupts:

¤  A hardware interrupt is an asynchronous signal from hardware, either originating outside the SoC or from the programmable logic within the SoC, indicating a peripheral's need for attention. n  Embedded processor peripheral (FIT, PIT, for example). n  External bus peripheral (UART, EMAC, for example). n  External interrupts enter via hardware pin(s). n  Multiple hardware interrupts can utilize general interrupt controller of the PS.

¤  A software interrupt is a synchronous event in software, often referred to as exceptions, indicating the need for a change in execution. n  Examples

n  Divide by zero. n  Illegal instruction. n  User-generated software interrupt.


¨  Cortex-A9 Modes and Registers: Cortex-A9 has seven execution modes

¤  Five are exception modes. ¤  Each mode has its own stack space and different

subset of registers. ¤  System mode will use the user mode registers.

Cortex-A9 has 37 registers ¤  Up to 18 visible at any one time. ¤  Execution modes have some private registers

that are banked in when the mode is changed. ¤  Non-banked registers are shared between

modes.


¨  Cortex-A9 Exceptions:

In Cortex-A9 processor interrupts are handled as exceptions ¤  Each Cortex-A9 processor core accepts two different levels of interrupts.

n  nFIQ interrupts from secure sources (serviced first). n  nIRQ interrupts from either secure sources or non-secure sources.


Interrupt Servicing in Cortex-A9:

¨  When an interrupt is received, the current executing instruction completes.

¨  Save processor status ¤  Copies CPSR into SPSR_irq. ¤  Stores the return address in LR_irq.

¨  Change processor status for exception ¤  Mode field bits. ¤  ARM or thumb (T2) state. ¤  Interrupt disable bits (if appropriate). ¤  Sets PC to vector address (either FIQ or

IRQ).

¨  The above steps are performed automatically by the core


General Interrupt Controller (GIC)

¨  Supports interrupt prioritization ¨  Handles up to 16 software-

generated interrupts (SGI)

¨  Supports 64 shared peripheral interrupts (SPI) starting at ID 32

¨  Processes both level-sensitive interrupts and edge-sensitive interrupts ¤  Five private peripheral

interrupts (PPI) dedicated for each.

¤  The global timer, private watchdog timer, private timer, and FIQ/IRQ from the PL.

54





55

A microcontroller combines onto the same microchip : ¨  The CPU core ¨  Memory (both ROM and RAM) ¨  I/O – parallel, serial, analog,

digital

32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.

56

ARM Ltd ¨  Founded in November 1990

¤  Spun out of Acorn Computers ¤  Initial funding from Apple, Acorn

and VLSI ¨  Designs the ARM range of RISC

processor cores ¤  Licenses ARM core designs to

semiconductor partners who fabricate and sell to their customers

¤  ARM does not fabricate silicon itself

¨  Also develop technologies to assist with the design-in of the ARM architecture ¤  Software tools, boards, debug

hardware ¤  Application software ¤  Bus architectures ¤  Peripherals, etc


Energy Efficient Appliances

IR Fire Detector

Intelligent Vending

Tele-parking

Utility Meters

Exercise Machines Intelligent

toys

Equipment Adopting 32-bit ARM Microcontrollers

57

Cortex Family ¨  ARM Cortex-A family (v7-A):

¤  Applications processors for full OS and 3rd party applications

¨  ARM Cortex-R family (v7-R): ¤  Embedded processors for real-time

signal processing, control applications

¨  ARM Cortex-M family (v7-M): ¤  Microcontroller-oriented processors

for MCU and SoC applications


Cortex-R4

Cortex-A8

SC300™

Cortex-M1

Cortex™-M3

...2.5GHz x1-4

Cortex-A9

12k gates... Cortex-M0

Cortex-M4

x1-4

Cortex-A5

x1-4

Cortex-A15

58

ARM Cortex Family


Cortex-A8 §  Architecture v7A

§  MMU

§  AXI

§  VFP & NEON support

Cortex-R4

§  Architecture v7R

§  MPU (optional)

§  AXI

§  Dual Issue

Cortex-M3 §  Architecture v7M

§  MPU (optional)

§  AHB Lite & APB

59

Relative Perfomance


Cortex-M0

Cortex-M3 ARM7 ARM92

6 ARM10

26 ARM11

36 ARM11

76 Cortex-

A8

Cortex-A9

Dual-core

Max Freq (MHz) 50 150 184 470 540 610 750 1100 2000 Min Power (mW/MHz) 0,012 0,06 0,35 0,235 0,36 0,335 0,568 0,43 0,5

0

500

1000

1500

2000

2500

Max

Fre

quen

cy (M

hz)

60

ARM architecture ¨  Load/store

architecture ¨  A large array of

uniform registers ¨  Fixed-length 32-bit

instructions ¨  3-address instructions


61

Data Sizes and Instruction Sets ¨  The ARM is a 32-bit architecture. ¨  When used in relation to the

ARM: ¤  Byte means 8 bits ¤  Halfword means 16 bits (two

bytes) ¤  Word means 32 bits (four bytes)

¨  Most ARM’s implement two instruction sets ¤  32-bit ARM Instruction Set ¤  16-bit Thumb Instruction Set

¨  Jazelle cores can also execute Java bytecode


Memory width (zero wait state)

0

5000

10000

15000

20000

25000

30000

32-bit 16-bit 16-bit with 32-bit stack

ARM

Thumb

Dhrystone 2.1/sec @ 20MHz

ARM and Thumb Performance

0 8 7 16 15 24 23 31

8-bit Byte 16-bit Half word

32-bit word

62

The Thumb-2 instruction set ¨  Variable-length instructions

¤  ARM instructions are a fixed length of 32 bits. ¤  Thumb instructions are a fixed length of 16 bits.

¤  Thumb-2 instructions can be either 16-bit or 32-bit.

¨  Thumb-2 g ives approx imate ly 26% improvement in code density over ARM

¨  Thumb-2 g ives approx imate ly 25% improvement in performance over Thumb.


63

Cortex-M Programmer’s Model ¨  Fully programmable in C ¨  Stack-based exception

model ¨  Only two processor modes

¤  Thread Mode for User tasks

¤  Handler Mode for OS tasks and exceptions

¨  Vector table contains addresses


Process

r8 r9 r10 r11 r12 sp lr r15 (pc)

xPSR

r0 r1 r2 r3 r4 r5 r6 r7

Main

sp

32-bits Endianess

Address Space

32-bits Endianess

64

Address Space


65

Cortex-M3 Processor Privilege


ARM Cortex-M3

Application code

OS

System Call (SVCall) Undefined Instruction

Privileged

Memory

Instructions & Data

Aborts Interrupts Reset

Non-Privileged

Supervisor

User

Handler Mode

Thread Mode

66

Cortex-M3 Interrupt Handling ¨  One Non-Maskable Interrupt

(INTNMI) supported ¨  1-240 prioritizable interrupts

supported ¤  Interrupts can be masked ¤  Implementation option

selects number of interrupts supported

¨  Nested Vectored Interrupt Controller (NVIC) is tightly coupled with processor core

¨  Interrupt inputs are active HIGH


INTNMI

NVIC

Cortex-M3

1-240 Interrupts INTISR[239:0] …

Cortex-M3 Processor Core

67

Cortex-M3 Exception Handling ¤  Reset : power-on or system reset ¤  NMI : cannot be stopped or preempted by any

exception other than reset

¤  Faults n  Hard Fault : default Fault or any fault unable to

activate n  Memory Manage : MPU violations n  Bus Fault : prefetch and memory access violations n  Usage Fault : undef instructions, divide by zero, etc.

¤  SVCall : privileged OS requests

¤  Debug Monitor : debug monitor program

¤  PendSV : pending SVCalls

¤  SysTick Interrupt : internal sys timer, i.e., used by RTOS to periodically check resources or peripherals

¤  External Interrupt : i.e., external peripherals


68

Cortex-M3 Program Status Register ¨  One Status Register consisting of

¤  APSR - Application Program Status Register – ALU flags ¤  IPSR - Interrupt Program Status Register – Interrupt/Exception No. ¤  EPSR - Execution Program Status Register

n  IT field – If/Then block information n  ICI field – Interruptible-Continuable Instruction information

¨  xPSR

¤  Composite of the 3 PSRs ¤  Stored on the stack on exception entry


IT/ICI IT

27 31

N Z C V Q

28 7

ISR Number

16 23

15

0 24 25 26 10

T

69

Conditional Execution ¨  If – Then (IT) instruction added (16 bit)

¤  Up to 3 additional “then” or “else” conditions maybe specified (T or E) ¤  Makes up to 4 following instructions conditional

¨ Any normal ARM condition code can be used ¨ 16-bit instructions in block do not affect condition code flags

¤ Apart from comparison instruction. ¤ 32 bit instructions may affect flags (normal rules apply)

¨  Current “if-then status” stored in CPSR ¤ Conditional block maybe safely interrupted and returned to ¤ Must NOT branch into or out of ‘if-then’ block


ITTET EQ Inst 1 Inst 2 Inst 3 Inst 4

MOVEQ ADDEQ SUBNE ORREQ

70

Classes of Instructions (v4T)


Load/Store

LDR

STR

ADR

Miscellaneous

CMP

SWI

SWP Data Operations

ADD MUL

LSL

AND MOV PC, Rm Bcc BL BLX

Change of Flow

71

Data Processing Instructions ¨  Consist of :

¤  Arithmetic: ADD ADC SUB SBC RSB RSC ¤  Logical: AND ORR EOR BIC ¤  Comparisons: CMP CMN TST TEQ ¤  Data movement: MOV MVN

¨  These instructions only work on registers, NOT memory. ¨  Syntax:

<Operation>{<cond>}{S} Rd, Rn, Operand2 n  Comparisons set flags only - they do not specify Rd n  Data movement does not specify Rn n  Second operand is sent to the ALU via barrel shifter.


72

Using a Barrel-shifter: The 2nd Operand


Register, optionally with shift operation •  Shift value can be either be:

•  5 bit unsigned integer •  Specified in bottom byte of

another register. •  Used for multiplication by constant

Immediate value

•  8 bit number, with a range of 0-255.

•  Rotated right through even number of positions

•  Allows increased range of 32-bit constants to be loaded directly into registers

Result

Operand 1

Barrel Shifter

Operand 2

ALU

73

Single Register Data Transfer LDR STR Word LDRB STRB Byte LDRH STRH Halfword LDRSB Signed byte load LDRSH Signed halfword load

¨  Memory system must support all access sizes

¨  Syntax: ¤  LDR{<cond>}{<size>} Rd, <address> ¤  STR{<cond>}{<size>} Rd, <address> e.g. LDREQB


74

Cortex-M3 Datapath


Register Bank Mul/Div

Address Incrementer

ALU

B

A

INTADDR

I_HADDR

Address Register

Barrel Shifter

Writeback

ALU

Read Data Register

Write Data Register

Instruction Decode

I_HRDATA

D_HWDATA

D_HRDATA

Address Incrementer

D_HADDR Address Register

75

Cortex-M3 Pipeline ¨  Cortex-M3 has 3-stage fetch-decode-execute pipeline

¤  Similar to ARM7 ¤  Cortex-M3 does more in each stage to increase overall performance


Branch forwarding & speculation

1st Stage - Fetch 2nd Stage - Decode 3rd Stage - Execute

Execute stage branch (ALU branch & Load Store Branch)

Fetch (Prefetch)

AGU

Instruction Decode &

Register Read

Branch

Address Phase & Write Back

Data Phase Load/Store &

Branch

Multiply & Divide

Shift ALU & Branch

Write

76

SW Development


0x00000142 49120x00000144 68080x00000146 F040000F0x0000014A 6008

Start; direction register LDR R1,=GPIO_PORTD_DIR_R LDR R0,[R1] ORR R0,R0,#0x0F; make PD3-0 output STR R0, [R1]

Source code

Build Target (F7)

DownloadObject code

Processor

Memory

I/O

SimulatedMicrocontroller

Address Data

Editor KeilTM uVision®

Processor

Memory

I/O

RealMicrocontroller

StartDebugSession

StartDebugSession

¨  GNU compiler and binutils ¤  gcc: GNU C compiler ¤  as: GNU assembler ¤  ld: GNU linker ¤  gdb: GNU project

debugger ¤  COFF (common object

file format) ¤  ELF (extended linker

format) ¤  Segments in the object file

n  Text: code n  Data: initialized global

variables n  BSS: uninitialized global

variables

.c .elf

C source executable

gcc .s

asm source

as .coff

object file

ld Simulator Debugger …

77





78

Intel x86 Processor Evolution: Name Date Transistors MHz 8086 1978 29K 5-10

n  First 16-bit processor. Basis for IBM PC & DOS n  1MB address space

386 1985 275K 16-33 n  First 32 bit processor , referred to as IA32 n  Added “flat addressing” n  Capable of running Unix n  Until recently, 32-bit Linux/gcc used no instructions introduced in later models

Pentium 4F 2005 230M 2800-3800 n  First 64-bit processor n  Meanwhile, Pentium 4s (Netburst arch.) phased out in favor of “Core” line

32/64bit microprocessor. Intel x86, UltraSparc

79

Intel x86 Processor Evolution: Machine Evolution

n  486 1989 1.9M n  Pentium 1993 3.1M n  Pentium/MMX 1997 4.5M n  PentiumPro 1995 6.5M n  Pentium III 1999 8.2M n  Pentium 4 2001 42M n  Core 2 Duo 2006 291M

Added Features n  Instructions to support multimedia operations

l  Parallel operations on 1, 2, and 4-byte data, both integer & FP n  Instructions to enable more efficient conditional operations

Linux/GCC Evolution n  Very limited, needs to get better – trying to maintain compatibility


80

Intel x86 Processor Evolution: Name Date Transistors Itanium 2001 10M

n  First shot at 64-bit architecture: first called IA64 n  Radically new instruction set designed for high

performance n  Can run existing IA32 programs

l  On-board “x86 engine” n  Joint project with Hewlett-Packard - Boat Anchor

Itanium 2 2002 221M n  Big performance boost

Itanium 2 Dual-Core 2006 1.7B ¤  Itanium has not taken off in marketplace

n  Lack of backward compatibility, no good compiler support, Pentium 4 too good.


81

¨  IA-32 architecture ¤  Lots of architecture improvements, pipelining, superscalar, branch prediction, hyperthreading and

multi-core. ¤  From programmer’s point of view, IA-32 has not changed substantially except the introduction of

a set of high-performance instructions


¨  Modes of operation ¤  Protected mode

n  Native mode (Windows, Linux), full features, separate memory

¤  Real-address mode n  Native MS-DOS

¤  System management mode n  Power management, system security, diagnostics

•  Virtual-8086 mode •  hybrid of Protected •  each program has its own 8086 computer

¨  Addressable Memory ¤  Protected mode

n  4 GB n  32-bit address

¤  Real-address and Virtual-8086 modes n  1 MB space n  20-bit address

82

¨  General Purpose Registers


CS

SS

DS

ES

EIP

EFLAGS

16-bit Segment Registers

EAXEBXECX

EDX

32-bit General-Purpose Registers

FS

GS

EBPESP

ESI

EDI

83

¨  Accessing parts of registers ¤  Use 8-bit name, 16-bit

name, or 32-bit name ¤  Applies to EAX, EBX,

ECX, and EDX ¤  The 16-bit registers are

usually used only in real-address mode.


AH AL

16 bits

8

AX

EAX

8

32 bits

8 bits + 8 bits

84

¨  Floating-point, MMX,XMM registers. ¤  Eight 80-bit floating-point data registers

n  ST(0), ST(1), . . . , ST(7) n  arranged in a stack

n  used for all floating-point arithmetic

¤  Eight 64-bit MMX registers.

¤  Eight 128-bit XMM registers for single-instruction multiple-data (SIMD) operations.


ST(0)ST(1)ST(2)

ST(3)

80-bit Data Registers

FPU Data Pointer

Tag Register

Control Register

Status Register

ST(4)ST(5)ST(6)

ST(7)

FPU Instruction Pointer

Opcode Register

16-bit Control Registers

48-bit Pointer Registers

85

¨  Programmer’s Model


86

¨  IA-32 addressing Modes


8

87

¨  IA-32 Memory Management ¤  Protected Mode n  1 MB RAM maximum

addressable (20-bit address)

n  Application programs can access any area of memory

n  Single tasking n  Supported by MS-DOS

operating system


00000

10000

20000

30000

40000

50000

60000

70000

80000

90000

A0000

B0000

C0000

D0000

E0000

F0000

8000:0000

8000:FFFF

seg ofs

8000:0250

0250

linea

r add

ress

es

one segment

(64K)

Segmented memory addressing: absolute (linear) address is a combination of a 16-bit segment value added to a 16-bit offset

88

¨  IA-32 Memory Management ¤  Real-address mode n  4 GB addressable RAM (32-

bit address) n  (00000000 to FFFFFFFFh)

n  Each program assigned a memory partition which is protected from other programs

n  Designed for multitasking n  Supported by Linux & MS-

Windows n  Segment descriptor tables n  Program structure

n  code, data, and stack areas n  CS, DS, SS segment descriptors n  global descriptor table (GDT)

n  MASM Programs use the Microsoft flat memory model.


Flat

segm

enta

tion

mod

el

3000

RAM

00003000

Local Descriptor Table

000200008000 000A00026000 0010

base limit access

8000

26000

multiplied by 1000h

Mul

ti-se

gmen

t mod

el

89

¨  IA-32 Memory Management ¤  Translating Addresses

n  The IA-32 processor uses a one- or two-step process to convert a variable's logical address into a unique memory location.

n  The first step combines a segment value with a variable’s offset to create a linear address.

n  The second optional step, called page translation, converts a linear address to a physical address.


Selector Offset

Logical address

Segment Descriptor

Descriptor table

+

GDTR/LDTR

(contains base address ofdescriptor table)

Linear address

90

¨  IA-32 Memory Management ¤  Indexing into a

Descriptor Table n  Each segment descriptor

indexes into the program's local descriptor table (LDT). Each table entry is mapped to a linear address.


Logical addresses

0018 0000003A

(unused)

DRAMSS ESP

001A0000

0002A000

0001A000

00003000

Local Descriptor Table

0010 000001B6

0008 00002CD3

LDTR register

DS18

10

08

00

(index)

Linear address space

IP

offset

91

¨  IA-32 Memory Management ¤  Paging

n  Virtual memory uses disk as part of the memory, thus allowing sum of all programs can be larger than physical memory. Only part of a program must be kept in memory, while the remaining parts are kept on disk.

n  The memory used by the program is divided into small units called pages (4096-byte).

n  As the program runs, the processor selectively unloads inactive pages from memory and loads other pages that are immediately required.

n  OS maintains page directory and page tables n  Page translation: CPU converts the linear address

into a physical address n  Page fault: occurs when a needed page is not in

memory, and the CPU interrupts the program

n  Virtual memory manager (VMM) – OS utility that manages the loading and unloading of pages

n  OS copies the page into memory, program resumes execution


Directory Table Offset

Directory Entry

CR3

Page Directory

Page-Table Entry

Page Table

Physical Address

Page Frame

Linear Address10 10 12

32

92

¨  Interrupt Handling ¨  Processor generates interrupts that

index into a Interrupt Descriptor Table, whose base is stored in IDTR and loaded using the privileged instruction LIDT.

¨  The descriptors in IDT can be ¤  Interrupt gate: ISR handled as a

normal call subroutine – uses the interrupted processor stack to save EIP,CS, (SS, ESP in case of stack switch – new stack got from TSS).

¤  Task gate: ISR handled as a task switch n  Needed for stack fault in CPL = 0 and

double faults.


93

Intel® Core® Micro-architecture Blocks


Branch Target Buffer

Microcode Sequencer

Register Allocation Table (RAT)

32 KB Instruction Cache Next IP

Instruction Decode (4 issue)

Fetch / Decode

Retire

Re-Order Buffer (ROB) – 96 entry

IA Register Set

To L2 Cache

Por

t P

ort

Por

t P

ort

Bus Unit

Res

erva

tion

Sta

tion

s (R

S)

32

en

try

Sch

edu

ler

/ D

isp

atch

Por

ts

32 KB Data Cache

Execute

Por

t

FP Add

SIMD Integer Arithmetic

Memory Order Buffer (MOB)

Load

Store Addr

FP Div/Mul Integer

Shift/Rotate SIMD

SIMD

Integer Arithmetic

Integer Arithmetic

Por

t

Store Data

94

Intel® Core® Micro-architecture Blocks ¨ Intel® Wide Dynamic Execution

¤  14-stage efficient pipeline n  Wider decoding capacity n  Advanced branch prediction n  Wider execution path

¤  64-Bit Support n  Merom, Conroe, and Woodcrest support

EM64T

¨ Intel® Advanced Smart Cache ¤  Multi-core optimization

n  Shared between the two cores n  Advanced Transfer Cache architecture n  Reduced bus traffic n  Both cores have full access to the entire cache n  Dynamic Cache sizing

¤  Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache

Execution Unit Overview

Execute 6 operations/cycle •  3 Memory Operations

•  1 Load •  1 Store Address

•  1 Store Data •  3 “Computational” Operations

Unified Reservation Station

Port 0

Port 1

Port 2

Port 3

Port 4

Port 5

Load Store Address

Store Data

Integer ALU & Shift

Integer ALU & LEA

Integer ALU & Shift

Branch FP Add FP Multiply

Complex Integer Divide

SSE Integer ALU Integer Shuffles

SSE Integer Multiply

FP Shuffle

SSE Integer ALU Integer Shuffles

Unified Reservation Station •  Schedules operations to Execution units •  Single Scheduler for all Execution Units •  Can be used by all integer, all FP, etc.

95


¨ Instruction Decode ¤  Frequent pairs of micro-operations

derived from the same Macro Instruction can be fused into a single micro-operation


Micro-op fusion effectively widens the pipeline

96

Intel® Core® Micro-architecture Blocks ¨ Intel® Advanced Digital Media Boost

¤  Single Cycle SSE n  8 Single Precision Flops/cycle n  4 Double Precision Flops/cycle

¤  Wide Operations n  128-bit packed Add n  128-bit packed Multiply

n  128-bit packed Load n  128-bit packed Store

¤  Support for Intel® EM64T instructions


Core™ µarch

Previous

X4

Y4

X4opY4

SOURCE

X1opY1

X3

Y3

X3opY3

X2

Y2

X2opY2

X1

Y1

X1opY1

DEST

SSE/2/3 OP

X2opY2

X3opY3 X4opY4

CLOCK CYCLE 1

CLOCK CYCLE 2

0 127

CLOCK CYCLE 1

SSE Operation (SSE/SSE2/SSE3)

97


¨  Hyperthreading ¤  Ability of processor to run multiple

threads n  Duplicate architecture state

creates illusion to SW of Dual Processor (DP).

n  Execution unit shared between two threads, but dedicated if one stalls.

¤  Almost two Logical Processors. ¤  Architecture state (registers) and APIC

duplicated. ¤  Share execution units, caches, branch

prediction, control logic and buses.


Processor Execution Resource

Adv. Programmable Interrupt Control

Architecture State

Adv. Programmable Interrupt Control

Architecture State

On-Die Caches

System Bus

98


¨ Power Efficient Support ¤  Advanced power gating & Dynamic

power coordination n  Multi-point demand-based switching n  Voltage-Frequency switching separation n  Supports transitions to deeper sleep

modes n  Event blocking n  Clock partitioning and recovery n  Dynamic Bus Parking n  During periods of high performance

execution, many parts of the chip core can be shut off


PLL

Uncore , LLC

Core Vcc

Freq . Sensors

Core Vcc

Freq . Sensors

Core Vcc

Freq . Sensors

Core Vcc

Freq . Sensors

PLL

PLL

PLL

PLL

PCU

BCLK Vcc

99

X86-64 Architecture

¨  Full support for 64-bit integers ¤  All general-purpose registers are expanded from 32 bits to 64 bits ¤  All arithmetic and logical operations, memory-to-register, and register-to-memory

operations are now directly supported for 64-bit integers ¤  Pushes and pops on the stack are always in eight-byte strides, and pointers are

eight bytes wide ¨  Additional registers

¤  The number of named registers is increased from 8 (i.e. eax, ebx, ecx, edx, ebp, esp, esi, edi) to 16.

¤  Compilers can keep more local variables in registers rather than on the stack. ¤  Can use registers for frequently accessed constants. ¤  Arguments for small and fast subroutines may also be passed in registers to a

greater extent.


100

X86-64 Architecture

¨  Larger virtual address space ¤  Current models can address

up to 256 terabytes ¤  Expandable in the future to

16 exabytes ¤  Compared to just 4 gigabytes

for 32-bit x86

¨  Larger physical address space ¤  Current models can address

up to 1 terabyte ¤  Expandable in the future to

4 petabytes


101

UltraSparc (RISC)

¨  Sun Microsystems (ORACLE) ¨  Sparc = Scalable Processor

Architecture Open processor architecture

¨  SUN UltraSparc v9: ¤  RISC Architecture big-endian. ¤  64 bit address and data. ¤  Memory Management

Unit(MMU). ¤  Superscalar. ¤  OpenSparc (open-source) ¤  LEON (soft-core). Space rated.

VHDL


Begin developing Sparc – 1984 First Sparc Processor – 1986 SuperSparc – 1992 UltraSparc I – 1995 UltraSparc II – 1997 UltraSparc III – 2001 UltraSparc IV – 2004 UltraSparc IV+ – 2005 UltraSparc T1 – 2005 UltraSparc T2 – 2007 Sparc T3 – 2010 Sparc T4 – 2011 Sparc T5 – 2013

102

UltraSparc (RISC)

¨  Registers ¤  ~160 general-purpose registers ¤  Any procedure can access only 32

registers (r0~r31) n  First 8 registers (r0~r8) are global,

i.e. they can be access by all procedures on the system (r0 is zero)

n  Other 24 registers can be visualized as a window through which part of the register file can be seen

¤  Program counter (PC) n  The address of the next instruction to

be executed

¤  Condition code registers ¤  Other control registers


¨  Data Formats ¤  Integers are 8-, 16-, 32-, 64-bit binary

numbers ¤  2’s complement is used for negative values ¤  Support both big-endian and little-endian

byte orderings n  (big-endian means the most significant part of

a numeric value is stored at the lowest-numbered address)

¤  Three different floating-point data formats n  Single-precision, 32 bits long (23 + 8 + 1) n  Double-precision, 64 bits long (52 + 11 + 1) n  Quad-precision, 128 bits long (112 + 15 + 1)

103

UltraSparc (RISC)

¨  Addressing Modes ¤  Immediate mode ¤  Register direct mode ¤  Memory addressing

Mode Target address calculation PC-relative* TA= (PC)+displacement {30 bits, signed} Register indirect TA= (register)+displacement {13 bits, signed} with displacement Register indirect indexed TA= (register-1)+(register-2)

*PC-relative is used only for branch instructions


¨  Instruction Set ¤  <150 instructions ¤  Pipelined execution

n  While one instruction is being executed, the next one is fetched from memory and decoded

¤  Delayed branches n  The instruction immediately following the branch

instruction is actually executed before the branch is taken

¤  Special-purpose instructions n  High-bandwidth block load and store operations n  Special “atomic” instructions to support multi-

processor system

¨  Input and Output ¤  A range of memory locations is logically replaced

by device registers ¤  Each I/O device has a unique address, or set of

addresses ¤  No special I/O instructions are needed

104

UltraSparc T2 (RISC)

¨  Multi-threaded(8), multi-core(8) CPU

¨  Frequency ranges from 900MHz to 1.4GHz

¨  Powered by less than 95 watts (nominal) with less than 2 watts per thread

¨  Integrated ¤  10 Gb Ethernet networking ¤  PCI Express I/O expansion ¤  FPU and cryptographic

processing units per core


¨  Codename Niagara2 ¨  Member of SPARC family ¨  2 previous multi-core processors

¤  UltraSPARC IV ¤  UltraSPARC IV+

¨  UltraSPARC T1 (first multi-core and multi-threaded) ¤  Released 14 November 2005 ¤  4, 6, or 8 cores with 4 threads each

¨  UltraSPARC T2 Released 7 August 2007 ¤  Now 8 threads per core (instead of 4)

105


¨  8 Fully pipelined FPUs ¨  8 SPUs ¨  2 integer ALUs per core, each

one shared by a group of four threads

¨  4MB L2 Cache (8-banks, 16-way associative)

¨  8 KB data cache and 16 KB instruction cache

¨  Two 10Gb Ethernet ports and one PCIe port


106



107

UltraSparc T2.Core Architecture


108

UltraSparc T2.Core Architecture


109

UltraSparc T2 Pipeline

¨  Eight-stage integer pipeline

¤  Pick is for selecting 2 threads for execution (Added this stage for T2) ¤  In the bypass stage, the load/store unit (LSU) forwards data to the integer register files

(IRFs) with sufficient write timing margin. All integer operations pass through the bypass stage.

¨  12-stage floating point pipeline

Ø  6-cycle latency for dependent FP ops!Ø  Integer multiplies are pipelined between different threads. Integer multiplies block within the same thread.!Ø  Integer divide is a long latency operation. Integer divides are not pipelined between different threads.!


Fetch Cache Pick Decode Execute Mem Bypass W

Fetch Cache Pick Decode Execute Fx1 Fx5 FW . . . FB

110

MIPS (ARM) vs x86


x86 32/64-bit 4KB Data unaligned Right add %rs1,%rs2,%rd %r0, %r1, ..., %r7 (n.a.) (n.a.)

MIPS (ARM) Address: 32/64-bit Page size: 4KB Data aligned Destination reg: Left add $rd,$rs1,$rs2 Regs: $0, $1, ..., $31 Reg = 0: $0 Return address: $31

MIPS: “Three-address architecture” •  Arithmetic-logic specify all 3 operands

!add $s0,$s1,$s2 # s0=s1+s2!Benefit: fewer instructions éé performance x86: “Two-address architecture” •  Only 2 operands, so the destination is also one of

the sources add $s1,$s0 # s0=s0+s1! Often true in C statements: c += b;

Benefit: smaller instructions êê smaller code

111

MIPS (ARM) vs x86


MIPS: “load-store architecture” •  Only Load/Store access memory; rest

operations register-register; e.g., lw $t0, 12($gp) add $s0,$s0,$t0 # s0=s0+Mem[12+gp]!

Benefit: simpler hardware è easier to pipeline, higher performance

x86: “register-memory architecture” •  All operations can have an operand in memory;

other operand is a register; e.g., add 12(%gp),%s0 # s0=s0+Mem[12+gp]!

Benefit: fewer instructions è smaller code

MIPS: “fixed-length instructions” •  All instructions same size, e.g., 4 bytes •  Simple hardware performance •  Branches can be multiples of 4 bytes

x86: “variable-length instructions” •  Instructions are multiple of bytes: 1 to 17;

êê small code size (30% smaller?) •  More Recent Performance Benefit:

better instruction cache hit rates •  Instructions can include 8- or 32-bit

immediates

1 tema iii – microcontrollers and microprocessors

Documents