1 tema iii – microcontrollers and microprocessors
TRANSCRIPT
Tema III – Microcontrollers and Microprocessors
1
Roberto Gutiérrez Mazón
2
¨ Introduction ¨ Processor Architectural Features. Datapath &
pipeline. ¨ Data Representation: Fixed-point vs Floating-point ¨ Interrupts, Exceptions, Watch-Dog, … ¨ 32-bit microcontroller. ARM Cortex-M3
¤ ARM Cortex-M3 Architecture. Programmers Model.
¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model
3
What is “Computer Architecture”??
Processor Architectural Features
Instruction Set Architecture
Applications
Compiler
Operating System
Firmware
I/O system Instr. Set Proc.
Digital Design Circuit Design
Datapath & Control
Layout & fab Semiconductor Materials
4
Introduction
¨ Moore`s Law ¨ “Cramming More Components onto
Integrated Circuits” ¤ Gordon Moore, Electronics, 1965
¨ Nº on transistors on cost-effective integrated circuit double every 18 months
5
Introduction
¨ Prehistoric Computer Architecture: ¤ The Z1 was the first mechanical freely
programmable computer in the world which used Boolean logic and binary floating point numbers
¤ Memory: 64 words of 22bits.
¤ Clock Frequency: 1Hz
¤ Registers: two 22bits floating-point registers. ¤ ALU: add (5 seg), sub, mult. (16 seg) ,div
(18seg).
¤ Weight: 1000 kg
6
Introduction
¨ The zEC12 Zseries IBM Microprocessor: ¤ 5.5 GHz in IBM 32nm PD-SOI CMOS
technology ¤ 2.75 billion transistors in 597 mm2 ¤ 64-bit virtual addressing
n original S/360 was 24-bit, and S/370 was 31-bit extension
¤ Six-core design ¤ Three-issue out-of-order superscalar
pipeline ¤ Out-of-order memory accesses ¤ Redundant datapaths
n every instruction performed in two parallel datapaths and results compared
¤ 64KB L1 I-cache, 128KB L1 D-cache on-chip
¤ 1MB private L2 unified instruction and data cache per core, on-chip
¤ On-Chip 48MB eDRAM L3 cache ¤ Scales to 120-core multiprocessor with
384MB of shared L4 eDRAM
Maquina Diferencias Baggage (1832)
1er Transistor (Shokley, Bardeen,Brattain) (1947)
IC 4004 Intel (1971) IC 486DX2 Intel (1989)
Intel Quad (2007)
Cell (2005)
Nanotecnología (¿?)
MEMS(2000) Procesadores opticos (¿?)
7
Introduction
ENIAC(1946)
8
Introduction
C B
Wirelessly networked into large scale sensor arrays
Battery Solar Cells
Processor, SRAM and PMU
Sensors, timers
Cortex-M0 +16KB RAM 65nm UWB Radio antenna
10 kB Storage memory ~3fW/bit
12µAh Li-ion Battery
Wireless Sensor Network
A
University of Michigan
Cortex-M0; 65¢
9
Introduction
4200 ARM powered Neutrino Detectors
Work supported by the National Science Foundation and University of Wisconsin-Madison
2.5km 70 bore holes 2.5km deep 60 detectors per string starting 1.5km down 1km3 of active telescope
1km
10
Introduction
11
¨ Introduction ¨ Processor Architectural Features. Datapath &
pipeline. ¨ Data Representation: Fixed-point vs Floating-point ¨ Interrupts, Exceptions, Watch-Dog, … ¨ 32-bit microcontroller. ARM Cortex-M3
¤ ARM Cortex-M3 Architecture. Programmers Model.
¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model
12
Programming Model ¨ M i c r o p r o c e s s o r s c a n b e
programmed directly using an assembly language.
¨ Differences with high-level languages: ¤ Use commands to execute data
movements, arithmetic, logic and program control operations.
¤ Use registers to hold data for operation.
¨ Programmers need to know not only the assembly language for the microprocessor, but also the internal configuration of the microprocessor.
Processor Architectural Features
High-Level Language
Assembly Language
Operating System
Instruction SetArchitecture
Microarchitecture
Digital Logic Level 0
Level 1
Level 2
Level 3
Level 4
Level 5
13
A Basic Processor ¨ The basic components:
¤ Processor with its associate temporary memory (registers and cache if available) for code execution
¤ Main memory and secondary memory where code and data are temporary and permanently stored
¤ Input and output modules that provide interface between the processor and the user
¨ Connected through an interface bus consists of
¤ Address, Data, and Control signals. e.g. AMBA bus for the ARM-based processor
Processor Architectural Features
Reg
iste
rs
Processor core
Cache/SRAM memory
Main memory
Storage memory
I/O Interface
Address bus, data bus, and bus control signals
14
Processor Architectural Features
The gap widens between DRAM, disk, and CPU speeds.
110
1001,000
10,000100,000
1,000,00010,000,000
100,000,000
1980 1985 1990 1995 2000
year
ns
Disk seek timeDRAM access timeSRAM access timeCPU cycle time
register cache memory disk Access time
(cycles) 1 1-10 50-100 20,000,000
15
Memory Hierarchy ¨ A typical processor is supported by:
¤ on-board main memory (e.g. SDRAM up to GB)
¤ on-chip or on-die cache memory (e.g. SRAM KB to MB)
¤ on-die registers
¨ Some processors also provide general purpose on-chip ¤ SRAM (e.g. embedded processor) which may be
configured as SRAM/Cache combination (e.g. TI’s DSP)
¨ Typically, a processor also utilizes secondary non-volatile memory ¤ For permanent code and data storage like Flash-
based memory and hard disk
Processor Architectural Features
Larger, slower, and cheaper (per byte) storage devices
registers
on-chip L1 cache (SRAM)
main memory (DRAM)
local secondary storage (virtual memory) (local disks)
remote secondary storage (tapes, distributed file systems, Web servers)
off-chip L2 cache (SRAM)
L0:
L1:
L2:
L3:
L4:
L5:
Smaller, faster, and more expensive (per byte) storage devices
16
Processor Architectural Features
¨ Multiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.
Processor Chip
L1 Data 1 cycle latency
16 KB 4-way assoc
Write-through 32B lines
L1 Instruction 16 KB, 4-way
32B lines
Regs. L2 Unified 128KB--2 MB 4-way assoc Write-back
Write allocate 32B lines
Main Memory
Up to 4GB
Pentium III cache hierarchy
17
Address Space ¨ Address space of a processor depends on its address
decoding mechanism. ¤ Size will depend on the number of address bit used.
¨ Depending on the processor design, there may be two types of address space: ¤ One is used by normal memory access. ¤ Another one is reserved for I/O peripheral registers (control,
status, and data). ¤ Need extra control signal or special means of accessing the
alternate address space.
Processor Architectural Features
18
Address Space ¨ Refer to the range of address that can be accessed by the processor determined
by the number of address bit utilized in the processor architecture. ¨ Some processor families (e.g. ARM) utilize only one address space for both
memory and I/O devices
¤ i.e. everything is mapped in the same address space
I/O Reg
I/O Reg
Memory
Processor
0x00000000
0xFFFFFFFF I/O
Data
Code
Processor Architectural Features
0x00000000
0xFFFFFFFF
0x0000
0xFFFF
I/O Address Space
19
Memory mapped vs I/O mapped ¨ Some processor families have two address spaces. ¨ E.g., for the x86 processor, memory and I/O devices can be mapped in
two different address spaces: ¤ Memory address space and I/O address space
Memory Address Space
Processor
I/O Reg
I/O Reg Data
Code
Data
Code
Processor Architectural Features
20
Memory system Architectures ¨ Two types of information are found in a typical program code:
¤ Instruction codes for execution ¤ Data that is used by the instruction codes
¨ Two classes of memory system design to store these information: ¤ Von Neumann architecture ¤ Harvard architecture
0000h
FFFFh
Code
Data
Code
Data Table
Data
Processor
Single path (bus) for both Code & Data
0000h
FFFFh
Code
Code
Data
Data
Processor
Separate bus for Code & Data
Data
Code 7FFFh
8000h
Von Neumann
Harvard
Processor Architectural Features
21
Processor Size ¨ The processor size is described in terms of ‘bits’ (e.g. an 8 bit, 32-bit
processor).
¤ Corresponds to the data size that can be manipulated at a time by the processor.
¤ Typically reflected in the size of the processor (internal) data path and register bank.
¨ Hence an 8-bit processor can only manipulate byte size data at a time, while a 32-bit processor can handle 32-bit double word size data at a time.
• Even though the data content may only be of single byte size.
Processor Architectural Features
22
Registers ¨ The most fundamental storage area
in the processor is closely located to the processor provides very fast access, operating at the processor clock but is of limited amount (less than 100 typical)
¨ Most are of the general purpose type and can store any type of information: ¤ data – e.g. timer value, constants ¤ address – e.g. ASCII table, stack
¨ Some are reserved for specific purpose ¤ program counter (IP). ¤ program status register (SR).
Processor Architectural Features
I-1 I-2 I-3 I-4
PC program
I-1instructionregister
op1op2
memory fetch
ALU
registers
writ
e
decode
execute
read
writ
e(output)
registers
flags
program counter instruction queue
23
Data Organization in Memory ¨ A typical memory contains a storage location that can store data of a certain
fixed size (most commonly of the 8-bit (byte) size). Each location is provided with a unique address.
¨ Depending on the data path size of the processor. The memory content is accessible in the size of an 8-bit byte, a 16-bit half word, a 32-bit word, and even a 64-bit double word.
¨ A 32-bit data consists of four bytes of data, and are stored in four successive memory locations. Data and code must be aligned to the respective address size boundary. ¤ E.g. for a 32-bit system, align to the word boundary, with the lowest two address bits equal
to zero
¨ But what is the order of the four bytes of data?. Depends on the Endianness adopted
Processor Architectural Features
24
Data Endianness ¨ In the Little Endian format, the least significant byte (LSB) is stored in the lowest
address of the memory, with the most significant byte (MSB) stored in the highest address location of the memory.
¨ In the Big Endian format, the least significant byte (LSB) is stored in the highest address of the memory, with the most significant byte (MSB) stored in the lowest address location of the memory.
Memory Address Space
0x000000
Memory Address Space
0x000000
MSB LSB
Big Endian Little Endian
Processor Architectural Features
25
Top Boot and Botton Boot ¨ Different processor family uses different location for its reset vector boot-up purpose. ¨ Examples:
¤ x86 boot up from the top of the memory space ¤ ARM boot up from the bottom of its memory space
00..00h
FF..FFh
Reset vector
Data
Program
Data
Processor
00..00h
FF..FFh
Program
Data
Data Processor
Reset vector
Processor Architectural Features
x86 ARM
26
CISC – Complex Instruction Set Computer. Philosophy: Hardware is always faster than the software. Objective: Instruction set should be as powerful as possible
With a power instruction set, fewer instructions needed to complete (and less memory) the same task as RISC. CISC was developed at a time (early 60’s), when memory technology was not so advanced. Memory was small (in terms of kilobytes) and expensive. But for embedded systems, especially Internet Appliances, memory efficiency comes into play again, especially in chip area and power. ¨ Many instructions ¨ Complex instructions
¤ Each instruction can execute several low level operations
¨ Complex addressing modes ¤ Smaller number of registers needed
¨ A semantically rich instruction set is accommodated by allowing instructions that can be of variable lengths
Processor Architectural Features
27
RISC – Reduce Instruction Set Computer. By reducing the number of instructions that a processor supports and thereby reducing the complexity of the chip, it is possible to make individual instructions execute faster and achieve a net gain in performance even though more instructions might be required to accomplish a task. RISC trades-off instruction set complexity for instruction execution timing. Large register set: having more registers allows memory access to be minimized. Load/Store architecture: operating data in memory directly is one of the most expensive in terms of clock cycle. Fixed length instruction encoding: This simplifies instruction fetching and decoding logic and allows easy implementation of pipelining.
¤ All instructions are register-to-register format except Load/Store which access memory ¤ All instructions execute in a single cycle save branch instructions which require two. ¤ Almost all single instruction size & same format.
Processor Architectural Features
28
Limitations of CISC ¨ A highly encoded instruction set
n e e d s t o b e d e c o d e d b y hardwired microcode electronic circuitry. ¤ More complex hardware design ¤ Slower instruction decoding/
execution ¨ Variable length instructions
different execution time among instructions affect pipelined operations.
Advantages of CISC ¨ As each instruction can execute several
low level operations, the code size is r e d u c e d t o s a v e o n m e m o r y requirement. less main memory access is required and hence faster.
¨ Backward code compatibility is maintained. ¤ Can add new (and more powerful)
instructions while retaining the ‘old’ instruction set for code compatibility (i.e. the legacy program can still run)
¨ Easy to program. ¤ direct support of high-level language
constructs. ¤ complex instructions that fit well with high-
level language expression.
Processor Architectural Features
29
Limitations of RISC ¨ Fewer instructions than CISC:
¤ Compared to CISC, RISC needs more instructions to execute one task.
¤ code density is less. ¤ need more memory.
¨ No complex instruction: ¤ No hardware support for division,
floating-point arithmetic operation. ¤ Need a more complex compiler and a
longer compiling time
But ARM also adds DSP-l ike instructions to support commonly used signal processing function.
Advantages of RISC ¨ Simpler instructions:
¤ One clock per instruction gives faster execution than on a CISC processor with the same clock speed
¨ Simpler addressing mode: ¤ Faster decoding
¨ Fixed length instructions: ¤ Faster decoding and better pipeline
performance
¨ Simpler hardware: ¤ Less silicon area ¤ Less power consumption
Processor Architectural Features
30
CISC RISC
Any instruction may reference memory Only load/store references memory
Many instructions & addressing modes Few instructions & addressing modes
Variable instruction formats Fixed instruction formats
Single register set Multiple register sets
Multi-clock cycle instructions Single-clock cycle instructions
Micro-program interprets instructions Hardware (FSM) executes instructions
Complexity is in the micro-program Complexity is in the compiler
Less to no pipelining Highly pipelined
Program code size small Program code size large
Processor Architectural Features
31
RISC vs CISC ¨ RISC machines: SUN SPARC, SGI Mips, HP PA-RISC, ARM ¨ CISC machines: Intel 80x86, Motorola 680x0 ¨ What really distinguishes RISC from CISC these days lies in the architecture and
not in the instruction set. ¨ CISC occurs whenever there is a disparity in speed between CPU operations and
memory accesses due to technology or cost. ¨ What about combining both ideas?
¤ Intel 8086 Pentium P6 architecture is externally CISC but internally RISC & CISC! ¤ Intel IA-64 executes many instructions in parallel.
Processor Architectural Features
CISC (Intel 486)
RISC (MIPS R4000)
#instructions 235 94
Addr. modes 11 1
Inst. Size (bytes) 1-12 4
GP registers 8 32
32
Instruction Code Format ¨ Opcode encoding depends on the number of bit used.
¤ Example: For ARM, all instructions are of 32-bit length, but only 8 bits (bit 20 to 28) are used to encode the instruction. Hence a total of 28 = 256 different instructions possible.
¨ A typical instruction is encoded with a specific bit pattern that consists of the following: ¤ Opcode field specifying the operation to be performed. ¤ Operand(s) identification (address) field that depends on the modes of
addressing; n this provides the address of the register/memory location (s) that store the
operand(s), or the operand itself.
Processor Architectural Features
33
Operand Addressing Types: ¨ Immediate addressing.
¤ Operand is given in the instruction.
¨ Register addressing. ¤ Operand is stored in a register.
¨ Direct addressing. ¤ Operand is stored in memory, with the address
given in the instruction.
¨ Indirect (Index) addressing. ¤ Operand is stored in memory, with the address
given in a register (address adds with an offset given in the instruction).
¨ Implied addressing ¤ Implicit location like stack and program counter.
Instruction Opcode Types: ¨ General categories of
instruction operations: ¤ Data transfer ¤ E.g. move, load, and store ¤ Data manipulation ¤ E.g. add, subtract, logical
operation ¤ Program control ¤ E.g. branch, subroutine call
Processor Architectural Features
34
Instruction Execution ¨ Multiple stages are involved in executing an instruction. Example:
1) Fetching the instruction code. Reads the instruction from the memory 1) Decoding the instruction code. Determining which instruction is to be executed 2) Executing the instruction code. Performs the operations necessary to complete
what the instruction is suppose to do. Read data from memory, write data to memory or I/O device, perform only operations within CPU or combination of those.
¨ Hence multiple processor clock cycles are needed to execute one single instruction.
Fetch Instruction
Decode Instruction
Execute Instruction
time
Fetch Instruction
Decode Instruction
Execute Instruction
1st instruction 2nd instruction
Processor Architectural Features
35
Instruction Pipeline ¨ Pipeline allows concurrent execution of multiple different
instructions at the same time ¨ During a normal operation
• While one instruction is being executed. • The next instruction is being decoded. • And a third instruction is being fetched from memory. • Allows effective throughput to increase to one instruction per clock cycle.
Processor Architectural Features
36
Pipeline Architecture: Longer pipeline can also be used to further break down the operation carried out in the individual stage. Simpler logic for each stage to increase system clock.
Example: A 5-stage instruction pipeline
Fetch Instruction
Decode Instruction
Fetch Operand
Execute Instruction
Store Result
Parallel execution of multiple instructions.
Assume instructions are completely independent!
Fetch Instruction
Decode Instruction
Fetch Operand
Execute Instruction
Store Result
Fetch Instruction
Decode Instruction
Fetch Operand
Execute Instruction
Store Result
Fetch Instruction
Decode Instruction
Fetch Operand
Execute Instruction
Store Result
Fetch Instruction
Decode Instruction
Fetch Operand
Execute Instruction
Store Result
time
1st
2nd 3rd
4th
5th
Maximum Speedup é Number of stages Speedup ≈ Time for unpipelined operation Time for longest stage
Processor Architectural Features
37
ARM Cortex-A15
Processor Architectural Features
38
¨ Introduction ¨ Processor Architectural Features. Datapath &
pipeline. ¨ Data Representation: Fixed-point vs Floating-
point ¨ Interrupts, Exceptions, Watch-Dog, … ¨ 32-bit microcontroller. ARM Cortex-M3
¤ ARM Cortex-M3 Architecture. Programmers Model.
¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model
39
¨ Numerical values represented as binary fractions: -1.0 ≤ value < 1.0 ¨ Why a fractional representation?
¤ Multiplying a fraction by a fraction always results in a fraction and will not produce an overflow (e.g., 0.99 x 0.9999 = less than 1). Successive additions may cause overflow
¤ Normalized representation is convenient. Signal processing is multiplication-intensive.
¤ Coefficients from digital filter designs are typically already in fractional form.
Data representation. Fixed-point vs Floating-point
-20 2-1 2-2 2-3 2-(n-1)
Radix point
Sign bit
40
¨ Fixed-point Notation: ¤ Decimal point is always in a fixed location (e.g., 0.74, 0.34, etc.). ¤ Fixed-point notation prevents overflow (useful with a small dynamic range). ¤ Fixed-point notation is less expensive.
¨ How is fixed-point notation realized in a DSP? ¤ Most fixed-point DSPs are 16 bits. ¤ The range of numbers that can be represented is 215-1 to -215.
¤ The most common fixed-point format is Q15.
Data representation. Fixed-point vs Floating-point
Q15 Notation Bit 15 Bits 14 to 0
sign two’s complement number
41
Data representation. Fixed-point vs Floating-point
Dynamic range in Q15
Number representations in Q15
Rules for operations Avoid operations with numbers larger than 1
2.0 x (0.5 x 0.45) = (0.2 x 0.5 x 0.45) x 10
= (0.5 x 0.45) + (0.5 x 0.45)
Scale numbers before the operation
0.5 in Q15 = 0.5 x 32767 =16384
Number Biggest Smallest
Fractional number 0.999 -1.000
Scaled integer for Q15 32767 -32768
Decimal Q15 = Decimal x 215 Q15 Integer
0.5 0.5 x 32767 16384
0.05 0.05 x 32767 1638
0.0012 0.0012 x 32767 39
Addition
Multiplication 2 x 0.5 x 0.45 =
Decimal Q15 Scale back Q15 / 32767
0.5 + 0.05 = 0.55 16384 + 1638 = 18022 0.55
0.5 – 0.05 = 0.45 16384 – 1638 = 14746 0.45
Decimal Q15 Back to Q15 Product / 32767
Scale back Q15 / 32767
0.5 x 0.45 = 0.225 16384 x 14745 = 241584537 7373
0.225 + 0.225 = 0.45 7373 + 7373 = 14746 0.45
42
¨ Floating-point Notation:
Data representation. Fixed-point vs Floating-point
Conversion equations
Special case
e = exponent is a signed two’s compliment 8-bit field and determines the location of the binary Q point
s = sign of mantissa (s = 0 positive, s =1 negative) f = fractional part of the mantissa; an implied 1.0 is added to this fraction
but is not allocated in the bit field since this value is always present
single-precision floating-point format
Bit No
Exponent (e)
Hex two’s comp. 00 01 7F FF 80
Decimal 1 127 -1 -128 0
31 ... 24 23 22 .............. 0e s f
8 bits 1 bit 23 bits
Binary Decimal Equation s = 0 X = 01.f x 2e X = 01.f x 2e 1 s = 1 X = 10.f x 2e X = ( -2 + 0.f ) x 2e 2
s = 0 X = 0 e = -128
43
¨ Floating-point Numbers:
Data representation. Fixed-point vs Floating-point
Calculate 1.0e0 In hex 00 00 00 00 In binary 00000000000000000000000000000000 s = 0 Equation 1 applies: X = 01.f x2e
01.0 x 20 = 1.0
e = 0
f = 0
Calculate 1.5e01 In hex 03 70 00 00 In binary 00110111000000000000000000000000 s = 0 Equation 1 applies: X = 01.f x2e
0011 e = 3 s111 f = 0.5 + 0.25 + 0.125 = 0.875 X = 01.875 x 23 = 15.0 decimal
...
Calculate -2.0e0 In hex 00 80 00 00 In binary 00000000100000000000000000000000 s = 1 Equation 2 applies: X = ( -2.0 + 0.f ) x 2e ( -2.0 + 0.0 ) x 20 = -2.0
e = 0
f = 0
Addition 1.5 + (-2.0) = 0.5 Multiplication 1.5e00 x 1.5e01 = 2.25e01 = 22.5
44
¨ Dynamic Range ¤ Ranges of number systems
¤ The dynamic range of floating-point representation is very large ¤ Conclusion
n Largest integer x (1.5 x 10 29 ) ~ = largest floating point n Largest Q15 x (1.03 x 10 34 ) ~ = largest floating point
Data representation. Fixed-point vs Floating-point
Numbers Base 2 Decimal Two’s
Complement Hex
Largest Integer 231 - 1 2 147 483 647 7F FF FF FF
Smallest Integer - 231 -2 147 483 648 80 00 00 00
Largest Q15 215 - 1 32 767 7F FF
Smallest Q15 - 215 -32 768 80 00
Largest Floating Point ( 2 - 2-23 ) x 2127 3.402823 x 1038 7F 7F FF FD
Smallest Floating Point -2 x 2127 -3.402823 x 1038 83 39 44 6E
45
¨ DSP devices are designed as floating point or fixed point. ¨ Floating-point devices usually have a full set of fixed-point instructions. ¨ Floating point devices are easier to program. ¨ Fixed-point devices can emulate floating point in software.
Data representation. Fixed-point vs Floating-point
Characteristic Floating point Fixed point Dynamic range much larger smaller
Resolution comparable comparable
Speed comparable comparable
Ease of programming much easier more difficult
Compiler efficiency more efficient less efficient
Power consumption comparable comparable
Chip cost comparable comparable
System cost comparable comparable
Design cost less more
Time to market faster slower
46
¨ Applications which require: ¤ High precision. ¤ Wide dynamic range. ¤ High signal-to-noise ratio. ¤ Ease of use.
Need a floating point processor. ¨ Drawback of floating point processors:
¤ Higher power consumption. ¤ Can be more expensive. ¤ Can be slower than fixed-point counterparts and larger in size.
DSP Data representation. Fixed-point vs Floating-point
47
¨ Introduction ¨ Processor Architectural Features. Datapath &
pipeline. ¨ Data Representation: Fixed-point vs Floating-point ¨ Interrupts, Exceptions, Watch-Dog, … ¨ 32-bit microcontroller. ARM Cortex-M3
¤ ARM Cortex-M3 Architecture. Programmers Model.
¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model
Interrupt, Exceptions, Watch-Dog, … 48
¨ Exceptions:
¤ Exception handling is a combination of hardware behaviors and software constructs designed to manage a unique condition. n Related to the current program flow. n Result of unexpected error conditions (such as a bus error). n Result of illegal operations (guarded memory access). n Some exceptions can be programmed to occur (FIT, PIT). n A software routine could not execute properly (divide by 0).
¤ Exception handling changes the normal flow of software execution.
Interrupt, Exceptions, Watch-Dog, … 49
¨ Interrupts:
¤ A hardware interrupt is an asynchronous signal from hardware, either originating outside the SoC or from the programmable logic within the SoC, indicating a peripheral's need for attention. n Embedded processor peripheral (FIT, PIT, for example). n External bus peripheral (UART, EMAC, for example). n External interrupts enter via hardware pin(s). n Multiple hardware interrupts can utilize general interrupt controller of the PS.
¤ A software interrupt is a synchronous event in software, often referred to as exceptions, indicating the need for a change in execution. n Examples
n Divide by zero. n Illegal instruction. n User-generated software interrupt.
Interrupt, Exceptions, Watch-Dog, … 50
¨ Cortex-A9 Modes and Registers: Cortex-A9 has seven execution modes
¤ Five are exception modes. ¤ Each mode has its own stack space and different
subset of registers. ¤ System mode will use the user mode registers.
Cortex-A9 has 37 registers ¤ Up to 18 visible at any one time. ¤ Execution modes have some private registers
that are banked in when the mode is changed. ¤ Non-banked registers are shared between
modes.
Interrupt, Exceptions, Watch-Dog, … 51
¨ Cortex-A9 Exceptions:
In Cortex-A9 processor interrupts are handled as exceptions ¤ Each Cortex-A9 processor core accepts two different levels of interrupts.
n nFIQ interrupts from secure sources (serviced first). n nIRQ interrupts from either secure sources or non-secure sources.
Interrupt, Exceptions, Watch-Dog, … 52
Interrupt Servicing in Cortex-A9:
¨ When an interrupt is received, the current executing instruction completes.
¨ Save processor status ¤ Copies CPSR into SPSR_irq. ¤ Stores the return address in LR_irq.
¨ Change processor status for exception ¤ Mode field bits. ¤ ARM or thumb (T2) state. ¤ Interrupt disable bits (if appropriate). ¤ Sets PC to vector address (either FIQ or
IRQ).
¨ The above steps are performed automatically by the core
Interrupt, Exceptions, Watch-Dog, … 53
General Interrupt Controller (GIC)
¨ Supports interrupt prioritization ¨ Handles up to 16 software-
generated interrupts (SGI)
¨ Supports 64 shared peripheral interrupts (SPI) starting at ID 32
¨ Processes both level-sensitive interrupts and edge-sensitive interrupts ¤ Five private peripheral
interrupts (PPI) dedicated for each.
¤ The global timer, private watchdog timer, private timer, and FIQ/IRQ from the PL.
54
¨ Introduction ¨ Processor Architectural Features. Datapath &
pipeline. ¨ Data Representation: Fixed-point vs Floating-point ¨ Interrupts, Exceptions, Watch-Dog, … ¨ 32-bit microcontroller. ARM Cortex-M3
¤ ARM Cortex-M3 Architecture. Programmers Model.
¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model
55
A microcontroller combines onto the same microchip : ¨ The CPU core ¨ Memory (both ROM and RAM) ¨ I/O – parallel, serial, analog,
digital
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
56
ARM Ltd ¨ Founded in November 1990
¤ Spun out of Acorn Computers ¤ Initial funding from Apple, Acorn
and VLSI ¨ Designs the ARM range of RISC
processor cores ¤ Licenses ARM core designs to
semiconductor partners who fabricate and sell to their customers
¤ ARM does not fabricate silicon itself
¨ Also develop technologies to assist with the design-in of the ARM architecture ¤ Software tools, boards, debug
hardware ¤ Application software ¤ Bus architectures ¤ Peripherals, etc
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
Energy Efficient Appliances
IR Fire Detector
Intelligent Vending
Tele-parking
Utility Meters
Exercise Machines Intelligent
toys
Equipment Adopting 32-bit ARM Microcontrollers
57
Cortex Family ¨ ARM Cortex-A family (v7-A):
¤ Applications processors for full OS and 3rd party applications
¨ ARM Cortex-R family (v7-R): ¤ Embedded processors for real-time
signal processing, control applications
¨ ARM Cortex-M family (v7-M): ¤ Microcontroller-oriented processors
for MCU and SoC applications
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
Cortex-R4
Cortex-A8
SC300™
Cortex-M1
Cortex™-M3
...2.5GHz x1-4
Cortex-A9
12k gates... Cortex-M0
Cortex-M4
x1-4
Cortex-A5
x1-4
Cortex-A15
58
ARM Cortex Family
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
Cortex-A8 § Architecture v7A
§ MMU
§ AXI
§ VFP & NEON support
Cortex-R4
§ Architecture v7R
§ MPU (optional)
§ AXI
§ Dual Issue
Cortex-M3 § Architecture v7M
§ MPU (optional)
§ AHB Lite & APB
59
Relative Perfomance
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
Cortex-M0
Cortex-M3 ARM7 ARM92
6 ARM10
26 ARM11
36 ARM11
76 Cortex-
A8
Cortex-A9
Dual-core
Max Freq (MHz) 50 150 184 470 540 610 750 1100 2000 Min Power (mW/MHz) 0,012 0,06 0,35 0,235 0,36 0,335 0,568 0,43 0,5
0
500
1000
1500
2000
2500
Max
Fre
quen
cy (M
hz)
60
ARM architecture ¨ Load/store
architecture ¨ A large array of
uniform registers ¨ Fixed-length 32-bit
instructions ¨ 3-address instructions
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
61
Data Sizes and Instruction Sets ¨ The ARM is a 32-bit architecture. ¨ When used in relation to the
ARM: ¤ Byte means 8 bits ¤ Halfword means 16 bits (two
bytes) ¤ Word means 32 bits (four bytes)
¨ Most ARM’s implement two instruction sets ¤ 32-bit ARM Instruction Set ¤ 16-bit Thumb Instruction Set
¨ Jazelle cores can also execute Java bytecode
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
Memory width (zero wait state)
0
5000
10000
15000
20000
25000
30000
32-bit 16-bit 16-bit with 32-bit stack
ARM
Thumb
Dhrystone 2.1/sec @ 20MHz
ARM and Thumb Performance
0 8 7 16 15 24 23 31
8-bit Byte 16-bit Half word
32-bit word
62
The Thumb-2 instruction set ¨ Variable-length instructions
¤ ARM instructions are a fixed length of 32 bits. ¤ Thumb instructions are a fixed length of 16 bits.
¤ Thumb-2 instructions can be either 16-bit or 32-bit.
¨ Thumb-2 g ives approx imate ly 26% improvement in code density over ARM
¨ Thumb-2 g ives approx imate ly 25% improvement in performance over Thumb.
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
63
Cortex-M Programmer’s Model ¨ Fully programmable in C ¨ Stack-based exception
model ¨ Only two processor modes
¤ Thread Mode for User tasks
¤ Handler Mode for OS tasks and exceptions
¨ Vector table contains addresses
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
Process
r8 r9 r10 r11 r12 sp lr r15 (pc)
xPSR
r0 r1 r2 r3 r4 r5 r6 r7
Main
sp
32-bits Endianess
Address Space
32-bits Endianess
64
Address Space
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
65
Cortex-M3 Processor Privilege
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
ARM Cortex-M3
Application code
OS
System Call (SVCall) Undefined Instruction
Privileged
Memory
Instructions & Data
Aborts Interrupts Reset
Non-Privileged
Supervisor
User
Handler Mode
Thread Mode
66
Cortex-M3 Interrupt Handling ¨ One Non-Maskable Interrupt
(INTNMI) supported ¨ 1-240 prioritizable interrupts
supported ¤ Interrupts can be masked ¤ Implementation option
selects number of interrupts supported
¨ Nested Vectored Interrupt Controller (NVIC) is tightly coupled with processor core
¨ Interrupt inputs are active HIGH
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
INTNMI
NVIC
Cortex-M3
1-240 Interrupts INTISR[239:0] …
Cortex-M3 Processor Core
67
Cortex-M3 Exception Handling ¤ Reset : power-on or system reset ¤ NMI : cannot be stopped or preempted by any
exception other than reset
¤ Faults n Hard Fault : default Fault or any fault unable to
activate n Memory Manage : MPU violations n Bus Fault : prefetch and memory access violations n Usage Fault : undef instructions, divide by zero, etc.
¤ SVCall : privileged OS requests
¤ Debug Monitor : debug monitor program
¤ PendSV : pending SVCalls
¤ SysTick Interrupt : internal sys timer, i.e., used by RTOS to periodically check resources or peripherals
¤ External Interrupt : i.e., external peripherals
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
68
Cortex-M3 Program Status Register ¨ One Status Register consisting of
¤ APSR - Application Program Status Register – ALU flags ¤ IPSR - Interrupt Program Status Register – Interrupt/Exception No. ¤ EPSR - Execution Program Status Register
n IT field – If/Then block information n ICI field – Interruptible-Continuable Instruction information
¨ xPSR
¤ Composite of the 3 PSRs ¤ Stored on the stack on exception entry
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
IT/ICI IT
27 31
N Z C V Q
28 7
ISR Number
16 23
15
0 24 25 26 10
T
69
Conditional Execution ¨ If – Then (IT) instruction added (16 bit)
¤ Up to 3 additional “then” or “else” conditions maybe specified (T or E) ¤ Makes up to 4 following instructions conditional
¨ Any normal ARM condition code can be used ¨ 16-bit instructions in block do not affect condition code flags
¤ Apart from comparison instruction. ¤ 32 bit instructions may affect flags (normal rules apply)
¨ Current “if-then status” stored in CPSR ¤ Conditional block maybe safely interrupted and returned to ¤ Must NOT branch into or out of ‘if-then’ block
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
ITTET EQ Inst 1 Inst 2 Inst 3 Inst 4
MOVEQ ADDEQ SUBNE ORREQ
70
Classes of Instructions (v4T)
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
Load/Store
LDR
STR
ADR
Miscellaneous
CMP
SWI
SWP Data Operations
ADD MUL
LSL
AND MOV PC, Rm Bcc BL BLX
Change of Flow
71
Data Processing Instructions ¨ Consist of :
¤ Arithmetic: ADD ADC SUB SBC RSB RSC ¤ Logical: AND ORR EOR BIC ¤ Comparisons: CMP CMN TST TEQ ¤ Data movement: MOV MVN
¨ These instructions only work on registers, NOT memory. ¨ Syntax:
<Operation>{<cond>}{S} Rd, Rn, Operand2 n Comparisons set flags only - they do not specify Rd n Data movement does not specify Rn n Second operand is sent to the ALU via barrel shifter.
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
72
Using a Barrel-shifter: The 2nd Operand
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
Register, optionally with shift operation • Shift value can be either be:
• 5 bit unsigned integer • Specified in bottom byte of
another register. • Used for multiplication by constant
Immediate value
• 8 bit number, with a range of 0-255.
• Rotated right through even number of positions
• Allows increased range of 32-bit constants to be loaded directly into registers
Result
Operand 1
Barrel Shifter
Operand 2
ALU
73
Single Register Data Transfer LDR STR Word LDRB STRB Byte LDRH STRH Halfword LDRSB Signed byte load LDRSH Signed halfword load
¨ Memory system must support all access sizes
¨ Syntax: ¤ LDR{<cond>}{<size>} Rd, <address> ¤ STR{<cond>}{<size>} Rd, <address> e.g. LDREQB
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
74
Cortex-M3 Datapath
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
Register Bank Mul/Div
Address Incrementer
ALU
B
A
INTADDR
I_HADDR
Address Register
Barrel Shifter
Writeback
ALU
Read Data Register
Write Data Register
Instruction Decode
I_HRDATA
D_HWDATA
D_HRDATA
Address Incrementer
D_HADDR Address Register
75
Cortex-M3 Pipeline ¨ Cortex-M3 has 3-stage fetch-decode-execute pipeline
¤ Similar to ARM7 ¤ Cortex-M3 does more in each stage to increase overall performance
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
Branch forwarding & speculation
1st Stage - Fetch 2nd Stage - Decode 3rd Stage - Execute
Execute stage branch (ALU branch & Load Store Branch)
Fetch (Prefetch)
AGU
Instruction Decode &
Register Read
Branch
Address Phase & Write Back
Data Phase Load/Store &
Branch
Multiply & Divide
Shift ALU & Branch
Write
76
SW Development
32-bit microcontroller ARM Cortex-M3. Architecture. Programmers Model.
0x00000142 49120x00000144 68080x00000146 F040000F0x0000014A 6008
Start; direction register LDR R1,=GPIO_PORTD_DIR_R LDR R0,[R1] ORR R0,R0,#0x0F; make PD3-0 output STR R0, [R1]
Source code
Build Target (F7)
DownloadObject code
Processor
Memory
I/O
SimulatedMicrocontroller
Address Data
Editor KeilTM uVision®
Processor
Memory
I/O
RealMicrocontroller
StartDebugSession
StartDebugSession
¨ GNU compiler and binutils ¤ gcc: GNU C compiler ¤ as: GNU assembler ¤ ld: GNU linker ¤ gdb: GNU project
debugger ¤ COFF (common object
file format) ¤ ELF (extended linker
format) ¤ Segments in the object file
n Text: code n Data: initialized global
variables n BSS: uninitialized global
variables
.c .elf
C source executable
gcc .s
asm source
as .coff
object file
ld Simulator Debugger …
77
¨ Introduction ¨ Processor Architectural Features. Datapath &
pipeline. ¨ Data Representation: Fixed-point vs Floating-point ¨ Interrupts, Exceptions, Watch-Dog, … ¨ 32-bit microcontroller. ARM Cortex-M3
¤ ARM Cortex-M3 Architecture. Programmers Model.
¨ 32/64bit microprocessor. ¤ Intel x86, UltraSparc Architecture. Programmers Model
78
Intel x86 Processor Evolution: Name Date Transistors MHz 8086 1978 29K 5-10
n First 16-bit processor. Basis for IBM PC & DOS n 1MB address space
386 1985 275K 16-33 n First 32 bit processor , referred to as IA32 n Added “flat addressing” n Capable of running Unix n Until recently, 32-bit Linux/gcc used no instructions introduced in later models
Pentium 4F 2005 230M 2800-3800 n First 64-bit processor n Meanwhile, Pentium 4s (Netburst arch.) phased out in favor of “Core” line
32/64bit microprocessor. Intel x86, UltraSparc
79
Intel x86 Processor Evolution: Machine Evolution
n 486 1989 1.9M n Pentium 1993 3.1M n Pentium/MMX 1997 4.5M n PentiumPro 1995 6.5M n Pentium III 1999 8.2M n Pentium 4 2001 42M n Core 2 Duo 2006 291M
Added Features n Instructions to support multimedia operations
l Parallel operations on 1, 2, and 4-byte data, both integer & FP n Instructions to enable more efficient conditional operations
Linux/GCC Evolution n Very limited, needs to get better – trying to maintain compatibility
32/64bit microprocessor. Intel x86, UltraSparc
80
Intel x86 Processor Evolution: Name Date Transistors Itanium 2001 10M
n First shot at 64-bit architecture: first called IA64 n Radically new instruction set designed for high
performance n Can run existing IA32 programs
l On-board “x86 engine” n Joint project with Hewlett-Packard - Boat Anchor
Itanium 2 2002 221M n Big performance boost
Itanium 2 Dual-Core 2006 1.7B ¤ Itanium has not taken off in marketplace
n Lack of backward compatibility, no good compiler support, Pentium 4 too good.
32/64bit microprocessor. Intel x86, UltraSparc
81
¨ IA-32 architecture ¤ Lots of architecture improvements, pipelining, superscalar, branch prediction, hyperthreading and
multi-core. ¤ From programmer’s point of view, IA-32 has not changed substantially except the introduction of
a set of high-performance instructions
32/64bit microprocessor. Intel x86, UltraSparc
¨ Modes of operation ¤ Protected mode
n Native mode (Windows, Linux), full features, separate memory
¤ Real-address mode n Native MS-DOS
¤ System management mode n Power management, system security, diagnostics
• Virtual-8086 mode • hybrid of Protected • each program has its own 8086 computer
¨ Addressable Memory ¤ Protected mode
n 4 GB n 32-bit address
¤ Real-address and Virtual-8086 modes n 1 MB space n 20-bit address
82
¨ General Purpose Registers
32/64bit microprocessor. Intel x86, UltraSparc
CS
SS
DS
ES
EIP
EFLAGS
16-bit Segment Registers
EAXEBXECX
EDX
32-bit General-Purpose Registers
FS
GS
EBPESP
ESI
EDI
83
¨ Accessing parts of registers ¤ Use 8-bit name, 16-bit
name, or 32-bit name ¤ Applies to EAX, EBX,
ECX, and EDX ¤ The 16-bit registers are
usually used only in real-address mode.
32/64bit microprocessor. Intel x86, UltraSparc
AH AL
16 bits
8
AX
EAX
8
32 bits
8 bits + 8 bits
84
¨ Floating-point, MMX,XMM registers. ¤ Eight 80-bit floating-point data registers
n ST(0), ST(1), . . . , ST(7) n arranged in a stack
n used for all floating-point arithmetic
¤ Eight 64-bit MMX registers.
¤ Eight 128-bit XMM registers for single-instruction multiple-data (SIMD) operations.
32/64bit microprocessor. Intel x86, UltraSparc
ST(0)ST(1)ST(2)
ST(3)
80-bit Data Registers
FPU Data Pointer
Tag Register
Control Register
Status Register
ST(4)ST(5)ST(6)
ST(7)
FPU Instruction Pointer
Opcode Register
16-bit Control Registers
48-bit Pointer Registers
85
¨ Programmer’s Model
32/64bit microprocessor. Intel x86, UltraSparc
86
¨ IA-32 addressing Modes
32/64bit microprocessor. Intel x86, UltraSparc
8
87
¨ IA-32 Memory Management ¤ Protected Mode n 1 MB RAM maximum
addressable (20-bit address)
n Application programs can access any area of memory
n Single tasking n Supported by MS-DOS
operating system
32/64bit microprocessor. Intel x86, UltraSparc
00000
10000
20000
30000
40000
50000
60000
70000
80000
90000
A0000
B0000
C0000
D0000
E0000
F0000
8000:0000
8000:FFFF
seg ofs
8000:0250
0250
linea
r add
ress
es
one segment
(64K)
Segmented memory addressing: absolute (linear) address is a combination of a 16-bit segment value added to a 16-bit offset
88
¨ IA-32 Memory Management ¤ Real-address mode n 4 GB addressable RAM (32-
bit address) n (00000000 to FFFFFFFFh)
n Each program assigned a memory partition which is protected from other programs
n Designed for multitasking n Supported by Linux & MS-
Windows n Segment descriptor tables n Program structure
n code, data, and stack areas n CS, DS, SS segment descriptors n global descriptor table (GDT)
n MASM Programs use the Microsoft flat memory model.
32/64bit microprocessor. Intel x86, UltraSparc
Flat
segm
enta
tion
mod
el
3000
RAM
00003000
Local Descriptor Table
000200008000 000A00026000 0010
base limit access
8000
26000
multiplied by 1000h
Mul
ti-se
gmen
t mod
el
89
¨ IA-32 Memory Management ¤ Translating Addresses
n The IA-32 processor uses a one- or two-step process to convert a variable's logical address into a unique memory location.
n The first step combines a segment value with a variable’s offset to create a linear address.
n The second optional step, called page translation, converts a linear address to a physical address.
32/64bit microprocessor. Intel x86, UltraSparc
Selector Offset
Logical address
Segment Descriptor
Descriptor table
+
GDTR/LDTR
(contains base address ofdescriptor table)
Linear address
90
¨ IA-32 Memory Management ¤ Indexing into a
Descriptor Table n Each segment descriptor
indexes into the program's local descriptor table (LDT). Each table entry is mapped to a linear address.
32/64bit microprocessor. Intel x86, UltraSparc
Logical addresses
0018 0000003A
(unused)
DRAMSS ESP
001A0000
0002A000
0001A000
00003000
Local Descriptor Table
0010 000001B6
0008 00002CD3
LDTR register
DS18
10
08
00
(index)
Linear address space
IP
offset
91
¨ IA-32 Memory Management ¤ Paging
n Virtual memory uses disk as part of the memory, thus allowing sum of all programs can be larger than physical memory. Only part of a program must be kept in memory, while the remaining parts are kept on disk.
n The memory used by the program is divided into small units called pages (4096-byte).
n As the program runs, the processor selectively unloads inactive pages from memory and loads other pages that are immediately required.
n OS maintains page directory and page tables n Page translation: CPU converts the linear address
into a physical address n Page fault: occurs when a needed page is not in
memory, and the CPU interrupts the program
n Virtual memory manager (VMM) – OS utility that manages the loading and unloading of pages
n OS copies the page into memory, program resumes execution
32/64bit microprocessor. Intel x86, UltraSparc
Directory Table Offset
Directory Entry
CR3
Page Directory
Page-Table Entry
Page Table
Physical Address
Page Frame
Linear Address10 10 12
32
92
¨ Interrupt Handling ¨ Processor generates interrupts that
index into a Interrupt Descriptor Table, whose base is stored in IDTR and loaded using the privileged instruction LIDT.
¨ The descriptors in IDT can be ¤ Interrupt gate: ISR handled as a
normal call subroutine – uses the interrupted processor stack to save EIP,CS, (SS, ESP in case of stack switch – new stack got from TSS).
¤ Task gate: ISR handled as a task switch n Needed for stack fault in CPL = 0 and
double faults.
32/64bit microprocessor. Intel x86, UltraSparc
93
Intel® Core® Micro-architecture Blocks
32/64bit microprocessor. Intel x86, UltraSparc
Branch Target Buffer
Microcode Sequencer
Register Allocation Table (RAT)
32 KB Instruction Cache Next IP
Instruction Decode (4 issue)
Fetch / Decode
Retire
Re-Order Buffer (ROB) – 96 entry
IA Register Set
To L2 Cache
Por
t P
ort
Por
t P
ort
Bus Unit
Res
erva
tion
Sta
tion
s (R
S)
32
en
try
Sch
edu
ler
/ D
isp
atch
Por
ts
32 KB Data Cache
Execute
Por
t
FP Add
SIMD Integer Arithmetic
Memory Order Buffer (MOB)
Load
Store Addr
FP Div/Mul Integer
Shift/Rotate SIMD
SIMD
Integer Arithmetic
Integer Arithmetic
Por
t
Store Data
94
Intel® Core® Micro-architecture Blocks ¨ Intel® Wide Dynamic Execution
¤ 14-stage efficient pipeline n Wider decoding capacity n Advanced branch prediction n Wider execution path
¤ 64-Bit Support n Merom, Conroe, and Woodcrest support
EM64T
¨ Intel® Advanced Smart Cache ¤ Multi-core optimization
n Shared between the two cores n Advanced Transfer Cache architecture n Reduced bus traffic n Both cores have full access to the entire cache n Dynamic Cache sizing
¤ Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache
Execution Unit Overview
Execute 6 operations/cycle • 3 Memory Operations
• 1 Load • 1 Store Address
• 1 Store Data • 3 “Computational” Operations
Unified Reservation Station
Port 0
Port 1
Port 2
Port 3
Port 4
Port 5
Load Store Address
Store Data
Integer ALU & Shift
Integer ALU & LEA
Integer ALU & Shift
Branch FP Add FP Multiply
Complex Integer Divide
SSE Integer ALU Integer Shuffles
SSE Integer Multiply
FP Shuffle
SSE Integer ALU Integer Shuffles
Unified Reservation Station • Schedules operations to Execution units • Single Scheduler for all Execution Units • Can be used by all integer, all FP, etc.
95
Intel® Core® Micro-architecture Blocks
¨ Instruction Decode ¤ Frequent pairs of micro-operations
derived from the same Macro Instruction can be fused into a single micro-operation
32/64bit microprocessor. Intel x86, UltraSparc
Micro-op fusion effectively widens the pipeline
96
Intel® Core® Micro-architecture Blocks ¨ Intel® Advanced Digital Media Boost
¤ Single Cycle SSE n 8 Single Precision Flops/cycle n 4 Double Precision Flops/cycle
¤ Wide Operations n 128-bit packed Add n 128-bit packed Multiply
n 128-bit packed Load n 128-bit packed Store
¤ Support for Intel® EM64T instructions
32/64bit microprocessor. Intel x86, UltraSparc
Core™ µarch
Previous
X4
Y4
X4opY4
SOURCE
X1opY1
X3
Y3
X3opY3
X2
Y2
X2opY2
X1
Y1
X1opY1
DEST
SSE/2/3 OP
X2opY2
X3opY3 X4opY4
CLOCK CYCLE 1
CLOCK CYCLE 2
0 127
CLOCK CYCLE 1
SSE Operation (SSE/SSE2/SSE3)
97
Intel® Core® Micro-architecture Blocks
¨ Hyperthreading ¤ Ability of processor to run multiple
threads n Duplicate architecture state
creates illusion to SW of Dual Processor (DP).
n Execution unit shared between two threads, but dedicated if one stalls.
¤ Almost two Logical Processors. ¤ Architecture state (registers) and APIC
duplicated. ¤ Share execution units, caches, branch
prediction, control logic and buses.
32/64bit microprocessor. Intel x86, UltraSparc
Processor Execution Resource
Adv. Programmable Interrupt Control
Architecture State
Adv. Programmable Interrupt Control
Architecture State
On-Die Caches
System Bus
98
Intel® Core® Micro-architecture Blocks
¨ Power Efficient Support ¤ Advanced power gating & Dynamic
power coordination n Multi-point demand-based switching n Voltage-Frequency switching separation n Supports transitions to deeper sleep
modes n Event blocking n Clock partitioning and recovery n Dynamic Bus Parking n During periods of high performance
execution, many parts of the chip core can be shut off
32/64bit microprocessor. Intel x86, UltraSparc
PLL
Uncore , LLC
Core Vcc
Freq . Sensors
Core Vcc
Freq . Sensors
Core Vcc
Freq . Sensors
Core Vcc
Freq . Sensors
PLL
PLL
PLL
PLL
PCU
BCLK Vcc
99
X86-64 Architecture
¨ Full support for 64-bit integers ¤ All general-purpose registers are expanded from 32 bits to 64 bits ¤ All arithmetic and logical operations, memory-to-register, and register-to-memory
operations are now directly supported for 64-bit integers ¤ Pushes and pops on the stack are always in eight-byte strides, and pointers are
eight bytes wide ¨ Additional registers
¤ The number of named registers is increased from 8 (i.e. eax, ebx, ecx, edx, ebp, esp, esi, edi) to 16.
¤ Compilers can keep more local variables in registers rather than on the stack. ¤ Can use registers for frequently accessed constants. ¤ Arguments for small and fast subroutines may also be passed in registers to a
greater extent.
32/64bit microprocessor. Intel x86, UltraSparc
100
X86-64 Architecture
¨ Larger virtual address space ¤ Current models can address
up to 256 terabytes ¤ Expandable in the future to
16 exabytes ¤ Compared to just 4 gigabytes
for 32-bit x86
¨ Larger physical address space ¤ Current models can address
up to 1 terabyte ¤ Expandable in the future to
4 petabytes
32/64bit microprocessor. Intel x86, UltraSparc
101
UltraSparc (RISC)
¨ Sun Microsystems (ORACLE) ¨ Sparc = Scalable Processor
Architecture Open processor architecture
¨ SUN UltraSparc v9: ¤ RISC Architecture big-endian. ¤ 64 bit address and data. ¤ Memory Management
Unit(MMU). ¤ Superscalar. ¤ OpenSparc (open-source) ¤ LEON (soft-core). Space rated.
VHDL
32/64bit microprocessor. Intel x86, UltraSparc
Begin developing Sparc – 1984 First Sparc Processor – 1986 SuperSparc – 1992 UltraSparc I – 1995 UltraSparc II – 1997 UltraSparc III – 2001 UltraSparc IV – 2004 UltraSparc IV+ – 2005 UltraSparc T1 – 2005 UltraSparc T2 – 2007 Sparc T3 – 2010 Sparc T4 – 2011 Sparc T5 – 2013
102
UltraSparc (RISC)
¨ Registers ¤ ~160 general-purpose registers ¤ Any procedure can access only 32
registers (r0~r31) n First 8 registers (r0~r8) are global,
i.e. they can be access by all procedures on the system (r0 is zero)
n Other 24 registers can be visualized as a window through which part of the register file can be seen
¤ Program counter (PC) n The address of the next instruction to
be executed
¤ Condition code registers ¤ Other control registers
32/64bit microprocessor. Intel x86, UltraSparc
¨ Data Formats ¤ Integers are 8-, 16-, 32-, 64-bit binary
numbers ¤ 2’s complement is used for negative values ¤ Support both big-endian and little-endian
byte orderings n (big-endian means the most significant part of
a numeric value is stored at the lowest-numbered address)
¤ Three different floating-point data formats n Single-precision, 32 bits long (23 + 8 + 1) n Double-precision, 64 bits long (52 + 11 + 1) n Quad-precision, 128 bits long (112 + 15 + 1)
103
UltraSparc (RISC)
¨ Addressing Modes ¤ Immediate mode ¤ Register direct mode ¤ Memory addressing
Mode Target address calculation PC-relative* TA= (PC)+displacement {30 bits, signed} Register indirect TA= (register)+displacement {13 bits, signed} with displacement Register indirect indexed TA= (register-1)+(register-2)
*PC-relative is used only for branch instructions
32/64bit microprocessor. Intel x86, UltraSparc
¨ Instruction Set ¤ <150 instructions ¤ Pipelined execution
n While one instruction is being executed, the next one is fetched from memory and decoded
¤ Delayed branches n The instruction immediately following the branch
instruction is actually executed before the branch is taken
¤ Special-purpose instructions n High-bandwidth block load and store operations n Special “atomic” instructions to support multi-
processor system
¨ Input and Output ¤ A range of memory locations is logically replaced
by device registers ¤ Each I/O device has a unique address, or set of
addresses ¤ No special I/O instructions are needed
104
UltraSparc T2 (RISC)
¨ Multi-threaded(8), multi-core(8) CPU
¨ Frequency ranges from 900MHz to 1.4GHz
¨ Powered by less than 95 watts (nominal) with less than 2 watts per thread
¨ Integrated ¤ 10 Gb Ethernet networking ¤ PCI Express I/O expansion ¤ FPU and cryptographic
processing units per core
32/64bit microprocessor. Intel x86, UltraSparc
¨ Codename Niagara2 ¨ Member of SPARC family ¨ 2 previous multi-core processors
¤ UltraSPARC IV ¤ UltraSPARC IV+
¨ UltraSPARC T1 (first multi-core and multi-threaded) ¤ Released 14 November 2005 ¤ 4, 6, or 8 cores with 4 threads each
¨ UltraSPARC T2 Released 7 August 2007 ¤ Now 8 threads per core (instead of 4)
105
UltraSparc T2 (RISC)
¨ 8 Fully pipelined FPUs ¨ 8 SPUs ¨ 2 integer ALUs per core, each
one shared by a group of four threads
¨ 4MB L2 Cache (8-banks, 16-way associative)
¨ 8 KB data cache and 16 KB instruction cache
¨ Two 10Gb Ethernet ports and one PCIe port
32/64bit microprocessor. Intel x86, UltraSparc
106
UltraSparc T2 (RISC)
32/64bit microprocessor. Intel x86, UltraSparc
107
UltraSparc T2.Core Architecture
32/64bit microprocessor. Intel x86, UltraSparc
108
UltraSparc T2.Core Architecture
32/64bit microprocessor. Intel x86, UltraSparc
109
UltraSparc T2 Pipeline
¨ Eight-stage integer pipeline
¤ Pick is for selecting 2 threads for execution (Added this stage for T2) ¤ In the bypass stage, the load/store unit (LSU) forwards data to the integer register files
(IRFs) with sufficient write timing margin. All integer operations pass through the bypass stage.
¨ 12-stage floating point pipeline
Ø 6-cycle latency for dependent FP ops!Ø Integer multiplies are pipelined between different threads. Integer multiplies block within the same thread.!Ø Integer divide is a long latency operation. Integer divides are not pipelined between different threads.!
32/64bit microprocessor. Intel x86, UltraSparc
Fetch Cache Pick Decode Execute Mem Bypass W
Fetch Cache Pick Decode Execute Fx1 Fx5 FW . . . FB
110
MIPS (ARM) vs x86
32/64bit microprocessor. Intel x86, UltraSparc
x86 32/64-bit 4KB Data unaligned Right add %rs1,%rs2,%rd %r0, %r1, ..., %r7 (n.a.) (n.a.)
MIPS (ARM) Address: 32/64-bit Page size: 4KB Data aligned Destination reg: Left add $rd,$rs1,$rs2 Regs: $0, $1, ..., $31 Reg = 0: $0 Return address: $31
MIPS: “Three-address architecture” • Arithmetic-logic specify all 3 operands
!add $s0,$s1,$s2 # s0=s1+s2!Benefit: fewer instructions éé performance x86: “Two-address architecture” • Only 2 operands, so the destination is also one of
the sources add $s1,$s0 # s0=s0+s1! Often true in C statements: c += b;
Benefit: smaller instructions êê smaller code
111
MIPS (ARM) vs x86
32/64bit microprocessor. Intel x86, UltraSparc
MIPS: “load-store architecture” • Only Load/Store access memory; rest
operations register-register; e.g., lw $t0, 12($gp) add $s0,$s0,$t0 # s0=s0+Mem[12+gp]!
Benefit: simpler hardware è easier to pipeline, higher performance
x86: “register-memory architecture” • All operations can have an operand in memory;
other operand is a register; e.g., add 12(%gp),%s0 # s0=s0+Mem[12+gp]!
Benefit: fewer instructions è smaller code
MIPS: “fixed-length instructions” • All instructions same size, e.g., 4 bytes • Simple hardware performance • Branches can be multiples of 4 bytes
x86: “variable-length instructions” • Instructions are multiple of bytes: 1 to 17;
êê small code size (30% smaller?) • More Recent Performance Benefit:
better instruction cache hit rates • Instructions can include 8- or 32-bit
immediates