gary marsdenslide 1university of cape town interfacing processor and peripherals overview
TRANSCRIPT
Gary Marsden Slide 1University of Cape Town
Interfacing Processor and Peripherals
Overview
Mainmemory
I/Ocontroller
I/Ocontroller
I/Ocontroller
Disk Graphicsoutput
Network
Memory– I/O bus
Processor
Cache
Interrupts
Disk
Gary Marsden Slide 2University of Cape Town
Introduction
I/O often viewed as second class to processor design– Processor research is cleaner– System performance given in terms of
processor– Courses often ignore peripherals– Writing device drivers is not fun
This is crazy - a computer with no I/O is pointless
Gary Marsden Slide 3University of Cape Town
Peripheral design
As with processors, characteristics of I/O driven by technology advances– E.g. properties of disk drives affect how they
should be connected to the processor– PCs and super computers now share the same
architectures, so I/O can make all the difference
Different requirements from processors– Performance– Expandability– Resilience
Gary Marsden Slide 4University of Cape Town
Peripheral performance
Harder to measure than for the processor– Device characteristics
• Latency / Throughput
– Connection between system and device– Memory hierarchy– Operating System
Assume 100 secs to execute a benchmark– 90 secs CPU and 10 secs I/O– If processors get 50% faster per year for the
next 5 years, what is the impacr
Gary Marsden Slide 5University of Cape Town
Relative performance
CPU time + IO time = total time (% of IO time)
Year 0: 90 + 10 = 100 (10%)Year 1: 60 + 10 = 70 (14%) :Year 5: 12 + 10 = 22 (45%) !
Gary Marsden Slide 6University of Cape Town
IO bandwidth
Measured in 2 ways depending on application– How much data can we move through the system in
a given time• Important for supercomputers with large amounts of data
for, say, weather prediction
– How many IO operations can we do in a given time• ATM is small amount of data but need to be handled rapidly
So comparison is hard. Generally– Response time lowered by handling early– Throughput increased by handling multiple requests
together
Gary Marsden Slide 7University of Cape Town
I/O Performance Measures
Look at some examples from the world of disks: all sorts of factors and uses
Different examples of performance benchmarks for different applications:
Supercomputers– I/O dominated by access to large files– Batch jobs of several hours– Large read followed by many writes (snapshots in case
process fails)– Main measure: throughput
Gary Marsden Slide 8University of Cape Town
More Measures
Transaction processing (TP)– TP Involves both a response time requirement
and a level of throughput performance– Most accesses are small, so chiefly concerned
with I/O rate (# disk accesses / second) as opposed to data rate (bytes of data per second)
– Usually related to databases: graceful failure required, and reliability essential
– Benchmark: TPC-C• 128 pages long• Measures T/s• Includes other system elements (e.g. terminals)
Gary Marsden Slide 9University of Cape Town
File system I/O benchmarks
File systems stored on disk have different access patterns (each OS stores files differently)
Can ‘profile’ accesses to create synthetic file system benchmarks
E.g. for unix in an engineering environment:– 80% of accesses to files < 100k– 90% of accesses to sequential addresses– 67% reads– 27% writes– 6% read-modify-write
Gary Marsden Slide 10University of Cape Town
Typical File benchmark
5 phases using 70 files, totalling 200k– Make dir– Copy– Scan Dir (recursive for all attributes)– Read all– Make
Gary Marsden Slide 11University of Cape Town
Device Types and characteristics
Key characteristics– Behaviour: Input / Output / Storage (read &
write)– Partner: Human / Machine– Data rate: Peak data transfer rate
Device Behaviour Partner Kb / sec
Mouse Input Human 0.02
Laser Printer Output Human 200
Hard Disk Storage Machine 2000 - 10000
Gary Marsden Slide 12University of Cape Town
Mouse
Communicates with– Pulses from LED– Increment / decrement counters
Mice have at least 1 button– Need click and hold
Movement is smooth, slower than processor– Polling– No submarining– Software configuration
Initialposition
of mouse+20 in X– 20 in X
+20 in Y+20 in Y+20 in X
+20 in Y– 20 in X
– 20 in Y– 20 in Y+20 in X
– 20 in Y– 20 in X
Gary Marsden Slide 13University of Cape Town
Mouse guts
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Gary Marsden Slide 14University of Cape Town
Hard disk
Rotating rigid platters with magnetic surfaces
Data read/written via head on armature– Think record player
Storage is non-volatileSurface divided into tracks
– Several thousand concentric circles
Track divided in sectors– 128 or so sectors per track
Gary Marsden Slide 15University of Cape Town
Diagram
Platter
Track
Platters
Sectors
Tracks
Gary Marsden Slide 16University of Cape Town
Access time
Three parts1. Perform a seek to position arm over correct
track2. Wait until desired sector passes under head.
Called rotational latency or delay3. Transfer time to read information off disk
– Usually a sector at a time at 2~4 Mb / sec
– Control is handled by a disk controller, which can add its own delays.
Gary Marsden Slide 17University of Cape Town
Calculating time
Seek time:– Measure max and divide by two– More formally: (sum of all possible
seeks)/number of possible seeks
Latency time:– Average of complete spin– 0.5 rotations / spin speed (3600~5400
rpm)– 0.5/ 3600 / 60– 0.00083 secs– 8.3 ms
Gary Marsden Slide 18University of Cape Town
Comparison
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Gary Marsden Slide 19University of Cape Town
More faking
Disk drive hides internal optimisations from external world
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Gary Marsden Slide 20University of Cape Town
Networks
Currently very importantFactors
– Distance: 0.01m to 10 000 km– Speed: 0.001 Mb/sec to 1Gb/sec– Topology: Bus, ring, star, tree– Shared lines: None (point to point) or shared
(multidrop)
Gary Marsden Slide 21University of Cape Town
Example 1: RS-232
For very simple terminal networks– From olden times when 80x24 text terminals
connected to mainframes over dedicated lines– 0.3 to 19.2 Kbps– Point to point, star– 10 to 100m
Gary Marsden Slide 22University of Cape Town
Example 2: Ethernet LAN
10 - 100 MbpsOne wire bus with no central control
– Multiple masters– Only one sender at a time which limits
bandwidth but ok as utilisation is low– Messages or packets are sent in blocks of 64
bytes (0.1 ms to send) to 1518 bytes (1.5 ms)
Listen, start, collision detection, backoff
Gary Marsden Slide 23University of Cape Town
Example 3: WAN ARPANET
10 to n thousand kmARPANET first and most famous WAN (56
Kbps)– Point to point dedicated lines– Host computer communicated with interface
message processor (IMP)– IMPs used the phone lines to communicate– IMPs packetised messages into 1Kbit chunks– Packet switched delivery (store and forward)– Packets reassembled at receiving IMP
Gary Marsden Slide 24University of Cape Town
Some thoughts
Currently 100 Mbps over copper is common and we have Gbps technology in place
This is fast!We need systems than process data as
quickly as it arrivesHard to know where the bottlenecks lieNetwork at UCT has / will have Gigabit
backplaneOur international connection is only 10
Mbps
Gary Marsden Slide 25University of Cape Town
Buses: Connecting I/O devices
Interfacing subsystems in a computer system is commonly done with a bus: “a shared communication link, which uses one set of wires to connect multiple sub-systems”
Gary Marsden Slide 26University of Cape Town
Why a bus?
Main benefits:– Versatility: new devices easily added– Low cost: reusing a single set of wires many ways
Problems:– Creates a bottleneck– Tries to be all things to all subsystems
Comprised of– Control lines: signal requests, acknowledgements
and to show what type of information is on the – Data lines:data, destination / source address
Gary Marsden Slide 27University of Cape Town
Controlling a bus
As the bus is shared, need a protocol to manage usage
Bus transaction consists of– Sending the address– Sending / receiving the data
Note than in buses, we talk about what the bus does to memory– During a read, a bus will ‘receive’ data
Gary Marsden Slide 28University of Cape Town
Bus transaction 1
97018/PattersonFig. 8.07
Memory Processor
Control lines
Data lines
Disks
Memory Processor
Control lines
Data lines
Disks
Processor
Control lines
Data lines
Disks
a.
b.
c.
Memory
Gary Marsden Slide 29University of Cape Town
Bus transaction 2
Memory Processor
Control lines
Data lines
Disks
Processor
Control lines
Data lines
Disks
a.
b.
Memory
97108/PattersonFig. 8.08
Gary Marsden Slide 30University of Cape Town
Types of Bus
Processor-memory bus– Short and high speed– Matched to memory system (usually
Proprietary) I/O buses
– Lengthy,– Connected to a wide range of devices– Usually connected to the processor using 1 or 3
Backplane bus– Processors, memory and devices on single bus– Has to balance proc-memory with I/O-memory– Usually requires extra logic to do this
Gary Marsden Slide 31University of Cape Town
Bus type diagramProcessor Memory
Backplane bus
a. I/O devices
Processor MemoryProcessor-memory bus
b.
Busadapter
Busadapter
I/Obus
I/Obus
Busadapter
I/Obus
Processor MemoryProcessor-memory bus
c.
Busadapter
Backplanebus
Busadapter
I/O bus
Busadapter
I/O bus
Gary Marsden Slide 32University of Cape Town
Synchronous and Asynchronous buses
Synchronous bus has a clock attached to the control lines and a fixed protocol for communicating that is relative to the pulse
Advantages– Easy to implement (CC1 read, CC5 return value)– Requires little logic (FSM to specify)
Disadvantages– All devices must run at same rate– If fast, cannot be long due to clock skew
Most proc-mem buses are clocked
Gary Marsden Slide 33University of Cape Town
Asynchronous buses
No clock, so it can accommodate a variety of devices (no clock = no skew)
Needs a handshaking protocol to coordinate different devices– Agreed steps to progress through by sender and
receiver– Harder to implement - needs more control lines
Gary Marsden Slide 34University of Cape Town
Example handshake - device wants a word from memory
Gary Marsden Slide 35University of Cape Town
FSM control
1Record fromdata linesand assert
Ack
ReadReq
ReadReq________
ReadReq
ReadReq
3, 4Drop Ack;
put memorydata on datalines; assert
DataRdy
Ack
Ack
6Release data
lines andDataRdy
________
___
Memory
2Release data
lines; deassertReadReq
Ack
DataRdy
DataRdy
5Read memorydata from data
lines;assert Ack
DataRdy
DataRdy
7Deassert Ack
I/O device
Put addresson data
lines; assertReadReq
________
Ack___
________
New I/O request
New I/O request
Gary Marsden Slide 36University of Cape Town
Increasing bus bandwidth
Key factors– Data bus width: Wider = fewer cycles for
transfer– Separate vs Multiplexed, data and address lines
• Separating allows transfer in one bus cycle
– Block transfer: Transfer multiple blocks of data in consecutive cycles without resending addresses and control signals etc.
Gary Marsden Slide 37University of Cape Town
Obtaining bus access
Need one, or more, bus masters to prevent chaos
Processor is always a bus master as it needs to access memory– Memory is always a slave
Simplest system as a single master (CPU)Problems
– Every transfer needs CPU time– As peripherals become smarter, this is a waste
of timeBut, multiple masters can cause problems
Gary Marsden Slide 38University of Cape Town
Bus Arbitration
Deciding which master gets to go next– Master issues ‘bus request’ and awaits ‘granted’
Two key properties– Bus priority (highest first)– Bus fairness (even the lowest get a go,
eventually)
Arbitration is an overhead, so good to reduce it– Dedicated lines, grant lines, release lines etc.
Gary Marsden Slide 39University of Cape Town
Different arbitration schemes
Daisy chain: Bus grant line runs through devices from highest to lowest
Very simple, but cannot guarantee fairness
Device n
Lowest priority
Device 2Device 1
Highest priority
Busarbiter
Grant
Grant Grant
Release
Request
Gary Marsden Slide 40University of Cape Town
Centralised Arbitration
Centralised, parallel: All devices have separate connections to the bus arbiter– This is how the PCI backplane bus works (found
in most PCs)– Can guarantee fairness– Arbiter can become congested
Gary Marsden Slide 41University of Cape Town
Distributed
Distributed arbitration by self selection:Each device contains information about
relative importanceA device places its ID on the bus when it
wants access If there is a conflict, the lower priority
devices back downRequires separate lines and complex
devicesUsed on the Macintosh II series (NuBus)
Gary Marsden Slide 42University of Cape Town
Collision detection
Distributed arbitration by collision detection:
Basically ethernetEveryone tries to grab the bus at once If there is a ‘collision’ everyone backs off a
random amount of time
Gary Marsden Slide 43University of Cape Town
Bus standards
To ensure machine expansion and peripheral re-use, there are various standard buses– IBM PC-AT bus (de-facto standard)– SCSI (needs controller)– PCI (Started as Intel, now IEEE)– Ethernet
Bus bandwidth depends on size of transfer and memory speed
Gary Marsden Slide 44University of Cape Town
PCI
Type Backplane
Data width 32-64
Address/data Multiplexed
Bus masters Multiple
Arbitration Central parallel
Clocking Synch. 33-66 Mhz
Theoretical Peak 133-512 MB/sec
Achievable peak 80 MB/sec
Max devices 1024
Max length 50 cm
Bananas none
Gary Marsden Slide 45University of Cape Town
My Macintosh
Mainmemory
I/Ocontroller
I/Ocontroller
Graphicsoutput
PCI
CDROM
Disk
Tape
I/Ocontroller
Stereo
I/Ocontroller
Serialports
I/Ocontroller
Appledesktop bus
Processor
PCIinterface/memory controller
EthernetSCSI bus
outputinput
Gary Marsden Slide 46University of Cape Town
Giving commands to I/O devices
Processor must be able to address a device– Memory mapping: portions of memory are
allocated to a device (Base address on a PC)• Different addresses in the space mean different things• Could be a read, write or device status address
– Special instructions: Machine code for specific devices
• Not a good idea generally
Gary Marsden Slide 47University of Cape Town
Communicating with the Processor
Polling– Process of periodically checking the status bits
to see if it is time for the next I/O operation– Simplest way for device to communicate (via a
shared status register– Mouse– Wasteful of processor time
Gary Marsden Slide 48University of Cape Town
Interrupts
Notify processor when a device needs attention (IRQ lines on a PC)
Just like exceptions, except for– Interrupt is asynchronous with program
execution• Control unit only checks I/O interrupt at the start of
each instruction execution
– Need further information, such as the identity of the device that caused the interrupt and its priority
• Remember the Cause Register?
Gary Marsden Slide 49University of Cape Town
Transferring Data Between Device and Memory
We can do this with Interrupts and Polling– Works best with low bandwidth devices and
keeping cost of controller and interface– Burden lies with the processor
For high bandwidth devices, we don’t want the processor worrying about every single block
Need a scheme for high bandwidth autonomous transfers
Gary Marsden Slide 50University of Cape Town
Direct Memory Access (DMA)
Mechanism for offloading the processor and having the device controller transfer data directly
Still uses interrupt mechanism, but only to communicate completion of transfer or error
Requires dedicated controller to conduct the transfer
Gary Marsden Slide 51University of Cape Town
Doing DMA
Essentially, DMA controller becomes bus master and sets up the transfer
Three steps– Processor sets up the DMA by supplying
• device identity• Operation on device• Memory Address (source or destination)• Amount to transfer
– DMA operates devices, supplies addresses and arbitrates bus
– On completion, controller notifies processor
Gary Marsden Slide 52University of Cape Town
DMA and the memory system
With DMA, the relationship between memory and processor is changed– DMA bypasses address translation and
hierarchy
So, should DMA use virtual or physical addresses?– Virtual addresses: DMA must translate– Physical addresses: Hard to cross page
boundary
Gary Marsden Slide 53University of Cape Town
DMA address translation
Can provide the DMA with a small address translation table for pages - provided by OS at transfer time
Get the OS to break the transfer into chunks, each chunk relating to a single page
Regardless, the OS cannot relocate pages during transfer
Gary Marsden Slide 54University of Cape Town
DMA Cache problems
DMA can create inconsistencies between cache and main memory– Called the ‘stale data’ or ‘coherency’ problem
Solve by– Route all I/O activity through cache (expensive)– Have the who cache flushed (easy and not too
bad)– Selectively flush cache (slightly more efficient
but lots of control circuit needed)
Gary Marsden Slide 55University of Cape Town
Gary Marsden Slide 56University of Cape Town
Parallel Processors
Parallel processing machines are common– All current G5 Macs have two processors
Parallel categorisation (Flynn 1966)– Single Instruction Stream, Single Data Stream
(SISD)– Single Instruction, Multiple Data (SIMD - MMX)– Multiple Instruction, Single Data (MISD -
SuperScalar)– Multiple Instructions, Multiple Data (MIMD - true
parallelism)
Gary Marsden Slide 57University of Cape Town
Directions
Microprocessors get faster every yearDevelopment costs are high
– MIPS R4000: 30 engineers for 3 years; $30 million to develop; $10 million to fabricate; 50000 hours simulation
– High costs led companies to look at re-using existing chips in multi-processor machines
– This is possible due to improvements in memory and bus technology
– Still expensive to run these things
Gary Marsden Slide 58University of Cape Town
Evolution vs Revolution
Evolutionary approaches tend to be invisible to users except for– Lower cost and better performance
Revolutionary approaches require new languages and applications– Looks good on paper– Must be worth the effort– KCM
Gary Marsden Slide 59University of Cape Town
RISC vs CISC
Reduced Instruction Set computing vs Complex Instruction Set Computing
Falls between evolutionary and revolutionary– Requires some small changes (68k - PPC
Macintosh)
CISC: Instruction set close to high level languages simplify compilers– C is very close to PDP-11 assembly– VAX was the last big CISC machine, before it
was replaced by the ‘Alpha’
Gary Marsden Slide 60University of Cape Town
RISC
Emerged in 80s when programming languages, memory and compiler improvements meant there was less need for assembly programming
Became important to optimise instruction sets for compilers, not programmers
Philosophy: fixed instruction lengths, load-store instruction set, limited addressing modes, limited operations
Has been hugely successful: PPC, SPARC, Alpha, Apollo, ARM etc.
Gary Marsden Slide 61University of Cape Town
Pentium
Is both CISC and RISCBackwards compatibility is very important
for Windows users– No real change since 80386
So, Pentiums have a RISC core surrounded by CISC translation system– Like compilation in hardware
Gary Marsden Slide 62University of Cape Town
Pentium Instruction Set
Pentium Pro added some enhanced instructions to better exploit hardware– Largely ignored– Not supported by AMD
MMX instructions were added (Multi-media)– SIMD instructions for graphics etc.
Gary Marsden Slide 63University of Cape Town
Processor Modes
Real Mode– For DOS programs which can access 1Mb of
RAM
Protected mode– For multiple applications, but no Real mode
Virtual Mode– Enhancement to protected mode– Fakes “Real mode” in “Protected” mode
Gary Marsden Slide 64University of Cape Town
Pentium buses
Processor bus – Processor to chipset
Cache bus– “Backside” bus
Memory bus Local I/O bus (PCI)Standard I/O bus (ISA)
Additional AGP (Accelerated Graphics Port)
Gary Marsden Slide 65University of Cape Town
More on buses
Level 1 cache is included in the processor Since the PII, level 2 cache is also included
in the packagePentium has separate address busBuses clock in multiples of the system clock
– Processor = system clock *4– Memory = system clock * 2
P4 internal bus runs at 400 MHz as opposed to standard 66 MHz
Gary Marsden Slide 66University of Cape Town
IA-64
Replacement for 32 bit PentiumDesign from the ground up by Intel and HPCompletely new RISC instruction set with
64 bit registers etc.Will still support 80386 code!