gary marsdenslide 1university of cape town interfacing processor and peripherals overview

Gary Marsden Slide 1University of Cape Town

Interfacing Processor and Peripherals

Overview

Mainmemory

I/Ocontroller

I/Ocontroller

I/Ocontroller

Disk Graphicsoutput

Network

Memory– I/O bus

Processor

Cache

Interrupts

Disk


Introduction

I/O often viewed as second class to processor design– Processor research is cleaner– System performance given in terms of

processor– Courses often ignore peripherals– Writing device drivers is not fun

This is crazy - a computer with no I/O is pointless


Peripheral design

As with processors, characteristics of I/O driven by technology advances– E.g. properties of disk drives affect how they

should be connected to the processor– PCs and super computers now share the same

architectures, so I/O can make all the difference

Different requirements from processors– Performance– Expandability– Resilience


Peripheral performance

Harder to measure than for the processor– Device characteristics

• Latency / Throughput

– Connection between system and device– Memory hierarchy– Operating System

Assume 100 secs to execute a benchmark– 90 secs CPU and 10 secs I/O– If processors get 50% faster per year for the

next 5 years, what is the impacr


Relative performance

CPU time + IO time = total time (% of IO time)

Year 0: 90 + 10 = 100 (10%)Year 1: 60 + 10 = 70 (14%) :Year 5: 12 + 10 = 22 (45%) !


IO bandwidth

Measured in 2 ways depending on application– How much data can we move through the system in

a given time• Important for supercomputers with large amounts of data

for, say, weather prediction

– How many IO operations can we do in a given time• ATM is small amount of data but need to be handled rapidly

So comparison is hard. Generally– Response time lowered by handling early– Throughput increased by handling multiple requests

together


I/O Performance Measures

Look at some examples from the world of disks: all sorts of factors and uses

Different examples of performance benchmarks for different applications:

Supercomputers– I/O dominated by access to large files– Batch jobs of several hours– Large read followed by many writes (snapshots in case

process fails)– Main measure: throughput


More Measures

Transaction processing (TP)– TP Involves both a response time requirement

and a level of throughput performance– Most accesses are small, so chiefly concerned

with I/O rate (# disk accesses / second) as opposed to data rate (bytes of data per second)

– Usually related to databases: graceful failure required, and reliability essential

– Benchmark: TPC-C• 128 pages long• Measures T/s• Includes other system elements (e.g. terminals)


File system I/O benchmarks

File systems stored on disk have different access patterns (each OS stores files differently)

Can ‘profile’ accesses to create synthetic file system benchmarks

E.g. for unix in an engineering environment:– 80% of accesses to files < 100k– 90% of accesses to sequential addresses– 67% reads– 27% writes– 6% read-modify-write


Typical File benchmark

5 phases using 70 files, totalling 200k– Make dir– Copy– Scan Dir (recursive for all attributes)– Read all– Make


Device Types and characteristics

Key characteristics– Behaviour: Input / Output / Storage (read &

write)– Partner: Human / Machine– Data rate: Peak data transfer rate

Device Behaviour Partner Kb / sec

Mouse Input Human 0.02

Laser Printer Output Human 200

Hard Disk Storage Machine 2000 - 10000


Mouse

Communicates with– Pulses from LED– Increment / decrement counters

Mice have at least 1 button– Need click and hold

Movement is smooth, slower than processor– Polling– No submarining– Software configuration

Initialposition

of mouse+20 in X– 20 in X

+20 in Y+20 in Y+20 in X

+20 in Y– 20 in X

– 20 in Y– 20 in Y+20 in X

– 20 in Y– 20 in X


Mouse guts

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.


Hard disk

Rotating rigid platters with magnetic surfaces

Data read/written via head on armature– Think record player

Storage is non-volatileSurface divided into tracks

– Several thousand concentric circles

Track divided in sectors– 128 or so sectors per track


Diagram

Platter

Track

Platters

Sectors

Tracks


Access time

Three parts1. Perform a seek to position arm over correct

track2. Wait until desired sector passes under head.

Called rotational latency or delay3. Transfer time to read information off disk

– Usually a sector at a time at 2~4 Mb / sec

– Control is handled by a disk controller, which can add its own delays.


Calculating time

Seek time:– Measure max and divide by two– More formally: (sum of all possible

seeks)/number of possible seeks

Latency time:– Average of complete spin– 0.5 rotations / spin speed (3600~5400

rpm)– 0.5/ 3600 / 60– 0.00083 secs– 8.3 ms


Comparison

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.


More faking

Disk drive hides internal optimisations from external world

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.


Networks

Currently very importantFactors

– Distance: 0.01m to 10 000 km– Speed: 0.001 Mb/sec to 1Gb/sec– Topology: Bus, ring, star, tree– Shared lines: None (point to point) or shared

(multidrop)


Example 1: RS-232

For very simple terminal networks– From olden times when 80x24 text terminals

connected to mainframes over dedicated lines– 0.3 to 19.2 Kbps– Point to point, star– 10 to 100m


Example 2: Ethernet LAN

10 - 100 MbpsOne wire bus with no central control

– Multiple masters– Only one sender at a time which limits

bandwidth but ok as utilisation is low– Messages or packets are sent in blocks of 64

bytes (0.1 ms to send) to 1518 bytes (1.5 ms)

Listen, start, collision detection, backoff


Example 3: WAN ARPANET

10 to n thousand kmARPANET first and most famous WAN (56

Kbps)– Point to point dedicated lines– Host computer communicated with interface

message processor (IMP)– IMPs used the phone lines to communicate– IMPs packetised messages into 1Kbit chunks– Packet switched delivery (store and forward)– Packets reassembled at receiving IMP


Some thoughts

Currently 100 Mbps over copper is common and we have Gbps technology in place

This is fast!We need systems than process data as

quickly as it arrivesHard to know where the bottlenecks lieNetwork at UCT has / will have Gigabit

backplaneOur international connection is only 10

Mbps


Buses: Connecting I/O devices

Interfacing subsystems in a computer system is commonly done with a bus: “a shared communication link, which uses one set of wires to connect multiple sub-systems”


Why a bus?

Main benefits:– Versatility: new devices easily added– Low cost: reusing a single set of wires many ways

Problems:– Creates a bottleneck– Tries to be all things to all subsystems

Comprised of– Control lines: signal requests, acknowledgements

and to show what type of information is on the – Data lines:data, destination / source address


Controlling a bus

As the bus is shared, need a protocol to manage usage

Bus transaction consists of– Sending the address– Sending / receiving the data

Note than in buses, we talk about what the bus does to memory– During a read, a bus will ‘receive’ data


Bus transaction 1

97018/PattersonFig. 8.07

Memory Processor

Control lines

Data lines

Disks

Memory Processor

Control lines

Data lines

Disks

Processor

Control lines

Data lines

Disks

a.

b.

c.

Memory


Bus transaction 2

Memory Processor

Control lines

Data lines

Disks

Processor

Control lines

Data lines

Disks

a.

b.

Memory

97108/PattersonFig. 8.08


Types of Bus

Processor-memory bus– Short and high speed– Matched to memory system (usually

Proprietary) I/O buses

– Lengthy,– Connected to a wide range of devices– Usually connected to the processor using 1 or 3

Backplane bus– Processors, memory and devices on single bus– Has to balance proc-memory with I/O-memory– Usually requires extra logic to do this


Bus type diagramProcessor Memory

Backplane bus

a. I/O devices

Processor MemoryProcessor-memory bus

b.

Busadapter

Busadapter

I/Obus

I/Obus

Busadapter

I/Obus

Processor MemoryProcessor-memory bus

c.

Busadapter

Backplanebus

Busadapter

I/O bus

Busadapter

I/O bus


Synchronous and Asynchronous buses

Synchronous bus has a clock attached to the control lines and a fixed protocol for communicating that is relative to the pulse

Advantages– Easy to implement (CC1 read, CC5 return value)– Requires little logic (FSM to specify)

Disadvantages– All devices must run at same rate– If fast, cannot be long due to clock skew

Most proc-mem buses are clocked


Asynchronous buses

No clock, so it can accommodate a variety of devices (no clock = no skew)

Needs a handshaking protocol to coordinate different devices– Agreed steps to progress through by sender and

receiver– Harder to implement - needs more control lines


Example handshake - device wants a word from memory


FSM control

1Record fromdata linesand assert

Ack

ReadReq

ReadReq________

ReadReq

ReadReq

3, 4Drop Ack;

put memorydata on datalines; assert

DataRdy

Ack

Ack

6Release data

lines andDataRdy

________

___

Memory

2Release data

lines; deassertReadReq

Ack

DataRdy

DataRdy

5Read memorydata from data

lines;assert Ack

DataRdy

DataRdy

7Deassert Ack

I/O device

Put addresson data

lines; assertReadReq

________

Ack___

________

New I/O request

New I/O request


Increasing bus bandwidth

Key factors– Data bus width: Wider = fewer cycles for

transfer– Separate vs Multiplexed, data and address lines

• Separating allows transfer in one bus cycle

– Block transfer: Transfer multiple blocks of data in consecutive cycles without resending addresses and control signals etc.


Obtaining bus access

Need one, or more, bus masters to prevent chaos

Processor is always a bus master as it needs to access memory– Memory is always a slave

Simplest system as a single master (CPU)Problems

– Every transfer needs CPU time– As peripherals become smarter, this is a waste

of timeBut, multiple masters can cause problems


Bus Arbitration

Deciding which master gets to go next– Master issues ‘bus request’ and awaits ‘granted’

Two key properties– Bus priority (highest first)– Bus fairness (even the lowest get a go,

eventually)

Arbitration is an overhead, so good to reduce it– Dedicated lines, grant lines, release lines etc.


Different arbitration schemes

Daisy chain: Bus grant line runs through devices from highest to lowest

Very simple, but cannot guarantee fairness

Device n

Lowest priority

Device 2Device 1

Highest priority

Busarbiter

Grant

Grant Grant

Release

Request


Centralised Arbitration

Centralised, parallel: All devices have separate connections to the bus arbiter– This is how the PCI backplane bus works (found

in most PCs)– Can guarantee fairness– Arbiter can become congested


Distributed

Distributed arbitration by self selection:Each device contains information about

relative importanceA device places its ID on the bus when it

wants access If there is a conflict, the lower priority

devices back downRequires separate lines and complex

devicesUsed on the Macintosh II series (NuBus)


Collision detection

Distributed arbitration by collision detection:

Basically ethernetEveryone tries to grab the bus at once If there is a ‘collision’ everyone backs off a

random amount of time


Bus standards

To ensure machine expansion and peripheral re-use, there are various standard buses– IBM PC-AT bus (de-facto standard)– SCSI (needs controller)– PCI (Started as Intel, now IEEE)– Ethernet

Bus bandwidth depends on size of transfer and memory speed


PCI

Type Backplane

Data width 32-64

Address/data Multiplexed

Bus masters Multiple

Arbitration Central parallel

Clocking Synch. 33-66 Mhz

Theoretical Peak 133-512 MB/sec

Achievable peak 80 MB/sec

Max devices 1024

Max length 50 cm

Bananas none


My Macintosh

Mainmemory

I/Ocontroller

I/Ocontroller

Graphicsoutput

PCI

CDROM

Disk

Tape

I/Ocontroller

Stereo

I/Ocontroller

Serialports

I/Ocontroller

Appledesktop bus

Processor

PCIinterface/memory controller

EthernetSCSI bus

outputinput


Giving commands to I/O devices

Processor must be able to address a device– Memory mapping: portions of memory are

allocated to a device (Base address on a PC)• Different addresses in the space mean different things• Could be a read, write or device status address

– Special instructions: Machine code for specific devices

• Not a good idea generally


Communicating with the Processor

Polling– Process of periodically checking the status bits

to see if it is time for the next I/O operation– Simplest way for device to communicate (via a

shared status register– Mouse– Wasteful of processor time


Interrupts

Notify processor when a device needs attention (IRQ lines on a PC)

Just like exceptions, except for– Interrupt is asynchronous with program

execution• Control unit only checks I/O interrupt at the start of

each instruction execution

– Need further information, such as the identity of the device that caused the interrupt and its priority

• Remember the Cause Register?


Transferring Data Between Device and Memory

We can do this with Interrupts and Polling– Works best with low bandwidth devices and

keeping cost of controller and interface– Burden lies with the processor

For high bandwidth devices, we don’t want the processor worrying about every single block

Need a scheme for high bandwidth autonomous transfers


Direct Memory Access (DMA)

Mechanism for offloading the processor and having the device controller transfer data directly

Still uses interrupt mechanism, but only to communicate completion of transfer or error

Requires dedicated controller to conduct the transfer


Doing DMA

Essentially, DMA controller becomes bus master and sets up the transfer

Three steps– Processor sets up the DMA by supplying

• device identity• Operation on device• Memory Address (source or destination)• Amount to transfer

– DMA operates devices, supplies addresses and arbitrates bus

– On completion, controller notifies processor


DMA and the memory system

With DMA, the relationship between memory and processor is changed– DMA bypasses address translation and

hierarchy

So, should DMA use virtual or physical addresses?– Virtual addresses: DMA must translate– Physical addresses: Hard to cross page

boundary


DMA address translation

Can provide the DMA with a small address translation table for pages - provided by OS at transfer time

Get the OS to break the transfer into chunks, each chunk relating to a single page

Regardless, the OS cannot relocate pages during transfer


DMA Cache problems

DMA can create inconsistencies between cache and main memory– Called the ‘stale data’ or ‘coherency’ problem

Solve by– Route all I/O activity through cache (expensive)– Have the who cache flushed (easy and not too

bad)– Selectively flush cache (slightly more efficient

but lots of control circuit needed)


Parallel Processors

Parallel processing machines are common– All current G5 Macs have two processors

Parallel categorisation (Flynn 1966)– Single Instruction Stream, Single Data Stream

(SISD)– Single Instruction, Multiple Data (SIMD - MMX)– Multiple Instruction, Single Data (MISD -

SuperScalar)– Multiple Instructions, Multiple Data (MIMD - true

parallelism)


Directions

Microprocessors get faster every yearDevelopment costs are high

– MIPS R4000: 30 engineers for 3 years; $30 million to develop; $10 million to fabricate; 50000 hours simulation

– High costs led companies to look at re-using existing chips in multi-processor machines

– This is possible due to improvements in memory and bus technology

– Still expensive to run these things


Evolution vs Revolution

Evolutionary approaches tend to be invisible to users except for– Lower cost and better performance

Revolutionary approaches require new languages and applications– Looks good on paper– Must be worth the effort– KCM


RISC vs CISC

Reduced Instruction Set computing vs Complex Instruction Set Computing

Falls between evolutionary and revolutionary– Requires some small changes (68k - PPC

Macintosh)

CISC: Instruction set close to high level languages simplify compilers– C is very close to PDP-11 assembly– VAX was the last big CISC machine, before it

was replaced by the ‘Alpha’


RISC

Emerged in 80s when programming languages, memory and compiler improvements meant there was less need for assembly programming

Became important to optimise instruction sets for compilers, not programmers

Philosophy: fixed instruction lengths, load-store instruction set, limited addressing modes, limited operations

Has been hugely successful: PPC, SPARC, Alpha, Apollo, ARM etc.


Pentium

Is both CISC and RISCBackwards compatibility is very important

for Windows users– No real change since 80386

So, Pentiums have a RISC core surrounded by CISC translation system– Like compilation in hardware


Pentium Instruction Set

Pentium Pro added some enhanced instructions to better exploit hardware– Largely ignored– Not supported by AMD

MMX instructions were added (Multi-media)– SIMD instructions for graphics etc.


Processor Modes

Real Mode– For DOS programs which can access 1Mb of

RAM

Protected mode– For multiple applications, but no Real mode

Virtual Mode– Enhancement to protected mode– Fakes “Real mode” in “Protected” mode


Pentium buses

Processor bus – Processor to chipset

Cache bus– “Backside” bus

Memory bus Local I/O bus (PCI)Standard I/O bus (ISA)

Additional AGP (Accelerated Graphics Port)


More on buses

Level 1 cache is included in the processor Since the PII, level 2 cache is also included

in the packagePentium has separate address busBuses clock in multiples of the system clock

– Processor = system clock *4– Memory = system clock * 2

P4 internal bus runs at 400 MHz as opposed to standard 66 MHz


IA-64

Replacement for 32 bit PentiumDesign from the ground up by Intel and HPCompletely new RISC instruction set with

64 bit registers etc.Will still support 80386 code!

gary marsdenslide 1university of cape town interfacing processor and peripherals overview

Documents