comp 2003: assembly language and digital logic

COMP 2003:Assembly Language and Digital Logic

Chapter 7: Computer ArchitectureNotes by Neil Dickson

About This Chapter

• This chapter delves deeper into the computer to give an understanding of the issues regarding the CPU, RAM, and I/O.

• Having an understanding of the underlying architecture helps with writing efficient software.

Part 1 of 3: CPU Execution

Pipelining and Beyond

Execution Pipelining

Fetch Instruction

Decode Instruction

Load Operand Values

Execute Operation

Store Results

Fetch

Decode

Load

Execute

Store

Fetch

Decode

Load

Execute

Store

Fetch

Decode

Load

Execute

Store

Fetch

Decode

Load

Execute

Store

Old systems:1 instruction at a time

Less-old systems:multiple independent instructions at a time

Instruction-Fetching Circuitry

Instruction Decoder(s)

Operand-Loading Circuitry

Execution Unit(s)

Results-Storing

Circuitry

Instruction 1

A Hardware View of PipeliningInstruction 2

Instruction 3

Instruction 4

Instruction 5

Instruction 6

Instruction 7

Problem:What if Instruction 1 stores result in eax (e.g. “mov eax,1”) and Instruction 2

needs to load eax? (e.g. “add ebx,eax”)




Execution Unit(s)

Results-Storing

Circuitry

Instruction 1

Pipeline DependenciesInstruction 2

Instruction 3

Instruction 4

Instruction 5

Instruction 6

Instruction 7

Suppose Instruction 1 stores result in eax

and Instruction 2 needs to load eax.

Have to wait here until result stored.

Problem:What about

conditional jumps?

Branch Prediction• Suppose that Instruction 3 is a conditional jump

(e.g. jc MyLabel)• The “operand” to load is the flags.• Its execution is to determine whether or not to

jump (i.e. where to go next).• Its result is stored in the instruction pointer, eip.• Unknown what comes next until the execution,

so the CPU makes a prediction first and checks it in the execution stage




Execution Unit(s)

Results-Storing

Circuitry

Instruction 1

Branch Prediction and the PipelineInstruction 2

Instruction 3

Instruction 4

Instruction 5

Instruction 6

Instruction 4’

Suppose Instruction 3 is a conditional jump

oh no! It turned out that the CPU guessed wrong.

so clear the pipeline and start from the new eip

Instruction 2 changed the flags,

so wait here

Pipelining Pros/Cons

• Pro: Only one set of each hardware component is needed (plus some hardware to manage)

• Pro: Initial concept was simple• Con: Programmer/compiler must try to

eliminate dependencies, which can be tough, else face big performance penalties

• Con: The actual hardware can get complicated• Note: No longer short on CPU die space, so

first Pro doesn’t matter much anymore

Beyond Pipelining

• For jumps that are hard to predict, guess BOTH directions, and keep two copies of results based on the guess (e.g. 2 of each register)

• Allow many instructions in at once (e.g. multiple decoders, multiple execution units, etc.) so that there’s a higher probability of more operations that can run concurrently

• Vector instructions (multiple data together)

Intel Core i7 Execution Architecture

32KB Instruction Cache

16-byte Prefetch Buffer

Initial (Length) Decoder

Branch Prediction

Queue of ≤18 Instructions

4 Decoders

2 Copies of Registers Buffer of ≤128 MicroOps

Store to Memory

Load from Memory

Several, 128-bit Execution Units

32KB Data Cache

L2 Cache

L3 Cache

from RAMsplit instructions into parts called “MicroOps”

What About Multiple Cores?

• What we’ve looked at so far is a single CPU core’s execution.

• A CPU core is a copy of this functionality on the CPU die, so a quad-core CPU has 4 copies of everything shown (except larger caches).

• Instead of trying to run multiple instructions from the same stream of code concurrently, as before, each core runs independently of any others (one thread on each)

Confusion About Cores

• “Cores” in GPUs and custom processors like the Cell are not independent, whereas cores in standard CPUs are, so this has led to great confusion and misunderstanding.

• The operating system decides what instruction stream (thread) to run on each CPU core, and can periodically change this (thread scheduling)

• These issues are not part of this course, but may be covered in a parallel computing or operating systems course.

Part 2 of 3: Memory

Caches and Virtual Memory

Memory Caches

• Caches are copies of RAM on the CPU to save time• A cache miss is when one checks a cache for a piece

of memory that is not there• Larger caches have fewer misses, but are slower, so

modern CPUs have multiple levels of cache:– Memory Buffers (ignored here), L1 Cache, L2 Cache, L3

Cache, RAM

• CPU only accesses memory through cache under normal circumstances

Reading From Cache

• want value of memory at location A• if A is not in L1• if A is not in L2• if A is not in L3• L3 reads A from RAM

• L2 reads A from L3• L1 reads A from L2

• read A from L1• Note: A is now in all levels of cache

Writing to Cache

• want to store value into memory at location A• write A into L1• after time delay, L1 writes A into L2• after time delay, L2 writes A into L3• after time delay, L3 writes A into RAM• Note: the time delays could result in

concurrency issues in multi-core CPUs, so write caching can get more complicated

Caching Concerns

• Randomly accessing memory causes many more cache misses than sequentially accessing memory or accessing relatively few locations– This is how quicksort is usually not so quick

compared to mergesort• Writing to a huge block of memory that won’t

be read soon can cause cache misses later, since it fills up caches with the written data– There are special instructions to indicate not to

cache certain writes, avoiding this in assembly

Paging

• Paging, a.k.a. virtual memory mapping, is a feature of CPUs that allows the apparent rearrangement of physical memory blocks into one or more virtual memory spaces.

• 3 main reasons for this:– Programs can be in separate memory spaces, so they

don’t interfere with each other– The OS can give the illusion of more memory using

the hard drive– The OS can prevent programs from messing up the

system (accidentally or intentionally)

Virtual Memory

• With a modern OS, no memory accesses by a program directly access physical memory

• Virtual addresses are mapped to physical addresses in 4KB or 2MB pages using page tables, set up by the OS.

Page Tables

page table for Dude.exe:

physical memory:

virtual page #: 0 1 2 6

...

3 4

page table for Sweet.exe:

virtual page #: 0 1 2

...

3 4

physical page #: 0 1 2 3 4 5 6 7 8 9 A B C D E F...

5 7 65 7

Part 3 of 3: I/O and Interrupts

Just an Overview

Common I/O Devices

• Human Interface (what most people think of)– Keyboard, Mouse, Microphone, Speaker, Display,

Webcam, etc.• Storage– Hard Drive, Optical Drive, USB Key, SD Card

• Adapters– Network Card, Graphics Card

• Timers (very important for software)– PITs, LAPIC Timers, CMOS Timer

If There’s One Thing to Remember

•I/O IS SLOW!• Bad Throughput:– Mechanical drives can transfer up to 127MB/s– Memory bus can transfer up to 30,517 MB/s

(or more for modern ones)• Very Bad Latency:– 10,000 RPM drive average latency: 3,000,000ns– 1333MHz uncached memory average latency: 16ns

I/O Terminology

• I/O Ports or Memory-Mapped I/O?– Some devices are controlled through special “I/O

ports” accessible with the “in” and “out” instructions.

– Some devices make themselves controllable by occupying blocks of memory and intercepting any reads or writes to that memory instead of using “in” and “out”. This is often called Direct Memory Access (DMA).

I/O Terminology

• Programmed I/O or Interrupt-Driven I/O?– Programmed I/O is controlling a device’s

“operation” step-by-step with the CPU– Interrupt-Driven I/O involves the CPU setting up

some “operation” to be done by a device and getting “notified” by the device when the “operation” is done

– Most I/O in a modern system is interrupt-driven

Interrupts

• Instead of continuously checking for keyboard or mouse input, can be notified of it when it happens

• Instead of waiting idly for the hard drive to finish writing data, can do other work and be notified when it’s done

• Such a notification is called an I/O interrupt.• (There are also exception interrupts e.g. for

when doing an integer division by zero.)

Interrupts

• When an interrupt occurs, the CPU stops what it was doing and calls a function specified by the OS to handle the interrupt.– This function is an interrupt handler

• The interrupt handler deals with the I/O operation (e.g. saves a typed key) and returns, resuming whatever was interrupted

• Because interrupts can occur at any time, values on the stack below esp may change at any time

comp 2003: assembly language and digital logic

Documents

instruction pointer

multiple execution units

execution stageinstruction

execution units32kb

cpu executionpipelining

stores result

mov eax

cpu guessed