overview of the intel sandy bridge micro- architecture · intel sandy bridge memory subsystem 6...

OVERVIEW OF THE INTEL

SANDY BRIDGE MICRO-

ARCHITECTURESlides by: Pedro Tomás

ADVANCED COMPUTER ARCHITECTURES

ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

Advanced Computer Architectures, 2014

Outline

2

Intel Sandy Bridge

Memory Hierarchy System

Data Prefetching

Performance issues


Intel Sandy Bridge

Architecture Overview3

Core#0

Core#0

L1-I

L1-DL2

L3:Last Level Cache (LLC)

PCIe Controller

Input/Output Controller Hub(Southbridge)

Graphics Core

Co

her

ence

Rin

g In

terc

on

nec

tCore#0

Core#0

L1-I

L1-DL2

Core#0

Core#0

L1-I

L1-DL2

Core#0

Core#0

L1-I

L1-DL2

4-core processor

Power and Clock Control

UnCore

Universal Serial Bus (USB) Controller

SATA Controller

PCI Controller

Audio Input/Output Controller

PCI Express Controller

Display Controller

Wireless/Ethernet Controller

Solid State Drive (SSD)or

Hard Disk Drive (HDD)

Additional GPUor

FPGA

GPUGraphics

Processing Unit

DDR3RAM Memory

CPUAdditional general purpose processor

(GPP)

CPU Chip

VGA/HDMI Display Interface

Quickpath Interconnect (QPI)

I/O Interface

DDR3 Memory Controller


Intel x86 and x86_64 processors

IA32 Virtual Memory System4

Physical Address Space

36 address bits

Allows addressing up to 64GB of memory

Virtual Address Space

32 address bits

Allows up to 4GB of virtual memory (OS can restrict the total amount to a smaller value)

4KB / 4MB virtual and physical pages

Page tables (and directory) with 1024 entries, each occupying 4B

Page sizeVirtual Address (32 bits)

32 22 21 12 11 0

4KB Directory Page Table Page offset 2 levels

4MB Directory Page offset 1 level


Intel x86_64 processors

IA32e Virtual Memory System5

Physical Address Space

52 address bits

Allows addressing up to 4PB of memory

Virtual Address Space

48 address bits

Allows up to 256TB of virtual memory (OS can restrict the total amount to a smaller value)

4KB / 4MB / 1GB virtual and physical pages

Page tables (and directory) with 512 entries, each occupying 8B

Page sizeVirtual Address (64 bits) (16 most significant bits are unused)

63 48 47 39 38 30 29 21 20 12 11 0

4KB - Directory PT (Level 3) PT (Level 2) PT (Level 1) Page offset 4 levels

2MB - Directory PT (Level 3) PT (Level 2) Page offset 3 levels

1GB - Directory PT (Level 3) Page offset 2 levels

Note: page size on on higher level tables is always 4KB


Intel Sandy Bridge

Memory subsystem6

Translation look-aside buffer

Requires supporting multiple page sizes

Instruction TLB is simpler, since instruction pages are always of 4KB

Data TLB requires supporting multiple page sizes

Access to data is a bottleneck in many applications

Complex TLBs delay access to data

Many processors use a fully associative TLB

E.g., MIPS, Alpha, ARM Cortex A8, Intel Itanium 2

Intel Ivy Bridge uses an 4-way associative TLB

L1 Instruction TLB uses 64 entries (4KB only pages)

L1 Data TLB uses

64 entries for 4KB pages

32 entries for 2MB pages

32 entries for 1GB pages


Memória Cache

TAG V

Get the data and return the value to the processor

=

DATA

TAG V DATA

Retrieve one entry per way

Retrieve the page physical address

Check TAG in all for ways for

Also get the value of the remaining index bits

Generate a cache miss if the data is not on cache

PAGE VIRTUAL ADDRESS PAGE OFFSET

INDEX OFFSET

VIRTUAL ADDRESSOFFSET

Select only the bytes corresponding to the

line offset

Use only the index bits

falling within the page

offset

Select 2K words, where K is the number of index bits overlapping the page virtual address

Extract TAGfor comparison

Extract remaining index bits

TLB structure

TAG PTECONTROL

Check all TLB entries and verify if any of the entries is a match

Page Virtual Address V Page physical addressPID

Intel Nehalem memory subsystem

L1 Instruction cache (3 cycles)7

512

lines

4 ways

6 bits

9 bits

8KB/way

32KB in total

Retrieves 4+4 entries

On miss check

Unified L2 cache

4 ways


64B

Fetches 16B

each time


Memória Cache

TAG V


=

DATA

TAG V DATA







INDEX OFFSET



line offset



offset



TLB structure

TAG PTECONTROL



Intel Sandy Bridge memory subsystem

L1 Instruction cache (3 cycles)8

64B

512 lines

8 ways

6 bits

9 bits

4KB/way

32KB in total

Retrieves 8 entries

On miss check

Unified L2 cache

4 ways


On miss check the Unified

L2 TLB, with 512 entries

Fetches 16B

each time


Memória Cache

TAG V


=

DATA

TAG V DATA







INDEX OFFSET



line offset



offset



TLB structure

TAG PTECONTROL



Intel Ivy Bridge memory subsystem

L1 Data cache9

64B

512 lines

8 ways

6 bits

9 bits

4KB/way

32KB in total

Retrieves 8 entries

On miss check

Unified L2 cache

4 ways


32 entries for 2MB pages

32 entries for 1GB pages On miss check the Unified

L2 TLB, with 512 entries


Intel Sandy Bridge

L1 Data cache10

Sandy Bridge includes 2 load ports and 1 store port, allowing to sustain 2x

128-bit loads and 1x 128-bit store each cycle

Integer load to use latency is 4 cycles

FP and SIMD load to use latency is increased by 1-2 cycles

Memory accesses that are split between two cache lines, or between two pages take

longer:

Requires checking two different tags, or making two virtual to physical address translations

Load units support out-of-order access to data

If the first load misses L1, the following can still fetch from L1

A maximum of 10 outstanding misses are allowed

1x outstanding miss 1x miss to a cache line

Uses a write-back write allocate policy

SIMD stream_store instructions allow writing directly to low level caches


Memória Cache

TAG V DATA

TAG OFFSET

PHYSICAL ADDRESS

INDEX


L2 Unified instruction + data cache11

Lines of 64B

4096

lines

8 ways

6 bits

9 bits

32KB/way

256KB in total

Load to use latency of 12 cycles (10 in

Nehalem)

L1 to L2 bandwidth is 32B/cycle

Fetching a line takes 2 clock cycles

Cache controller allows bypassing the L1

cache, by sending the data to the

processor while it is being written to the L1

cache

L1-L2 relationship is non-inclusive and

non-exclusive

A line on L1 cache may, or may not be on

the L2 cache

L2 allows up to 16 outstanding misses

Write back, write allocate policy


Memória Cache

TAG V DATA

TAG OFFSET

PHYSICAL ADDRESS

INDEX


L3 shared cache12

Lines of 64B

8192

lines

16 ways

6 bits

9 bits

512KB/way

8MB in total

The L3 Cache is shared between all

cores and the integrated GPU

Load to use latency of 26-31 cycles

(35-40 in Nehalem)

Latency varies because of the time to

transfer the data from the L3, through the

ring, to the core

It is expected that a core closer to the L3

has a smaller latency, whereas a core

farther way has a longer latency

L3 to L1/L2 bandwidth is 32B/cycle

L1 can bypass the L2 cache and load data

directly from the L3

The L3 cache is inclusive

A line on L1 or L2 cache is always on the

L3 cache


Intel Smart Memory Access

Advanced pre-fetching mechanism13

Advanced pre-fetching mechanisms try to anticipate loads

Pre-fetched data is placed on a buffer

When loading a value, the processor verifies if the data is:

On the Data Cache

On the Write Buffer

On the Pre-fetch Buffer

(the store and pre-fetch buffer may be merged to increase hardware utilization)

Each core contains:

1 L1 Instruction pre-fetcher

2 L1 Data pre-fetchers

Each pair of cores share:

2 L2 pre-fetchers




The simplest prefetching mechanism is to explore spatial locality

Whenever reading the value from a line, load to the pre-fetch buffer the following/previous line

The pre-fetch buffer is used to avoid cache pollution

Prefetching only occurs whenever the bus to the lower hierarchy memory is not being used

Avoids interfering with normal execution

Intel implements more complex mechanisms




Loads are tracked by the

instruction pointer (IP)

E.g., the program counter

A Prefetch History Table (PHT) is used

to hold load history

The instruction pointer is used to check

the PHT for previous memory accesses

by that instruction

On HIT it verifies the next loads are

already on the cache

If the next loads are not on the cache, try to

load the correct data

Exceptions generated by the pre-fetchers

(e.g., accessing to invalid pages) are

ignored.




Hardware pre-fetchers are very

good at guessing requests with a

constant stride

The difference (in address) between

consecutive loads (for the same

instruction) is always constant, e.g.

LD R1,A

LD R1,A+32

LD R1,A+64

LD R1,A+96

…

Complex patterns require

software prefetching

Typically using the prefetch

instruction

Performance issues…

Next on Intel Architectures…17

overview of the intel sandy bridge micro- architecture · intel sandy bridge memory subsystem 6...

Documents