computer structure 2012 – p6 uarch 1 ooo execution of memory operations

Computer Structure 2012 – P6 uArch 1

OOO Execution of Memory Operations


P6 Caches· Blocking caches severely hurt OOO

– A cache miss prevents from other cache requests (which could possibly be hits) to be served

– Hurts one of the main gains from OOO – hiding caches misses

· Both L1 and L2 cache in the P6 are non-blocking– Initiate the actions necessary to return data to cache miss while

they respond to subsequent cached data requests– Support up to 4 outstanding misses

Misses translate into outstanding requests on the P6 bus The bus can support up to 8 outstanding requests

· Squash subsequent requests for the same missed cache line– Squashed requests not counted in number of outstanding requests– Once the engine has executed beyond the 4 outstanding requests

subsequent load requests are placed in the load buffer


OOO Execution of Memory Operations· The RS operates based on register dependencies

– RS cannot detect memory dependencies movl -4(%ebp), %ebx # MEM[ebp-4] ← ebxmovl %eax, -4(%ebp) # eax ← MEM[ebp-4]

– RS dispatches memory uops when data for address calculation is ready,

and the MOB and Address Generation Unit (AGU) are free– AGU computes the linear address

Segment-Base + Base-Address + (Scale*Index) + Displacement Sends linear address to MOB, to be stored in Load Buffer or Store Buffer

· MOB resolves memory dependencies and enforces memory ordering– Some memory dependencies can be resolved statically

store r1,a load r2,b

– Problem: some cannotstore r1,[r3]; load r2,b

can advance load before store

load must wait till r3 is known


Load and Store Ordering· x86 has small register set uses memory often

– Preventing Stores from passing Stores/Loads: 3%~5% perf. loss P6 chooses not allow Stores to pass Stores/Loads

– Preventing Loads from passing Loads/Stores: big perf. loss P6 allows Loads to pass Stores, and Loads to pass Loads

· Stores are not executed OOO– Stores are never performed speculatively

there is no transparent way to undo them

– Stores are also never re-ordered among themselves The Store Buffer dispatches a store only when

the store has both its address and its data, and there are no older stores awaiting dispatch

– Store commits its write to memory (DCU) at retirement


Store Implemented as 2 Uops

· Store decoded as two independent uops– STA (store-address): calculates the address of the store– STD (store-data): stores the data into the Store Data buffer

The actual write to memory is done when the store retires

· Separating STA & STD is important for memory OOO– Allows STA to dispatch earlier, even before the data is known– Address conflicts resolved earlier

opens memory pipeline for other loads

· STA and STD can be issued to execution units in parallel– STA dispatched to AGU when its sources (base+index) are ready– STD dispatched to SDB when its source operand is available


Memory Order Buffer (MOB)· Store Coloring

– Each Store allocated in-order in Store Buffer, and gets a SBID– Each load allocated in-order in Load Buffer,

and gets LBID + current SBID

· Load is checked against all previous stores– Stored with SBID ≤ store’s SBID

· Load blocked if– Unresolved address of a relevant STAs– STA to same address, but data not ready– Missing resources (DTLB miss, DCU miss)

· MOB writes blocking info into load buffer– Re-dispatches load when wake-up signal received

· If Load is not blocked executed (bypassed)

LBID SBID

Store

- 0

Store

- 1

Load 0 1

Store

- 2

Load 1 2

Load 2 2

Load 3 2

Store

- 3

Load 4 3


MOB (Cont.)

· If a Load misses in the DCU– The DCU marks the write-back data as invalid – Assigns a fill buffer to the load, and issues an L2 request– When critical chunk is returned, wakeup and re-dispatch the load

· Store → Load Forwarding– Older STA with same address as load and data ready

Load gets its data directly from the SB (no DCU access)

· Memory Disambiguation– MOB predicts if a load can proceed despite unknown STAs

Predict colliding block Load if there is unknown STA (as usual) Predict non colliding execute even if there are unknown STAs

– In case of wrong prediction The entire pipeline is flushed when the load retires


Pipeline: Load: Allocate

· Allocate ROB/RS, MOB entries· Assign Store ID (SBID) to enable ordering

IDQ

Alloc

ROBRS

RetireSchedule

LB

AGULB

Write

DTLB DCU WBMOB


Pipeline: Bypassed Load: EXE

· RS checks when data used for address calculation is ready· AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. · Write load into Load Buffer· DTLB Virtual → Physical + DCU set access· MOB checks blocking and forwarding· DCU read / Store Data Buffer read (Store → Load forwarding)· Write back data / write block code

IDQ

Alloc

ROBRS

RetireSchedule

LB

AGULB

Write

DTLB DCU WBMOB


Pipeline: Blocked Load Re-dispatch

· MOB determines which loads are ready, and schedules one· Load arbitrates for MEU · DTLB Virtual → Physical + DCU set access· MOB checks blocking/forwarding· DCU way select / Store Data Buffer read· write back data / write block code

IDQ

Alloc

ROBRS

RetireSchedule

LB

AGULB

Write

DTLB DCU WBMOB


Pipeline: Load: Retire

· Reclaim ROB, LB entries· Commit results to RRF

IDQ

Alloc

ROBRS

RetireSchedule

LB

AGULB

Write

DTLB DCU WBMOB


Pipeline: Store: Allocate

· Allocate ROB/RS· Allocate Store Buffer entry

IDQ RS

Alloc Schedule AGU SB

DTLB

ROB

SB

Retire


Pipeline: Store: STA EXE

· RS checks when data used for address calculation is ready– dispatches STA to AGU

· AGU calculates linear address· Write linear address to Store Buffer· DTLB Virtual → Physical · Load Buffer Memory Disambiguation verification· Write physical address to Store Buffer

IDQ RS

ScheduleAlloc AGUSBV.A.

ROB

DTLBSBP.A.

SB

Retire


Pipeline: Store: STD EXE

· RS checks when data for STD is ready– dispatches STD

· Write data to Store Buffer

IDQ RS

ScheduleAllocSB

dataROB

SB

Retire


Pipeline: Senior Store Retirement

· When STA (and thus STD) retires– Store Buffer entry marked as senior

· When DCU idle MOB dispatches senior store· Read senior entry

– Store Buffer sends data and physical address· DCU writes data· Reclaim SB entry

SB

IDQ RS

ScheduleAlloc

ROB

Retire

SB DCUMOB


The life of a Load…Instruction Q

load BufferRS

ROB

EXE

Retire

RAT

R3MEM(R2+50)

ArchReg.

Phys.Reg.

RF0

# Valid Rdy Data DST

0

0

0

RF0MEM(R2+50)

R0

R1

R2R3

R2+50

Addr.

AGU

BC

ALU1 … dTLB

Data Cache

V(R2+50) 0V1

Ld 1 0 X R3

• 1 entry in the ROB, RS and Load Buffer + rename in RAT

• Dispatch Load address calculation to AGU when source is ready – Release RS entry

• AGU updates the address in the Load buffer. Pipeline proceeds to dTLB

• Load Buffer checks for blocking conditions and dispatches the Load to the DCU

• DCU sends the result to RS and updates the ROB with the load result

Not Valid

Ld 1 1 data R3

• Load will retire as any other instruction (when all previous instructions have retired) – RAT updated• LB and ROB entry are released

Store Buffer

MOB


The life of a Store…Instruction Q

load BufferRS

ROB

EXE

Retire

RAT

MEM(R2+50) R3

ArchReg.

Phys.Reg.

RF0

# Valid Rdy Data DST

0

0

0

STA: R2+50

R0

R1

R2R3

R2+50

Addr.

AGU

BC

ALU1 … dTLB

Data Cache

V

St 1 0 X X

• 1 entry in the ROB, 2 in the RS and 1 in the Store Buffer

• Dispatch Store address calculation to AGU when source is ready – Release RS entry

• AGU updates the address in the Store buffer update the Store Buffer & provide addr. to depending loads• Store pipeline proceeds to dTLB. Physical address will be updated in the SB

• Dispatch Store Data when Data is ready update the Store Buffer & provide data to depending loads

• The Store will retire from the ROB as any other instruction (when all previous instructions have retired)• After this, the Store is marker as Senior Store in the Store Buffer

St 1 1 X X

• The Store buffer will initiate a DCU write. When the write is done, the SB reclaims the entry

Store Buffer

MOB

STD: RF0

Addr. Data

V(R2+50)

V

1 Not Vld Not VldV(RF0)

Snr

1

• The Store Buffer updates the ROB entry


Questionעם · למעבד נתייחס זו - OOOEבשאלה Speculative Executionו ·: הבא הקוד קטע נתון

1000 load R2,R1,30 ; R2=m[R1+30]1004 store R2,20,R1 ; m[R2+20]=R11008 load R3,R1,100 ; R3=m[R1+100]100C store R1,40,R3 ; m[R1+40]=R31010 add R1,R1,10 ; R1=R1+101014 blt R1,100,1000 ; if (R1<100) PC=1000

הנחות·כ – נחזית הקפיצה נלקחתפקודתכתובת – בכל הביצוע ,Nבתחילת הערך קיים R1=R2=R3=10וכן Nבזיכרון–. בתרגום צורך ואין פיזיות הן בתוכנית הכתובות כי נניח פשטות למען–L1 data cache מחזירdata. , הביצוע בתחילת ריק הוא אך אחד שעון מחזור תוך–L2 data cache מחזירdata 7תוך , הכתובות כל את מכיל והוא שעון מחזורי

. הביצוע בתחילת כבר המבוקשות


פקודות של אלוקציה

לפחות ) · ויש פקודות לארבע אלוקציה לבצע ניתן מחזור פקודות 4בכל) לאלוקציה מוכנות

- ROB, MOBה-· .RSוה, מתמלאים ואינם גדולים הם


פקודות של ביצוע· . ביצוע יחידות אינסוף ישנן· , שכל בתנאי האלוקציה שלאחר במחזור לביצוע להיכנס יכולה פקודה

. יכולה לנתון שממתינה פקודה מוכנים כבר זקוקה היא להם הנתונים. מוכן הוא שלאחריו במחזור מייד לביצוע להיכנס

פקודת · .ALUביצוע אחד שעון מחזור אורךפקודת · . branchביצוע אחד מחזור אורך

– , מבוצע הבא במחזור כשגוי מתגלה החיזוי (.t+1בזמן ) flushאםאלוקציה – מבצעות הנכון מהמסלול לאחר 5הפקודות בזמן ) flushמחזורים

t+6.)


המשך – פקודות של ביצוע.loadפקודת · מוכנים הכתובת לחישוב הנתונים כאשר לביצוע נשלחת

הכתובת – מחושבת הראשון במחזור– : פקודת כל עבור הבא התנאי נבדק השני -storeבמחזור ל , loadהקודמת

- ה של -storeהכתובת : ה של שהכתובת או ומתקיים שונה loadידועה- ה של -storeמהכתובת , וה, שוות הכתובות ששתי -dataאו ה כבר storeשל

ידועה. –- , , מה מתקבל הנתון מצליחה והבדיקה במידה השלישי L1 cacheבמחזור

יש) -hitאם מה(, ישירות " MOBאו י store to load forwardingעיש – אך מצליחה והבדיקה אין L1 cache missבמידה store to loadוכן

forwarding - מה, העשירי במחזור מתקבל .L2 cacheהנתון–- , ה נכשלת והבדיקה חסום )loadבמידה תנאי(. blockedהוא מוסר כאשר

- ה, , loadהחסימה הראשון המחזור על ומדלגים לביצוע שוב נשלח.) התנאי) בבדיקת מתחילים

.storeפקודת · מוכנים הכתובת לחישוב הנתונים כאשר לביצוע נשלחת–- , ל הכתובת נכתבת ובסופו אחד שעון מחזור אורך הכתובת .MOBחישוב– , , הוא הבא במחזור מוכן לזיכרון לכתיבה הנתון כאשר תלוי בלתי באופן

- ל MOBנכתב


פקודות Commitשל

לבצע · יכולה , commitפקודה ובתנאי הביצוע סיום שלאחר מהמחזור החל / מבצעת ביצעה שלפניה הפקודות. commitשהפקודה כמות על מגבלה אין

מחזור commitשמבצעות בכל

-storeפקודת · ה אל הכתיבה את .post-commitבזמן cacheמבצעת


Summary…· 4 wide machine· L1: 1 cycle L2: 7 cycles Alu, Branch: 1 cycle· L1 empty / L2 always hits…· Mispredict @ T:

– T+1: Flush pipeline– T+6: Alloc on the good path

Load Addr.calculation

Memorychecks

All previousStore:- ≠ addr.- Same addr

& data Rdy

L1 HitForwarding(From MOB)

4 10321

… L2 Hit

Retry after block

Store Addr.calculation

DataReady

MOBupdate

7 cycles


Pdst instruction R1 R2 R3 addr data T alloc src1 src2 Imm

T src1 ready

T src2ready

T exeLoadblock code

T data ready

T commit

0 load R2=m[R1+30]

1 store m[R2+20]=R1

2 load R3=m[R1+100]

3 store m[R1+40]=R3

4 add R1=R1+10

5 bltif (R1<100)

6 load R2=m[R1+30]

7 store m[R2+20]=R1

8 load R3=m[R1+100]

9 store m[R1+40]=R3

10 add R1=R1+10

11 bltif (R1<100)

Arch. reg value after commit

Addr. for LD & ST

Data for LD & ST

AllocTime4 / cycle

Fill this table…

Src reg:Pi / Ri:Store:Src1: addrSrc2: data

TimeSrc ready

Timeexe

0: ready1: addr blocking2: data not ready


הטבלה למילוי הנחיות

·R1, R2, R3 - לאחר הארכיטקטוניים הרגיסטרים .commitערכי. כותבת הפקודה שאליו הארכיטקטוני הרגיסטר של הערך את בעיגול להקיף יש

- ל מגיעה אינה והפקודה .commitבמידה ריקים אלה שדות להשאיר יש

·addr – – פקודות עבור לזיכרון הגישה -loadכתובת .storeו בלבד ·data – – פקודות עבור נכתב או שנקרא זיכרון -loadערך .storeו בלבד ·T alloc - , מ: ) החל מחזור בכל פקודות ארבע לפקודה אלוקציה מבוצעת בו הזמן

T=1)·src1, src2- כ: המשמשים הרגיסטרים :sourcesמספרי לפקודה

Pi- , ו פיזי רגיסטר . Riעבור הארכיטקטוני הרגיסטר את ישירות וקוראים במידה. store: src1עבור · הכתובת – לחישוב המשמש המכיל – src2הרגיסטר הרגיסטר

. הנתון את·Imm – יש ולפקודה - Immבמידה ה, .Immערך·T src1 ready , T src2 ready- ה: ערכי אחד כל מוכן בו .sourcesהזמן לפקודה

- ה .srcאם , האלוקציה לזמן שווה יהיה זה זמן אז האלוקציה בזמן מוכןשל הערך את שמחשבת הפקודה בזמן srcאם ביצוע -Tמסיימת בזמן srcה, מוכן

T.


המשך – הטבלה למילוי הנחיות

·R1, R2, T exe. לביצוע: נשלחת הפקודה בו הזמן- ה כל בזמן- srcאם מוכנים פקודה של לביצוע, Tים הפקודה את לשלוח ניתן

.T+1בזמן ·Load block code ( בפקודות רק -loadרלוונטי ה(: של החסימה .loadקוד

0. חסימה – אין1- מ – כתוצאה unresolved store addressחסימה2 - מ – כתוצאה waiting for store dataחסימה

·- וה .loadבמידה , החסימה קודי כל את לרשום יש אחת מפעם יותר נחסם·T data ready:

-storeעבור – ה: בו .dataהזמן מוכן לזיכרון לכתיבה-loadעבור – ה: מתקבל בו -dataהזמן -cacheמה ) מה ישירות (.MOBאו

·T commit :מבצעת הפקודה בו commitהזמן



T src1 ready

T src2ready

T exeLoadblock code

T data ready

T commit

0 load R2=m[R1+30] 10 40 10 40 40

1 store m[R2+20]=R1 60 10

2 load R3=m[R1+100] 110 110 110

3 store m[R1+40]=R3 50 110

4 add R1=R1+10 20

5 bltif (R1<100) 20 40 110

6 load R2=m[R1+30] 110 50 110

7 store m[R2+20]=R1 130 20

8 load R3=m[R1+100] 120 120 120

9 store m[R1+40]=R3 60 120

10 add R1=R1+10 30

11 bltif (R1<100) 20 110 120



T src1 ready

T src2ready

T exeLoadblock code

T data ready

T commit

0 load R2=m[R1+30] 10 40 10 40 40 1 R1 30 1 2 0 10 11

1 store m[R2+20]=R1 60 10 1 P0 R1 20 10 1 11 11 12

2 load R3=m[R1+100] 110 110 110 1 R1 100 1 2 1 20 21

3 store m[R1+40]=R3 50 110 1 R1 P2 40 1 20 2 21 22

4 add R1=R1+10 20

5 bltif (R1<100) 20 40 110

6 load R2=m[R1+30] 110 50 110

7 store m[R2+20]=R1 130 20

8 load R3=m[R1+100] 120 120 120

9 store m[R1+40]=R3 60 120

10 add R1=R1+10 30

11 bltif (R1<100) 20 110 120


. . .

. .


Load Memorychecks

13 20121110

…L2 Hit

Store

Load R2 (pb0)is known

Addr. Calc:PB0+20

L1 miss



T src1 ready

T src2ready

T exeLoadblock code

T data ready

T commit

0 load R2=m[R1+30] 10 40 10 40 40 1 R1 30 1 2 0 10 11

1 store m[R2+20]=R1 60 10 1 P0 R1 20 10 1 11 11 12

2 load R3=m[R1+100] 110 110 110 1 R1 100 1 2 1 20 21

3 store m[R1+40]=R3 50 110 1 R1 P2 40 1 20 2 21 22

4 add R1=R1+10 20 2 R1 10 2 3 22

5 bltif (R1<100) 20 40 110 2 P4 100 3 4 22

6 load R2=m[R1+30] 110 50 110 2 P4 30 3 4 1, 2 22 23

7 store m[R2+20]=R1 130 20 2 P6 P4 20 22 3 23 23 24

8 load R3=m[R1+100] 120 120 120

9 store m[R1+40]=R3 60 120

10 add R1=R1+10 30

11 bltif (R1<100) 20 110 120



.



T src1 ready

T src2ready

T exeLoadblock code

T data ready

T commit

0 load R2=m[R1+30] 10 40 10 40 40 1 R1 30 1 2 0 10 11

1 store m[R2+20]=R1 60 10 1 P0 R1 20 10 1 11 11 12

2 load R3=m[R1+100] 110 110 110 1 R1 100 1 2 1 20 21

3 store m[R1+40]=R3 50 110 1 R1 P2 40 1 20 2 21 22

4 add R1=R1+10 20 2 R1 10 2 3 22

5 bltif (R1<100) 20 40 110 2 P4 100 3 4 22

6 load R2=m[R1+30] 110 50 110 2 P4 30 3 4 1, 2 22 23

7 store m[R2+20]=R1 130 20 2 P6 P4 20 22 3 23 23 24

8 load R3=m[R1+100] 120 120 120 3 P4 100 3 4 1 32 33

9 store m[R1+40]=R3 60 120 3 P4 P8 40 3 32 4 33 34

10 add R1=R1+10 30 3 P4 10 3 4 34

11 bltif (R1<100) 20 110 120 3 P10 100 4 5 34



computer structure 2012 – p6 uarch 1 ooo execution of memory operations

Documents