verifying conformance to memory models: the test model-checking approach ganesh gopalakrishnan...

Verifying Conformance to Memory Models: the Test Model-checking Approach

Ganesh Gopalakrishnan (funded by NSF)

presenting work done by

Ratan Nalumasu (PhD, 9/98, HP Cupertino)

Rajnish Ghughal, (MS, 10/98?, interviewing)

Abdel Mokkedem (postdoc; going to Compaq)

Other members of the group:

Ravi Hosabettu (PhD, 12/99? Processor Verification)

Mike Jones (PhD, 6/00? - at Intel now)

Annette Bunker (PhD, 6/01?)

10/7/98 Ganesh, Utah Verifier group -- UT Austin talk

2

FM for memory system design• Processor speed increases at 55% a year, while

memory speed increases at 7%

• With shared memory multiprocessors, the mismatch is exacerbated

• Complex protocols are employed to help improve the overall performance

• Performance improvements should not be at the expense of correctness...

• Hence, the need for a formal verification technique for memory systems that can be employed in actual design cycles


3

Goal of our Research

Develop domain-specific formal methods for shared memory systems

Memory

CPUCPU

Interconnect

Memory


4

Types of shared memory systems: 1. Symmetric Multiprocessors (SMP)

- Can scale upto 10s of processors- Modern caches have support for such SMP protocols

CPU$

Memory

CPU$

CPU$


5

2. Distributed Shared Memory (DSM) systems

NODE NODE NODE

MEM MEM MEM

Network

Each node may be a SMP or a single CPU


6

Shared Memory Correctness• Low level:

– deadlock– forward progress– bus arbitration

• Intermediate level:– at most one owner of cache-line

• High-level:– verify abstraction provided to software

(the desired Formal Memory Model)


7

Results of the UV group• New Partial Order reduction algorithm (Nalumasu)

• Realized in verifier called PV

• Outperforms SPIN “10 to 1” on most examples

• Selective state-caching is available “for free”

• A DSM Protocol synthesis algorithm (Nalumasu)• Safety of synthesis proved correct using PVS

• Derives realistic (hand-quality) DSM protocols Incorporates a

scalable buffer-reservation scheme

• Verifying Formal Memory Models (Nalumasu, Mokkedem, Ghughal) ....this talk

10/7/98 8

The key issue in verifying memory models:must reorder reads / writes correctly

Multiprocessor:P1write(a,new)read(b)

P2write(b,new)read(a)

P1read(b)write(a,new)

P2read(a)write(b, new)

Not okunderS.C.

Uniprocessor: P1write(a,new)read(b)

P1read(b)write(a,new)

okcache/compiler/out-of-order execution

Test model checking

10/7/98 9

Speculative Execution can permit OO processingwhile maintaining SC...

wr(a,2) ;rd(b) ;

wr(b,3) ;rd(a) ;

CPU1 CPU2

bus

b : shareda : invalid

a : sharedb : invalid

MEM

- Miss on “a”- Hit on “b” (speculative)- Bus serializes as wr(a) ; wr(b)- Speculation successful

- Miss on “b”- Hit on “a” (speculative)- Bus serializes as wr(a) ; wr(b)- Speculation unsuccessful

10/7/98 10

Speculative Execution can permit OO processingwhile maintaining SC...

wr(a,2) ;rd(b) ;

wr(b,3) ;rd(a) ;

CPU1 CPU2

bus

b =1 : shareda : invalid

a=0 : sharedb : invalid

MEM

- SC can explain sequence wr(a,2); rd(b,1); wr(b,3)

- SC can’t explain sequence wr(a,2); wr(b,3); rd(a,0)


11

What Test model-checking does

SC:SC:The observed execution results can beexplained by a weave of the individual instructionsequences.

CPU $

Memory

CPU $

CPU $

Test modelchecking

?


12

Current Memory-system Verification Techniques

Simulation:– Ad-hoc– Incomplete– Memory model bugs are non-intuitive.

Difficult to tell that something’s wrong!


13

Current Formal Techniques for Memory System Verification

• Graf’s work on verifying Lazy caching in ACTL*– Not adequately demonstrated

• Toy examples• Technique unfit for iterative design cycles (..more..)

• Gibbons/Koren approach: verify execution traces• Ad-hoc• NP-complete

• McMillan proposed seed of idea similar to test model-checking• No details


14

Graf’s work: ACTL* for (stronger than) SC

• AG(enabled( read(a,d) )) avail(a,d)• AG(avail(a,d) AND EF(enable(read(a,d)))) A[NOT

avail(a,d) W AG NOT avail(a,d)]• ...• init AG[after(write(a,d))

A(NOT enabled(read(a,d) W avail(a,d))]

Such MODEL DEPENDENT SPECS do not fit in

an iterative industrial framework: - spec too tedious to write - spec rendered obsolete by design iterations


15

Test model-checking evolved out of ARCHTEST (Collier)

• Thread-based tests that are run on the CPUs

• Architectural rules formulated as safety properties

• Has detected bugs in commercial multiprocessors

• Formally based

• Available for free (for schools...)

• Unfortunately, ARCHTEST

– isn’t that effective at design-time

– is incomplete in many ways


16

ARCHTEST overview: (i) Instruction Execution

View programs w.r.t. their memory instructions (reads and writes) :

... rd(a) wr(b,2) ...

and focus on the outcomes of such executions...


17

(ii) Shared memory modeling

CPU

Memory

CPUCPU

Network

NODE

MEM

NODE

MEM

NODE

MEM

R1(a)W2(b,1)

R3(c)W4(d,2)

CPU_i

STORE_i

CPU_j

STORE_j

Conceptual“local stores”

R1(a) ;W2(b,1) ;R5(d) ;

R3(c) ;W4(d,2) ;W6(d) ;

CPU_i

STORE_i

CPU_j

STORE_j

Program ordering

(iii) Given a parallel program..., define

CPU_i

STORE_i

CPU_j

STORE_j

R1(a,T) ;W2(b,1) ;R5(d,2) ;

R3(c,T) ;W4(d,2) ;W6(d,3) ;

EXECUTIONS...

R3(c,T) W4(d,2) W6(d,3)

W2(b,1)

If ROobeyed

If POobeyed

Event orderingsRO, WO,PO...

R1(a,T) W2(b,1) R5(d,2)

W4(d,2) W6(d,3)

If WOobeyed

Events

(iv) Define computational ordering, CMP, as a total- ordering S per address and per CPU that includes a valid linearization of all local read events and all write events:

CPU_i

STORE_i

CPU_j

STORE_j

R1(a,T) W2(b,1) R5(d,2)

W4(d,2) W6(d,3)

R3(c,T) W4(d,2) W6(d,3)

W2(b,1)

OneCMPorder(cmp1)

Another (cmp2)

R1(a,T) ;W2(b,1) ;R5(d,2) ;

R3(c,T) ;W4(d,2) ;W6(d,3) ;

CPU_i

STORE_i

CPU_j

STORE_j

The entire event-graph for CPU_i...

R1(a,T) W2(b,1) R5(d,2)

W4(d,2) W6(d,3)

RO

WO

- RO arcs present in event-graph if architecture obeys Read Ordering (similarly for WO, PO, CMP)

- Acyclic event-graph G => the architecture obeys ordering rules used in G

- Given execution E on architecture A, if all members of G(E,R) are cyclic, A violates one of the rules in R

cmp1

CPU_i

cmp2 leads to a cycle..(but cmp1 doesn’t..)


21

Example execution revealing (CMP,RO,WO) violation

wr(A,1) ;wr(A,2) ;wr(A,3) ;

CPU_i

STORE_i

CPU_j

STORE_j

rd(A,1) ;rd(A,3) ;rd(A,2) ;

rd(A,1) rd(A,3) rd(A,2)

wr(A,1) wr(A,2) wr(A,3)

RO

RO

WO

WO CMParcs

Any linearization consistent with this graph causes cycle:Eg1) w1 r1 w2 r2 w3 r3 -- not acceptable because r3 <RO r2Eg2) w1 r1 w2 r3 w3 r2 -- not acceptable because this is not a valid linearization...


22

P1A := 1A := 2A := 3

....A := k

P2X1 := AX2 := AX3 := A

....Xk := A

Drawbacks:- ARCHTEST runs on real machines * (very) late-cycle debugging! * non-deterministic interleavings non-exhaustive- What “k” to use?- P2 never writes into A - what if buggy “write-update” coherence protocol used?

ARCHTEST test for (CMP,RO,WO)

Check Invariant:For all j >= i, X(j) X(i)


23

Test Model Checking

• Adaptation of ARCHTEST to model-checking

• (Like ARCHTEST) tests are independent of the model being verified

• Usable at design-time (VIS- or PV-based model-checking)– Simulates the effect of K infinity– Considers all interleavings– Complete tests (defined later) does examine

all possible writes by both CPUs

R3(c,T) ;W4(d,2) ;W6(d,3) ;

CPU_i

STORE_i

CPU_j

STORE_j

Basic adaptation of ARCHTEST to get k=infinityand all interleavings is possible, assuming thatthe memory system is:

R1(a,T) ;W2(b,1) ;R5(d,2) ;

R5(d,2) ;

Projectible:

Any executionprojected ontoa subset of theaddresses isstill an execution

W4(d,2) ;W6(d,3) ;

Used in “limited addresstheorems” later...

Data independent:

Replacing a datavalue d in anexecution by f(d)results in anexecution

R3(c,T) ;W4(d,22) ;W6(d,3) ;

R1(a,T) ;W2(b,1) ;R5(d,22) ;

Used to define completetests...


25

Details• Define a formal shared memory

description language– “data is not used for control decisions”– “addresses are symmetric”

• Use Model checking– “Small number of addresses” sufficient

• Have applied technique to HP Runway / PA 8000 memory system, using PV


26

Test model-checking adaptation of ARCHTEST for (CMP,RO,WO) (k=infinity;

all non-det interleavings; still incomplete...)

rd(1)

rd(0)

rd(0)

rd(1)

wr(0)

wr(1)

wr(1)

Errorstate

P2P1P2

X1 := AX2 := AX3 := A

....Xk := A

P1A := 1A := 2A := 3

....A := k

Check InvariantFor all j >= i, X(j) X(i)


27

Completeness result for (CMP,RO,WO)

– For any number of CPUs (N >= 1), we need consider only executions over TWO addresses

– The proof is by showing that if there exists an event-graph cycle involving more than two addresses, there exists one with one less


28

Reducing all (CMP,RO,WO) cycles to those that contain only two addresses

R1(P1)

R2(P1)

W1(P2)

W2(P2)

W3(P3)

W4(P3)

P1:

R1(P1)

R2(P1)

W1(P2)

W2(P2)

W3(P3)

W4(P3)

P1:

RO

WO

WO

R1(P1)

R2(P1)

W1(P2)

W2(P2)

W3(P3)

W4(P3)

RO

WO

WO

CMP

CMP

CMP

R1(P1)

R2(P1)

W1(P2)

W2(P2)

W3(P3)

W4(P3)

RO

WO

WO

CMP

CMP

CMP

R3(P1)

RO

RO

Involves twoaddrs!


29

Executions to consider for verifying (CMP,RO,WO)

• How a complete test can be developed: Design a test automaton that non-deterministically examines all “relevant” executions over two addresses...


30

2-address (CMP,RO,WO) test is broken intothe following cases (approx...)

P1:

wr(A,0)or rd(A,-)

wr(A,1)or rd(A,-)

wr(A,1)or rd(A,-)

Error

wr(A,1)

rd(A,1)

rd(A,0)

P2:

wr(A,2)or rd(A,-)

wr(A,2)or rd(A,-)

Error

rd(A,1)

rd(A,0)

Case 1:

10/7/98 31

2-address (CMP,RO,WO) test, Case 2:

P1:

Sigma(0,0)

Error

P2:

wr(A,2)or wr(B,2)or rd(A,-)or rd(B,-)

Error

rd(B,1)

rd(A,0)

wr(A,2)or wr(B,2)or rd(A,-)or rd(B,-)

Sigma(1,0)

Sigma(1,1)

Sigma(1,1)

Sigma(2,2)

wr(A,1)

wr(B,1)

rd(B,1)

rd(A,0)


32

• All processors agree on the order of writes– WO imposes the order only if the writes are

from same program

Write Atomicity, and S.C.

wr(A,0)

wr(B,1)

SC is (CMP, PO, WA)


33

SC testing: need to consider N-address programs...

wr(a,1) ;rd(b,0) ;

wr(b,1) ;rd(c,0) ;

wr(c,1) ;rd(d,0) ;

wr(d,1) ;rd(a,0) ;

P1: P2: P3: P4:

E.g.: The execution

violates SC when all four addresses a,b,c,d areconsidered..... but is SC if only 3 addresses areconsidered at a time.... for example:

wr(a,1) ;rd(b,0) ;

wr(b,1) ;rd(c,0) ;

wr(c,1) ;rd(a,0) ;

P1: P2: P3: P4:


34

SC creates barriers wrt writesNow each event-space provides at least two writessuch that these writes are connected by thewrite-atomicity equivalence arcs:

w w w

ww w WA equivalence arcs


35

Complete Tests for SC

• Theorem: A system with N processors implements SC if and only if it has no errors on all n<N address programs

• Scheme for N processors– N barriers– Data written before, at, and after barrier

are different• data 0, 1, 2 for P0, and data 3, 4, 5 for P1

10/7/98 36

A portion of the complete test for SC for 1-address programs

P1:

Sigma(0)

Sigma(2)

wr(A,1)

rd(A,3)or rd(A,4)

P2:

Sigma(3)

Sigma(5)

Error: Saw 1,4 in P1 4,1 in P2

wr(A,4)

rd(A,0)or rd(A,1)


37

Case Studies

• Serial memory (operational semantics of`SC)• Lazy caching• Runway/PA system model

– Bus based design

– An aggressive split transaction protocol

– Out-of-order completion of transactions on Runway for high-performance

– In-order completion of instructions in PA for sequential consistency


38

Test Model checking of HP/RunwaySpin PV

PO-1 56K 2412

PO-2 > 5M/DNF 285K

SC-1 499K 7880

SC-2a > 5M/DNF 5.9M

SC-2b > 4M/DNF 574K

39

Conclusions• Test model-checking is practically viable• Work is in progress in adapting these ideas for

weak memory models• SC tests don’t scale well: future work:

– discover non-trivial equivalence relations to reduce execution-space even more

– Need to consider symmetries in the system– Need to build a tool integrated into the design

cycle of CPUs (performance evaluation + test model-checking must go hand in hand...)

verifying conformance to memory models: the test model-checking approach ganesh gopalakrishnan...

Documents

talk slide

shared memory multiprocessors

shared b

types of shared memory

utah verifier group

formal memory models

memory speed increases

ut austin