formal modeling and analysis of a flash filesystem in alloy eunsuk kang tds seminar, mar. 14, 2008

Post on 17-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Formal Modeling and Analysis of a Flash Filesystem in Alloy

Eunsuk KangTDS Seminar, Mar. 14, 2008

What is flash memory?

Non-volatile, high-performance storage

Applications: MP3 players, laptop drives, digital cameras, etc.

NASA Mars Exploration Rover Spirit

On-board flash memory to store scientific data

Flash anomaly on Spirit

System failure18 days after landing (2004)

Loss of communication with Earth, stuck in “reboot” loop

Cause: Flaw in the flash filesystemCost: 10 days of lost scientific

activity

Testing for unanticipated?

Out of free space, but still attempted to service file operations

“There was a belief among the FSW development team that the system would not exhibit the behavior that is the root cause of the anomaly…” [Reeves, 2004]

Testing is essential, but is it enough?

Answer: Formal methods?

Allows exhaustive analysisBUT: Verifying a poorly designed

piece of code in an after-the-fact, ad hoc manner is impractical

Apply formal methods early, get the design right

Grand Challenge in VerificationLong term

“Build a verifying compiler” – Tony HoareShort term

“Build a verified flash filesystem” – Joshi & Holzmann (Jet Propulsion Laboratory)

In this talk“Build a verified design for a flash

filesystem”

Outline

What is POSIX?

IEEE standard for filesystem operations Adopted by UNIX, Mac OS X, etc. Reference model for the flash filesystem Function signatures & behaviors

e.g. write(fildes, *buf, nbyte, offset)“The write() function shall attempt to write

nbyte bytes from the buffer pointed to by buf to the file associated with the open file descriptor, fildes.”

POSIX filesystem in Alloy

AlloyFirst-order relational logic + transitive closure

sig Data {} // data elementsig FID {} // file identifier

sig File {contents : seq Data

}

sig AbsFsys { // abstract filesystemfileMap : FID -> lone File // “lone” means one or zero

}

Abstract read operation

// simulationrun { some fsys : AbsFsys,

fid : FID, output : seq Data | output = readAbs[fsys, fid,

1, 3]} for 3

fun readAbs [fsys: AbsFsys, fid: FID, offset, size: Int] : seq Data {

let file = fsys.fileMap[fid] | (file.contents).subseq[offset, offset + size – 1]}

Abstract write operation

// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }

pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {

let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3

file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]

writePromote[fsys, fsys’, file, file’, fid] } } }

pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {

let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3

file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]

writePromote[fsys, fsys’, file, file’, fid] } } }

Alloy is pure logic

// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }

No built-in syntax/semantics for state machines Transition as an explicit constraint between two

states

pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {

let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3

file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]

writePromote[fsys, fsys’, file, file’, fid] } } }

Abstract write operation: Case 1

// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }

Input buffer is empty; no changes to the file

pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {

let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3

file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]

writePromote[fsys, fsys’, file, file’, fid] } } }

Abstract write operation: Case 2

// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }

Offset is within the file Shift buffer by offset & override existing

data

pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {

let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3

file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]

writePromote[fsys, fsys’, file, file’, fid] } } }

Abstract write operation: Case 3

// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }

Offset is after the end of the file Fill in the gap with zeros

pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {

let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3

file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]

writePromote[fsys, fsys’, file, file’, fid] } } }

Promotion

// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }

A style of modeling changes in system state Ensure all other files remain unchanged

Outline

What makes flash special?

Two types: NOR and NANDProgram (i.e. write) at the page

level, erase at the block levelMust erase before programmingBlock can be erased only a limited

number of times (need wear-leveling)

Modeling memory hierarchy

sig Page { data : seq Data } { #data = PAGE_SIZE }

sig Block { pages : seq Page } { #pages = BLOCK_SIZE }

sig LUN { blocks : seq Block } { #blocks = LUN_SIZE }

sig Device { LUNs : seq LUN …} { #LUNs = DEVICE_SIZE }// simulation with

constraintsrun { some Device DEVICE_SIZE = 1 LUN_SIZE = 2 BLOCK_SIZE = 2 PAGE_SIZE = 4} for 4

Addressing mode

Row & column addresses:sig RowAddr { // used to access a

page lunIndex : Int blockIndex : Int pageIndex : Int}

A column address is an Int, andidentifies a data element in a

pageExample:rowAddr.lunIndex = 0rowAddr.blockIndex

= 1rowAddr. pageIndex

= 1columnAddr = 1

Page status & data structures Each page is associated with its current status

abstract sig PageStatus {}one sig Free,

Allocated, Valid, Invalid extends PageStatus {}

Auxiliary data structures*

sig Device { LUNs : seq LUN, pageStatusMap : RowAddr -> one PageStatus, eraseCountMap : RowAddr -> one Int, // wear-

leveling reserveBlock : RowAddr // garbage

collection} { #LUNs = DEVICE_SIZE } (*

disclaimers)

Flash API functions// reads data from page, starting at “colAddr”fun read[d : Device, colAddr : Int, rowAddr : RowAddr] : seq

Data { … }

// program data into page & set page status to “Allocated”pred program[d, d’ : Device, colAddr : Int, rowAddr :

RowAddr, data : seq Data] { … }

// erase data in block & increase its erase count, and set status of every page in block to “Free”

pred erase[d, d’ : Device, rowAddr : RowAddr] { … }

Outline

Abstract vs. concrete filesystem

Concrete filesystem in Alloysig Inode { blockList : seq VBlock }sig VBlock {} // virtual block

sig ConcFsys { inodeMap : FID -> lone Inode blockMap : VBlock one -> one RowAddr}

Concrete read operation (snippet)pred readConc[fsys : ConcFsys, d : Device,

fid : FID, offset, size : Int, buffer : seq Data] { … all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],

from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }

} …}

State of a flash filesystem

pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {

… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],

from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }

}}

State is represented by a pair (ConcFsys, Device)

Read operation animated

Initially, buffer is empty

Read operation animated

Read operation animated

Read operation animated

Three calls to flash read in total

Concrete read operation: Step 1

pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {

… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],

from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }

}}

Extract blocks to read from inode using offset & size

Concrete read operation: Step 2

pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {

… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],

from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }

}}

Consider each index i in blocksToRead

Concrete read operation: Step 3

pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {

… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],

from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }

}}

Retrieve the address of page for ith virtual block

Concrete read operation: Step 4

pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {

… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],

from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }

}}

Calculate indices for current buffer slot

Concrete read operation: Step 5

pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {

… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],

from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }

}}

Execute the flash API function, read

Wear-leveling

Wear-leveling example

Client sends a write request to overwrite data in VBlk1 with 0110

Simple approach: Erase Block2 & program Page5

Non-wear-leveling approach: Step 1

Client sends a write request to overwrite data in VBlk1 with 0110

Step 1: Erase Block2

Non-wear-leveling approach: Step 2

Client sends a write request to overwrite data in VBlk1 with 0110

Step 2: Program 0110 into Page5 - Done.

Why wear-level?

What’s wrong with a simple approach?

1. Frequent requests on VBlk1: Block2 wears out quickly

2. H/W failure: Original data in Page5 is lost

Wear-leveling approach

Client sends a write request to overwrite data in VBlk1 with 0110

Wear-leveling approach: Search for a free page & program

Wear-leveling approach: Step 1

Client sends a write request to overwrite data in VBlk1 with 0110

Step 1: Program 0110 into a free page, Page3

Wear-leveling approach: Step 2

Client sends a write request to overwrite data in VBlk1 with 0110

Step 2: Invalidate Page5 & validate Page3

Wear-leveling approach: Step 3

Client sends a write request to overwrite data in VBlk1 with 0110

Step 3: Update blockMap

Erase unit reclamation (garbage collection)

Erase-unit reclamation example

Client sends a write request to append 0101 at the end of the inode

Problem: Flash is out of free pages (besides reserved ones)

Erase-unit reclamation: Step 1

Client sends a write request to append 0101 at the end of the inode

Step 1: Pick a dirty block with the least erase count

Erase-unit reclamation: Step 2

Client sends a write request to append 0101 at the end of the inode

Step 2: Relocate valid data to reserveBlock

Erase-unit reclamation: Step 3

Client sends a write request to append 0101 at the end of the inode

Step 3: Invalidate/validate pages & update blockMap

Erase-unit reclamation: Step 4

Client sends a write request to append 0101 at the end of the inode

Step 4: Erase Block2 & set it as the new reserveBlock

Erase-unit reclamation complete

Client sends a write request to append 0101 at the end of the inode

Complete: Page0 in Block0 is now free and available for use

Concrete write operation

Concrete write operation

pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size :

Int] { … }

Transition between two pairs (fsys, d) and (fsys’, d’)

Flash API program is a single-step transition between two device states

PAGE_SIZE = 4

Write operation: Phase 1 Partition input buffer into N fragments & program

them1. Introduce an intermediate device, interDev2. Create a sequence of states between d and

interDev using seq Device

pred stateSeqConds[init, final : Device, stateSeq : seq Device, length : Int] {stateSeq.first = init

stateSeq.last = final #stateSeq = length + 1}

3. Constrain the sequence

4. Program fragments one by one

Write operation: Phase 1.1

pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int]

{ … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq,

numBlocksToProgram] all i : stateSeq.butlast.inds {

let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr,

dataFragment] } …

Introduce & constrain intermediate device states

Write operation: Phase 1.2

pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int]

{ … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq,

numBlocksToProgram] all i : stateSeq.butlast.inds {

let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr,

dataFragment] } …

For each sequence index i, extract a data fragment from buffer

Write operation: Phase 1.3

pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int]

{ … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq,

numBlocksToProgram] all i : stateSeq.butlast.inds {

let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr

dataFragment] } …

Retrieve the address of page for ith virtual block (could be empty)

Write operation: Phase 1.4

pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int]

{ … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq,

numBlocksToProgram] all i : stateSeq.butlast.inds {

let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr,

dataFragment] } …

Retrieve the current pair of pre- and post- states

Write operation: Phase 1.5

pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int]

{ … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq,

numBlocksToProgram] all i : stateSeq.butlast.inds {

let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr,

dataFragment] } …

Program data fragment into page at rowAddr

Write operation: Phase 2 Invalidate obsolete pages & validate all

allocated pages by updating interDev.pageStatusMap

pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset,

size : Int] { … some stateSeq : seq Device, interDev : Device { … updatePageStatus[interDev, d’] updateFilesystemInfo[fsys, fsys’] } …}

Write operation: Phase 3 Update filesystem information (blockMap &

inode.blockList)

pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset,

size : Int] { … some stateSeq : seq Device, interDev : Device { … updatePageStatus[interDev, d’]

updateFilesystemInfo[fsys, fsys’] } …}

Fault Tolerance

Fault Tolerance

What happens when H/W loses power in the middle of a write operation?:

On recovery, the filesystem must be in a state as if:1. the operation has never begun, or2. the operation has successfully completed

Power loss may occur either in Phase 1 or Phase 2

Phase 1 crash

At the time of failure, one or more pages programmed & status set to Allocated.

Recovery: Invalidate every allocated page

Recovery from Phase 1 crash

After recovery, the filesystem is in the original state (but has extra invalid pages)

Phase 2 crash

At the time of failure:1. some/all obsolete pages have been invalidated2. all obsolete pages have been invalidated, and

some allocated pages have been validatedRecovery: Complete the rest of Phase 2 & Phase 3

Recovery from Phase 2

After recovery, the inode contains the new data as expected by the caller of writeConc

Outline

Refinement: Trace inclusion

Does the concrete filesystem conform

to the abstract filesystem?

Abstract function

pred alpha[asys : AbsFsys, csys : ConcFsys, d : Device] { all fid : FID | let file = asys.fileMap[fid], inode = csys.inodeMap[fid], vblocks = inode.blockList { #file.contents = #vblocks * PAGE_SIZE (all i : vblocks.inds | let vblock = vblocks[i], from = i * PAGE_SIZE, to = from + PAGE_SIZE – 1, absDataFrag = file.contents.subseq[from, to], concDataFrag = findPageData[vblock, csys, d] |

absDataFrag = concDataFrag) }}

Write refinement

assert WriteRefinement { all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ :

Device, fid : FID, buffer : seq Data, offset, size : Int | concInvariant[csys, d] and writeConc[csys, csys’, d, d’, fid, buffer, offset, size]

and alpha[asys, csys, d] and alpha[asys’, csys’, d’] => writeAbs[asys, asys’, fid, buffer, offset, size]}

State invariant

assert WriteRefinement { all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ :

Device, fid : FID, buffer : seq Data, offset, size : Int | concInvariant[csys, d] and writeConc[csys, csys’, d, d’, fid, buffer, offset, size]

and alpha[asys, csys, d] and alpha[asys’, csys’, d’] => writeAbs[asys, asys’, fid, buffer, offset, size]}

…all inode : FID.(csys.inodeMap) |all rowAddr : csys.blockMap[inode.blockList.elems] | d.pageStatusMap[rowAddr] = Valid…

e.g. All pages within an inode have a valid status

Write refinement

assert WriteRefinement { all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ :

Device, fid : FID, buffer : seq Data, offset, size : Int | concInvariant[csys, d] and writeConc[csys, csys’, d, d’, fid, buffer, offset, size]

and alpha[asys, csys, d] and alpha[asys’, csys’, d’] => writeAbs[asys, asys’, fid, buffer, offset, size]}

Analysis resultsWriteRefinement: A scope of 5 for each domain 6 pages, each with 4 data elements Incremental modeling & analysis Found over 20 bugs over development Final version returned no counterexample,

approximately 8 hours to check

ReadRefinement: Final version returned no counterexample,

approximately 45 minutes to check

Discussion & future work

On analysis: Our filesystem is small, but still found bugs Many bugs occur in “boundary” cases,

involving a small number of components Scientific argument for confidence?

On the Alloy language: Explicitly modeling state transitions – need

better syntax/semantics?

Discussion

On filesystem: Extended functionality (directories, etc.) Revisiting assumptions about flash H/W A wider variety of fault tolerance

mechanisms Concurrency

On Alloy: Syntax/semantics for imperative

statements Scalability Proof

Future work

top related