formal modeling and analysis of a flash filesystem in alloy eunsuk kang tds seminar, mar. 14, 2008
Post on 17-Jan-2016
213 Views
Preview:
TRANSCRIPT
Formal Modeling and Analysis of a Flash Filesystem in Alloy
Eunsuk KangTDS Seminar, Mar. 14, 2008
What is flash memory?
Non-volatile, high-performance storage
Applications: MP3 players, laptop drives, digital cameras, etc.
NASA Mars Exploration Rover Spirit
On-board flash memory to store scientific data
Flash anomaly on Spirit
System failure18 days after landing (2004)
Loss of communication with Earth, stuck in “reboot” loop
Cause: Flaw in the flash filesystemCost: 10 days of lost scientific
activity
Testing for unanticipated?
Out of free space, but still attempted to service file operations
“There was a belief among the FSW development team that the system would not exhibit the behavior that is the root cause of the anomaly…” [Reeves, 2004]
Testing is essential, but is it enough?
Answer: Formal methods?
Allows exhaustive analysisBUT: Verifying a poorly designed
piece of code in an after-the-fact, ad hoc manner is impractical
Apply formal methods early, get the design right
Grand Challenge in VerificationLong term
“Build a verifying compiler” – Tony HoareShort term
“Build a verified flash filesystem” – Joshi & Holzmann (Jet Propulsion Laboratory)
In this talk“Build a verified design for a flash
filesystem”
Outline
What is POSIX?
IEEE standard for filesystem operations Adopted by UNIX, Mac OS X, etc. Reference model for the flash filesystem Function signatures & behaviors
e.g. write(fildes, *buf, nbyte, offset)“The write() function shall attempt to write
nbyte bytes from the buffer pointed to by buf to the file associated with the open file descriptor, fildes.”
POSIX filesystem in Alloy
AlloyFirst-order relational logic + transitive closure
sig Data {} // data elementsig FID {} // file identifier
sig File {contents : seq Data
}
sig AbsFsys { // abstract filesystemfileMap : FID -> lone File // “lone” means one or zero
}
Abstract read operation
// simulationrun { some fsys : AbsFsys,
fid : FID, output : seq Data | output = readAbs[fsys, fid,
1, 3]} for 3
fun readAbs [fsys: AbsFsys, fid: FID, offset, size: Int] : seq Data {
let file = fsys.fileMap[fid] | (file.contents).subseq[offset, offset + size – 1]}
Abstract write operation
// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid] } } }
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid] } } }
Alloy is pure logic
// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }
No built-in syntax/semantics for state machines Transition as an explicit constraint between two
states
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid] } } }
Abstract write operation: Case 1
// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }
Input buffer is empty; no changes to the file
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid] } } }
Abstract write operation: Case 2
// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }
Offset is within the file Shift buffer by offset & override existing
data
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid] } } }
Abstract write operation: Case 3
// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }
Offset is after the end of the file Fill in the gap with zeros
pred writeAbs[fsys, fsys’ : AbsFsys, fid : FID, buffer : seq Data, offset, size : Int] {
let file = fsys.fileMap[fid], file’ = fsys’.fileMap[fid], buffer’ = buffer.subseq[0, size – 1] { (#buffer’ = 0) => file’ = file // case 1 (#buffer’ != 0) => // case 2 & 3
file’.contents = (zeros[offset] ++ file.contents) ++ shift[buffer’, offset]
writePromote[fsys, fsys’, file, file’, fid] } } }
Promotion
// promotionpred writePromote[fsys, fsys’ : AbsFsys, file, file’ : File, fid : FID] { fsys’.fileMap = fsys.fileMap ++ (fid -> file’) }
A style of modeling changes in system state Ensure all other files remain unchanged
Outline
What makes flash special?
Two types: NOR and NANDProgram (i.e. write) at the page
level, erase at the block levelMust erase before programmingBlock can be erased only a limited
number of times (need wear-leveling)
Modeling memory hierarchy
sig Page { data : seq Data } { #data = PAGE_SIZE }
sig Block { pages : seq Page } { #pages = BLOCK_SIZE }
sig LUN { blocks : seq Block } { #blocks = LUN_SIZE }
sig Device { LUNs : seq LUN …} { #LUNs = DEVICE_SIZE }// simulation with
constraintsrun { some Device DEVICE_SIZE = 1 LUN_SIZE = 2 BLOCK_SIZE = 2 PAGE_SIZE = 4} for 4
Addressing mode
Row & column addresses:sig RowAddr { // used to access a
page lunIndex : Int blockIndex : Int pageIndex : Int}
A column address is an Int, andidentifies a data element in a
pageExample:rowAddr.lunIndex = 0rowAddr.blockIndex
= 1rowAddr. pageIndex
= 1columnAddr = 1
Page status & data structures Each page is associated with its current status
abstract sig PageStatus {}one sig Free,
Allocated, Valid, Invalid extends PageStatus {}
Auxiliary data structures*
sig Device { LUNs : seq LUN, pageStatusMap : RowAddr -> one PageStatus, eraseCountMap : RowAddr -> one Int, // wear-
leveling reserveBlock : RowAddr // garbage
collection} { #LUNs = DEVICE_SIZE } (*
disclaimers)
Flash API functions// reads data from page, starting at “colAddr”fun read[d : Device, colAddr : Int, rowAddr : RowAddr] : seq
Data { … }
// program data into page & set page status to “Allocated”pred program[d, d’ : Device, colAddr : Int, rowAddr :
RowAddr, data : seq Data] { … }
// erase data in block & increase its erase count, and set status of every page in block to “Free”
pred erase[d, d’ : Device, rowAddr : RowAddr] { … }
Outline
Abstract vs. concrete filesystem
Concrete filesystem in Alloysig Inode { blockList : seq VBlock }sig VBlock {} // virtual block
sig ConcFsys { inodeMap : FID -> lone Inode blockMap : VBlock one -> one RowAddr}
Concrete read operation (snippet)pred readConc[fsys : ConcFsys, d : Device,
fid : FID, offset, size : Int, buffer : seq Data] { … all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }
} …}
State of a flash filesystem
pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {
… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }
}}
State is represented by a pair (ConcFsys, Device)
Read operation animated
Initially, buffer is empty
Read operation animated
Read operation animated
Read operation animated
Three calls to flash read in total
Concrete read operation: Step 1
pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {
… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }
}}
Extract blocks to read from inode using offset & size
Concrete read operation: Step 2
pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {
… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }
}}
Consider each index i in blocksToRead
Concrete read operation: Step 3
pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {
… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }
}}
Retrieve the address of page for ith virtual block
Concrete read operation: Step 4
pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {
… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }
}}
Calculate indices for current buffer slot
Concrete read operation: Step 5
pred readConc[fsys : ConcFsys, d : Device, fid : FID, offset, size : Int, buffer : seq Data] {
… all i : blocksToRead.inds { let vblock = blocksToRead[i], rowAddr = fsys.blockMap[vblock],
from = PAGE_SIZE*i, to = from + PAGE_SIZE – 1 | buffer.subseq[from, to] = read[d, 0, rowAddr] }
}}
Execute the flash API function, read
Wear-leveling
Wear-leveling example
Client sends a write request to overwrite data in VBlk1 with 0110
Simple approach: Erase Block2 & program Page5
Non-wear-leveling approach: Step 1
Client sends a write request to overwrite data in VBlk1 with 0110
Step 1: Erase Block2
Non-wear-leveling approach: Step 2
Client sends a write request to overwrite data in VBlk1 with 0110
Step 2: Program 0110 into Page5 - Done.
Why wear-level?
What’s wrong with a simple approach?
1. Frequent requests on VBlk1: Block2 wears out quickly
2. H/W failure: Original data in Page5 is lost
Wear-leveling approach
Client sends a write request to overwrite data in VBlk1 with 0110
Wear-leveling approach: Search for a free page & program
Wear-leveling approach: Step 1
Client sends a write request to overwrite data in VBlk1 with 0110
Step 1: Program 0110 into a free page, Page3
Wear-leveling approach: Step 2
Client sends a write request to overwrite data in VBlk1 with 0110
Step 2: Invalidate Page5 & validate Page3
Wear-leveling approach: Step 3
Client sends a write request to overwrite data in VBlk1 with 0110
Step 3: Update blockMap
Erase unit reclamation (garbage collection)
Erase-unit reclamation example
Client sends a write request to append 0101 at the end of the inode
Problem: Flash is out of free pages (besides reserved ones)
Erase-unit reclamation: Step 1
Client sends a write request to append 0101 at the end of the inode
Step 1: Pick a dirty block with the least erase count
Erase-unit reclamation: Step 2
Client sends a write request to append 0101 at the end of the inode
Step 2: Relocate valid data to reserveBlock
Erase-unit reclamation: Step 3
Client sends a write request to append 0101 at the end of the inode
Step 3: Invalidate/validate pages & update blockMap
Erase-unit reclamation: Step 4
Client sends a write request to append 0101 at the end of the inode
Step 4: Erase Block2 & set it as the new reserveBlock
Erase-unit reclamation complete
Client sends a write request to append 0101 at the end of the inode
Complete: Page0 in Block0 is now free and available for use
Concrete write operation
Concrete write operation
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size :
Int] { … }
Transition between two pairs (fsys, d) and (fsys’, d’)
Flash API program is a single-step transition between two device states
PAGE_SIZE = 4
Write operation: Phase 1 Partition input buffer into N fragments & program
them1. Introduce an intermediate device, interDev2. Create a sequence of states between d and
interDev using seq Device
pred stateSeqConds[init, final : Device, stateSeq : seq Device, length : Int] {stateSeq.first = init
stateSeq.last = final #stateSeq = length + 1}
3. Constrain the sequence
4. Program fragments one by one
Write operation: Phase 1.1
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int]
{ … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq,
numBlocksToProgram] all i : stateSeq.butlast.inds {
let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr,
dataFragment] } …
Introduce & constrain intermediate device states
Write operation: Phase 1.2
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int]
{ … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq,
numBlocksToProgram] all i : stateSeq.butlast.inds {
let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr,
dataFragment] } …
For each sequence index i, extract a data fragment from buffer
Write operation: Phase 1.3
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int]
{ … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq,
numBlocksToProgram] all i : stateSeq.butlast.inds {
let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr
dataFragment] } …
Retrieve the address of page for ith virtual block (could be empty)
Write operation: Phase 1.4
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int]
{ … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq,
numBlocksToProgram] all i : stateSeq.butlast.inds {
let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr,
dataFragment] } …
Retrieve the current pair of pre- and post- states
Write operation: Phase 1.5
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset, size : Int]
{ … some stateSeq : seq Device, interDev : Device { stateSeqConds[d, interDev, stateSeq,
numBlocksToProgram] all i : stateSeq.butlast.inds {
let from = PAGE_SIZE * i, to = from + PAGE_SIZE – 1, dataFragment = buffer.subseq[from, to], vblock = inode.blockList[startBlkIndex + i], rowAddr = fsys.blockMap[vblock], preState = stateSeq[i], postState = stateSeq[i + 1] | programPage[preState, postState, rowAddr,
dataFragment] } …
Program data fragment into page at rowAddr
Write operation: Phase 2 Invalidate obsolete pages & validate all
allocated pages by updating interDev.pageStatusMap
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset,
size : Int] { … some stateSeq : seq Device, interDev : Device { … updatePageStatus[interDev, d’] updateFilesystemInfo[fsys, fsys’] } …}
Write operation: Phase 3 Update filesystem information (blockMap &
inode.blockList)
pred writeConc[fsys, fsys’ : ConcFsys, d, d’ : Device, fid : FID, buffer : seq Data, offset,
size : Int] { … some stateSeq : seq Device, interDev : Device { … updatePageStatus[interDev, d’]
updateFilesystemInfo[fsys, fsys’] } …}
Fault Tolerance
Fault Tolerance
What happens when H/W loses power in the middle of a write operation?:
On recovery, the filesystem must be in a state as if:1. the operation has never begun, or2. the operation has successfully completed
Power loss may occur either in Phase 1 or Phase 2
Phase 1 crash
At the time of failure, one or more pages programmed & status set to Allocated.
Recovery: Invalidate every allocated page
Recovery from Phase 1 crash
After recovery, the filesystem is in the original state (but has extra invalid pages)
Phase 2 crash
At the time of failure:1. some/all obsolete pages have been invalidated2. all obsolete pages have been invalidated, and
some allocated pages have been validatedRecovery: Complete the rest of Phase 2 & Phase 3
Recovery from Phase 2
After recovery, the inode contains the new data as expected by the caller of writeConc
Outline
Refinement: Trace inclusion
Does the concrete filesystem conform
to the abstract filesystem?
Abstract function
pred alpha[asys : AbsFsys, csys : ConcFsys, d : Device] { all fid : FID | let file = asys.fileMap[fid], inode = csys.inodeMap[fid], vblocks = inode.blockList { #file.contents = #vblocks * PAGE_SIZE (all i : vblocks.inds | let vblock = vblocks[i], from = i * PAGE_SIZE, to = from + PAGE_SIZE – 1, absDataFrag = file.contents.subseq[from, to], concDataFrag = findPageData[vblock, csys, d] |
absDataFrag = concDataFrag) }}
Write refinement
assert WriteRefinement { all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ :
Device, fid : FID, buffer : seq Data, offset, size : Int | concInvariant[csys, d] and writeConc[csys, csys’, d, d’, fid, buffer, offset, size]
and alpha[asys, csys, d] and alpha[asys’, csys’, d’] => writeAbs[asys, asys’, fid, buffer, offset, size]}
State invariant
assert WriteRefinement { all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ :
Device, fid : FID, buffer : seq Data, offset, size : Int | concInvariant[csys, d] and writeConc[csys, csys’, d, d’, fid, buffer, offset, size]
and alpha[asys, csys, d] and alpha[asys’, csys’, d’] => writeAbs[asys, asys’, fid, buffer, offset, size]}
…all inode : FID.(csys.inodeMap) |all rowAddr : csys.blockMap[inode.blockList.elems] | d.pageStatusMap[rowAddr] = Valid…
e.g. All pages within an inode have a valid status
Write refinement
assert WriteRefinement { all csys, csys’ : ConcFsys, asys, asys’ : AbsFsys, d, d’ :
Device, fid : FID, buffer : seq Data, offset, size : Int | concInvariant[csys, d] and writeConc[csys, csys’, d, d’, fid, buffer, offset, size]
and alpha[asys, csys, d] and alpha[asys’, csys’, d’] => writeAbs[asys, asys’, fid, buffer, offset, size]}
Analysis resultsWriteRefinement: A scope of 5 for each domain 6 pages, each with 4 data elements Incremental modeling & analysis Found over 20 bugs over development Final version returned no counterexample,
approximately 8 hours to check
ReadRefinement: Final version returned no counterexample,
approximately 45 minutes to check
Discussion & future work
On analysis: Our filesystem is small, but still found bugs Many bugs occur in “boundary” cases,
involving a small number of components Scientific argument for confidence?
On the Alloy language: Explicitly modeling state transitions – need
better syntax/semantics?
Discussion
On filesystem: Extended functionality (directories, etc.) Revisiting assumptions about flash H/W A wider variety of fault tolerance
mechanisms Concurrency
On Alloy: Syntax/semantics for imperative
statements Scalability Proof
Future work
top related