the page cache - computer science · recap of previous lectures • page tables: translate virtual...
TRANSCRIPT
3/16/16
1
CSE506:Opera.ngSystems
ThePageCache
DonPorter
1
CSE506:Opera.ngSystems
LogicalDiagram
MemoryManagement
CPUScheduler
User
Kernel
Hardware
BinaryFormats
Consistency
SystemCalls
Interrupts Disk Net
RCU FileSystem
DeviceDrivers
Networking Sync
MemoryAllocators Threads
Today’sLecture(kernellevelmem.management)
2
CSE506:Opera.ngSystems
Recapofpreviouslectures• Pagetables:translatevirtualaddressestophysicaladdresses
• VMAreas(Linux):trackwhatshouldbemappedatinthevirtualaddressspaceofaprocess
• Hoard/Linuxslab:EfficientallocaVonofobjectsfromasuperblock/slabofpages
3
CSE506:Opera.ngSystems
Background• Lab2:TrackphysicalpageswithanarrayofPageInfostructs– Containsreferencecounts– Freelistlayeredoverthisarray
• JustlikeJOS,Linuxrepresentsphysicalmemorywithanarrayofpagestructs– Obviously,nottheexactsamecontents,butsameidea
• Pagescanbeallocatedtoprocesses,ortocachefiledatainmemory
4
CSE506:Opera.ngSystems
Today’sProblem• GivenaVMAorafile’sinode,howdoIfigureoutwhichphysicalpagesarestoringitsdata?
• Nextlecture:Wewillgotheotherway,fromaphysicalpagebacktotheVMAorfileinode
5
CSE506:Opera.ngSystems
TheaddressspaceabstracVon• UnifyingabstracVon:– Eachfileinodehasanaddressspace(0—filesize)– SodoblockdevicesthatcachedatainRAM(0---devsize)– The(anonymous)virtualmemoryofaprocesshasanaddressspace(0—4GBonx86)
• Inotherwords,allpagemappingscanbethoughtofasand(object,offset)tuple– Makesense?
6
3/16/16
2
CSE506:Opera.ngSystems
AddressSpacesfor:• VMAreas(VMAs)• Files
7
CSE506:Opera.ngSystems
StartSimple• “Anonymous”memory–nofilebackingit– E.g.,thestackforaprocess
• Notsharedbetweenprocesses– Willdiscusssharingandswappinglater
• Howdowefigureoutvirtualtophysicalmapping?– Justwalkthepagetables!
• Linuxdoesn’tdoanythingoutsideofthepagetablestotrackthismapping
8
CSE506:Opera.ngSystems
Filemappings• AVMAcanalsorepresentamemorymappedfile• Thekernelcanalsomapfilepagestoserviceread()orwrite()systemcalls
• Goal:Weonlywanttoloadafileintomemoryonce!
9
CSE506:Opera.ngSystems
LogicalView
DiskHello!
Foo.txtinode
?ProcessA
?
ProcessB
?
ProcessC
?
10
CSE506:Opera.ngSystems
VMAtoafile• Alsoeasy:VMAincludesafilepointerandanoffsetintofile– AVMAmaymaponlypartofthefile– Offsetmustbeatpagegranularity– Anonymousmapping:filepointerisnull
• Filepointerisanopenfiledescriptorintheprocessfiledescriptortable– Wewilldiscussfilehandleslater
11
CSE506:Opera.ngSystems
LogicalView
Disk
Hello! Foo.txtinode
?ProcessA
ProcessB
ProcessC
FileDescriptorTable
FDsareprocess-specific
12
3/16/16
3
CSE506:Opera.ngSystems
Trackingfilepages• Whatdatastructuretouseforafile?– Nopagetablesforfiles
• Forexample:Whatpagestoresthefirst4koffile“foo”
• Whatdatastructuretouse?– Hint:Filescanbesmall,orvery,verylarge
13
CSE506:Opera.ngSystems
TheRadixTree• Aspace-opVmizedtrie– Trie:RatherthanstoreenVrekeyineachnode,traversalofparent(s)buildsaprefix,nodejuststoressuffix• Especiallyusefulforstrings
– Prefixlessimportantforfileoffsets,butdoesboundkeystoragespace
• Moreimportant:Atreewithabranchingfactork>2– Fasterlookupforlargefiles(esp.withtricks)
• Note:Linux’suseoftheRadixtreeisconstrained
14
CSE506:Opera.ngSystems
From“UnderstandingtheLinuxKernel”
This is the Title of the Book, eMatter EditionCopyright © 2007 O’Reilly & Associates, Inc. All rights reserved.
The Page Cache | 605
not NULL, and tags is a two-component array of flags that will be discussed in the sec-tion “The Tags of the Radix Tree” later in this chapter. The root of the tree is repre-sented by a radix_tree_root data structure, having three fields: height denotes thecurrent tree’s height (number of levels excluding the leaves), gfp_mask specifies theflags used when requesting memory for a new node, and rnode points to the radix_tree_node data structure corresponding to the node at level 1 of the tree (if any).
Let us consider a simple example. If none of the indices stored in the tree is greaterthan 63, the tree height is equal to one, because the 64 potential leaves can all bestored in the node at level 1 (see Figure 15-1 (a)). If, however, a new page descriptorcorresponding to index 131 must be stored in the page cache, the tree height isincreased to two, so that the radix tree can pinpoint indices up to 4095 (seeFigure 15-1(b)).
Table 15-3 shows the highest page index and the corresponding maximum file sizefor each given height of the radix tree on a 32-bit architecture. In this case, the maxi-mum height of a radix tree is six, although it is quite unlikely that the page cache ofyour system will make use of a radix tree that huge. Because the page index is storedin a 32-bit variable, when the tree has height equal to six, the node at the highestlevel can have at most four children.
Figure 15-1. Two examples of a radix tree
(a) radix tree of height 1 (b) radix tree of height 2
radix_tree_root
height = 1
radix_tree_node
rnode
count = 2
slots[0] slots[4]
index = 0
page
...
index = 4
0 63
radix_tree_root
height = 2
radix_tree_node
rnode
count = 2
...0 63
radix_tree_nodecount = 2
slots[0] slots[4]
index = 0
...
index = 4
0 63
slots[0]
radix_tree_nodecount = 1
slots[3]
index = 131
...0 63
slots[2]
page page page page
15
CSE506:Opera.ngSystems
Abitmoredetail• Assumeanupperboundonfilesizewhenbuildingtheradixtree– Canrebuildlaterifwearewrong
• Specifically:Maxsizeis256k,branchingfactor(k)=64
• 256k/4kpages=64pages– Soweneedaradixtreeofheight1torepresentthesepages
16
CSE506:Opera.ngSystems
Treeofheight1• Roothas64slots,canbenull,orapointertoapage• LookupaddressX:– Shipofflow12bits(offsetwithinpage)– Usenext6bitsasanindexintotheseslots(2^6=64)– Ifpointernon-null,gotothechildnode(page)– Ifnull,pagedoesn’texist
17
CSE506:Opera.ngSystems
Treeofheightn• Similarstory:– Shipofflow12bits
• Ateachchildshipoff6bitsfrommiddle(starVngat6*(distancetothebosom–1)bits)tofindwhichofthe64potenValchildrentogoto– Usefixedheighttofigureoutwheretostop,whichbitstouseforoffset
• ObservaVons:– “Key”ateachnodeimplicitbasedonposiVonintree– LookupVmeconstantinheightoftree
• Inageneral-purposeradixtree,mayhavetocheckallkchildren,forhigherlookupcost
18
3/16/16
4
CSE506:Opera.ngSystems
Fixedheights• Ifthefilesizegrowsbeyondmaxheight,mustgrowthetree
• RelaVvelysimple:Addanotherroot,previoustreebecomesfirstchild
• Scalinginheight:– 1:2^((6*1)+12)=256KB– 2:2^((6*2)+12)=16MB– 3:2^((6*3)+12)=1GB– 4:2^((6*4)+12)=64GB– 5:2^((6*5)+12)=4TB
19
CSE506:Opera.ngSystems
Backtoaddressspaces• Eachaddressspaceforafilecachedinmemoryincludesaradixtree– Radixtreeissparse:pagesnotinmemoryaremissing
• Radixtreealsosupportstags:suchasdirty– Atreenodeistaggedifatleastonechildalsohasthetag
• Example:Itagafilepagedirty– Musttageachparentintheradixtreeasdirty– WhenIamfinishedwriVngpageback,Imustcheckallsiblings;ifnonedirty,cleartheparent’sdirtytag
20
CSE506:Opera.ngSystems
LogicalView
Disk
Hello!
Foo.txtinode ProcessA
ProcessB
ProcessC
AddressSpace
RadixTree
21
CSE506:Opera.ngSystems
Recap• Anonymouspage:Justusethepagetables• File-backedmapping– VMA->openfiledescriptor->inode– Inode->addressspace(radixtree)->page
22
CSE506:Opera.ngSystems
Problem2:Dirtypages• MostOSesdonotwritefileupdatestodiskimmediately– (Laterlecture)OStriestoopVmizediskarmmovement
• OSinsteadtracks“dirty”pages– Ensuresthatwritebackisn’tdelayedtoolong
• Lestdatabelostinacrash
• ApplicaVoncanforceimmediatewritebackwithsyncsystemcalls(andsomeopen/mmapopVons)
23
CSE506:Opera.ngSystems
Syncsystemcalls• sync()–Flushalldirtybufferstodisk• fsync(fd)–Flushalldirtybuffersassociatedwiththisfiletodisk(includingchangestotheinode)
• fdatasync(fd)–Flushonlydirtydatapagesforthisfiletodisk– Don’tbotherwiththeinode
24
3/16/16
5
CSE506:Opera.ngSystems
Howtoimplementsync?• Goal:keepoverheadsoffindingdirtyblockslow– Anaïvescanofallpageswouldwork,butexpensive– Lotsofcleanpages
• Idea:keeptrackofdirtydatatominimizeoverheads– Abitofextraworkonthewritepath,ofcourse
25
CSE506:Opera.ngSystems
Howtoimplementsync?• Background:Eachfilesystemhasasuperblock– Allsuperblocksinalist
• Eachsuperblockkeepsalistofdirtyinodes• Inodesandsuperblocksbothmarkeddirtyuponuse
26
CSE506:Opera.ngSystems
FSOrganizaVon
SB/
SB/floppy
SB/d1
OneSuperblockperFS
inode
Dirtylist
Dirtylistofinodes
Inodesandradixnodes/pagesmarkeddirtyseparately
27
CSE506:Opera.ngSystems
Simpletraversalforeachsinsuperblocklist:
if(s->dirty)writebacksforiininodelist: if(i->dirty)writebacki if(i->radix_root->dirty): //RecursivelytraversetreewriVng //dirtypagesandclearingdirtyflag
28
CSE506:Opera.ngSystems
Asynchronousflushing• Kernelthread(s):pdflush– Akernelthreadisataskthatonlyrunsinthekernel’saddressspace
– 2-8threads,dependingonhowbusy/idlethreadsare• Whenpdflushruns,itisgivenatargetnumberofpagestowriteback– Kernelmaintainsatotalnumberofdirtypages– AdministratorconfiguresatargetdirtyraVo(say10%)
29
CSE506:Opera.ngSystems
pdflush• Whenpdflushisscheduled,itfiguresouthowmanydirtypagesareabovethetargetraVo
• WritesbackpagesunVlitmeetsitsgoalorcan’twritemoreback– (Somepagesmaybelocked,justskipthose)
• Sametraversalassync()+acountofwrisenpages– Usuallyquitsearlier
30
3/16/16
6
CSE506:Opera.ngSystems
Howlongdirty?• Linuxhassomeinode-specificbookkeepingaboutwhenthingsweredirVed
• pdflushalsochecksforanyinodesthathavebeendirtylongerthan30seconds– Writesthesebackevenifquotawasmet
• NotthestrongestguaranteeI’veeverseen…
31
CSE506:Opera.ngSystems
Butwheretowrite?• Ok,soIseehowtofindthedirtypages• Howdoesthekernelknowwhereondisktowritethem?– Andwhichdiskforthatmaser?
• Superblocktracksdevice• Inodetracksmappingfromfileoffsettosector
32
CSE506:Opera.ngSystems
Blocksizemismatch• Mostdiskshave512byteblocks;pagesaregenerally4K– Somenew“green”diskshave4Kblocks– Perpageincache–usually8diskblocks
• Whenblocksdon’tmatch,whatdowedo?– Simpleanswer:Justwriteall8!– Butthisisexpensive–ifonlyoneblockchanged,weonlywanttowriteoneblockback
33
CSE506:Opera.ngSystems
Bufferhead• Simpleidea:foreverypagebackedbydisk,storeanextradatastructureforeachdiskblock,calledabuffer_head
• Ifapagestores8diskblocks,ithas8bufferheads• Example:write()systemcallforfirst5bytes– Lookupfirstpageinradixtree– Modifypage,markdirty– Onlymarkfirstbufferheaddirty
34
CSE506:Opera.ngSystems
From“UnderstandingtheLinuxKernel”
This is the Title of the Book, eMatter EditionCopyright © 2007 O’Reilly & Associates, Inc. All rights reserved.
Storing Blocks in the Page Cache | 615
buffer page points to the buffer head of the first block in the page;* every buffer headstores in the b_this_page field a pointer to the next buffer head in the list. Moreover,every buffer head stores the address of the buffer page’s descriptor in the b_page field.Figure 15-2 shows a buffer page containing four block buffers and the correspond-ing buffer heads.
Allocating Block Device Buffer PagesThe kernel allocates a new block device buffer page when it discovers that the pagecache does not include a page containing the buffer for a given block (see the section“Searching Blocks in the Page Cache” later in this chapter). In particular, the lookupoperation for the block might fail for the following reasons:
1. The radix tree of the block device does not include a page containing the data ofthe block: in this case a new page descriptor must be added to the radix tree.
2. The radix tree of the block device includes a page containing the data of the block,but this page is not a buffer page: in this case new buffer heads must be allocatedand linked to the page, thus transforming it into a block device buffer page.
3. The radix tree of the block device includes a buffer page containing the data ofthe block, but the page has been split in blocks of size different from the size ofthe requested block: in this case the old buffer heads must be released, and anew set of buffer heads must be allocated and linked to the page.
* Because the private field contains valid data, the PG_private flag of the page is also set; hence, if the pagecontains disk data and the PG_private flag is set, then the page is a buffer page. Notice, however, that otherkernel components not related to the block I/O subsystem use the private and PG_private fields for otherpurposes.
Figure 15-2. A buffer page including four buffers and their buffer heads
Page descriptor
b_dataprivateb_this_page
Buffer
Buffer
Buffer
Buffer
Page
Buffer head
Buffer head
Buffer head
Buffer head
b_page
35
CSE506:Opera.ngSystems
Moreonbufferheads• Onwrite-back(sync,pdflush,etc),onlywritedirtybufferheads
• Tolookupagivendiskblockforafile,mustdividebybufferheadsperpage– Ex:diskblock25ofafileisinpage3intheradixtree
• Note:memorymappedfilesmarkall8buffer_headsdirty.Why?– Canonlydetectwriteregionsviapagefaults
36
3/16/16
7
CSE506:Opera.ngSystems
Summary• Seenhowmappingsoffiles/diskstocachepagesaretracked– Andhowdirtypagesaretagged– Radixtreebasics
• Whenandhowdirtydataiswrisenbacktodisk• Howdifferencebetweendisksectorandpagesizesarehandled
37