the page cache - computer science · recap of previous lectures • page tables: translate virtual...

7
3/16/16 1 CSE 506: Opera.ng Systems The Page Cache Don Porter 1 CSE 506: Opera.ng Systems Logical Diagram Memory Management CPU Scheduler User Kernel Hardware Binary Formats Consistency System Calls Interrupts Disk Net RCU File System Device Drivers Networking Sync Memory Allocators Threads Today’s Lecture (kernel level mem. management) 2 CSE 506: Opera.ng Systems Recap of previous lectures Page tables: translate virtual addresses to physical addresses VM Areas (Linux): track what should be mapped at in the virtual address space of a process Hoard/Linux slab: Efficient allocaVon of objects from a superblock/slab of pages 3 CSE 506: Opera.ng Systems Background Lab2: Track physical pages with an array of PageInfo structs Contains reference counts Free list layered over this array Just like JOS, Linux represents physical memory with an array of page structs Obviously, not the exact same contents, but same idea Pages can be allocated to processes, or to cache file data in memory 4 CSE 506: Opera.ng Systems Today’s Problem Given a VMA or a file’s inode, how do I figure out which physical pages are storing its data? Next lecture: We will go the other way, from a physical page back to the VMA or file inode 5 CSE 506: Opera.ng Systems The address space abstracVon Unifying abstracVon: Each file inode has an address space (0—file size) So do block devices that cache data in RAM (0---dev size) The (anonymous) virtual memory of a process has an address space (0—4GB on x86) In other words, all page mappings can be thought of as and (object, offset) tuple Make sense? 6

Upload: others

Post on 19-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Page Cache - Computer Science · Recap of previous lectures • Page tables: translate virtual addresses to physical addresses • VM Areas (Linux): track what should be mapped

3/16/16

1

CSE506:Opera.ngSystems

ThePageCache

DonPorter

1

CSE506:Opera.ngSystems

LogicalDiagram

MemoryManagement

CPUScheduler

User

Kernel

Hardware

BinaryFormats

Consistency

SystemCalls

Interrupts Disk Net

RCU FileSystem

DeviceDrivers

Networking Sync

MemoryAllocators Threads

Today’sLecture(kernellevelmem.management)

2

CSE506:Opera.ngSystems

Recapofpreviouslectures•  Pagetables:translatevirtualaddressestophysicaladdresses

•  VMAreas(Linux):trackwhatshouldbemappedatinthevirtualaddressspaceofaprocess

•  Hoard/Linuxslab:EfficientallocaVonofobjectsfromasuperblock/slabofpages

3

CSE506:Opera.ngSystems

Background•  Lab2:TrackphysicalpageswithanarrayofPageInfostructs–  Containsreferencecounts–  Freelistlayeredoverthisarray

•  JustlikeJOS,Linuxrepresentsphysicalmemorywithanarrayofpagestructs–  Obviously,nottheexactsamecontents,butsameidea

•  Pagescanbeallocatedtoprocesses,ortocachefiledatainmemory

4

CSE506:Opera.ngSystems

Today’sProblem•  GivenaVMAorafile’sinode,howdoIfigureoutwhichphysicalpagesarestoringitsdata?

•  Nextlecture:Wewillgotheotherway,fromaphysicalpagebacktotheVMAorfileinode

5

CSE506:Opera.ngSystems

TheaddressspaceabstracVon•  UnifyingabstracVon:–  Eachfileinodehasanaddressspace(0—filesize)–  SodoblockdevicesthatcachedatainRAM(0---devsize)–  The(anonymous)virtualmemoryofaprocesshasanaddressspace(0—4GBonx86)

•  Inotherwords,allpagemappingscanbethoughtofasand(object,offset)tuple– Makesense?

6

Page 2: The Page Cache - Computer Science · Recap of previous lectures • Page tables: translate virtual addresses to physical addresses • VM Areas (Linux): track what should be mapped

3/16/16

2

CSE506:Opera.ngSystems

AddressSpacesfor:•  VMAreas(VMAs)•  Files

7

CSE506:Opera.ngSystems

StartSimple•  “Anonymous”memory–nofilebackingit–  E.g.,thestackforaprocess

•  Notsharedbetweenprocesses– Willdiscusssharingandswappinglater

•  Howdowefigureoutvirtualtophysicalmapping?–  Justwalkthepagetables!

•  Linuxdoesn’tdoanythingoutsideofthepagetablestotrackthismapping

8

CSE506:Opera.ngSystems

Filemappings•  AVMAcanalsorepresentamemorymappedfile•  Thekernelcanalsomapfilepagestoserviceread()orwrite()systemcalls

•  Goal:Weonlywanttoloadafileintomemoryonce!

9

CSE506:Opera.ngSystems

LogicalView

DiskHello!

Foo.txtinode

?ProcessA

?

ProcessB

?

ProcessC

?

10

CSE506:Opera.ngSystems

VMAtoafile•  Alsoeasy:VMAincludesafilepointerandanoffsetintofile–  AVMAmaymaponlypartofthefile–  Offsetmustbeatpagegranularity–  Anonymousmapping:filepointerisnull

•  Filepointerisanopenfiledescriptorintheprocessfiledescriptortable– Wewilldiscussfilehandleslater

11

CSE506:Opera.ngSystems

LogicalView

Disk

Hello! Foo.txtinode

?ProcessA

ProcessB

ProcessC

FileDescriptorTable

FDsareprocess-specific

12

Page 3: The Page Cache - Computer Science · Recap of previous lectures • Page tables: translate virtual addresses to physical addresses • VM Areas (Linux): track what should be mapped

3/16/16

3

CSE506:Opera.ngSystems

Trackingfilepages•  Whatdatastructuretouseforafile?–  Nopagetablesforfiles

•  Forexample:Whatpagestoresthefirst4koffile“foo”

•  Whatdatastructuretouse?–  Hint:Filescanbesmall,orvery,verylarge

13

CSE506:Opera.ngSystems

TheRadixTree•  Aspace-opVmizedtrie–  Trie:RatherthanstoreenVrekeyineachnode,traversalofparent(s)buildsaprefix,nodejuststoressuffix•  Especiallyusefulforstrings

–  Prefixlessimportantforfileoffsets,butdoesboundkeystoragespace

•  Moreimportant:Atreewithabranchingfactork>2–  Fasterlookupforlargefiles(esp.withtricks)

•  Note:Linux’suseoftheRadixtreeisconstrained

14

CSE506:Opera.ngSystems

From“UnderstandingtheLinuxKernel”

This is the Title of the Book, eMatter EditionCopyright © 2007 O’Reilly & Associates, Inc. All rights reserved.

The Page Cache | 605

not NULL, and tags is a two-component array of flags that will be discussed in the sec-tion “The Tags of the Radix Tree” later in this chapter. The root of the tree is repre-sented by a radix_tree_root data structure, having three fields: height denotes thecurrent tree’s height (number of levels excluding the leaves), gfp_mask specifies theflags used when requesting memory for a new node, and rnode points to the radix_tree_node data structure corresponding to the node at level 1 of the tree (if any).

Let us consider a simple example. If none of the indices stored in the tree is greaterthan 63, the tree height is equal to one, because the 64 potential leaves can all bestored in the node at level 1 (see Figure 15-1 (a)). If, however, a new page descriptorcorresponding to index 131 must be stored in the page cache, the tree height isincreased to two, so that the radix tree can pinpoint indices up to 4095 (seeFigure 15-1(b)).

Table 15-3 shows the highest page index and the corresponding maximum file sizefor each given height of the radix tree on a 32-bit architecture. In this case, the maxi-mum height of a radix tree is six, although it is quite unlikely that the page cache ofyour system will make use of a radix tree that huge. Because the page index is storedin a 32-bit variable, when the tree has height equal to six, the node at the highestlevel can have at most four children.

Figure 15-1. Two examples of a radix tree

(a) radix tree of height 1 (b) radix tree of height 2

radix_tree_root

height = 1

radix_tree_node

rnode

count = 2

slots[0] slots[4]

index = 0

page

...

index = 4

0 63

radix_tree_root

height = 2

radix_tree_node

rnode

count = 2

...0 63

radix_tree_nodecount = 2

slots[0] slots[4]

index = 0

...

index = 4

0 63

slots[0]

radix_tree_nodecount = 1

slots[3]

index = 131

...0 63

slots[2]

page page page page

15

CSE506:Opera.ngSystems

Abitmoredetail•  Assumeanupperboundonfilesizewhenbuildingtheradixtree–  Canrebuildlaterifwearewrong

•  Specifically:Maxsizeis256k,branchingfactor(k)=64

•  256k/4kpages=64pages–  Soweneedaradixtreeofheight1torepresentthesepages

16

CSE506:Opera.ngSystems

Treeofheight1•  Roothas64slots,canbenull,orapointertoapage•  LookupaddressX:–  Shipofflow12bits(offsetwithinpage)–  Usenext6bitsasanindexintotheseslots(2^6=64)–  Ifpointernon-null,gotothechildnode(page)–  Ifnull,pagedoesn’texist

17

CSE506:Opera.ngSystems

Treeofheightn•  Similarstory:–  Shipofflow12bits

•  Ateachchildshipoff6bitsfrommiddle(starVngat6*(distancetothebosom–1)bits)tofindwhichofthe64potenValchildrentogoto–  Usefixedheighttofigureoutwheretostop,whichbitstouseforoffset

•  ObservaVons:–  “Key”ateachnodeimplicitbasedonposiVonintree–  LookupVmeconstantinheightoftree

•  Inageneral-purposeradixtree,mayhavetocheckallkchildren,forhigherlookupcost

18

Page 4: The Page Cache - Computer Science · Recap of previous lectures • Page tables: translate virtual addresses to physical addresses • VM Areas (Linux): track what should be mapped

3/16/16

4

CSE506:Opera.ngSystems

Fixedheights•  Ifthefilesizegrowsbeyondmaxheight,mustgrowthetree

•  RelaVvelysimple:Addanotherroot,previoustreebecomesfirstchild

•  Scalinginheight:–  1:2^((6*1)+12)=256KB–  2:2^((6*2)+12)=16MB–  3:2^((6*3)+12)=1GB–  4:2^((6*4)+12)=64GB–  5:2^((6*5)+12)=4TB

19

CSE506:Opera.ngSystems

Backtoaddressspaces•  Eachaddressspaceforafilecachedinmemoryincludesaradixtree–  Radixtreeissparse:pagesnotinmemoryaremissing

•  Radixtreealsosupportstags:suchasdirty–  Atreenodeistaggedifatleastonechildalsohasthetag

•  Example:Itagafilepagedirty– Musttageachparentintheradixtreeasdirty– WhenIamfinishedwriVngpageback,Imustcheckallsiblings;ifnonedirty,cleartheparent’sdirtytag

20

CSE506:Opera.ngSystems

LogicalView

Disk

Hello!

Foo.txtinode ProcessA

ProcessB

ProcessC

AddressSpace

RadixTree

21

CSE506:Opera.ngSystems

Recap•  Anonymouspage:Justusethepagetables•  File-backedmapping–  VMA->openfiledescriptor->inode–  Inode->addressspace(radixtree)->page

22

CSE506:Opera.ngSystems

Problem2:Dirtypages•  MostOSesdonotwritefileupdatestodiskimmediately–  (Laterlecture)OStriestoopVmizediskarmmovement

•  OSinsteadtracks“dirty”pages–  Ensuresthatwritebackisn’tdelayedtoolong

•  Lestdatabelostinacrash

•  ApplicaVoncanforceimmediatewritebackwithsyncsystemcalls(andsomeopen/mmapopVons)

23

CSE506:Opera.ngSystems

Syncsystemcalls•  sync()–Flushalldirtybufferstodisk•  fsync(fd)–Flushalldirtybuffersassociatedwiththisfiletodisk(includingchangestotheinode)

•  fdatasync(fd)–Flushonlydirtydatapagesforthisfiletodisk–  Don’tbotherwiththeinode

24

Page 5: The Page Cache - Computer Science · Recap of previous lectures • Page tables: translate virtual addresses to physical addresses • VM Areas (Linux): track what should be mapped

3/16/16

5

CSE506:Opera.ngSystems

Howtoimplementsync?•  Goal:keepoverheadsoffindingdirtyblockslow–  Anaïvescanofallpageswouldwork,butexpensive–  Lotsofcleanpages

•  Idea:keeptrackofdirtydatatominimizeoverheads–  Abitofextraworkonthewritepath,ofcourse

25

CSE506:Opera.ngSystems

Howtoimplementsync?•  Background:Eachfilesystemhasasuperblock–  Allsuperblocksinalist

•  Eachsuperblockkeepsalistofdirtyinodes•  Inodesandsuperblocksbothmarkeddirtyuponuse

26

CSE506:Opera.ngSystems

FSOrganizaVon

SB/

SB/floppy

SB/d1

OneSuperblockperFS

inode

Dirtylist

Dirtylistofinodes

Inodesandradixnodes/pagesmarkeddirtyseparately

27

CSE506:Opera.ngSystems

Simpletraversalforeachsinsuperblocklist:

if(s->dirty)writebacksforiininodelist: if(i->dirty)writebacki if(i->radix_root->dirty): //RecursivelytraversetreewriVng //dirtypagesandclearingdirtyflag

28

CSE506:Opera.ngSystems

Asynchronousflushing•  Kernelthread(s):pdflush–  Akernelthreadisataskthatonlyrunsinthekernel’saddressspace

–  2-8threads,dependingonhowbusy/idlethreadsare•  Whenpdflushruns,itisgivenatargetnumberofpagestowriteback–  Kernelmaintainsatotalnumberofdirtypages–  AdministratorconfiguresatargetdirtyraVo(say10%)

29

CSE506:Opera.ngSystems

pdflush•  Whenpdflushisscheduled,itfiguresouthowmanydirtypagesareabovethetargetraVo

•  WritesbackpagesunVlitmeetsitsgoalorcan’twritemoreback–  (Somepagesmaybelocked,justskipthose)

•  Sametraversalassync()+acountofwrisenpages–  Usuallyquitsearlier

30

Page 6: The Page Cache - Computer Science · Recap of previous lectures • Page tables: translate virtual addresses to physical addresses • VM Areas (Linux): track what should be mapped

3/16/16

6

CSE506:Opera.ngSystems

Howlongdirty?•  Linuxhassomeinode-specificbookkeepingaboutwhenthingsweredirVed

•  pdflushalsochecksforanyinodesthathavebeendirtylongerthan30seconds– Writesthesebackevenifquotawasmet

•  NotthestrongestguaranteeI’veeverseen…

31

CSE506:Opera.ngSystems

Butwheretowrite?•  Ok,soIseehowtofindthedirtypages•  Howdoesthekernelknowwhereondisktowritethem?–  Andwhichdiskforthatmaser?

•  Superblocktracksdevice•  Inodetracksmappingfromfileoffsettosector

32

CSE506:Opera.ngSystems

Blocksizemismatch•  Mostdiskshave512byteblocks;pagesaregenerally4K–  Somenew“green”diskshave4Kblocks–  Perpageincache–usually8diskblocks

•  Whenblocksdon’tmatch,whatdowedo?–  Simpleanswer:Justwriteall8!–  Butthisisexpensive–ifonlyoneblockchanged,weonlywanttowriteoneblockback

33

CSE506:Opera.ngSystems

Bufferhead•  Simpleidea:foreverypagebackedbydisk,storeanextradatastructureforeachdiskblock,calledabuffer_head

•  Ifapagestores8diskblocks,ithas8bufferheads•  Example:write()systemcallforfirst5bytes–  Lookupfirstpageinradixtree– Modifypage,markdirty–  Onlymarkfirstbufferheaddirty

34

CSE506:Opera.ngSystems

From“UnderstandingtheLinuxKernel”

This is the Title of the Book, eMatter EditionCopyright © 2007 O’Reilly & Associates, Inc. All rights reserved.

Storing Blocks in the Page Cache | 615

buffer page points to the buffer head of the first block in the page;* every buffer headstores in the b_this_page field a pointer to the next buffer head in the list. Moreover,every buffer head stores the address of the buffer page’s descriptor in the b_page field.Figure 15-2 shows a buffer page containing four block buffers and the correspond-ing buffer heads.

Allocating Block Device Buffer PagesThe kernel allocates a new block device buffer page when it discovers that the pagecache does not include a page containing the buffer for a given block (see the section“Searching Blocks in the Page Cache” later in this chapter). In particular, the lookupoperation for the block might fail for the following reasons:

1. The radix tree of the block device does not include a page containing the data ofthe block: in this case a new page descriptor must be added to the radix tree.

2. The radix tree of the block device includes a page containing the data of the block,but this page is not a buffer page: in this case new buffer heads must be allocatedand linked to the page, thus transforming it into a block device buffer page.

3. The radix tree of the block device includes a buffer page containing the data ofthe block, but the page has been split in blocks of size different from the size ofthe requested block: in this case the old buffer heads must be released, and anew set of buffer heads must be allocated and linked to the page.

* Because the private field contains valid data, the PG_private flag of the page is also set; hence, if the pagecontains disk data and the PG_private flag is set, then the page is a buffer page. Notice, however, that otherkernel components not related to the block I/O subsystem use the private and PG_private fields for otherpurposes.

Figure 15-2. A buffer page including four buffers and their buffer heads

Page descriptor

b_dataprivateb_this_page

Buffer

Buffer

Buffer

Buffer

Page

Buffer head

Buffer head

Buffer head

Buffer head

b_page

35

CSE506:Opera.ngSystems

Moreonbufferheads•  Onwrite-back(sync,pdflush,etc),onlywritedirtybufferheads

•  Tolookupagivendiskblockforafile,mustdividebybufferheadsperpage–  Ex:diskblock25ofafileisinpage3intheradixtree

•  Note:memorymappedfilesmarkall8buffer_headsdirty.Why?–  Canonlydetectwriteregionsviapagefaults

36

Page 7: The Page Cache - Computer Science · Recap of previous lectures • Page tables: translate virtual addresses to physical addresses • VM Areas (Linux): track what should be mapped

3/16/16

7

CSE506:Opera.ngSystems

Summary•  Seenhowmappingsoffiles/diskstocachepagesaretracked–  Andhowdirtypagesaretagged–  Radixtreebasics

•  Whenandhowdirtydataiswrisenbacktodisk•  Howdifferencebetweendisksectorandpagesizesarehandled

37