the impact of operating system structure on memory system...

TheImpactofOperatingSystemStructure

onMemorySystemPerformance

ACMSIGOPS,1994

J.BradleyChen(CarnegieMellonUniversity),BrianN.Bershad(University ofWashington)

SanghoonHan,ByeonghunHyeon,Gyusun Lee

Previousworks

• Thispaperisabouttracing

• Oldworks?• Memorysystemstructure• Multiprocessors• Subcomponentofthememorysystem

• HowaboutdifferentOSstructure?

Thetwosystemstructures

• SystemcallvsIPC

• Memory사용패턴이다를것

• Systemstructure에따라성능에영향을

미치는원인이다를수있다.3

Monolithic Micro-kernel

Wanttodo

• Memorystructure내에서시간을어떻게

소모하는지측정 (가능한한적은오차로)

• 두 system간차이를비교

• 7assertions이 정말맞는지확인

Target

• OS• Ultrix(monotonic)• Mach3.0(microkernel)withCMU’sUNIXserver• Botharederived from3.2BSDUNIX

• Machine• DECstation 5000/200• Because itcanrunbothOS

• 13개의 programworkload

Progress

• Modifyexecutionfile• 어떤이유로얼마나시간을보내는지• 어떤주소에접근하는지

•결과 :2배크고 15배느려진 program• 결과를믿을수있는지?• 오차를줄이는법?

Minimizedistortion

1.Memorydilation• 이전보다메모리를많이사용할것이다• 실행양상이달라질수있다 (더많은 pageout)• 매우큰 physicalmemory를사용하자• 대신 TLB(user)는 simulation을통해서구현

2.Timedilation• 상대적으로외부환경이 15배빨라진효과• Clock도 15배빨리들어온다 (systemclock을 1/15로늦춤)• I/O속도가빨라진다 (idlethread가적게동작한다)• Idle상태에서보낸시간 x15

Minimizedifference

3.Differentpagemappingstrategy• Ultrix:deterministic,Mach:random• Itisimportant• Useddeterministicstrategy

Because,• 실험진행상황을다시재현해보기좋다• 두시스템간차이를줄일수있다.

Result(Table)

Numberofinstructions,cachemisses

WhatisMCPI?

• MemoryCyclesPerInstruction

• MCPI= CPUstallcyclesduetothememorysystemInstruction

• Onlyincludenon-idleinstructions

• Canverifyresultwiththis.

Result(MCPI)

Verification

• Assumption• IdleloopCPI=1

• Thenthetotalcycle={Idleinstruction}+{non-idleinstruction}x{1+MCPI}

• Example(gcc)• Cycles=63684000+29318000x(1+0.434)

=105726012• Runtime=Cycles/Clockspeed(25Mhz)

=4.22seconds

Comparingtwosystems

Ultrix에서의 diskI/O가더많았다.Mach3.0이 demandpaging을지원하기때문

Comparingtwosystems

User SystemWorkload Ultrix Mach Difference Ultrix Mach Difference

sed 4335.04 4347.28 12.24 1368.96 3415.72 2046.76egrep 41545.92 41876.97 331.05 1731.08 3152.03 1420.95yacc 30831.06 31085.1 254.04 1967.94 3453.9 1485.96gcc 22868.04 23000.96 132.92 6449.96 12938.04 6488.08

compress 13685.76 13748.94 63.18 3210.24 6177.06 2966.82ab 582720.4 587104.3 4383.84 287011.6 611067.7 324056.16

espresso 132677.3 132293.8 383.54 2707.7 5512.24 2804.54lisp 1249386 1251087 1700.43 38640.81 25532.38 13108.43

equtott 1400225 1403689 3464.01 14143.69 14178.68 34.99fpppp 244220.4 244588.1 367.7 21236.56 18409.86 2826.7doduc 318111.8 318844 732.23 3213.25 6507.02 3293.77liv 22317.76 22351.32 33.56 690.24 1426.68 736.44

tomcatv 1985646 1985534 111.87 20057.03 20055.9 1.13

Non-idleSysteminstruction수차이가크다.

Comparingtwosystems

Mach3.0의 instruction이더비싸다

Relativesystemoverheads

Mach3.0이느린

이유는

IPC때문만이아니다.

7assertions

1.Systemanduserlocality

• System의 locality는user보다낮을것이다?

• Lowlocality→Highcachemiss

1.Systemanduserlocality

2.Systemdependencyoncaches

• Instructioncache에서시간을더많이보낼것이다.

• MCPIcontribution?

특히Mach3.0에서더심하다.

System activity증가

3.Competitionbetweentheuserandsystem• User와 system간

cache경쟁?

• Cache를따로

사용한다면?

큰성능차이가없다.

4.Systemself-interference

• 동시에 cache에있어야하는instruction끼리경쟁?

• Cache associativity를올리면성능이좋아진다.

5.Blockoperations

• Systemblockmemoryoperation에서

많은시간이소요된다?

• 전체MCPI중 큰비중을

차지할것이다.

6.Streamingwrites

• Writebuffer는 systemcode를상대로성능이

안좋을것이다

• Systeminstruction을

실행시킬때

Writebufferstall이

빈번할것이다.

7.Pagemappingstrategy

• Virtualpagemappingstrategy가

성능에미치는영향이클것이다.

•이전실험을 randomstrategy로바꾸면

성능차이가날것이다.

MCPIcomparison(1)

Deterministic

Random

MCPIcomparison(2)

Deterministic

Random

• Itcausessignificanteffect

• Inmostcase,randomisbetter

Conclusion

• Tracing방법을이용해 systemoverhead가

어디서생기는지

파악할수있다.

• IPC의 overhead는 생각보다비중이적었다.

• Assertion7개중 6개가실제로성립한다는것을

확인할수있었다.

MagazinesandVmem:ExtendingtheSlabAllocatorto

ManyCPUsandArbitraryResources

JeffBonwick,JonathanAdamsUSENIXAnnualTechnicalConference,GeneralTrack.01’

Contents

• Introduction• SlabAllocator• Magazine• Vmem• Conclusion• Critic

Introduction

• Slaballocatorhascontinuedtoevolve(94’~)• Per-CPUmemoryallocation• Moregeneralresourceallocation• Availableasauser-level library

SlabAllocator

• 94’SunMicrosystemsSolarisimplemented

Slab : One or more pages of virtuallycontiguous memory

Object :Prepared spacefor frequently objects(ex.structure)

Maintainrunoutobjectsonfreelist ->Reduceallocationsandfreesinstructions

Multiprocessor

Magazine

• Background– per-CPUmemoryallocation• Slaballocatorneedslocktoprotectcache’sslablist• Needmultiprocessorscalability

obj obj obj obj obj obj obj obj obj objSlab

Mobjects

Magazine Magazine

Magazine

Magazinefull+freeorMagazineempty+allocate

allocate

allocatefree

TradeinDepot(needlock)

Topreventfrequenttrade,thereispreviousmagazine

per-CPUcache

Magazine(cont.)

• MagazineSize(M)• ObserveCPUlayer’smissrateaslowasbyincreasingM(Initialvalue)• Observethecontentionrateonthedepotlock(Incrementacontentioncount)

• Ifcontentionrateexceedsfixedthreshold,increasethemagazinesize

• DepotSize• Ifdepot’sfullmagazinelistvariesbetween37~47overagivenperiod,thenworkingsetis10magazines(Remainderareeligibleforreclaiming)

MagazinePerformance

• Scalability• 333MHz16-CPUStarfire

• System-LevelBenchmark• SPECweb99• TPC-C• Kenbus

• Background– moregeneralresourceallocation• AlmostallversionsofUnixhaveresourcemapallocatorcalledrmalloc()• Linear-time algorithm

• Maintainalistoffreesegments• Coalescingsegmentstoreducefragments• Useinsertionsorttoreturnasegmenttothefreesegmentlist->O(n)

• Objectives• Constant-time performance->O(1)• Linearscalability• Lowfragmentation

Vmem Structure

virtualmemory

vmem_create(“heap”…)

kmem_va

vmem_create(“kmem_va”…)

kmem_default free allocated free allocated free

vmem_create(“kmem_default”…)

segmentlist

boundarytag

2^0 2^1 2^2 2^3 2^4

hashlist

freelist

virtualmemory

vmem_create(“heap”…)

kmem_va

vmem_create(“kmem_va”…)

kmem_default free allocated free allocated free

vmem_create(“kmem_default”…)

segmentlist

boundarytag

2^0 2^1 2^2 2^3 2^4

hashlist

freelist

Vmem Performance

• ConstantTime• Hash&Freelist로 allocated혹은 freesegment를 빠르게찾아진행할수있다.->O(1)

• Regardlessofarenafragmentation

• System-LevelBenchmark• LADDIS• WebService• I/OBandwidth

User-LevelMemoryAllocation

mtmalloc :Selectingafreelist wassimplyround-robin

mtmalloc (fixed):Selectaper-CPUfreelist bythreadIDhashingasin

libumem

Summary

• Magazine• Providesefficientobjectcachingwithverylowlatencyandlinearscaling

• Vmem• Guaranteeconstant-timeperformanceregardlessofallocationsizeorarenafragmentation

Critic

the impact of operating system structure on memory system...

Documents

virtual file system and file system...

maintaining system memory - ciscomaintaining system memory...

the memory system

operating system 7 memory management. memory management...

topical memory system - navigators

memory coherence in shared virtual memory system

tmpfs/memory file system

memory system unit-iv 10/23/20151unit-4 : memory system

memory system case studies

2 memory system

memory system

july 2005computer architecture, memory system designslide 1...

unit 4 memory system

memory sub-system ct101 – computing systems. memory...

memory management -...

energy-aware flash memory management in virtual memory...

feb. 2011computer architecture, memory system designslide 1...

flash memory project | file system features in flash memory

microprocessor system-memory

memory system design