Automatic Compaction of OS Kernel Code via On-Demand
Code Loading
Haifeng He, Saumya Debray, Gregory Andrews
The University of Arizona
Background
GeneralPurpose
Operating Systems
• Resource constraints
• Limited amount of memory
Reduce memory footprint of OS kernel code as much as possible
Desktop
EmbeddedDevices
General OS with Embedded Apps.
Executed
• Needed (exception handling)• Not needed but missed by existing analysis
Statically proved as unnecessary by prior work
Unexecuted but still can’t be discarded
About 68% kernel code is not executed
A Linux kernel with minimal configurationProfiling with MiBench suite
32%
18%-24%
Our Approach
Limited amount of main memory
Greater amount of secondary storage
Memory Hierarchy Kernel Code
lives in memory
lives in secondary storage
Hot code
Cold code
On-Demand
Code Loading
A Big Picture
Main Memory
Remainingkernel code
Code clustering
Memory-residentkernel codeHot code
Code buffer
Accommodate one cluster at a time
Core code
SchedulerMemory management Interrupt handling
Secondary Storage
size(cluser) size(code buffer)
Memory Requirement for Kernel Code
Main Memory
Hot code
Code buffer
Core code
Size is predetermined
Select the most frequently executed code
How much hot code should stay in memory?
The total size of memory-resident code size(core code)x(1 + )where specified by user (e.g. 0%,10%)
Size specified by user
Upper-bound of memory usage for kernel code
Our Approach Reminiscent of the old idea of overlays
Purely software-based approach Does not require MMU or OSs support for VM
Main steps Apply clustering to whole-program control flow
graph Group “related” code together Reduce cost of code loading
Transform kernel code to support overlays Modify control flow edges
Code Clustering Objective
minimize the number of code loading Given:
An edge-weighted whole-program control flow graph A list of functions marked as core code A growth bound for memory-resident code Code buffer size BufSz
Apply a greedy node-coalescing algorithm until no coalescing can be carried out without violating Size of memory-resident code
size(core code)x(1+ ) Size of each cluster BufSz
Code Transformation
Apply code transformation on Inter-cluster control flow edges Control flow edges from memory-
resident code to clusters (but not needed on the other way)
All indirect control flow edges (targets only known at runtime)
Code TransformationAfter clustering
Cluster A
Cluster B
call F
0x220 F:
Rewritten codeCluster A
push &Fcall dyn_loader
dyn_loader
Cluster B (in code buffer)
0x200 0x500
0x520 F:
Runtime library
1. Address look upfor &F
2. Load B into code buffer
3. Translate target addr &F into relative addr in code buffer
0x500
…
push &F
0x530 call dyn_loader
0x540
pc
Issue: Call Return in Code Buffer
code buffer : start at 0x500
Runtime
0x200…0x220 F:…0x250 ret
Cluster B
0x100 … push &F0x130 call dyn_loader0x140
Cluster A
Code
Cluster A
return address = 0x540
0x500
…
0x520 F:
…
0x540
0x550 ret
Call Return in Code Buffer
0x200…0x220 F:…0x250 ret
Cluster B
0x100 … push &F0x130 call dyn_loader0x140
Cluster A code buffer : start at 0x500
Code Runtime
Cluster B
pc
A has been overwritten by B!
pc
return address = 0x540
Load B into code buffer
pc
0x500
…
push &F
0x530 call dyn_loader
0x540
pc
Issue: Call Return in Code Buffer
code buffer : start at 0x500
Runtime
0x200…0x220 F:…0x250 ret
Cluster B
0x100 … push &F0x130 call dyn_loader0x140
Cluster A
Code
Cluster A
return address = 0x540
0x500
…
push &F
0x530 call dyn_loader
0x540
pc
Call Return in Code Buffer
code buffer : start at 0x500
Runtime
0x200…0x220 F:…0x250 ret
Cluster B
0x100 … push &F0x130 call dyn_loader0x140
Cluster A
Code
Cluster A
return address= 0x540= &dyn_restore_A
dyn_restore_AActual ret_addr = 0x140
Fix
0x500
…
0x520 F:
…
0x540
0x550 ret
Call Return in Code Buffer
0x100…0x220 F:…0x250 ret
Cluster B
0x100 … push &F0x130 call dyn_loader0x140
Cluster A code buffer : start at 0x500
Code Runtime
Cluster B
pcreturn address = &dyn_restore_A
pcdyn_restore_A
Actual ret_addr = 0x140
Load B into code buffer
0x500
…
push &F
0x530 call dyn_loader
0x540
Call Return in Code Buffer
code buffer : start at 0x500
return address = &dyn_restore_A
Runtime
0x100…0x220 F:…0x250 ret
Cluster B
0x100 … push &F0x130 call dyn_loader0x140
Cluster A
Code
Cluster A
pc
dyn_restore_AActual ret_addr = 0x140
restore
Context Switches and Interrupts Context switches
Interrupt Currently keep interrupt handlers in main
memory
Execute cluster Ain code buffer
conte
xt
sw
itche
s
Execute. May change code buffer
Remember A in Thread 1 task_struct Continue executing.
in code buffer
conte
xt
sw
itche
s
TimeReload A into code buffer
Thread 2
Thread 1
Experimental Setup Start with a minimally configured kernel
(Linux 2.4.31) Compile the kernel with optimization for cod
e size (gcc –Os) Original code size: 590KB
Implemented using binary rewriting tool PLTO
Benchmarks: MiBench, MediaBench, httpd
Memory Usage Reduction for Kernel Code
20%
30%
40%
50%
60%
70%
80%
0.00 0.02 0.04 0.06 0.08 0.10
Memory-resident Code Growth Bound
Mem
ory
reduct
ion r
ati
o
MiBenchMediaBenchhttpd
Code buffer size = 2KB
Reduction decreases because amount of memory-resident codeincreases
Estimated Cost of Code Loading All experiments were run in desktop
environment We estimated the cost of code loading as
follows: Choose Micron NAND flash memory as an
example (2KB page, takes to read a page)
Est. Cost =
130.9μ3access(i)#2KB
size(i)
i
130.9μ3
Overhead of Code Loading
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
0.00 0.02 0.04 0.06 0.08 0.10
Memory-resident Code Growth Bound
Code L
oadin
g O
verh
ead
MiBenchMediaBenchhttpd
UnmodifiedKernel
57% memory reduction
56% memory reduction
55% memory reduction
Related Work Code compaction of OS kernel
D. Chanet et al. LCTES 05 H. He et al. CGO 07
Reduce memory requirement in embedded system C. Park et al. EMSOFT 04 H. Park et al. DATE 06 B. Egger et al. CASE 06, EMSOFT 06
Binary rewriting of OS kernel Flower et al. FDDO-4
Conclusions Embedded devices typically have a limited
amount of memory General-purpose OS kernels contain lots of
code that is not executed in an embedded context
Reduce the memory requirement of OS kernel by using an on-demand code overlay mechanism
Memory requirements reduced significantly with little degradation in performance
Estimated Cost of Code Loading
MiBench
0
20
40
60
80
100
120
140
0.00 0.02 0.04 0.06 0.08 0.10
Growth Bound r
Ru
nti
me(s
ec)
OverlayOrignal
MediaBench
0
2
4
6
8
10
0.00 0.02 0.04 0.06 0.08 0.10
Growth Bound r
Ru
nti
me(s
ec)
OverlayOrignal
Httpd
0
5
10
15
20
25
30
0.00 0.02 0.04 0.06 0.08 0.10
Growth Bound r
Ru
nti
me(s
ec)
OverlayOrignal
A Big Picture
Code buffer
Main Memory
Hot code
Reuse code buffer
Cold code
Code clustering
Core code
Memory- resident kernel code
Accommodate one cluster at a time
SchedulerMemory management Interrupt handling
Memory Requirement for Kernel Code
Corecode
How much hot code should stay in memory?
Hot codeNeed to be in memorySize is predetermined
Code bufferSize specified by user (we chose 2KB)
Upper-bound of memory usage for kernel code
Select the most frequently executed code
Keep the total size of memory-resident code size(core code)x(1 + )where specified by user (0%,10%)