1 ekow j. otoo frank olken arie shoshani adaptive file caching in distributed systems

1

Ekow J. OtooFrank OlkenArie Shoshani

Adaptive File Caching in Distributed Systems

2

Objectives

• Goals— Develop a coordinated optimal file caching and

replication of distributed datasets

— Develop a software module, called Policy Advisory Module (PAM) as part of Storage Resource Managers (SRMs) and other grid storage middleware

• Examples of application areas• Particle Physics Data Grid (PPDG)• Earth Science, Grid (ESG)• Grid Physics Network (GriPhyN).

3

Managing File Requests at a Single Site

Shared disk

QueuingAnd

Scheduling

Mass Storage System

File requests

Network

Multiple Clients Using a Shared Disk for Accessing Remote MSS

Other Sites

PolicyAdvisoryModule

Storage Resource Manager

4

Two Principal Components of Policy Advisory Module

• A disk cache replacement policy— Evaluates which files are to be replaced when space is needed

• Admission policy for file requests— Determines which request is to be processed next

— e.g. may prefer to admit requests for files already in cache

• Work done so far concerns: — Disk cache replacement policies

— Development of SRM-PAM Interface

— Some models of file admission policies

5

New Results Since Last Meeting

• Implementation of the Greedy Dual Size (GDS), replacement policy

• New experimental runs with new workloads.• 6 –month log of access trace from Jlab• Synthetic workload with file sizes from 500K to 2.14G

Bytes• Implementation of SRM-PAM simulation in OMNeT++

• Papers:1. Disk cache replacement algorithm for storage resource Managers

on the grid – SC2002.2. Disk file caching Algorithms that account for for delays in space

reservation, file transfers and file processing. To be submitted to Mass Storage Conference

3. A discrete event simulation model of a storage resource manager – (To be submitted to SIGMMETRICS’2002) .

6

Some Known Results in Caching (1)

• Disk to Memory Caching— Least Recently Used (LRU); keeps the last ref. time

— Least Frequently Used (LFU); keeps reference counts

— LRU-K; keeps last reference times up to a max of K.• Best known result (O’Neil et al. 1993)

— Small K is sufficient (K=2, 3)

— Gain 5-10% over LRU depending on reference pattern

— Significance of a 10% saving in time per reference• Improved response time

• In the Grid and Wide Area Networks, this translates to — Reduced network traffic

— Reduce load at the source

— Savings in time to access files

7

Some Known Results in Caching (2)

• File Caching in Tertiary Storage to Disk — Modeling of Robotic Tapes [Johnson 1995, Sarawagi 1995,..]

— Hazard Rate Optimal [Olken 1983]

— Object LRU [Hahn et al. 1999]

• Web Caching— Self Adjusted LRU [Aggarwal and Yu 1997]

— Greedy Dual [Young 1991];

— Greedy Dual Size (GDS), [Cao and Irani 1997]

8

Difference Between Environments

• Caching in primary memory—Fixed page size—Cost (in time) is assumed constant

• Transfer time is negligible• Latency is assumed fixed for disk

— Memory reference is instantaneous

• Caching in Grid environment—Large files with variable sizes—Cost of retrieval (in time) varies considerably

• From one time instant to another even for the same file ;• Files may be available from different locations in a WAN• Transfer time may be comparable to the latency in a WAN

—Duration of file reference is significant and cannot be ignored—Main goal is to reduce network traffic and file access times

9

Our Theoretical Result on Caching Policiesin Grid Environment

• Latency delays, transfer delays and file size impact caching policies in the Grid— Cache replacement algorithms, such as LRU, LRU-K do not take

these into account and therefore are inappropriate

• The replacement policy we advocate is based on a cost-beneficial function computed at time t0 as

t0 is the current time,

ki(t0) is the count of references for file i up to max of KCi(t0) is the cost in time of accessing the file i, Si is size of file i.fi(t0) is the total count of references to the file i over its active time T. t-K is the time of the kth backward reference.

i

ii

k

ii S

Ck tfttttt

)(*)(*

)()( 00

0

00

• Eviction candidate is one with minimum i(t0)

10

Implementations from the Theoretical Results

• Two new practical implementations developed:

- MIT-K: Maximum average Inter-arrival Time, an improved LRU-K.

- MIT-K dominates LRU-K- Does not take into account access costs and file size- The main ranking function is

- LCB-K: Least Cost Beneficial with K backward references- does take into account retrieval delay and file size

)(

)()('

0

00 t

ttti

ki k

11

Some Details of the Implementation Algorithms

• Evaluation of replacement policies with no delay considerations involves:

- a reference stream r1, r2, r3, …, rN

- a specified cache size Z, and - two appropriate data structures:

One holds information of referenced files andA second holds information about the files in cache

but also allows for fast selection of eviction candidate.

Binary Search Tree (BST)

Cache content asa priority queue

(PQ)LRU

12

Implementation When Delays are Considered

When delays are considered each reference ri in the reference stream has 5 event times:

time of arrival,time when file caching starts, time when caching ends, time when processing begins andtime when processing ends and file is released.

BST, Active lifetime T

PQ of unpinned files in cache

A vector of pinned files in cache

Varies with different policies

13

Performance Metrics

• Hit RatioReferences#

Hits#

• Byte Hit Ratio

References Data of Size Total

Hit Data of Size Total

• Retrieval Cost Per Reference References #

Retrievals of Time Total

=

=

=

14

Implications of Metric Measures

• Hit Ratio:— Measure of the relative savings as a count of the

number of files hit

• Byte Hit Ratio:— Measure of the relative savings as the time avoided

in data transfers

• Retrieval Cost Per Reference:— Measure of the relative savings as the time avoided

in data transfers and in retrieving data from their sources

15

Parameters of the Simulations

• Real workload from Jefferson Nat’l. Accelerator Facility (JLab)—A six month trace log of file accesses to tertiary storage

—Log contains batched requests

—Replacement policy used in JLab is LRU

• Synthetic workload based on JLab —250,000 files with large sizes uniformly distributed between 500K to

2.147 GBytes

—Inter-arrival time is exponentially distributed with mean 90 sec

—Number of references generated is about 500,000

—Locality of reference: • partition the references into random size intervals

• follows the 80-20 rule within each interval(80% of references are on 20% of the files)

16

Replacement Policies Compared

• RND: Random• LFU: Least Frequently Used• LRU: Least Recently Used• MIT-K: Maximum Inter-Arrival Time based on last K

references• LCB-K: Least Cost Beneficial based on last K references• GDS: Greedy Dual Size

• Active lifetime of a file T is set at 5-days

• All results accumulated with variance reduction technique.

17

Simulations Results for JNAF Workload:Comparison of Hit Ratios

• MIT-K and LRU give the best performance

• LCB-K, GDS and RND are comparable

• LFU is the worst

• Higher values represent better performance

18

Simulations Results for JNAF Workload:Comparison of Byte Hit Ratios

• Higher values represent better performance

• MIT-K and LRU give slightly best performance

• All policies except LFU are comparable

• LFU is the worst

19

Simulations Results for JNAF Workload:Comparison of Average Retrieval Time Per Reference

• LCB-K and GDS give the best performance

• MIT-K, LRU and RND are comparable

• LFU shows the worst performance

• Lower values represent better performance

20

Simulations Results for Synthetic Workload: Comparison of Average Retrieval Time Per reference

• Lower values represent better performance

• LCB-K clearly gives the best performance although not significantly better than GDS

• LFU is still the worst

• Hit ratio and Byte Hit Ratio are not good measures of caching policy effectiveness on the Grid

21

Summary

• Developed a good replacement policy, LCB-K, for caching in storage resources management on the grid.

• Developed a realistic model for evaluating cache replacement policies taking into account delays at the data sources, transfers and processing.

• Applied the model for extensive simulation of different policies under synthetic and real workloads of access to mass storage system in JNAF

• We conclude that two worthwhile replacement policies for storage resource management on the GRID are LCB-K and GDS.

• The LCB-K gives about 10% savings in retrieval cost per reference compared to the widely used LRU.

• The cumulative effect can be significant in terms of reduced network traffic and reduced load at the source.

1 ekow j. otoo frank olken arie shoshani adaptive file caching in distributed systems

Documents

file files

disk file caching algorithms

file sizes

file transfers

file processing

memory caching

disk memory reference

disk cache replacement