dynamic data access to the gt/cercs linux mirror site

23
Dynamic Data Access to the GT/CERCS Linux Mirror Site Mohamed Mansour Matthew Wolf Karsten Schwan

Upload: hoangtu

Post on 12-Feb-2017

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic Data Access to the GT/CERCS Linux Mirror Site

Dynamic Data Access to theGT/CERCS Linux Mirror Site

Mohamed MansourMatthew Wolf

Karsten Schwan

Page 2: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 2

Motivation

• Testing (benchmarking) high performancedistributed streaming applications– Scientific domain

– Enterprise applications

Page 3: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 3

Scientific Data Stream

MolecularDynamics

Bondscalculate bondsand radial dist.

openGLVisualization

server

Co-ordinates

openGL triangulardata

co-ordinates +bonds

Radial dist. data Service

Data Channel

Page 4: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 4

Application Specific Workloads

• Margo Seltzer et. al. [1999] - Test andevaluate systems with realistic workloads– Avoid over designing the system

– Provide rigorous insights into systemcapabilities

Page 5: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 5

Goal

• Understand user interactions with largestreaming data repositories– Analyze ftp traces of GT/CERCS mirror site

• A tool to replay such workloads– StreamGen workload generation tool

Page 6: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 6

Example

Bondscalculate bondsand radial dist.

openGLVisualization

server#1

openGL triangulardata

co-ordinates +bonds

Radial dist. data

openGLVisualization

server#2

openGL triangulardata

StreamPerf loadgenerator

Service

Data Channel

Page 7: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 7

Outline

• Overview and definitions

• Method of analysis

• Results

• Summary

• Q&A

Page 8: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 8

file_xxxx.rpm

file_xxxx.rpm

Non-Striped Trafficfile_xxxx.rpm

Page 9: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 9

Striped Traffic – DownloadAccelerators

file_xxxx.rpm

file_xxxx.rpm

file_xxxx.rpm

Page 10: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 10

Traffic Traces

file_xxxx.rpm

file_xxxx.rpm

file_xxxx.rpm

GT CERCSLinux Mirror

Page 11: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 11

file_xxxx.rpm

file_xxxx.rpm

bytestotal

bytesdownloadedfactorstriping

_

__ =

+( )=factorstriping _

Striping Factor

Page 12: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 12

file_xxxx.rpm file_xxxx.rpm

Striping Factor – Examples

%100_ =factorstriping

file_xxxx.rpm file_xxxx.rpm

%45_ =factorstriping

Page 13: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 13

Method of Analysis

• Reconstruct user sessions from xferlogtraces

• Metadata, site heuristics and assumptions– Limit of two concurrent connections per host

– ls-lr files with relative path information

– Idle timeout of 2 hours

Page 14: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 14

User SessionsRedhat 7.1 - Traffic Histogram (bin size = 1 day)

0

100

200

300

400

500

600

700

0 100 200 300 400 500 600 700

Time (days)

Ses

sio

ns

Non-striped traffic Striped traffic

Page 15: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 15

Striping Factor Distribution

0

5

10

15

20

25

30

0 10 20 30 40 50 60 70 80 90 100

Fraction of data downloaded from GA TECH server (%)

Fra

cti

on

of

req

uest

train

s (

%)

SuSE 7.3

SuSE 8.0

SuSE 8.1

Page 16: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 16

Single File Domination

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

RedHat

7.1

RedHat

7.2

RedHat

7.3

RedHat

8.0

SuSE 7

.3

SuSE 8

.0

SuSE 8

.1

Debian

Pot

ato

Debian

Woo

dy

Striped

Non-Striped

Page 17: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 17

Single File Distribution(striped)

Redhat 7.3 - single file downloads - parallel download

1

10

100

1000

1 10 100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09

Downloaded data (bytes)

Fre

qu

ency

of

do

wn

load

s

Page 18: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 18

Single File Distribution(non-striped)

Redhat 7.3 - single file downloads - no download accelerator

1

10

100

1000

10000

1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09

Downloaded data (bytes)

Fre

qu

en

ce

of

do

wn

loa

ds

Page 19: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 19

Results

• Strong similarity between striped and non-stripedbehavior– Correlation factor between 70% and 98%

• Download accelerators are common– Only 20-25% of users do not use them

• Striping factor uniformly distributed over the range of 10-90%

• 7-25% ‘null’ requests• Requesting a single file is the most common pattern

– Download accelerators exhibit distinctive access patterns

Page 20: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 20

Contributions

• Traffic traces– Reconstructed from real traces

• StreamGen – a library to generatestreaming workloads– Derived from httperf

– Replays traffic traces, or generate statisticalpatterns

Page 21: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 21

Future Directions

• More in-depth analysis of striped behavior– Modified FTP server to collect offset data

• Use traces as realistic traffic models

Page 22: Dynamic Data Access to the GT/CERCS Linux Mirror Site

HPGC - IPDPS 2004 22

References

• V. Oleson, K. Schwan, G. Eisenhauer, B. Plale, C. Pu, and D. Amin.“Operational information systems - an example from the airline industry.”In First Workshop on Industrial Experiences with Systems Software(WIESS)

• Matthew Wolf and Zhongtang Cai and Weiyun Huang and KarstenSchwan, “SmartPointers: personalized scientific data portals in yourhand.” In Proc. of the 2002 ACM/IEEE conference on Supercomputing,Baltimore, Maryland, 2002, pp. 1-16

• Margo Seltzer, David Krinsky, Keith Smith and Xiaolan Zhang, “The Casefor Application-Specific Benchmarking”, In Proceedings of the 1999Workshop on Hot Topics in Operating Systems, Rico, AZ, 1999

• D. Mosberger and T. Jin, “httperf: A tool for measuring web serverperformance”, WISP, ACM, Madison, WI, June 1998, pp. 59-67

• http://www.cc.gatech.edu/~mansour

Page 23: Dynamic Data Access to the GT/CERCS Linux Mirror Site

Q&A

[email protected]