descriptive data analysis of file transfer data

20
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson

Upload: tatiana-torres

Post on 01-Jan-2016

45 views

Category:

Documents


1 download

DESCRIPTION

Descriptive Data Analysis of File Transfer Data. Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson. Objective. Understanding the GridFTP log transfer data we have at NICS. Analyze the data and identify areas of potential improvement. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Descriptive Data Analysis of File Transfer Data

Descriptive Data Analysis of File

Transfer Data

Sudarshan Srinivasan

Victor Hazlewood

Gregory D. Peterson

Page 2: Descriptive Data Analysis of File Transfer Data

2

Objective

· Understanding the GridFTP log transfer data we have at NICS.

· Analyze the data and identify areas of potential improvement.

· Perform predictive analysis to improve efficiency.

· Apply knowledge to XSEDE service providers.

Page 3: Descriptive Data Analysis of File Transfer Data

3

NICS GridFTP Infrastructure

Page 4: Descriptive Data Analysis of File Transfer Data

4

GridFTP Logging

· Gridftp data transfer protocol version 5.2.2.

· Two types of logging: "usage" logging and "log_transfer" logging (enabled in 5.2.2).

· Prior to 5.2.2 endpoint IP address data was filled with 0.0.0.0.

· Thanks to the Globus folks for fixing this bug!

Page 5: Descriptive Data Analysis of File Transfer Data

5

Transfer Logs

· NICS uses a PostgreSQL database for storing transfer log data.

· Two new tables: n_gridftp_usage and n_gridftp_usage_detail.

· n_gridftp_usage: quick lookup of aggregate monthly GridFTP usage information.

· n_gridftp_usage_detail: Detailed records of each data transfer.

· Log data includes: starttime, endtime, nbytes, user, filename, source and destination end points.

Page 6: Descriptive Data Analysis of File Transfer Data

Log Data Collection

· Data from each GridFTP server is copied to log files to a central NFS location.

· Each month we run a processing script on the log files that checks for errors in the log entry.

· Following this, we run a script to load the log files into database table.

· We chose transfer log data for the year 2013 for this analysis.

DATE=20130401132041.657463 HOST=datamover1.nics.utk.edu PROG=globus-gridftp-server NL_EVNT=FTP_INFO START=2013041132041.534646 USER=username NBYTES=1048576 VOLUME=/ STREAMS=1 STRIPS=1 DEST=[192.249.6.164] TYPE=RETR CODE=226

Page 7: Descriptive Data Analysis of File Transfer Data

7

Log Data Analysis· Two variables were identified: number of transfers

and total amount of data transferred.

· Data transfer rate based on starttime, endtime and nbytes.

· Monthly visual comparison of data coming into and going out of NICS from everywhere.

· Intra XSEDE site number of transfers and data transferred coming into and going out of NICS.

· Bucketing of transfer data based on transfer size (ts).

· R statistical computing language was used to plot all histograms and graphs.

Page 8: Descriptive Data Analysis of File Transfer Data

8

Basic Statistics for the year 2013

Type Quantity

Total Transfers 67,160,380

Average transfers per month 5,596,698

File transfers ts > 64 GB 813 (0.001%)

File transfers 1 MB < ts < 64GB 19,374,549 (28.85%)

File transfers ts < 1 MB 47,785,018 (71.15%)

Page 9: Descriptive Data Analysis of File Transfer Data

9

Number of transfers and amount transferred for the year 2013

Number of transfers (in millions)Total = 83.54 millions

Total amount transferred (in TB)Total = 1235.7millions

MonthTota

l am

ou

nt tr

ansf

err

ed

(in

TB

)N

um

be

r of

tran

sfe

rs(i

n m

illio

ns) Mean

Page 10: Descriptive Data Analysis of File Transfer Data

10

Percentage of transfers vs Transfer size for the year 2013

Total transfers: 67160380

Transfers size (ts)

Pe

rce

nta

ge

of t

ran

sfe

rs

Page 11: Descriptive Data Analysis of File Transfer Data

11

Transfer speed for top 500 transfers with transfer size > 1GB

Month

gbp

s

Page 12: Descriptive Data Analysis of File Transfer Data

12

Monthly comparison between number of transfers coming into and going out

of NICS for year 2013

Month

Tota

l nu

mb

er o

f tra

nsf

ers

(in

mill

ion

s)

Page 13: Descriptive Data Analysis of File Transfer Data

13

Monthly comparison between total amount of data coming into and going

out of NICS for year 2013

Month

Tota

l am

ou

nt o

f dat

a m

ove

d(i

n T

B)

Page 14: Descriptive Data Analysis of File Transfer Data

Transfer data buckets for November 2013

14

All transfers for November 2013Total transfers: 2181157

Transfer size (ts)

Pe

rce

nta

ge

of t

ran

sfe

rs

All transfers for November 2013, ts < 1MBTotal transfers: 749747

Pe

rce

nta

ge

of t

ran

sfe

rs

Transfer size (ts)

All transfers for November 2013, 1MB < ts < 64GBTotal transfers: 1431385

Pe

rce

nta

ge

of t

ran

sfe

rs

Transfer size (ts)

All transfers for November 2013, ts > 64GBTotal transfers: 25

Pe

rce

nta

ge

of t

ran

sfe

rs

Transfer size (ts)

Page 15: Descriptive Data Analysis of File Transfer Data

15

Intra XSEDE Sites and Abbreviation

Site Name Abbreviation

Texas Advanced Computer Center TACC

Pittsburgh Supercomputing Center PSC

San Diego Supercomputer Center SDSC

National Institute for Computational Sciences/ Georgia Institute of

Technology

NICS/GaTech

Indiana University IU

Open Science Grid OSG

National Center for Atmospheric Research

NCAR

Page 16: Descriptive Data Analysis of File Transfer Data

16

Intra XSEDE site data coming into NICSN

um

be

r of

tran

sfe

rs(i

n th

ousa

nd

s)To

tal a

mo

unt

tran

sfe

rre

d(i

n T

B)

Month

TACCPSCSDSCNICS/GaTech

IUOSGNCAR

Page 17: Descriptive Data Analysis of File Transfer Data

17

Intra XSEDE site data going out of NICS

Month

Nu

mb

er

of tr

ansf

ers

(in

thou

san

ds)

TACCPSCSDSCNICS/GaTech

IUOSGNCAR

Tota

l am

ou

nt tr

ansf

err

ed

(in

TB

)

Page 18: Descriptive Data Analysis of File Transfer Data

18

Intra XSEDE site data coming into and going out of NICS together

TACCPSCSDSCNICS/GaTech

IUOSGNCAR

Nu

mb

er

of tr

ansf

ers

(in

thou

san

ds)

Tota

l am

ou

nt tr

ansf

err

ed

(in

TB

)

Month

Page 19: Descriptive Data Analysis of File Transfer Data

19

Future Work· Currently in progress:

– Moving from using PostgreSQL database to loading data completely in memory in a separate machine.

– Using Apache Spark for fast large-scale data processing.– Combining SQL, streaming, and complex analytics.– Using advanced data mining and machine learning

algorithms provided in libraries in Python.

· Next Step:– Analyze by combing job data, filesystem data, and archive

data for analysis.– Visualize data flow within XSEDE network on a

geographical map.

Page 20: Descriptive Data Analysis of File Transfer Data

Thank You!

Questions?