1 chep 2003 arie shoshani experience with deploying storage resource managers to achieve robust file...

17
1 CHEP 2003 Arie Shoshani Experience with Deploying Experience with Deploying Storage Resource Managers to Storage Resource Managers to Achieve Achieve Robust File replication Robust File replication Arie Shoshani Arie Shoshani Alex Sim Alex Sim Junmin Gu Junmin Gu Scientific Data Management Group Scientific Data Management Group Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory http://sdm.lbl.gov/srm http://sdm.lbl.gov/srm

Upload: stephen-davis

Post on 18-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

1 CHEP 2003 Arie Shoshani

Experience with Deploying Experience with Deploying Storage Resource Managers to Achieve Storage Resource Managers to Achieve

Robust File replication Robust File replication

Arie ShoshaniArie Shoshani

Alex SimAlex Sim

Junmin GuJunmin Gu

Scientific Data Management GroupScientific Data Management Group

Lawrence Berkeley National LaboratoryLawrence Berkeley National Laboratory

http://sdm.lbl.gov/srmhttp://sdm.lbl.gov/srm

Page 2: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

2 CHEP 2003 Arie Shoshani

OutlineOutline

• File replication problem - motivationFile replication problem - motivation

• What are Storage Resource ManagersWhat are Storage Resource Managers

• General Analysis Scenario and the use of SRMsGeneral Analysis Scenario and the use of SRMs

• SRM functionalitySRM functionality

• SRMs use for file replication – robustnessSRMs use for file replication – robustness

• Advantages of using SRMs for file replicationAdvantages of using SRMs for file replication

• File monitoring toolFile monitoring tool

• Analysis of file replicationAnalysis of file replication

Page 3: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

3 CHEP 2003 Arie Shoshani

MotivationMotivation

• Multi-File Replication – why is it a problem?Multi-File Replication – why is it a problem?

• Tedious task – many files, repetitious

• Lengthy task – long transfer time, can take days

• Error prone – need to monitor scripts

• Error recovery – need to restart file transfers

• Stage and archive from MSS – limited concurrency, down

time, transient failures

• Use of FTP – large windows, concurrent transfer

• Security – both for local MSS and the network

• Firewalls – transfer from/to MSS must be internal to the site

Page 4: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

4 CHEP 2003 Arie Shoshani

What are What are Storage Resource Managers?Storage Resource Managers?

• Grid architecture needs to include reservation & Grid architecture needs to include reservation & scheduling of:scheduling of:• Compute resources• Storage resources• Network resources

• Storage Resource Managers (SRMs) role in the Storage Resource Managers (SRMs) role in the data grid architecturedata grid architecture• Shared storage resource allocation & scheduling• Especially important for data intensive applications• Often files are archived on a mass storage system (MSS)• Wide area networks – minimize transfers • large scientific collaborations (100’s of nodes,

1000’s of clients) – opportunities for file sharing• File replication and caching may be used• Need to support non-blocking (asynchronous) requests

Page 5: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

5 CHEP 2003 Arie Shoshani

General Analysis ScenarioGeneral Analysis Scenario

MSS

RequestExecuter

Storage Resource Manager

Metadatacatalog

Replicacatalog

NetworkWeatherService

logicalquery

network

clientclient ...

RequestInterpreter

requestplanning

A set oflogical files

Execution plan and site-specific

files

Client’s site

...Disk

Cache

DiskCache

ComputeEngine

DiskCache

Compute Resource Manager

Storage Resource Manager

ComputeEngine

DiskCache

Requests fordata placement andremote computation

Site 2Site 1 Site N

Storage Resource Manager

Storage Resource Manager

Compute Resource Manager

result files

ExecutionDAG

Page 6: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

6 CHEP 2003 Arie Shoshani

SRM is a ServiceSRM is a Service

• SRM functionalitySRM functionality• Manage space

• Negotiate and assign space to users• Manage “lifetime” of spaces

• Manage files on behalf of a user• Pin files in storage till they are released• Manage “lifetime” of files• Manage action when pins expire (depends on file types)

• Manage file sharing• Policies on what should reside on a storage resource at any one time• Policies on what to evict when space is needed

• Get files from remote locations when necessary• Purpose: to simplify client’s task

• Manage multi-file requests• A brokering function: queue file requests, pre-stage when possible

• Provide grid access to/from mass storage systems• HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE (Jlab), Castor

(CERN), MSS (NCAR), …

Page 7: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

7 CHEP 2003 Arie Shoshani

Types of SRMsTypes of SRMs

• Types of storage resource managersTypes of storage resource managers• Disk Resource Manager (DRM)

• Manages one or more disk resources• Tape Resource Manager (TRM)

• Manages access to a tertiary storage system (e.g. HPSS)• Hierarchical Resource Manager (HRM=TRM + DRM)

• An SRM that stages files from tertiary storage into its disk cache

• SRMs and File transfersSRMs and File transfers• SRMs DO NOT perform file transfer• SRMs DO invoke file transfer service if needed

(GridFTP, FTP, HTTP, …)• SRMs DO monitor transfers and recover from failures

• TRM: from/to MSS• DRM: from/to network

Page 8: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

8 CHEP 2003 Arie Shoshani

Uniformity of Interface Uniformity of Interface Compatibility of SRMsCompatibility of SRMs

SRM SRM SRM

Enstore JASMine

ClientUSER/APPLICATIONS

Grid Middleware

SRM

DCache

SRM

CASTOR

SRM

DiskCache

Page 9: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

9 CHEP 2003 Arie Shoshani

SRMs use in STAR forSRMs use in STAR forRobust Muti-file replication Robust Muti-file replication

Anywhere

BNL

DiskCache

DiskCache

HRM-COPY(thousands of files)

SRM-GET (one file at a time)

HRM-ClientCommand-line Interface

HRM(performs writes)

HRM(performs reads)

LBNLGridFTP GET (pull mode)

stage filesarchive files

Network transfer

Get listof files

Recovers from staging failures

Recovers from file transfer failures

Recovers from archiving failures

Page 10: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

10 CHEP 2003 Arie Shoshani

Detailed sequence of actionsDetailed sequence of actionsFor each file being replicatedFor each file being replicated

srmGet (sourceURL)2

GridFTP GET (pull mode)6

File staged (BNL’s diskURL)5

Anywhere srmCopy {(sourceURL=hpss.bnl.gov/xyz/file_x, targetURL =hpss.lbnl.gov/uvw/file_y)}

Get listof files fromdirectory

Request files

DiskCache

DiskCache

HRM-ClientCommand-line Interface

LBNL HRM(performs writes)

BNLHRM(performs reads)

1Allocate

Space 3Allocate

Space 4

StageFile

Transfer Complete7

8ReleaseSpace

9

Call_back: file on disk

Call_back: file on tape

12

10

ArchiveFile

11 ReleaseSpace

Web-basedFile

MonitoringTool

Page 11: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

11 CHEP 2003 Arie Shoshani

Web-Based File Monitoring ToolWeb-Based File Monitoring Tool

Shows:-Files already transferred- Files during transfer- Files to be transferred

Also shows foreach file:-Source URL-Target URL-Transfer rate

Page 12: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

12 CHEP 2003 Arie Shoshani

Tracking multi-file replication Tracking multi-file replication performanceperformance

20020103123100 20020103123200 20020103123300 20020103123400 20020103123500 20020103123600 20020103123700 20020103123800

time

pro

cess

set287_07_10evts_h_dst.xdf.STAR.DBset195_02_2evts_dst.xdf.STAR.DBset162_01_28evts_dst.xdf.STAR.DBset195_01_33evts_dst.xdf.STAR.DBset193_01_17evts_h_dst.xdf.STAR.DBset165_01_31evts_dst.xdf.STAR.DBset165_02_30evts_dst.xdf.STAR.DBset163_02_24evts_dst.xdf.STAR.DBset163_01_32evts_dst.xdf.STAR.DBset192_01_27evts_dst.xdf.STAR.DB

FILE_REQUEST_FAILED

Notified_Client

Migration_Finished

Migration_Requested

Transfered_to_PDSF_from_BNL

Staging_finished_at_BNL

Staging_started_at BNL

Staging_requested_at_BNL

File replication request start

Helped discover hard-to-find bug

Page 13: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

13 CHEP 2003 Arie Shoshani

File tracking helps to identify File tracking helps to identify bottlenecksbottlenecks

Shows that archiving is the bottleneck

Page 14: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

14 CHEP 2003 Arie Shoshani

File tracking shows recovery from File tracking shows recovery from transient failurestransient failures

Total:45 GBs

Page 15: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

15 CHEP 2003 Arie Shoshani

File tracking shows network File tracking shows network slowdown and recoveryslowdown and recovery

Total:53 GBs

Page 16: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

16 CHEP 2003 Arie Shoshani

Conclusion: Key advantagesConclusion: Key advantagesof using SRMs for file replicationof using SRMs for file replication

• All HRM communications are part of HRM functionalityAll HRM communications are part of HRM functionality• No changes required to HRMs

• Can replicate files from multiple sitesCan replicate files from multiple sites• In a single command to one target

• Recovers from transient failuresRecovers from transient failures• For staging and archiving from MSS• For network

• Uses disk buffers to keep multiple filesUses disk buffers to keep multiple files• pre-stage in case of slow network• Hold files in case of slow archiving

• Concurrent transfersConcurrent transfers• Concurrent staging, concurrent archiving from/to MSS• Concurrent transfers over the network• Concurrency limited by parameter setup

• Automatic cleanup of buffers (garbage collection)Automatic cleanup of buffers (garbage collection)• Can replicate files between different MSSs Can replicate files between different MSSs

(Enstore, Jasmine, HPSS, Castor, …)(Enstore, Jasmine, HPSS, Castor, …)• On-line monitoring, summary generatedOn-line monitoring, summary generated

Page 17: 1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific

17 CHEP 2003 Arie Shoshani

BNL–LBNL file replication for STAR BNL–LBNL file replication for STAR

is is in production for 9 monthsin production for 9 months now now

(nearly daily use to replicate 1000s of files per day)(nearly daily use to replicate 1000s of files per day)

More on SRMsMore on SRMs

Thursday, at 1:30 pmThursday, at 1:30 pm

(Category 3)(Category 3)

Final note

HTTP://sdm.lbl.gov/srm