and th g idk t tthe gridka mass storage systemtsm-symposium.oucs.ox.ac.uk/2007/papers/jos van wezel...
TRANSCRIPT
andth G idK t tthe GridKa mass storage system
Jos van Wezel / GridKa
[Tape|TSM] staging server[Tape|TSM] staging server
TSM Symposium, Oxford Sep 26 , 20072Jos van Wezel
IntroductionIntroductionGrid storage and storage middlewaredC h d TSSdCache and TSSTSS internalsConclusion and further work
TSM Symposium, Oxford Sep 26 , 20073Jos van Wezel
FZK/GridKaFZK/GridKa
Th G idK j t t R h C t K l hThe GridKa project at Research Center Karlsruhe:• Project start in 2002• Construct a compute cluster for use in computing Grids• Construct a compute cluster for use in computing Grids• Current capacity: ~3000 cores, 1.5 PB disk, 2 PB tape• Commenced as central compute service to the particle• Commenced as central compute service to the particle
physics community in Germany• Now serves several other Virtual Organizations (VOs)g ( )• Main focus at the moment is data processing for the LHC
(Large Hadron Collider)LCG LHC C i id– LCG: LHC Computing grid
– WLCG: World wide LCG
TSM Symposium, Oxford Sep 26 , 20074Jos van Wezel
Planning numbersPlanning numbers13088
11059Tape (TB) Disk (TB) CPU (# cores)
8629
7238
4397 4720 3810 46225166
1788 1744 2070
4622
1788 1744 2070
TSM Symposium, Oxford Sep 26 , 20075Jos van Wezel
2007 2008 2009 2010
Planned transfers ratesto and from tape
T0: CERN/data source T1: GridKa/data reprocessing T2 d t l i
T1←T1T1 → T2T1 → T1T2 → T1T0→T1
T0: CERN/data source T1: GridKa/data reprocessing T2: data analysis
MB/s outMB/s avg.MB/s InMB/s avg.MB/sVO
T1←T1T1 → T2T1 → T1T2 → T1T0→T1
16.017.815.244.114.9ALICE
VO
48.063.091.27.026.3CMS
110.399.093.334.188.2ATLAS
192.4191.2220.185.4135.6SUM
18.111.420.40.26.3LHCb
TSM Symposium, Oxford Sep 26 , 20076Jos van Wezel
Tape hardwareTape hardware
• Libraries• Libraries– GRAU XL (~600 slots / hour )– IBM 3592 (~700 slots / hour)
• Drives• Drives– LTO2
• 8 installed– LTO3
• 24 installed• 8 older drives (1 replaced)8 older drives (1 replaced)• IO rate 42 MB/s observed• mostly less ~25• could be 80?
TSM Symposium, Oxford Sep 26 , 20077Jos van Wezel
• could be 80?
Tape SANTape SAN
TSM Symposium, Oxford Sep 26 , 20078Jos van Wezel
IntroductionIntroductionGrid storage and storage middlewaredC h d TSSdCache and TSSTSS internalsConclusion and further work
TSM Symposium, Oxford Sep 26 , 20079Jos van Wezel
Accessing ‘Grid’ dataAccessing Grid data
• Network access via uniform protocol: SRM• SRM is software on top of storage management system• Connect grid storage islands through a grid Storage Service• Network interface to storage• You provide data and a combination of:
– Acess protocol: dcap, rfio, nfs– Retention policy: recover time needed (custodial, replica, output)ete t o po cy eco e t e eeded (custod a , ep ca, output)– Access latency: nearline, online, (offline) / tape, disk, shelf
• Location management takes care of duplication• The storage system (dCache) does the rest for you
TSM Symposium, Oxford Sep 26 , 200710Jos van Wezel
WLCG storageWLCG storage
• WLCG uses the Storage Resource Manager• From SRM the following storage classes are inferred for
th WLCG d t tthe WLCG data management:– T1D0: files moved to tape directly – T1D1: files migrated to tape but kept on disk as long as there is– T1D1: files migrated to tape but kept on disk as long as there is
space or ‘pin’ times out.– T0D1: files on disk only
• Tape (or mass storage system) is considered ‘custodial’ storage. Meaning: data is to be kept indefinitely.
• We do not delete data on tape.
TSM Symposium, Oxford Sep 26 , 200711Jos van Wezel
dCachedCache
Thi k f it fil t It iThink of it a as a filesystem. It gives you:A: interface to the grid via SRM data storage interface
data placement based on SRM classes– data placement based on SRM classes– disk, tape or both
B: manages disk storageg g– ‘global’ name space (within the domain)– load balancing
access control (via certificates)– access control (via certificates)C: backend to write to and read from permanent storage
(tape or other Mass Storage System)( p g y )– GridKa is ‘custodian’ for, part of, the raw detector data– Some computed data also goes to tape
TSM Symposium, Oxford Sep 26 , 200712Jos van Wezel
Disk pool managersDisk pool managers
• dCache– interfaced with TSS/TSM, HPSS, ENSTOR, OSM
• DPM : the disk pool manager– has no mass / archival storage support yetg pp y
• Storm: an SRM on top of GPFS– efficiency of interface to TSM is investigatedefficiency of interface to TSM is investigated
• Xroot: in use at particle physics labsHPSS TSS/TSM– HPSS, TSS/TSM
TSM Symposium, Oxford Sep 26 , 200713Jos van Wezel
dCache componentsdCache componentsuser (cli)
command
file system metadata on database
(p)NFS
interface to grid
command
dcap, gridftp, xrootd (p)NFS
mount
selects poolfor write, supplies
pool for readp
TSM Symposium, Oxford Sep 26 , 200714Jos van Wezel
Data flow to the T1with dCache and SRM
Open channelSRM soap hemessages
GridFTP
TSM
Cl 1
SRMserver/door
dCac
hSRMclient
Control channel
G idFTP
TSS
Class 1
Class 2GridFTP
server/doorGridFTP
clientGridFTP
Data Channel(s)Class 3
Class ndCache Pool
all
OP
N/F
irew
a
Queueing (on space token description)
TSM Symposium, Oxford Sep 26 , 200715Jos van Wezel
Other Grids GridKa
O
Summary 1Summary 1
• SRM is the entry point for data exchange between grid sites.
• Disk pool managers offer an SRM interface to disk (and tape) storage( ) g
• The disk pool manager in use at GridKa is dCachedCache
• dCache intelligently places files on distributed disk storage and coupled mass storage (tape)disk storage and coupled mass storage (tape)
TSM Symposium, Oxford Sep 26 , 200716Jos van Wezel
IntroductionIntroductionGrid storage and storage middlewaredC h d TSSdCache and TSSTSS internalsConclusion and further work
TSM Symposium, Oxford Sep 26 , 200717Jos van Wezel
dCache disk pools and pool nodesdCache disk pools and pool nodes
Disk pools
Disk pools on dCachep• trigger a callout
– number of filestotal size– total size
– wait time• the callout runs on recalls and
on migrate requestson migrate requests• synchronous to dCache
activities
TSM Symposium, Oxford Sep 26 , 200718Jos van Wezel
Calling sequenceCalling sequence
dCache providesp• physical filename: name on the disk pool• unique ID: pnfsid• storage info: detailed variablesstorage info: detailed variables
– logical file name: name as seen by user– administrator defined ‘tag’
• tag is set per directory Directory tag is set hereg p y• follows parent• e.g.
atlas/disk/pnfs/gridka/vos/
• callout runs UNIX script
//p /g / /
/tape
/cms/disk
dc_atlas
/tape dc_cms
O tp t on callo t
TSM Symposium, Oxford Sep 26 , 200719Jos van Wezel
Output on callout
dCache to TSS / TSMdCache to TSS / TSM
TSM Symposium, Oxford Sep 26 , 200720Jos van Wezel
IntroductionIntroductionGrid storage and storage middlewaredC h d TSSdCache and TSSTSS internalsConclusion and further work
TSM Symposium, Oxford Sep 26 , 200721Jos van Wezel
dCache to TSM previouslydCache to TSM previously
Original TSM backend • 1 file results in 1 store or recall
large overhead Session startup time takes inordinate amount of– large overhead. Session startup time takes inordinate amount of time
– when storage agents are used: TSM volume selection algorithm starts cartridge juggle Efficiency nears zerostarts cartridge juggle. Efficiency nears zero.
• No data classes– everything goes to one and the same tape
li i t f ti l d t– no policies or quota for particular data• On recalls
– no control over tape file order: recalls will be virtually impossible)p y p )– dCache cannot provide queues (for recalls)Remember: tape allows only sequential access!
TSM Symposium, Oxford Sep 26 , 200722Jos van Wezel
Requirements for dCache to tape interface
• Use available TSM base at Forschungszentrum K l hKarlsruhe
• Improve throughputR d b f t t• Reduce number of tape mounts
• Use different tape sets for different data classes
TSM Symposium, Oxford Sep 26 , 200723Jos van Wezel
TSS propertiesTSS properties
• Interface directly with TSM via the APIy• Fan out for all dpm/dCache to tape activities
– mutiple operations: recall, migrate, rename, delete, query• Runs on the TSM clients, storage agent or on the server proper• Plug-in replacement for the TSM backend that comes with dCache• Sends different type of data to different tape sets• Sends different type of data to different tape sets• Two level data classes (with dCache)• Queues requests on tape sequence orderq p q• No persistent state is kept• Allows to store an exact image of the logical global name space on
ttape• command line interface to set running parameters, monitor the
processing, run db queries (think of it as an alternative dsmc)
TSM Symposium, Oxford Sep 26 , 200724Jos van Wezel
p g, q ( )
TSS command and data flow
QueingSubsystem
dcache requests archive and classes meta-data from tape system
requests enter store/recall queue
TSM
dCachepool
storageagent
TSM
Scheduler
TSMAPI
sessionchannel
archivemeta-data
Arbiter
channel
datachannel
process selected queue
payload data
inter scheduler communication
TSM Symposium, Oxford Sep 26 , 200725Jos van Wezel
Major componentsMajor components
Queuing engineQueuing engine• data management: input output files, set data classes• enqueue: creates queues
Scheduler• Select queue to process based on trigger• Starts threads to process queue(s)
TSM DMI interface• handle sessions• queries TSM DB• setup data transfers• sends and receives data
Admin interfaceAdmin interface• separate thread to return status information of queues and clients• stopping and starting the subsystems• changing running parameters
TSM Symposium, Oxford Sep 26 , 200726Jos van Wezel
TSS SchedulerTSS Scheduler
S h d l t t t i• Scheduler starts request processing per queue• More than one queue may be processed
tl ( ll f ‘bi ’ h t th t h dl 2concurrently (allows for ‘big’ hosts that handle 2 or more tape drives)Q i d t i d ‘ bl ’ b d• Queue is determined ‘runnable’ based on:– time: elapsed time since first job entry)
i ti f th b f b t f ll fil– size: summation of the number of bytes of all files– length: number of requests in the queue
C i t ith bit t t ‘d i• Communicates with arbiter to prevent ‘drive collisions’ in next version
TSM Symposium, Oxford Sep 26 , 200727Jos van Wezel
TSS Queue engineTSS Queue engine
• On (recall) entry query the TSM DB– if ‘object’ exists, its id is put in the queue
• On (migrate) entry select management class– Unknown classes are not migratedg
• Renames, deletes etc are forwarded directly• Caller waits until TSS returns with the data or a• Caller waits until TSS returns with the data or a
non-zero error code
TSM Symposium, Oxford Sep 26 , 200728Jos van Wezel
DMI APIDMI API
W d th APIWrapper around the API• Library of the library• Keeps track of open sessions/handlesp p• Simplifies queries and Send/Get data calls• Utility functions dmi_query_mc(), dmi_log, dmi_session_info() etc.• Callbacks separate API lib from rest of the code• Callbacks separate API lib from rest of the code• Example: data Get becomes:if( dmi_init(p1 *, …) == 0)
if (dmi_query(p1 *, …) ==0)
if (dmi_get(p1 *, …) == 0)
return(0);• Regretfully no API support for library and or volume handling.
TSM Symposium, Oxford Sep 26 , 200729Jos van Wezel
Tapeview (VO’s)Tapeview (VO s)
TSM Symposium, Oxford Sep 26 , 200730Jos van Wezel
Tapeview (storage agents)Tapeview (storage agents)
25
30
45
TSM Symposium, Oxford Sep 26 , 200731Jos van Wezel
IntroductionIntroductionGrid storage and storage middlewaredC h d TSSdCache and TSSTSS internalsConclusion and further work
TSM Symposium, Oxford Sep 26 , 200732Jos van Wezel
Current issuesCurrent issues
‘C i ti l t’• ‘Communication lost’ errors:– (ANS1026E (RC136) The session is rejected: There was a
communications protocol error.p
• Multiple TSM clients talking to a single TSM Agent allocate ‘multiple’ tape drivesallocate ‘multiple’ tape drives– multiple clients now talk to a single STA
• No load balancing diversion for more then 1 libraryNo load balancing, diversion for more then 1 library• No upstream error detection: library down, no scratch
tapes left, no more drives available etc.• Interface dCache (java) to TSS is a shell script: i.e.
limited signal processing.
TSM Symposium, Oxford Sep 26 , 200733Jos van Wezel
In progressIn progress
Q bit ti i l b d• Queue process arbitration via volume pegboard– reduce concurrent drive access– improve recall throughput– synchronous updates would lame operations
• Concurrent queue processing– configurable number of queues g q– processed concurrently on a single host
• Per queue scheduling parameters– different queue triggers for read or write queues– different queue triggers for read or write queues– finer tuning of write queues
• Remote queue entryli t t t t l TSS– clients connect to a central TSS
– groups requests (esp. needed for recall)• Support for Multiple Tape Libraries
TSM Symposium, Oxford Sep 26 , 200734Jos van Wezel
ConclusionsConclusions
• TSM can handle > 300 MB/s• TSS is working as expectedg p• Tape speed not the expected rates• Need to find out the access pattern/tape mounts• Need to find out the access pattern/tape mounts• Need to have better error recovery• Configuration is eeeeh…. pretty complex• It would be better if TSM could do this
TSM Symposium, Oxford Sep 26 , 200735Jos van Wezel
IntroductionGrid storage and storage middlewareGrid storage and storage middlewaredCache and TSSTSS i t lTSS internalsConclusion and further work
Many thanks to:Dorin Lobontu, Stephanie Boehringer, Silke Halstenberg, Doris Ressmann,
You probably have some questions?TSM Symposium, Oxford Sep 26 , 200736Jos van Wezel
p y q
Spare slidesSpare slides
TSM Symposium, Oxford Sep 26 , 200737Jos van Wezel
Data Flow data and meta dataData Flow – data and meta data
TSM Symposium, Oxford Sep 26 , 200738Jos van Wezel
storage on the grid: SRMstorage on the grid: SRM
- hide the complexity of the local storage at a site with a uniform interface: SRM
- connect grid storage islands through a grid Storage Serviceconnect grid storage islands through a grid Storage Service- SRM is software on top of storage management system- provide dynamic space allocation/reservation and file management:
space management functionsp g- provide dynamic information regarding storage and files: status
functions- takes care of authorization and authentification (in the dcache SRM
i th Pl ll) i i f tivia the gPlazma cell): permission functions- transfer protocol negotiation: data transfer functions- and many other things …..
TSM Symposium, Oxford Sep 26 , 200739Jos van Wezel