tutorial openafs + object storageafscon09/docs/reuter.pdf · “openafs + object storage“ is an...
TRANSCRIPT
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Hartmut [email protected]
Tutorial OpenAFS + Object Storage
Hartmut Reuter, RZG, GermanyChristof Hanke, RZG, Germany
Felix Frank, DESY Zeuthen, Germany
2
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Overview
● What is AFS/OSD ? (in case you didn't attend prior presentations)
● Update to AFS/OSD since Graz
– Policies
– Embedded Filesystems
● Tutorial
– Typical use cases for AFS/OSD
– Setting up AFS/OSD
– Backup strategies for data in OSDs and metadata, link counts
– Running a cell with AFS/OSD
● Reference: New commands and subcommands
3
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
OSDs
IP-network
clients
Metadata server
● Object storage systems are distributed filesystems which store data in object storage devices (OSD) and keep metadata in metadata servers.
● Examples of object storage systems are Lustre, Panasas
● Access on a client to a file in object storage consists in the following steps:
– rpc to metadata server which checks permissions and returns a special ancrypted handle for each object the file consists of.
– rpc to the OSDs to read or write the objects using the handle obtained from the metadata server
● The OSD can decrypt the handle by use of a secret shared between OSD and metadata server. The advantage of this technique is that the OSD doesn't need any knowledge about users and access rights.
● Files may be striped or mirrored over multiple OSDs
:
What is object storage
4
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
“AFS/OSD” or “OpenAFS + Object Storage“
● Is an extension to OpenAFS which allows files to be stored in object storage.
● The OSDs (object storage devices) for OpenAFS use the rx-protocol and are therefore called “rxosd”.
● The AFS fileserver stores the metadata of files which are in object storage.
● A new ubik-database stores location and features of the OSDs.
● The decision which files should go into object storage can be based on policies.
● Files in object storage can be simple or striped over multiple OSDs, and/or have copies in multiple OSDs
● AFS/OSD allows to use HSM systems to migrate inactive files to tape
rxosds
clients
AFS fileserver
archival rxosd
HSM system
OSDDB
20080522 AFS & Kerberos Best Practice Workshop 5
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Example: reading a file in OSD
osddb fileserver
1
osd
osd
osd
osd
The fileserver gets a list of osds from the osddb (1).The client knows from status information that the filehe wants to read is in object storage. Therefore itdoes an rpc (2) to the fileserver to get its location and thepermission to read it and to lock the file during the asynchronous fetch operation.The fileserver returns an encrypted handle to the client. With this handle the client gets the data from osd (3) and does an EndAsynchronousFetch RPC to the fileserver to release the lock (4)..
client
2
3
24
20080522 AFS & Kerberos Best Practice Workshop 6
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Example: writing a file into OSD
osddb fileserver
1
osd
osd
osd
osd
(1) The fileserver gets a list of osds from the osddb. (2) The client asks the fileserver where to store the file it has in his cache. (3) The fileserver decides following his policies to store it in object storage and chooses an osd where it allocates an object. (4) The client asks the fileserver for permission and location of the object and to start an asynchronous store operation which locks the file(5) The client stores the data in the osd using the encrypted handle it got from the fileserver.(6) The client does an EndAsyncStore RPC to the fileserver to set the actual length of the file and to release the lock.
client
2
5
34 6
20080522 AFS & Kerberos Best Practice Workshop 7
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
• Data are stored in objects inside OSDs.
– simplest case: one file == one object
Object
OSDs
Objects
20080522 AFS & Kerberos Best Practice Workshop 8
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
● Data of a file could be stored in multiple objects allowing for– data striping (up to 8 stripes, each in a separate OSD)
– data mirroring (up to 8 copies, each in a separate OSD)
Objects
OSDs
Multiple Objects
20080522 AFS & Kerberos Best Practice Workshop 9
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
● To a file existing on some OSDs later more data could be appended. The appended data may be stored on different OSDs (in case there is not enough free space on the old ones)
– This leads to the concept of segments
– The appended part is stored in objects belonging to a new segment.
Objects + Segments
OSDs
20080522 AFS & Kerberos Best Practice Workshop 10
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
● The whole file could get a copy on an archival OSD (tape, HSM)
– This leads to the concept of file copies
Objects + Segments + File Copies
Archival OSD
20080522 AFS & Kerberos Best Practice Workshop 11
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
● The vnode of an AFS file points to quite a complex structure of osd metadata:
– Objects are contained in segments
– Segments are contained in file copies
– additional metadata such as md5-checksums may be included.
● Even in the simplest case of a file stored as a single object the whole hierarchy (file copy, segment, object) exists in the metadata.
● The osd metadata of all files belonging to a volume are stored together in a single volume special file
– This osdmetada file has slots of constant length which the vnodes point to
– In case of complicated metadata multiple slots can be chained
● The osd metadata are stored in network byte order to allow easy transfer during volume move or replication to other machines.
The complete osd metadata
20080522 AFS & Kerberos Best Practice Workshop 12
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
● These structures are serialized in net-byte-order by means of rxgen-created xdr-routines into slots of the volume special file “osdmetadata”.
len val
osd_p_fileList
archvers archtime spare len val len val
osd_p_file segmList metaList
length offset stripes stripesize copies len val
osd_p_segm objList
obj_id part_id osd_id stripe
osd_p_obj
● In memory the file is described by a tree of C-structures for the different components
osd_p_meta
osd metadatain memory
13
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Update to OpenAFS/OSD
● Since the last European meeting in Graz some new things came in
– Felix Frank completed his work on osd policies
– At the Workshop at Stanford I presented OpenAFS/OSD with embedded cluster filesystems (Lustre or GPFS)
● After the introduction of git and gerrit this year integration of OSD code into OpenAFS could start. After discussions in openafs-devel and AFS3-std
– a new transaction model has been applied to asynchronous data access enforcing mandatory locking
– dump tags have been changed to be compliant to the AFS3 standard proposal
● A web site at RZG integrates all information about OpenAFS/OSD and gives access to the source in the subversion server at DESY.
– http://pfanne.rzg.mpg.de/trac/openAFS-OSD
14
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Policies
● Since the last workshop Felix Frank (DESY Zeuthen) implemented the policies. Policies are rules stored in the OSD-database which have a number and a name. A policiy number can be set for a whole volume or individual directories inside the volume.
● The fileserver gets the policies from the OSD-database and applies the rules when new files in the directory or volume are created or written for the 1st time.
● Policies basically can use filenames (suffixes or regular expressions) and file size to state whether a file
– should be stored on the local disk or go into object storage
– and if into object storage whether striped or mirrorered
– and if striped with which stripe size.
● Simple example:
Means: all files with suffix „.root“ should be stored in object storage independent of their size. (Typical problem in the HEP-community because the program root does an fsync() after writing the tiny header or the file).
3 root
~'*.root' => location=osd, stop;
15
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
File creation rate
Evaluation of a policy has its price:
OSD off means normal AFS volume.
OSD no policy means files > 1 MB go into OSD
Most expensive is evaluation of regular expressions on file names.
Measurement and graphic by Felix Frank.
16
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Embedded Filesystems
● Modern cluster file-systems such as Lustre or GPFS are much faster than AFS especially in combination with fast networks (Infiniband, 10GE)
● But they have other disadvantages:
– Because of giant block size inefficient for small files
– File creation and deletion is slow
– No secure way to export into WAN or even to desktops
– Accessible only from Linux (or AIX in case of GPFS).
● This is exactly complementary to AFS. Therefore combine both !
– Use these fast file-systems for rxosd (or fileserver) partitions
– Export them to your trusted batch clusters and HPC environment
– Allow the AFS clients in the batch cluster to read and write files directly
– Let all other clients access data with OSD protocol or classical AFS protocol.
17
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Embedded Filesystems 2
● OSD files:
– rxosd writes an “osdid” file into the partition
– „afsd -check-osds“ informs cache manager about visible OSDs
– If a file consists in a single object in a visible OSD it's accessed directly
● Plain AFS files:
– Fileservers write a “partid” file into the partition
– „afsd -check-fspartitions“ inform the cache manager about visible vicep partitions
● The fileserver with the uuid found in the partid file gets a flag● Volumes on such a fileserver also get a flag
– If the volume is flagged the CM uses a GetPath RPC to the fileserver
– If the file can be opened all I/O is done directly
● Locking :
– During these asynchronous I/O transactions files are locked in the fileserver
18
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
HEPIX-Tests (1)
● The “HEPIX Storage Working Group“ developed last year a use case for distributed storage systems based on CERN's soft- and middleware stack for CMS
● In a 1st round in 2008 at FZK (Forschungszentrum Karlsruhe) Andrei Maslennikov compared: AFS, DCACHE, LUSTRE, and XROOTD.
Source: „HEPIX storage working group, - progress report, 2.2008-”, Andrei Maslennikov, Taipei October 20, 2008
● Lustre performs much better than AFS, DCACHE and XROOTD
19
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
HEPIX-Tests (2)
● In June this year again at FZK, but with different server hardware only Lustre and in AFS embedded Lustre were compared.
Source: SMS from Andrei Maslennikov , June, 3,2009
● AFS with embedded Lustre comes closer to Lustre native.
20 Thrds. 40 Thrds. 60 Thrds. 80 Thrds.0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
LustreAFS / Lustre
20
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
TUTORIAL
● Typical use cases for AFS/OSD
– AFS with HSM system
– AFS with embedded Lustre or GPFS
– AFS with just disk OSDs
● Setting up AFS/OSD
● Backup strategies for data in OSDs and metadata
● Running a cell with AFS/OSD
– Archiver and wiper
– Gathering information with „vos traverse“
– Health check with „vos salvage“
– How to replacing OSDs
– What's to do when a RAID breaks
21
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Typical Use Cases
There are use cases where you would like to have OpenAFS/OSD
● If you have an HSM system and want make use of it for AFS
– to get „unlimited“ disk space for AFS
– to remove inactive files from disk instead of buying more and more disks
– to have better backup of your data
● If you have a fast cluster filesystem such as Lustre or GPFS and want integrate it into AFS
– to get nearly native cluster filesystem speed for AFS inside the cluster
– to make files in the cluster filesystem accessible
● from all platforms● from all over the world
● If you have both (HSM and fast cluster filesystem)
– You can also add HSM features to the cluster filesystem
22
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Other Use Cases
Also without HSM and cluster filesystems you may want to use OpenAFS/OSD
● If you have many fileserver machines take on each one a partition away for OSD
– to make huge volumes easier to handle (move, dump)
– to distribute files more equally over your disk resources
– to get better throughput in heavily used volumes
– to allow for RW mirroring of files (better read throughput, lower risk to loose data)
– to allow for striping of files (better read and write throughput)
Classical OpenAFS OpenAFS/OSD
– Same amount of disk space, but split between fileserver (lower piece) and OSD (upper piece).
– Load to server machines balanced even if only a single volume used at a time.
– Huge volumes can still be moved or replicated.
23
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Setting Up AFS/OSD
● Adding OSD support to your production cell requires the following steps:
– Get RPMs from Christof Hanke's builds or compile it yourself from the source
– Install and start osddbserver on database machines
– Install rxosd on the OSD machines
– Create database entries for your OSDs in the OSDDB
– Start OSDs
– Install fileserver, volserver, salvager on your 1st AFS/OSD fileserver machine
– Create new volumes there or move existing OpenAFS volumes to it
– Set osdpolicy in the volumes which should use OSDs
24
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
RPMs for OpenAFS/OSD
● OpenAFS/OSD as we use it today is based on the openafs-1.4.11 source.
● For some Linux flavors Christof Hanke builds rpms of OpenAFS/OSD
– CentOS_5
– Fedora_10
– Mandriva_2009
– RHEL_5
– SLES_9
– SLE_10
– SLE_11
– openSUSE_11.0
– 0penSUSE_11.1
● See http://download.opensuse.org/repositories/home:/hauky/ for details
– The RPMS built with OSD support have 'osd' in their name.
25
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Building OpenAFS/OSD
● For all other systems you need to build it yourself
– svn checkout http://svnsrv.desy.de/openafs-osd/trunk/openafs
– I use the following options for configure
–
–
–
–
● We have built OpenAFS/OSD only on the following platforms
● AIX (5.3, 4.2)● SOLARIS (5.8, 5.9, 5.10)● LINUX 2.6 (i386, amd64, ppc64)● MacOS 10
> ./configure enabletransarcpaths enablelargefilefileserver enablebosrestrictedmode enablebosnewconfig enablenameifileserver enablefastrestart enablebitmaplater enabledebugkernel disableoptimizekernel enabledebug enabledebuglwp disableoptimize disableoptimizelwp –enableobjectstorage enablevicepaccess
26
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Creating OSDDB
● Install on your AFS database servers the binary „osddbserver“ and create the instance.
– This gives you an empty OSDDB with only a dummy entry for OSD 1 „local_disk“
– The fileserver uses the size range for OSD 1 to decide which files should be stored in OSDs. Default is (0kb-1mb). So files > 1mb go into OSDs.
●
●
●
●
● Now you may define your 1st OSD in the OSDDB
●
●
●
●
● The status remains „down“ until the OSD has sent its space information. This has to happen all 5 minutes. If not it's marked „down“ again.
> bos install mydbserver osddbserver bos: installed file osddbserver > bos create mydbserver osddbserver simple /usr/afs/bin/osddbserver > osd list id name(loc) total space flag prior. own. server lun size range 1 local_disk wr rd (0kb1mb) >
> osd createosd 2 myosdz ip 130.183.2.114 lun 25 minsize 1m maxsize 100g > osd list id name(loc) total space flag prior. own. server lun size range 1 local_disk wr rd (0kb1mb) 2 myosdz 0 gb 0.0 % down 0 0 myosd 25 (1mb100g) >
27
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Creating OSDDB 2
● To change fields use 'osd setosd <number> ....'
●
●
●
●
● If for a given size range more than one OSD are available the one with the higher 'wrprior' will be chosen.
● An OSD may have an owner and/or a location (both max 3 char.)
– The idea is to let fileservers at a remote location use only OSDs there
– and to let use OSDs some project paid for only by that project
– OSDs with the same location as the fileserver get a bonus of 20 on 'wrprior'
– OSDs with other or no location get a malus of -20 on 'wrprior'
– For owner it's +10 and -10
– The system remains elastic: if all other OSDs are full you still may get space remotely or on a foreign OSD
> osd setosd 2 wrprior 80 > osd list id name(loc) total space flag prior. own. server lun size range 1 local_disk wr rd (0kb1mb) 2 myosdz 0 gb 0.0 % down 80 0 myosd 25 (1mb100g) >
28
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Setting Up 1st OSD
● Install and start on the machine where your 1st rxosd should run the binary 'rxosd'
●
●
●
●
●
● If you expect run on your OSD machine also a filesever
– Put a file “OnlyRxosd” into the OSD's /vicep-partition to let the fileserver ignore it
● Install the fileserver, volserver, salvager binaries and create new instance 'fs'
●
●
> bos install myosd rxosd bos: installed file rxosd > bos create myosd rxosd osd /usr/afs/bin/rxosd > osd list id name(loc) total space flag prior. own. server lun size range 1 local_disk wr rd (0kb1mb) 2 myosdz 2500 gb 0.0 % up 80 0 myosd 25 (1mb100g) >
> bos create myfs fs „/usr/afs/bin/fileserver L“ /usr/afs/bin/volserver /usr/afs/bin/salvager >
29
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Policy Number 1
● To make use of the new OSD you need a volume with an osd policy set in its header
– Policy number 1 cannot to be defined explicitly in the OSDDB
● It says: store files > max size for local_disk in OSDs
– If you don't have special needs you can live with only policy 1
–
–
● If no filename based policy is applied all files will be created in the fileserver's partition.
– Before the cache manager stores the 1st time it does an ApplyOsdPolicy RPC to the fileserver telling him the actual length of the file
– At this time the fileserver may or may not decide to move the file to an OSD
– The problem with a too large max size for local_disk is
● With a small cache the cache manager may have to store data of the file before its length reaches this threshold.
● Later appends cannot change the file's location any more.
> vos setfields myvolume osdpolicy 1 >
30
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Backup Strategy 1
● „vos dump“ generally will only dump
– files (and directories …) which are stored on the fileserver
– the OSD metadata of the files which are stored in OSDs.
● There is an option “-osd” for “vos dump” which also integrates the data of the files in OSDs as if they were local files
– It is thought primarily to allow restoring of OSD-volumes on normal fileservers
– You shouldn't use “-osd” if you have HSM and are not sure all files are on-line.
– The dump files could become very large for large volumes.
● Therefore it's better to create copies of the objects on archival OSDs
– If you have HSM these typically are OSDs acting on HSM filesystems
– If not you may use any old cheap slow disks as archival OSDs.
● Instead of doing dumps you also may replicate the volumes to other servers
– OSD volumes mostly have only small files and directories in the fileserver
– After loss of a partition you are very fast back with “vos convertROtoRW”.
31
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Backup Strategy 2
● Creating backup copies of objects in archival OSDs can easily be automated
– There are scripts to be run on the database servers as a bos instance
– The script works on the osdserver's sync site and is stand-by on the others
– It gets all information from the OSDDB and asks the volservers for candidates
– For these candidates it issues „fs fidarchive“ commands
– If you have multiple archival OSDs you may run separate scripts for each one
● guarantees best throughput by having a constant input stream per OSD● Allows you to start and stop them separately in case of down times
● To check the backup works properly run „vos traverse“ over all servers once a day
– It shows you in the section „Data without a copy“ how much data are vulnerable
– In the 1st hour data don't get a copy to avoid intermediate versions to be saved
● If you have a suspect you may run “vos listobjects <osd number> server -single -size”
– Shows you all object IDs (Fids) of files without a copy elsewhere
32
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Backup Strategy 3
● Some sites (also we) use TSM to backup AFS volumes over the /afs path
– This allows users to restore deleted files or old versions of files on their own.
– With HSM you shouldn't do this for OSD volumes because files may be wiped causing TSM to wait long time to bring them back on-line from tape.
● For OSD volumes we are sometimes able to restore deleted OSD files by
– Never really unlinking files in HSM archival OSDs, but renaming them adding a suffix “-unlinked-yyyy-mm-dd”. This simulates the “soft delete” known in some HSM systems (DMF). After a grace period of a month these files may be deleted.
– This alone wouldn't help for deleted files because the Fid isn't known anymore.
– Therefore we do once in a while a “vos dump -osdmetadata” which only contains directories and OSD metadata. The tar.bz2 file for all volumes is < 1 GB.
– With Ken Hornstein's „dumptool“ you can walk in the volume's tree and find the OSD metadata of the lost file.
– From the OSD object ID you can then calculate the namei-path ...
33
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Link Counts
● The link count of a file in a vice-partition indicates how many volumes a file belongs to
– For local AFS files typically you have a RW volume and in the same partition a RO-volume, thus a link count of 2. With a backup volume you reach 3.
– During „vos move“ a temporary volume is cloned which increments the link count.
– Removing of volumes or the file in the RW-volume decrements the link count.
– If the link count goes to zero the file gets unlinked.
● For files in OSDs the link count is handled exactly the same way,
– But it can become higher because volume copies in any fileserver are counted.
– If a file is deleted while the OSD is down the link count cannot be decremented.
– Therefore it's very important to make sure the link counts are correct
● If it's too low you can loose the file when it still belongs to a volume● If it's too high an orphaned file will later remain in the OSD
● „vos salvage“ checks for all files the size and the link count and can also be used to correct them.
– It calculates the correct link count from the number of copies known in the vldb
34
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Archival OSDs
● For backup and HSM you need archival OSDs
● Archival OSDs may be OSDs with large cheap RAID systems
– If you need for capacity reasons more then one archival disk OSD you may give them different file size ranges to distribute data equally
–
–
–
–
–
● If you have an HSM system use it as archival OSD
– Space on tapes is cheaper than on disk and doesn't consume power
– You can get nearly unlimited space on a HSM system
> osd list id name(loc) total space flag prior. own. server lun size range 1 local_disk wr rd (0kb1mb) 2 myosdz 2500 gb 40.0 % up 80 0 myosd 25 (1mb100g) 3 arch1 2000 gb 10.3 % up arch 80 80 myback 0 (0kb64mb) 4 arch2 2000 gb 30.5 % up arch 80 80 myback 1 (64mb256mb) 5 arch3 4000 gb 12.7 % up arch 80 80 myback2 0 (256mb100gb) >
35
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
The archiver
● The 'archive1' scripts (one per archival OSD) run on multiple database servers
– Only on the OSDDB's sync site they are active otherwise only standby
● 'archive1' asks the volservers in a loop whether they have have candidates to archive
– With 'vos archcand' it gets a sorted list of files needing an archival copy
– With 'fs fidarchive' it creates the archival copies.
● 'fs fidarchive' adds the archival copy to the OSD metadata of the file● The OSD metadata of archival copies contain the md5 checksum.
● Running one script per archival OSD guarantees the maximal throughput
– Because all archival OSDs can be filled in parallel
– Because there is only one input stream into the archival OSD.
● The 'archive1' script waits when the archival OSD is filled over its high water mark
– That will give an underlying HSM system the chance to move the data onto tape.
36
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Wipeable OSDs
● If you got unlimited archival space from an underlying HSM system you may give HSM functionality to your on-line OSDs
– Set the 'hsm' flag for the OSD
– Define the high water mark when wiping should start (perhaps at 85 %)
– You may also define a minimal wipe size to avoid small files to be wiped
–
–
–
–
–
–
● In this example minimal wipe size was set to 64mb, so all smaller files remain on-line
● To actually perform automatic wiping run the 'wiper' script on the database servers.
– It also updates the 'newest wiped' field in the OSDDB so you get an idea how long data stay on disk
> osd list id name(loc) total space flag prior. own. server lun size range 1 local_disk wr rd (0kb1mb) 2 myosdz 2500 gb 84.0 % up hsm 80 0 myosd 25 (1mb100g) 3 arch1 2000 gb 10.3 % up arch 80 80 myback 0 (0kb64mb) 4 arch2 2000 gb 30.5 % up arch 80 80 myback 1 (64mb256mb) 5 arch3 4000 gb 32.7 % up arch 80 80 myback2 0 (256mb100gb) > osd list wipeable id name(loc) size state own usage limit wipe > newest wiped 2 myosdz 2500 gb up 84.0 % 85.0 % 64 mb Jul 31 2008 >
37
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
The wiper
● The 'wiper' script runs on multiple database servers
– Only on the OSDDB's sync site it is active otherwise only standby
● It checks for OSDs becoming filled up more than the high water mark (typically 85%)
– With 'osd wipecandidates' it gets from the OSD a sorted list of the longest unused objects (based on atime)
– With 'fs fidwipe' it removes the on-line copies of files in the RW-volumes until enough blocks have been freed
● Because of the RO-volumes the space is actually freed only after the volumes have been relesased by the 'wiper' script (for the same reason you shouldn't have backup volumes!)
● It updates the OSDDB database entry of the OSD with the new newest wiped file date.
● After checking all wipeable OSDs it sleeps up to10 minutes and starts again
38
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Using Embedded Filesystem
● As said before: if you have clusters with fast shared filesystems use them for AFS
– Mount the cluster filesystem as /vicep<x> on all cluster machines
– Run rxosd on one machine and create the corresponding OSDDB entry
– Make the client aware of it by „afsd -check-osd“ (put it into the startup script)
– Allow direct vicep-access on the client by „fs protocol -enable vicepaccess“
– If the filesystem is Lustre you need a special hack:
● „fs protocol -enable lustrehack“
– If you need highest read performance run „fs protocol -enable fastread”
● It does only a StartAsyncFetch at the begin● and EndAsyncFetch at the end instead for each chunk.● Not a good idea if read and write may compete on the same file
● Generally it would also be possible to export fileserver partitions
– „afsd -check-fs“ required on the client
– The many small files in volumes may not be optimal for shared filesystems
39
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
osd list
●
> osd list id name(loc) total space flag prior. own. server lun size range 1 local_disk wr rd (0kb1mb) 4 raid6 5119 gb 69.0 % up arch 64 64 afs15.rz 0 (0kb8mb) 5 tape 8924 gb 24.5 % up arch 64 40 styx.rzg 0 (1mb500gb) 8 afs16a 4095 gb 81.1 % up hsm 79 80 afs16.rz 0 (1mb100gb) 9 mppfs9a 11079 gb 84.9 % up hsm 80 80 mpp mppfs9. 0 (1mb500gb) 10 afs4a 4095 gb 77.1 % up hsm 79 80 afs4.bc. 0 (1mb100gb) 11 w7as(hgw) 2721 gb 67.1 % up 80 80 afsw7as 0 (1mb100gb) 12 afs1a 1869 gb 92.6 % up 70 80 tok afs1.rzg 0 (1mb100gb) 13 hsmgpfs 31236 gb 14.5 % up arch 64 30 hsmi.rzg 12 (8mb500gb) 14 afs6a 1228 gb 84.9 % up hsm 70 80 tok afs6.rzg 0 (8mb100gb) 23 mppfs11gj 5580 gb 70.7 % up hsm 80 80 mpp mppfs11 191 (1mb100gb) 24 mppfs12a 6143 gb 78.1 % up hsm 80 80 mpp mppfs12 0 (1mb100gb) 25 mppfs13a 6143 gb 84.0 % up hsm 80 80 mpp mppfs13 0 (1mb100gb) 32 afs8z 1023 gb 1.3 % up hsm 12 80 afs8.rzg 25 (1mb100gb) 34 afs17gb 5585 gb 85.0 % up hsm 80 80 afs17.rz 183 (1mb500gb) 35 afs18ga 5585 gb 85.0 % up hsm 80 80 afs18.rz 182 (1mb500gb) 36 afs21ge 5585 gb 85.0 % up hsm 80 80 afs21.rz 186 (1mb500gb) 37 afs22gf 5585 gb 84.9 % up hsm 80 80 afs22.rz 187 (1mb500gb) 38 afs19gc 5585 gb 83.7 % up hsm 79 80 afs19.rz 184 (1mb500gb) 39 afs20gd 5585 gb 82.8 % up hsm 79 80 afs20.rz 185 (1mb500gb) 40 afs23gg 5585 gb 74.9 % up hsm 79 80 afs23.rz 188 (1mb500gb) 41 mppfs15gi 5580 gb 84.7 % up hsm 80 80 mpp mppfs15 190 (1mb500gb) 42 mppfs14gh 5580 gb 84.8 % up hsm 80 80 mpp mppfs14 189 (1mb500gb) 43 mppfs10gk 5580 gb 84.1 % up hsm 80 80 mpp mppfs10 192 (1mb500gb) 44 sxbl19z 6656 gb 3.5 % up hsm 80 80 aug sxbl19.a 25 (1mb500gb)>
40
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
osd list -wipeable
● With -wipeable you get only the wipeabble osds, but with additional information
– Newest wiped shows you how long unused files stay on-line
> osd list wipeable Iid name(loc) size state own usage limit wipe > newest wiped 8 afs16a 4095 gb up 81.1 % 85.0 % 64 mb Jul 31 2008 9 mppfs9a 11079 gb up mpp 84.9 % 85.0 % 64 mb Oct 21 2008 10 afs4a 4095 gb up 77.1 % 85.0 % 64 mb Aug 14 2008 14 afs6a 1228 gb up tok 84.9 % 85.0 % 64 mb May 19 2008 23 mppfs11gj 5580 gb up mpp 70.7 % 85.0 % 64 mb 24 mppfs12a 6143 gb up mpp 78.1 % 85.0 % 64 mb 25 mppfs13a 6143 gb up mpp 84.0 % 85.0 % 64 mb 32 afs8z 1023 gb up 1.3 % 80.0 % 64 mb Jul 31 2008 34 afs17gb 5585 gb up 85.0 % 85.0 % 64 mb Jun 30 2009 35 afs18ga 5585 gb up 85.0 % 85.0 % 64 mb Apr 24 2009 36 afs21ge 5585 gb up 85.0 % 85.0 % 64 mb May 31 2009 37 afs22gf 5585 gb up 84.9 % 85.0 % 64 mb Jun 14 2009 38 afs19gc 5585 gb up 83.7 % 85.0 % 64 mb Jun 25 2009 39 afs20gd 5585 gb up 82.8 % 85.0 % 64 mb Aug 2 2009 40 afs23gg 5585 gb up 74.9 % 85.0 % 64 mb Jun 19 2009 41 mppfs15gi 5580 gb up mpp 84.7 % 85.0 % 64 mb Feb 25 2009 42 mppfs14gh 5580 gb up mpp 84.8 % 85.0 % 64 mb Feb 25 2009 43 mppfs10gk 5580 gb up mpp 84.1 % 85.0 % 64 mb Jun 21 2009 44 sxbl19z 6656 gb up aug 3.5 % 85.0 % 64 mb>
41
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Where are my files?
● There are some new or extended „fs“ subcommands:
– „fs ls“ gives an output similar to „ls -l“
– „fs whereis“ shows also OSDs
– „fs [fid]vnode“ shows all vnode fields. For OSD files also the index in the volume's OSD metadata file
– „fs [fid]osd“ shows the OSD metadata
● The command „osd“ has some subcommands to show what is stored in an OSD
– „osd volumes“ shows the RW-voiume ids present in the OSD
– „osd objects“ shows all objects in the OSD belonging to a specified volume
– “osd examine” shows details about a single object
● The command “vos” got some new subcommands
– “vos traverse” shows file size statistic and the number of objects per OSD
– “vos listobjects” shows object-ids of all objects on a specified OSD
– “vos salvage” does a health check for a volume checking sizes and link counts
42
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
„fs ls“
● „fs ls“ gives an output similar to „ls -l“
–
–
–
–
–
● The character in column 1 tells you what it is:
d == directory
f == normal AFS file in fileserver partition
l == symbolic link
m == mount point
o == object file (on-line)
w == wiped object file (off-line)
> fs ls f rw hwr 550615 19980701 13:04:21 arla0.7.2.tar.gz l rwx hwr 11 20090918 19:57:51 servers > tmp/servers m rwx root 2048 private o rwx hwr 63128640 19990317 12:23:03 unicosmk.crayt3e.20443 d rwx hwr 2048 19950726 09:47:14 tmp w rw daemon 79829905 19970108 09:22:50 ymp_uni80.mrafs34a.wrk >
43
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
„fs whereis“
● „fs whereis“ has been extended to show also the OSDs where the data are stored:
–
–
–
> fs ls ymp_uni80.mrafs34a.wrk w rw daemon 79829905 19970108 09:22:50 ymp_uni80.mrafs34a.wrk > fs whereis ymp_uni80.mrafs34a.wrk File ymp_uni80.mrafs34a.wrk is on host afs16.rzg.mpg.de Osds: tape hsmgpfs >
44
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
„fs vnode“
● „fs vnode“ shows all vnode fields
–
–
–
–
–
–
–
–
–
– Length shows also in hex notation vn_length_hi and length.
– This file is in object storage and doesn't have therefore an inode number
– vn_ino_hi is filled with the uniquifier in namei-fileservers. LastUsageTime is stored at this location instead.
–
> fs vnode ymp_uni80.mrafs34a.wrk File 536879945.290.46282 modeBits = 0644 linkCount = 1 author = 2 owner = 2 group = 4132 Length = 79829905 (0x0, 0x4c21b91) 76.130 MB dataVersion = 3 unixModifyTime = 19970108 09:22:50 serverModifyTime = 20090510 17:13:48 vn_ino_lo = 0 (0x0) lastUsageTime = 20090225 10:46:39 osd file on disk = 0 osdMetadataIndex = 75 parent = 1 >
45
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
● For an OSD file you can see the OSD metadata with „fs osd“:
●
●
●
●
●
●
●
●
● This file is not on-line, only in two archival OSDs (5 and 13)
– The file had been restored once from the 1st archival copy on Feb. 25
– Both archive are form data version 3 and have therefore the same md5 sum
– flag=0x2 means it has been checked that the file was copied to tape
> fs osd ymp_uni80.mrafs34a.wrk ymp_uni80.mrafs34a.wrk has 312 bytes of osd metadata, v=3 Archive, dv=3, 20070719 17:39:30, 1 fetches, last: 20090225, 1 segm, flags=0x2 segment: lng=79829905, offs=0, stripes=1, strsize=0, cop=1, 1 objects object: obj=536879945.290.46282.0, osd=5, stripe=0 metadata: md5=6441333f2acdae8833898bebaf2041d2 as from 20070719 17:39:30 Archive, dv=3, 20090225 10:46:39, 1 segm, flags=0x2 segment: lng=79829905, offs=0, stripes=1, strsize=0, cop=1, 1 objects object: obj=536879945.290.46282.0, osd=13, stripe=0 metadata: md5=6441333f2acdae8833898bebaf2041d2 as from 20090225 11:52:02 >
„fs osd“
20080522 AFS & Kerberos Best Practice Workshop 46
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
osd metadata in memory
● These structures are serialized in net-byte-order by means of rxgen-created xdr-routines into slots of the volume special file “osdmetadata”.
len val
osd_p_fileList
archvers archtime spare len val len val
osd_p_file segmList metaList
length offset stripes stripesize copies len val
osd_p_segm objList
obj_id part_id osd_id stripe
osd_p_obj
● In memory the file is described by a tree of C-structures for the different components
osd_p_meta
47
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
● For an OSD file you can see the OSD metadata with „fs osd“:
●
●
●
●
●
●
●
●
● This file is not on-line, only in two archival OSDs (5 and 13)
– The file had been restored once from the 1st archival copy on Feb. 25
– Both archive are form data version 3 and have therefore the same md5 sum
– flag=0x2 means it has been checked that the file was copied to tape
> fs osd ymp_uni80.mrafs34a.wrk ymp_uni80.mrafs34a.wrk has 312 bytes of osd metadata, v=3 Archive, dv=3, 20070719 17:39:30, 1 fetches, last: 20090225, 1 segm, flags=0x2 segment: lng=79829905, offs=0, stripes=1, strsize=0, cop=1, 1 objects object: obj=536879945.290.46282.0, osd=5, stripe=0 metadata: md5=6441333f2acdae8833898bebaf2041d2 as from 20070719 17:39:30 Archive, dv=3, 20090225 10:46:39, 1 segm, flags=0x2 segment: lng=79829905, offs=0, stripes=1, strsize=0, cop=1, 1 objects object: obj=536879945.290.46282.0, osd=13, stripe=0 metadata: md5=6441333f2acdae8833898bebaf2041d2 as from 20090225 11:52:02 >
Archive, dv=3, 20070719 17:39:30, 1 fetches, last: 20090225, 1 segm, flags=0x02
„fs osd“
Archive, dv=3, 20090225 10:46:39, 1 segm, flags=0x02
segment: lng=79829905, offs=0, stripes=1, strsize=0, cop=1, 1 objects
segment: lng=79829905, offs=0, stripes=1, strsize=0, cop=1, 1 objects
object: obj=536879945.290.46282.0, osd=13, stripe=0
object: obj=536879945.290.46282.0, osd=5, stripe=0
metadata: md5=6441333f2acdae8833898bebaf2041d2 as from 20090225 11:52:02
metadata: md5=6441333f2acdae8833898bebaf2041d2 as from 20070719 17:39:30
48
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Running a cell with OSDs
● Once the services and scripts for archiving and wiping are started the cell should run automatically. However, it is use-full
– to check daily the logs to detect anomalities and errors
– to run once a day a 'vos traverse' over all fileservers to check that all OSD files got copies in archives
– to run once in a while (per week or month) a script that does a 'vos release' followed by a 'vos salvage' for all volumes as a health check
49
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Analyzing Logfiles
● You find the following lines in a FileLog:
–
–
● To find out which file belongs to FID 536986017.82560.85368 just do
Sat Sep 19 07:55:45 2009 [72] Wrong md5 sum 06bb7f6e2b488bf2b4db6a115589def2 instead of 5789032929755570e2e0c95849d5353f Sat Sep 19 07:55:45 2009 [72] SetReady of 536986017.82560.85368 after restore to OSD 0 from archival OSD 0 failed with 5
> fs fidvnode 536986017.82560.85368 File 536986017.82560.85368 modeBits = 0664 linkCount = 1 author = 15122 owner = 15122 group = 2400 Length = 15669199 (0x0, 0xef17cf) 14.942 MB dataVersion = 81 unixModifyTime = 20080130 05:12:04 serverModifyTime = 20090602 20:44:20 vn_ino_lo = 0 (0x0) lastUsageTime = 20080204 09:14:56 osd file on disk = 0 osdMetadataIndex = 2682 parent = 1 Path = {Mountpoint}/snapshot_18880.datafile.corrupted>
50
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Analyzing Logfiles
● To see the osd metadata of this file do
–
–
–
–
–
–
–
● To see the objects in the OSDs do
●
●
●
● Here is something wrong on OSD 5 because the file is older than the md5 sum !
> fs fidosd 536986017.82560.85368 536986017.82560.85368 has 284 bytes of osd metadata, v=3 Beingrestored, 1 segm, flags=0x1 segment: lng=0, offs=0, stripes=1, strsize=0, cop=1, 1 objects object: obj=536986017.82560.85368.0, osd=39, stripe=0 Archive, dv=81, 20080209 08:39:47, 1 segm, flags=0x2 segment: lng=15669199, offs=0, stripes=1, strsize=0, cop=1, 1 objects object: obj=536986017.82560.85368.0, osd=5, stripe=0 metadata: md5=5789032929755570e2e0c95849d5353f as from 20080209 08:39:47 >
> osd ex 39 536986017.82560.85368.0 536986017.82560.85368.0 fid 536986017.82560.85368 tag 0 notstriped lng 15669199 lc 3 Sep 19 07:56 > osd ex 5 536986017.82560.85368.0 536986017.82560.85368.0 fid 536986017.82560.85368 tag 0 notstriped lng 15669199 lc 3 Sep 25 2007
51
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
File Size Range Files % run % Data % run % 0 B 4 KB 69881317 50.43 50.43 86.903 GB 0.02 0.02 4 KB 8 KB 11085714 8.00 58.43 60.663 GB 0.02 0.04 8 KB 16 KB 9663216 6.97 65.40 103.294 GB 0.03 0.07 16 KB 32 KB 10923970 7.88 73.29 228.964 GB 0.07 0.14 32 KB 64 KB 8411853 6.07 79.36 384.437 GB 0.11 0.25 64 KB 128 KB 6564296 4.74 84.09 571.588 GB 0.16 0.41 128 KB 256 KB 4397953 3.17 87.27 778.242 GB 0.22 0.63 256 KB 512 KB 4750125 3.43 90.69 1.604 TB 0.47 1.10 512 KB 1 MB 3333510 2.41 93.10 2.161 TB 0.63 1.73 1 MB 2 MB 2209330 1.59 94.69 3.079 TB 0.90 2.64 2 MB 4 MB 2169849 1.57 96.26 5.737 TB 1.68 4.31 4 MB 8 MB 1995968 1.44 97.70 10.545 TB 3.08 7.40 8 MB 16 MB 1220014 0.88 98.58 13.267 TB 3.88 11.28 16 MB 32 MB 644008 0.46 99.05 13.301 TB 3.89 15.17 32 MB 64 MB 505218 0.36 99.41 20.875 TB 6.11 21.28 64 MB 128 MB 325616 0.23 99.65 27.368 TB 8.01 29.28 128 MB 256 MB 174111 0.13 99.77 29.835 TB 8.73 38.01 256 MB 512 MB 153027 0.11 99.88 55.941 TB 16.36 54.37 512 MB 1 GB 142159 0.10 99.98 95.390 TB 27.90 82.28 1 GB 2 GB 16530 0.01 100.00 22.541 TB 6.59 88.87 2 GB 4 GB 2821 0.00 100.00 7.363 TB 2.15 91.02 4 GB 8 GB 1693 0.00 100.00 9.816 TB 2.87 93.90 8 GB 16 GB 720 0.00 100.00 6.897 TB 2.02 95.91 16 GB 32 GB 197 0.00 100.00 4.264 TB 1.25 97.16 32 GB 64 GB 130 0.00 100.00 5.724 TB 1.67 98.84 64 GB 128 GB 27 0.00 100.00 2.277 TB 0.67 99.50 128 GB 256 GB 6 0.00 100.00 1.052 TB 0.31 99.81 256 GB 512 GB 2 0.00 100.00 667.311 GB 0.19 100.00 Totals: 138573380 Files 341.869 TB
“vos traverse” output
93.1 % ofall files < 1MBon local disk
99,4 % ofall files are < 64MBpermanent on OSDsor local disk
74.5 TB
5.9 TB
341 TB
Only 0.6 % of all files may be wipedfrom disk
52
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Storage usage: 1 local_disk 134158413 files 65.702 TB arch. Osd 4 raid6 1202480 objects 3.403 TB arch. Osd 5 tape 4281152 objects 273.501 TB Osd 8 afs16a 18591 objects 3.239 TB Osd 9 mppfs9a 423720 objects 8.875 TB Osd 10 afs4a 16572 objects 3.057 TB Osd 11 w7as 97039 objects 1.743 TB Osd 12 afs1a 254659 objects 1.357 TB arch. Osd 13 hsmgpfs 1700047 objects 236.401 TB Osd 14 afs6a 8326 objects 1009.701 GB Osd 23 mppfs11gj 61121 objects 3.838 TB Osd 24 mppfs12a 79479 objects 4.673 TB Osd 25 mppfs13a 75723 objects 5.028 TB Osd 34 afs17gb 159327 objects 4.501 TB Osd 35 afs18ga 112586 objects 4.594 TB Osd 36 afs21ge 99621 objects 4.541 TB Osd 37 afs22gf 80227 objects 4.587 TB Osd 38 afs19gc 112268 objects 4.588 TB Osd 39 afs20gd 87823 objects 4.303 TB Osd 40 afs23gg 70848 objects 4.063 TB Osd 41 mppfs15gi 75436 objects 4.625 TB Osd 42 mppfs14gh 83013 objects 4.618 TB Osd 43 mppfs10gk 75339 objects 4.606 TB Osd 44 sxbl19z 18489 objects 174.675 GB Total 143352299 objects 657.023 TB
“vos traverse” output 2
„vos traverse“ tells you also where your data are
● 66 TB on filservers
● 77 TB on disk OSDs
● 510 TB on arch. OSDs (so much because we create two copies)
53
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Where our data are
66 TB replicated 77 TB osd+tape199 TB only tape
● All data in the local partitions of the fileservers are replicated to other fileservers
● All data in disk OSDs have copies in archival OSDs (tape)
54
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Data without a copy: if !replicated: 1 local_disk 134158411 files 65.702 TB arch. Osd 5 tape 750001 objects 30.400 TB Osd 9 mppfs9a 364 objects 978.990 MB Osd 11 w7as 1 objects 71.377 MB Osd 23 mppfs11gj 169 objects 510.487 MB Osd 24 mppfs12a 164 objects 444.453 MB Osd 25 mppfs13a 164 objects 448.978 MB Osd 34 afs17gb 1 objects 488.316 MB Osd 35 afs18ga 1 objects 3.790 MB Osd 36 afs21ge 1 objects 54.628 MB Osd 37 afs22gf 1 objects 3.790 MB Osd 38 afs19gc 2 objects 7.580 MB Osd 39 afs20gd 1 objects 14.942 MB Osd 40 afs23gg 3 objects 69.175 MB Osd 41 mppfs15gi 171 objects 446.868 MB Osd 42 mppfs14gh 4 objects 7.362 MB Osd 43 mppfs10gk 183 objects 1.751 GB Osd 44 sxbl19z 59 objects 83.693 MB Total 134909701 objects 96.108 TB
“vos traverse” output 3
„vos traverse“ tells you also which data are vulnerable
● 66 TB on filserverers are replicated and therefore not really vulnerable
● 30 TB on OSD 5 have still to be copied to the other archival OSD 13, but they have already 2 tape copies
● 6 GB of new data haven't got yet their copies on archival OSDs.
● Really vulnerable are only the 6 GB, less than 0.01 % of the total data!
55
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
4 K
B
16 K
B
64 K
B
256
KB
1 M
B
4 M
B
16 M
B
64 M
B
256
MB
1 G
B
4 G
B
16 G
B
64 G
B
256
GB
0
10
20
30
40
50
60
Files
Data
FilesData
● The diagram shows number of files and amount of data over the logarithm of the file size.
● All data right of the red line at 64 MB can be wiped from disk staying only on tape
– This are only 0.59 % of the all files, but 79.2 % of the total data volume !
HSM for AFS
56
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Retiring an OSD
● 1st set the wrprior for that OSD to 0: „osd setosd <OSD> -wrprior 0“ to prevent creation of new objects.
– Wait at least 5 minutes to make sure all fileservers saw your change !
● Use „vos listobj <OSD number> <server>“ for all servers to get a list of all the objects
● Do for all objects
– „fs fidreplace <object> <OSD number>
● This moves the object to another OSD● Use „osd volumes <OSD>“ to get a list of the volumes present there
● Do „vos release“ for all these volumes to ensure the RO-volumes don't have objects on the OSD.
● Run “vos traverse” over all servers to be sure no RW-volume contains any more an object on the retired OSD.
57
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Salvaging OSD volumes
● “bos salvage” checks only consistency of the OSD metadata.
● To check also the data run „vos salvage <volume>“ (without options it just checks)
●
●
●
●
● In this example 1 object is too short. The low link count of 2 other objects is not necessarily an error.
– Probably these are archival copies which have been created after the last replication. Before running „vos salvage -update“ we did a new „vos release“
> vos salvage 1108472992.115754.200635 Salvaging volume 1108472992 Object 1108472992.115754.200635.0 has wrong length on 34 (449740800 instead of 512037786) Object 1108472992.115828.200709.0: linkcount wrong on 5 (1 instead of 3) Object 1108472992.115830.200711.0: linkcount wrong on 5 (1 instead of 3) 1108472992: 49697 local (10.243 gb) and 8985 in OSDs (2.493 tb), 3 errors ATTENTION
> vos salvage 1108472992.115754.200635 update Salvaging volume 1108472992 Object 1108472992.115754.200635.0 has wrong length on 34 (449740800 instead of 512037786), repaired 1108472992: 49697 local (10.243 gb) and 8985 in OSDs (2.493 tb), 1 errors ATTENTION
58
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Damaged Partition
● Broken RAIDs happen about twice a year in our cell !
● If it's a fileserver partitions:
– If you consequently released RO-volumes to other servers
● You may run „vos convertROtoRW“ on the RO-volumes● Don't forget to create new RO-volumes after that !
– Otherwise you will have to restore dumps (takes much longer)
● If it's an OSD partitions
– run „vos listobjects“ for the lost OSD on all fileservers to get the object-ids
– If it's a non-archival OSD
● run „fs fidwipe“ for all these object-ids (the tag-suffix will be ignored)
– will fail for newly created files without archival copy.● run „fs fidprefetch“ to bring the wiped files on-line again
– If it's an archival OSD
● run „fs fidreplace <obj-id> <osd-number> -1“ to eliminate the archival copy
59
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
AFS hanging or slow 1
If AFS seems to be slow or if files or directories seem to be blocked it makes sense to analyze what's going on on your server and client machines.
● Ask hanging clients with „cmdebug“ or „cmdebug -long“ to find which file is requested
● Use „rxdebug <client> 7001 -nodally“ to see active RPCs
● Use „fs threads -server <fileserver>“ to see load on a fileserver
–
–
–
–
– The last line is my „fs thread“ command itself (as I know from the IP address)
– The other FsCmd line is a “fs fidarchive“ running on the database server
● Use „osd threads <OSD>“ to find the corresponding OSD activity
~: fs th ser afs16 3 active threads found rpc FsCmd on 1108595910.590 from 130.183.9.5 rpc StoreData64 on 1108524165.480 from 134.107.107.11 rpc FsCmd on 0.15 from 130.183.2.114 ~:
~: osd th hsm rpc create_archive on 1108595910.590.8415.0 from 130.183.30.16 rpc threads on 0.0.0.0 from 130.183.2.114 ~:
60
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
AFS hanging or slow 2
I don't have saved the output and steps of the analysis of such a hanger, but here is what happened
● There was a problem with GPFS and TSM-HSM on an archival OSD which let hang a write() to GPFS forever
● The write() belonged to an RXOSD_create_archive RPC for which a „fs fidarchive“ was waiting.
● „fs fidarchive“ has a WRITE_LOCK on the file's vnode because it must update the osdmetadata when the archiving is successful.
● This blocked many RXAFS_GetStatus and/or RXAFS_InlineBulkStatus RPCs
● Finally all threads of the server were blocked and the server didn't respond anymore
● A server restart only helped for a short time because the archiver script immediately started the next „fs fidarchive“ to that fileserver and the same thing happened again...
61
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
> fs stat afs14 Since 20090917 05:08:22 (211842 seconds == 2 days, 10:50:42 hours) Total number of bytes received 211137205106 196 gb Total number of bytes sent 55456907335 51 gb rpc 65538 StoreData64 8707980 rpc 132 FetchStatus 1563205 rpc 147 GiveUpCallBacks 612517 rpc 135 StoreStatus 1906959 rpc 137 CreateFile 314794 rpc 65537 FetchData64 2063199 rpc 157 ExtendLock 89790 rpc 140 Link 11107 rpc 136 RemoveFile 52155 rpc 156 SetLock 103698 rpc 158 ReleaseLock 103694 rpc 65560 OsdPolicy 7713 rpc 138 Rename 277955 rpc 65536 InlineBulkStatus 673076 rpc 65542 GetStatistics64 8254 rpc 146 GetStatistics 8254 rpc 130 FetchData 41994 rpc 133 StoreData 41586 rpc 139 Symlink 1697 rpc 141 MakeDir 8973 rpc 134 StoreACL 6528 rpc 155 BulkStatus 31685 rpc 65539 GiveUpAllCallBacks 35 rpc 142 RemoveDir 394 rpc 65566 Statistic 9
Server Statistcs
● „fs statistic <server> [-verbose]“ gives a statistic about data traffic and RPCs with “-verbose” you get the transfer rates per 15 minute interval
● „osd statistic <OSD> [-verbose]“ gives the same for OSDs
● „vos statistc <server> [-verbose]“ gives the same for the volserver, but without RPC statistc.
These commands allow you to see hot spots and give an idea about the data traffic in your cell.
62
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
> vos statistic afs11 v/ snip /17:4518:00 0 KB/s sent 0 KB/s received18:0018:15 0 KB/s sent 3133 KB/s received18:1518:30 0 KB/s sent 4756 KB/s received18:3018:45 0 KB/s sent 2874 KB/s received18:4519:00 0 KB/s sent 1511 KB/s received19:0019:15 0 KB/s sent 2173 KB/s received19:1519:30 0 KB/s sent 2176 KB/s received19:3019:45 0 KB/s sent 809 KB/s received19:4520:00 0 KB/s sent 1317 KB/s received20:0020:15 0 KB/s sent 371 KB/s received20:1520:30 0 KB/s sent 5660 KB/s received20:3020:45 0 KB/s sent 671 KB/s received20:4521:00 0 KB/s sent 4098 KB/s received21:0021:15 0 KB/s sent 8706 KB/s received21:1521:30 0 KB/s sent 10452 KB/s received21:3021:45 0 KB/s sent 3892 KB/s received21:4522:00 0 KB/s sent 5480 KB/s received22:0022:15 0 KB/s sent 7339 KB/s received22:1522:30 0 KB/s sent 8558 KB/s received22:3022:45 0 KB/s sent 5603 KB/s received22:4523:00 0 KB/s sent 6961 KB/s received23:0023:15 0 KB/s sent 5522 KB/s received23:1523:30 0 KB/s sent 2779 KB/s received23:3023:45 0 KB/s sent 946 KB/s received23:4524:00 0 KB/s sent 0 KB/s receivedSince Sep 17 05:00 (716186 seconds == 8 days, 6:56:26 hours)Total number of bytes received 348728736471 324 gbTotal number of bytes sent 0 0 bytes~:
Server Statistcs
● This example shows volserver traffic during nightly „vos release“.
● This server only keeps RO-copies of volumes belonging to one of the experiments.
● The vos release script is started at 18:00 by CRON.
● All intervals before 18:00 don't show activities.
63
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Command Reference 'osd' 1
● The „osd“ command has the following subcommands which talk to the OSDDB
– createosd create new osd entry in the OSDDB
– setosd change fields for existing osd entry in OSDDB
– deleteosd mark osd entry as obsolete (will not really delete it)
– addpolicy create a policy entry in the OSDDB
– deletepolicy delete a policy entry in the OSDDB
– addserver create server entry in the OSDDB
– deleteserver delete server entry in the OSDDB
– list list OSDs known in the OSDDB
– policies list policies known in the OSDDB
– servers list servers known in the OSDDB
– osd list all fields of the osd entries in the OSDDB
● Use 'osd help <subcommand>' to get syntax and parameters information.
● All modifying commands can only used by administrators (in UserList of the server)
64
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Command Reference 'osd' 2
● The „osd“ has the following subcommands to analyze or manage data in an OSD
– volumes show IDs all RW-volumes having data in the OSD
– objects show object IDs (Fids) of all objects of a volume
– examine show details of an object
– incrlinkcount increment link count of an object
– decrlinkcount decrement link count of an object
– read read contents of an onbject
– write over-write contents of an object
– md5sum let OSD calculate md5sum of an object
● Use 'osd help <subcommand>' to get syntax and parameters information.
● All these commands can be used only by administrators (in UserList of the server)
65
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Command Reference 'osd' 3
● Other subcommands of 'osd';
– fetchqueue show fetch requests on archival HSM OSDs
– statistic show RPC statistic and data flow
– threads show active RPCs
– getvariable show value of a variable
– setvariable set new value to a variable (e.g. LogLevel)
– wipecandidate get sorted list of longest unused obejcts
– help get help texts
● Use 'osd help <subcommand>' to get syntax and parameters information.
● Some of these commands can be used only by administrators (in UserList of the server)
66
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Command Reference 'vos'
● New subcommands of 'vos';
– archcand get sorted list of files which need archival copies
– statistic show data flow statistic
– listobjects show objects on specified OSD
– getvariable show value of a variable
– setvariable set new value to a variable (e.g. LogLevel)
– salvage check size and linkcounts of all objects in a volume
– traverse show file statistic on server or volume
– split split a volume at a specified directory vnode
● New options for subcommand 'dump'
-osd dump OSD files as normal files (include data)
-metadataonly dump only directories and OSD metadata (for 'dumptool')
67
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Command Reference 'fs'
● New display subcommands of 'fs';
– statistic show RPC statistic of data flow
– threads show active RPCs
– [fid]vnode show vnode fields (and relative path)
– [fid]osd show OSD metadata of a file
– ls shows directory in 'ls -l' style with info about osd files
– getvariable show value of a variable (e.g. LogLevel)
– translate translate namei-path to fid or vice versa
– listlocked shows locked vnodes
– listoffline shows volumes taken offline by volserver and salvager
– fsynctrace shows trace table of fssync interface
68
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Command Reference 'fs' 2
● New 'fs' subcommands acting on files:
– [fid]prefetch bring wiped OSD-file back on-line (asynchronously)
– [fid]archive create archive copy of OSD file
– [fid]wipe wipe on-line copy, keep only archival copies
– [fid]replaceosd move object on specified OSD to another one
– [fid]oldversion restore older version of OSD-file
– createstripedfile preallocate OSD file
● Other new modifying subcommands:
– setpolicy set policy on directory level
– setvariable set new value to a variable (e.g. LogLevel)
69
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Command Reference 'afsio'
● 'afsio' is used to write or read files bypassing the cache manager. 'afsio' reads from stdin (for write) and writes to stdout (for read)
– write write date into an empty or new AFS file
–
–
– append append data at the end of an existing AFS file
–
–
– read read an AFS file
–
–
– help help information
● On some platforms 'afsio' is remarkably faster than I/O through the cache manager
~: afsio help readafsio read: read a file from AFSUsage: afsio read file <AFSfilename> [cell <cellname>] [verbose] [help]~:
~: afsio help appendafsio append: append to a file in AFSUsage: afsio append [file <AFSfilename>] [cell <cellname>] [verbose] [help]~:
~: afsio help writeafsio write: write a file into AFSUsage: afsio write [file <AFSfilename>] [cell <cellname>] [verbose] [md5] [help]Where: md5 calculate md5 checksum
70
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
Questions or comments?
Thank you
71
RechenZentrum Garchingder Max-Planck-Gesellschaft
Sept 29, .2009 European AFS Conference 2009, Rome Hartmut Reuter
AFS cell „ipp-garching.mpg.de“
40 fileservers with 195 TB disk space
21 non-archival OSDs with 100 TB disk space
2 archival OSD with HSM system TSM-HSM,
will be replaced by HPSS by next year
27000 volumes
8700 users
340 TB total data
4 TB data written per day
8 TB data read per day