hathitrust is a solution
TRANSCRIPT
HathiTrustisaSolution
TheFoundationsofaDisasterRecoveryPlanfortheSharedDigitalRepository
ThisreportservesasrecommendationsmadebyMichaelJ.Shallcross,2009DigitalPreservationInternUniversityofMichiganSchoolofInformation
ii
ExecutiveSummary ThisreportseekstoestablishtheframeworkofaDisasterRecoveryPlanfortheHathiTrustDigitalLibrary.WhileprofessionalbestpracticesandinstitutionalneedshaveprovidedaclearmandateforHathiTrust’sDisasterRecoveryProgram,commonparlancehasoftenobscuredtwoprominentfeaturesofsuchinitiatives.First,a‘DisasterRecoveryPlan’isactuallycomprisedofasuiteofdocumentswhichdetailarangeofissues,fromcrisiscommunicationsandthecontinuityofadministrativeactivitiestotherestorationofhardwareanddata.Second,thereisnoconclusiontotheplanningprocess;itisinsteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,training,testing,andmaintenance.
Theprimarygoalofthepresentdocumentistoprovideafoundationonwhichfutureplanningeffortsmaybuild.Tothatend,itexaminesthestrategiesbywhichHathiTrusthasanticipatedandmitigatedtherisksposedbytencommonscenarioswhichcouldprecipitateadisaster:
o Hardwarefailureanddatalosso Networkconfigurationerrorso Externalattackso Formatobsolescenceo Coreutilityorbuildingfailureo Softwarefailureo Operatorerroro Physicalsecuritybreacho Mediadegradationo Manmadeaswellasnaturaldisasters.
Asthislistreveals,adisasterwithinthedigitalrepositoryrefersnotmerelytodataloss,thedestructionofequipment,ordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocauseanextendedserviceoutage.Foreachscenario,thereportdiscussespossiblethreats,summarizesthepotentialseverityofrelatedevents,andthendetailssolutionsHathiTrusthasenactedthroughdirectquotationsfromtheHathiTrustWebsiteandTRACself‐assessment,ServiceLevelAgreements,andliteraturefromserviceprovidersandvendors.AttachedappendicesproviderelevantinformationandincludecontactsforimportantHathiTrustresources,anannotatedguidetoDisasterRecoveryPlanningreferences,andanoverviewofkeystepsintheDisasterRecoveryPlanningprocess.
TheconcludingsectionofthereportprovidesrecommendationsandactionitemsforHathiTrust
asitproceedswithitsDisasterRecoveryInitiative.ThesearedividedintoShort(0‐6mos.),Intermediate(6‐12mos.)andLong‐Term(12+mos.)objectivesandarearrangedinasuggestedorderofaccomplishment.
o Short‐termgoalsinclude: DescribingthenatureandextentofHathiTrust’sinsurancecoverage Testingandvalidationofcurrenttapebackupprocedures Improvedphysicalandintellectualcontroloversystemhardware Establishment,distribution,andmaintenanceofphonetrees Increaseddocumentationofinstitutionalknowledge IdentificationofDisasterRecoverymeasuresinplaceattheIndianapolissite.
o Intermediate‐termobjectivesfocuson: CreationofaDisasterRecoveryPlanningCommittee
iii
Initiationofthedatacollectionandanalysisessentialtothecreationofrecoverystrategies(ThissectionprovidesahighlevelbreakdownofvarioustasksandincludesthecoordinationofactivitiesbetweentheAnnArborandIndianapolissitesaswellaswithserviceprovidersandvendors.)
o Long‐termactionitemsdealwith: CompletionandimplementationofthesuiteofDisasterRecoverydocuments Initiationofstafftrainingandtestsoforganizationalcompliance. Storageofanadditionalcopyofbackuptapesataremotethirdlocation InvestigationofanalternatehotsiteinAnnArborintheeventadisaster
renderstheMACCunusable Considerationofathirdinstanceoftherepository Avoidanceofvendorlock‐inifakeysuppliershouldgooutofbusiness.
Thisreportdemonstratesthatvariousriskmanagementstrategies,designelements,operating
procedures,andsupportcontractshaveendowedHathiTrustwiththeabilitytopreserveitsdigitalcontentandcontinueessentialrepositoryfunctionsintheeventofadisaster.TheestablishmentoftheIndianapolismirrorsite,theperformanceofnightlytapebackupstoaremotelocation,andtheredundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpracticesandwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Unfortunately,disastersoftenresultfromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsofaDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensurethat,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedserviceprovider.
iv
Acknowledgements TheauthorwouldliketothankShannonZacharyforherencouragementandguidance;CorySnavelyandJeremyYorkfortheirgenerousexpenditureoftime,energy,andknowledge;andNancyMcGovernandLanceStuchellforaccesstotheiroutstandingDisasterRecoveryPlanningresources.Thefollowingindividualshavealsobeeninvaluablesourcesofadvice,support,andinformation:JohnWilkin,BobCampe,CyndiMesa,AnnThomas,JohnWeise,LarryWentzel,LaraUnger‐Syrigos,BillHall,EmilyCampbell,SebastienKorner,JessicaFeeman,PhilFarber,ChrisPowell,CameronHanover,StephenHipkiss,TimPrettyman,ReneGobeyn,andKrystalHall.ThanksalsotoDr.ElizabethYakel,MagiaKrause,andVeronicaandCoraFambrough.TheworkinthisreportwasmadepossiblebyanIMLSGrant.
v
TableofContents• ExecutiveSummary p.ii• Acknowledgements p.iv• Introduction p.1
o GoalsforHathiTrust’sDisasterRecoveryProgram p.1o TheMandateforDisasterRecoveryPlanninginDigitalPreservation p.2o DisasterPreparednessintheDesignandOperationofHathiTrust p.2o EssentialHathiTrustBusinessFunctions p.3
• HathiTrust’sDisasterRecoveryStrategies p.5o BasicRequirementsforDisasterRecovery p.5o DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSitesp.5o DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups p.6
• Scenario1:HardwareFailureorObsolescenceandDataLoss p.8o Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss p.8o HathiTrust’sSolutionsforHardwareFailureandDataLoss p.8o RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructure p.9o KeyFeaturesofHathiTrust’sIsilonIQClusteredStorage p.10o HardwareSupportandService p.12o EquipmentTracking p.13o HardwareReplacementSchedule p.13o TimelineforEmergencyReplacementofHathiTrustInfrastructure p.13o HathiTrustandInsuranceCoverageattheUniversityofMichigan p.14
• Scenario2:NetworkConfigurationErrors p.15o Review:RisksInvolvingNetworkConfigurationErrors p.15o HathiTrust’sSolutionsforNetworkConfigurationErrors p.15o ExtentofITComSupport p.15o ITComResponsibilities p.16o ITComServicesinResponsetoOutagesorDegradationImpactingtheNetwork p.16o HathiTrustResponsibilities p.16
• Scenario3:NetworkSecurityandExternalAttacks p.17o Review:RisksInvolvingNetworkSecurityandExternalAttacks p.17o HathiTrust’sSolutionsforNetworkSecurity p.17
• Scenario4:FormatObsolescence p.18o Review:RisksInvolvingFormatObsolescence p.18o HathiTrust’sSolutionsforFormatObsolescence p.18o SelectionofFileFormats p.18o FormatMigrationPoliciesandActivities p.19
• Scenario5:CoreUtilityand/orBuildingFailure p.20o Review:RisksInvolvingCoreUtilityorBuildingFailure p.20o HathiTrust’sSolutionsforUtilityorBuildingFailure p.20o GeneralMaintenanceandRepairsinUniversityofMichiganFacilities p.20o TheMichiganAcademicComputingCenter(MACC) p.20o ArborLakesDataFacility(ALDF) p.22
vi
• Scenario6:SoftwareFailureorObsolescence p.23o Review:RisksInvolvingSoftwareFailureorObsolescence p.23o HathiTrust’sSolutionsforSoftwareIssues p.23
• Scenario7:OperatorError p.24o Review:RisksInvolvingOperatorError p.24o HathiTrust’sSolutionsforOperatorError p.24o Ingest p.24o ArchivalStorage p.24o Dissemination p.24o DataManagement p.24
• Scenario8:PhysicalSecurityBreach p.25o Review:RisksInvolvingaPhysicalSecurityBreach p.25o HathiTrust’sSolutionsforPhysicalSecurity p.25o SecurityattheMACC p.25o SecurityattheALDF p.26
• Scenario9:NaturalorManmadeDisaster p.27o Review:RisksInvolvingaNaturalorManmadeDisaster p.27o HathiTrust’sSolutionsforNaturalorManmadeCatastrophicEvents p.27o BasicDisasterRecoveryStrategies p.28
• Scenario10:MediaFailureorObsolescence p.29o Review:RisksInvolvingMediaFailureorObsolescence p.29o HathiTrust’sSolutionsforMediaFailure p.29o RemainingVulnerabilities p.29
• ConclusionsandActionItems p.30o Conclusions p.30o Short‐TermActionItems p.30o Intermediate‐TermActionItems p.31o Long‐TermActionItems p.32
• APPENDIXA:ContactInformationforImportantHathiTrustResources p.34• APPENDIXB:HathiTrustOutagesfromMarch2008throughApril2009 p.37• APPENDIXC:WashtenawCountyHazardRankingList p.38• APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences p.39• APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess p.45• APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008) p.52• APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardService
Agreement(2006) p.53• APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009) p.54• APPENDIXI:MichiganAcademicComputingCenterOperatingAgreement(2006) p.55
**AppendicesF–IareembeddedPDFfiles.**
2009‐08‐24 1
Introduction
Intherealmofprintlibraries,adisasterisafairlyunambiguousevent:itisafire,abrokenpipe,aninfestationofpests—inshort,anythingwhichthreatensthecontinueduseandexistenceoftextsortheenvironmentinwhichtheyarestored.Thisbasicdefinitionmayalsobeappliedtothedigitallibrary,inwhichadisasterrefersnotmerelytothelossofcontentorcorruptionofdata,thedestructionofequipmentordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocauseanextendedserviceoutage.Thislastpartprovestobethegreatestdifferencebetweentheprintanddigitalworldsbecausethereareagreatmanythreatswhichcanleavedataintactbutincapacitatetheprimaryfunctionsofadigitallibrary.ThedailyoperationofaninstitutionsuchasHathiTrustinvolvestheanticipationandresolutionofavarietyofproblems—crashedservers,softwarebugs,networkingerrors,etc.—whichonlyrisetothelevelofa‘disaster’whentheyexceedthecapacityofnormaloperatingproceduresand/orthemaximumallowableoutageperiods.DisasterRecoveryPlanningthuspromptsustodeveloprobuststrategiestomitigateandlimittheeffectsofcommonproblemsandatthesametimeforcesustothinktheunthinkable.Nevertheless,confrontingworst‐casescenariosisavitalactivity;thebeliefthataneventwillneverhappensimplybecauseithasneverhappenedisaninvitationtotheverydisasterweseektoavoid.Hereinliesaconundrum,inthatthecreationofdetailedplansforeveryeventualityisnearlyimpossibleandalsoimpractical,sincetheresultsofsuchanendeavorwouldbeneedlesslycomplexaswellasexpensive.Atitsbasis,then,DisasterRecoveryPlanningdemandsanastuteassessmentofrisksothatwemayweighthecostsofpreparationsandsolutionsagainstthecostsofapotentialevent.
Sowheretobegin?WhenthesubjectofDisasterRecoveryPlanningarises,commonparlanceoftenobscurestwoprominentfeaturesofsuchinitiatives.First,a‘DisasterRecoveryPlan’isactuallycomprisedofasuiteofdocumentswhichdetailavarietyofrelatedissues,fromcrisiscommunicationsandthecontinuityofadministrativeactivitiestotherecoveryofhardwareanddataandtherestorationofcorefunctions.Second,thereisnoconclusiontotheplanningprocessorapointatwhichaplanis‘done’;thereisinsteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,training,testing,andmaintenance.Theessentialfirststepisthereforeathoroughknowledgeoftheorganization,itsgoals,anditsmandateforaDisasterRecoveryProgramsothatlatereffortscanfocusonthearticulationofpoliciesandthedevelopmentofsolutions.Asapreliminarystepinthiseffort,thisreportlookstoestablishabasicfoundationfromwhichfutureplanningeffortsmaygrow.
• GoalsforHathiTrust’sDisasterRecoveryProgram WhileamoreformalstatementofHathiTrust’sgoalsandrequirementsforitsDisasterRecoveryProgrammustbeelucidated,therepository’smissionstatementprovidesagoodindicationofitsmainobjectiveintheformationofaDisasterRecoveryPlan.Aspartofitsaimto“contributetothecommongoodbycollecting,organizing,preserving,communicating,andsharingtherecordofhumanknowledge,”HathiTrustseeks“tohelppreservetheseimportanthumanrecordsbycreatingreliableandaccessibleelectronicrepresentations.”1Thisstatementclearlyjoinsthetwinimperativesofpreservationandaccesswithanadditionalrequirement:reliability.ThedevelopmentandimplementationofaDisasterRecoveryPlanwillensurethatdigitalobjectswillretaintheirauthenticityandintegrityoverthelongtermandthatpartnerlibrariesanddesignatedusersmayrelyonHathiTrustservices(ortheirtimelyresumption)andcontentinthefaceofcatastrophicevents.
1HathiTrust.“Mission&Goals”(2009)retrievedfromhttp://www.hathitrust.org/mission_goalson8July2009.
2009‐08‐24 2
• TheMandateforDisasterRecoveryPlanninginDigitalPreservation HathiTrust’smandateforacomprehensiveandproactiveDisasterRecoveryPlanstemsfromanumberofsignificantsources,amongwhichwemayincludeitsmissionandgoals.The“InstitutionalDataResourceManagementPolicy”(2008)oftheUniversityofMichigan’sStandardPracticeGuidealsoprovidesanimpetusforthecreationofaDisasterRecoveryProgram.WhilenotnecessarilyinclusiveoftheMichiganDigitizationProjectmaterialsstoredinHathiTrust,thisdocumentunderscoreshowimportantitisthatdataresources“besafeguarded[and]protected”and“contingencyplans[…]bedevelopedandimplemented.”2Initsdiscussionofthelatterpoint,thepolicyspecifiesthat:
DisasterRecovery/BusinessContinuityplansandothermethodsofrespondingtoanemergencyorotheroccurrencesofdamagetosystemscontaininginstitutionaldata[…]willbedeveloped,implemented,andmaintained.Thesecontingencyplansshallinclude,butarenotlimitedto,databackup,DisasterRecovery,andemergencymodeoperationsprocedures.Theseplanswillalsoaddresstestingofandrevisiontodisasterrecovery/businesscontinuityproceduresandacriticalityanalysis.3
WhiledatabackupproceduresandahostofriskmanagementpracticesarealreadyanintegralpartofHathiTrust’soperation,therepositorynowlookstoformalizetheotherstrategiessuggestedbythe“InstitutionalDataManagementPolicy.”Beyondtheexamplelaidoutbythisdocument,HathiTrust’smandateforDisasterRecoveryderivesfromtheprofessionalliteraturedetailingbestpracticesinthefieldofdigitalpreservation.TheReferenceModelforanOpenArchivalReferenceSystemidentifiesDisasterRecoveryasanessentialcomponentofits“ArchivalStorage”functionandhighlightstheimportanceofsuchplansinachievingthegoaloflong‐termpreservationofadigitalarchive’sholding.AsoutlinedintheOAISdocument,“theDisasterRecoveryfunctionprovidesamechanismforduplicatingthedigitalcontentsofthearchivecollectionandstoringtheduplicateinaphysicallyseparatefacility.”4HathiTrusthassuccessfullymetthisrequirementbyperformingnightlytapebackupsandestablishingamirrorsiteatIndianaUniversityinIndianapolis.TheTrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)isevenmoreexplicitinitsrequirementthatrepositoriesdocumenttheirpoliciesandprocedureswith“suitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoff‐sitebackupofallpreservedinformationtogetherwithanoff‐sitecopyoftherecoveryplan(s).”5ProfessionalbestpracticesaswellasinternalneedsandgoalsthusprovidethemandatewhichunderliesHathiTrust’sdevelopmentofaformalDisasterRecoveryPlan.
• DisasterPreparednessintheDesignandOperationofHathiTrust OneoftheprimarygoalsofHathiTrustistoprovide“transparencyinallofitsoperations,includingitsworktocomplywithdigitalpreservationstandardsandreviewprocesses.”6Nowhereisthiscommitmentmoreclearthaninitseffortstoanticipateandmitigateriskswhichcouldthreatenthe
2UniversityofMichigan.“InstitutionalDataResourceManagementPolicy”(2008)StandardPracticeGuide,retrievedfromhttp://spg.umich.edu/on8July2009.3Ibid.4ConsultativeCommitteeforSpaceDataSystems.ReferenceModelforanOpenArchivalInformationSystem(2002)p.4‐8.5OCLCandCRL.“SectionC3.4”TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49.6HathiTrust.“Accountability”(2009)retrievedfromhttp://www.hathitrust.org/accountabilityon25June2009.
2009‐08‐24 3
contentsandfunctionsoftheSharedDigitalRepository.AsafirststepinaddressingthedisasterpreparednessrequirementinsectionC3.4oftheTRACCriteriaandChecklist,7thisdocumentservestwopurposes.First,itprovidesanoverviewofthepolicies,procedures,resourcesandcontractsthatenableHathiTrusttoaddressthechallengesandthreatsendemictothefieldofdigitalpreservation.MaterialisthereforeciteddirectlyfromtheHathiTrustWebsite(http://www.hathitrust.org),themostrecentversionofHathiTrust’sreviewofitscompliancewiththeminimumrequiredelementsoftheTRACCriteriaandChecklist,8andrelevantliteratureprovidedbykeyvendorsandserviceproviders.9Second,thisreportexaminesHathiTrust’scurrentlevelofdisasterpreparednessanddefinescurrentandforthcomingeffortsinitsdevelopmentofadynamicandproactiveDisasterRecoveryProgram.PertherecommendationsoftheTRACCriteriaandChecklist,thisdocumentrecordsthemeasuresandprecautionsalreadyinplaceinregardsto“specifictypesofdisasters”thatcouldbefallHathiTrust.Theseeventsincludehardwarefailure,dataloss,networkconfigurationerrors,externalattacks,coreutilityfailure,formatobsolescence,softwarefailure,physicalsecuritybreach,andmanmadeaswellasnaturaldisasters.Whileaformal,writtenplandetailingindividualrolesandresponsibilitiesintherepository’sresponsetoeachofthesescenariosisstillforthcoming,theevidencegatheredinthisreportrevealsthatcrucialelementsofaDisasterRecoveryPlanarealreadyinplacewithinHathiTrust.10
• EssentialHathiTrustBusinessFunctionsAsthedevelopmentoftheDisasterRecoveryPlanproceeds,itisimportanttobearinmindthat
itsgoalisnotmerelytherestorationofhardwareanddatabutalsotherecoveryandcontinuityofessentialrepositoryfunctions.ThefollowinglistrepresentscorefunctionsthatneedtobeaddressedbyHathiTrust’sDisasterRecoveryPlanandassuchshouldnotbeconsideredacomprehensiverepresentationoftherepository’sfunctions.Bydirectingplanningeffortstowardspecificfunctions(ratherthantheorganization’sactivitiesasawhole),HathiTrustmayprioritizeandfocusitsrecoveryresponsesandresourcestoensurethatthemostessentialfunctionsgobackonlinefirst.SubsequentdiscussionofDisasterRecoverystrategiesandriskmanagementsolutionsinthisreportarepresentedundertheassumptionthatthecontinuityofthesefunctionsisaprimaryobjective.Theprioritizationofthesefunctionsremainstobedeterminedbyanappropriateauthority.11
7“Repositoryhassuitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoff‐sitebackupofallpreservedinformationtogetherwithanoff‐sitecopyoftherecoveryplan(s).Therepositorymusthaveawrittenplanwithsomeapprovalprocessforwhathappensinspecifictypesofdisaster(fire,flood,systemcompromise,etc.)andforwhohasresponsibilityforactions.Thelevelofdetailinadisasterplanandthespecificrisksaddressedneedtobeappropriatetotherepository’slocationandserviceexpectations.Fireisanalmostuniversalconcern,butearthquakesmaynotrequirespecificplanningatalllocations.Thedisasterplanmust,however,dealwithunspecifiedsituationsthatwouldhavespecificconsequences,suchaslackofaccesstoabuilding.”OCLCandCRL.TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49.8HathitrustDigitalLibraryReviewofCompliancewithTrustworthyRepositoriesAudit&Certification:CriteriaandChecklistMinimumRequiredElements,revisedMay20,2009.Availableathttp://hathitrust.org/documents/trac.pdf9ContactinformationforrelevantUniversityofMichigandepartmentsandserviceprovidersaswellasforexternalvendorsmaybefoundinAppendixA.10AlistofresourcesrelatedtodisasterrecoveryandtheplanningprocessmaybefoundinAppendixD(AnnotatedListofDisasterRecoveryPlanningResources).11ThislistofessentialHathiTrustbusinessfunctionswasdevelopedinconjunctionwithJeremyYork.
2009‐08‐24 4
o Ingest Ingestdigitalobjects(SIPs)viaGRIN—theGoogleReturnInterface(ora
modifiedingestportalforlocalcontent) ValidateingestedcontentwithGROOVE—theGoogleReturnObject‐Oriented
ValidationEnvironment(oramodifiedversionforlocalizedingest)o ArchivalStorage
Preserveindefinitelydigitalobjectsandmetadata(AIPs)intheSharedDigitalRepository(includesensuringtheintegrityandauthenticityofmaterials).Thisfunctionaddressestheneedsofpartnerlibrariesaswellasindividualusers.
Recordchangestoandactionsonitemswhiletheyareintherepository Maintainapersistentobjectaddressforitemswithinrepository
o Dissemination Provideaccesstodigitalobjectsforusers Allowforthetextsearchesthroughavarietyoffields Enablelargescalefull‐textsearches Permitthecreationofpublicandprivatecontentcollections Disseminatedigitalobjects(DIPs)tousers(viathepage‐turneraccesssystem
anddataAPI) DistributedatasetsandHathiTrustAPIstodevelopers ResearchanddevelopadditionalapplicationsandresourcesforHathiTrust
o Administration Providetransparentandup‐to‐dateinformationtousersandthegeneralpublic
viahttp://www.hathitrust.org/ Communicateinformationandcoordinateactivitiesamongstpartnerlibraries
andHathiTrustboardsandcommittees.o DataManagement
UpdateandmanagetheRightsandGeoIPdatabases BuildandmaintainCollectionBuilderandLargeScaleSearchSolrindexes Determineappropriateuseraccesstotextsviadatabasequeries SynccontentwiththeIndianapolissiteandbackupcontenttotape
2009‐08‐24 5
HathiTrust’sDisasterRecoveryStrategies
• BasicRequirementsforDisasterRecovery RoyTennanthasidentifiedthreerequisitecomponentsofadigitalDisasterRecoveryPlan:(1)theuseofaneffectivedataprotectionsystem(i.e.RAID),(2)redundantpowerandenvironmentalsystems,and(3)regularbackupofinformationtotapeand,ideally,toaremotemirroredsite.12HathiTrusthasincorporatedalltheseelementsintoitsdesignandoperation.ItsIsilonIQstorageclusterprovidesahighdegreeofdataredundancywithitsN+3parityprotection;theMichiganAcademicComputingCenterprovidesfullyredundantpowerandenvironmentalsystemsforHathiTrustinfrastructure;andnightlytapebackupsandthereplicationofdatatoafullyoperationalmirrorsitelocatedatIndianaUniversityinIndianapoliswiththesamelevelsofpowerandenvironmentalconditioningprovidemultiplecopiesaswellasgeographicdistributionofcontent.
o “HathiTrustisintendedtoprovidepersistentandhighavailabilitystoragefordepositedfiles.Inordertofacilitatethis,theinitiative’stechnologyconcentratesoncreatingaminimumoftwosynchronizedversionsofhigh‐availabilityclusteredstoragewithwidegeographicseparation(thefirsttwoinstancesofstoragewillbelocatedinAnnArbor,MIandIndianapolis,IN),aswellasanencryptedtapebackup(writtentoandstoredinaseparateAnnArborfacility).Eachofthesestorageortapeinstancesisphysicallysecure(e.g.,inalockedcageinamachineroom)andonlyaccessibletospecifiedpersonnel.Eachseparatestoragesystemisalsoequippedwithmechanismstoprovidemirroredmanagementandaccessfunctionality,andemploy100%dataredundancyinanefforttopreventdataloss.”13
DetailsonparityprotectionandtheHathiTrustserverenvironmentareavailablebelow(seeScenario1andScenario5,respectively).
• DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSites HathiTrust'sfirstlineofdefenseintheeventofadisasterisitshotmirrorsiteinIndianapolis.WhileingestofmaterialisrestrictedtotheAnnArborlocation,bothsitespossesstwowebservers,aMYSQLdatabaseserver,andanIsilonIQstoragecluster(currentlycomposedof21‘nodes,’serverscomposedofCentralProcessingUnitsaswellasstorage).Duringnormaloperations,thisarrangementallowsHathiTrusttobalanceahighvolumeofwebtrafficacrossbothsitessuchthatindividualuserrequestsmaybehandledbyeithersiteinatransparentmanner.Shouldthetolerancesforfailurebeexceededatasite(asinadisastersituation)thefailovercapabilitybuitintotheHathiTrustarchitectureenablestheremainingsitetoprovideaccesstothedesignatedcommunitywithoutnoticeableservicedisruptions.AsnotedintheMay2009HathiTrustUpdate,withthefulloperationofbothlocations,“Wearenowensuringthatusersdonotfeeltheeffectsofsingle‐siteoutages,suchasroutinemaintenance,
12Tennant,Roy.“DigitalLibraries:CopingwithDisasters.”LibraryJournal,15November2009.Retrievedfromhttp://www.libraryjournal.com/article/CA180529.htmlon13July2009.13HathiTrust.“Technology”retrievedfromhttp://www.hathitrust.org/technologyon15June2009.
2009‐08‐24 6
bytakingadvantageofsiteredundancy.”14However,becauseingesttakesplaceonlyinAnnArbor,thelossofkeycomponentstherewouldinhibittherepository’sabilitytoacquirenewcontent.
HathiTrustutilizesIsilonSystem’sSyncIQApplicationSoftwaretosynchronizedataattheIndianapolissitewithnewlyingestedorupdatedmaterialfromtheAnnArborsite.ThesynctoIndianapolisrunson24separatesubsetsofthedataandeachonerunsevery2hours,withtheexceptionofSundays.Inotherwords,subset1runsatmidnightonMonday,subset2runsat2a.m.,andsoon.ThemaximumtimefordatatobereplicatedfromAnnArbortoIndianapoliswouldthereforebethreedaysplustheruntimeofthesyncprocess(whichtendstotakelessthanthreehours.)15
o “SyncIQisanasynchronousreplicationapplicationthatfullyleveragestheuniquearchitectureofIsilonIQstoragetoefficientlycopydatafromaprimaryclustertoonelocatedatasecondarylocation.”16
o “Allnodes[…inboththesourceandtargetIsilonIQclusters]concurrentlysendandreceivedataduringreplicationjobsinrealtime,withoutimpactingusersreadingandwritingtothesystem.”17
o “Arobustwizard‐drivenweb‐basedinterfaceisfullyintegratedinto[…Isilon’sproprietary]OneFSmanagementtooltocontrolallthefunctionality,includingscheduling,policysettings,monitoringandloggingofdatatransferredandbandwidthutilization.”18
o “Onlyfilesthathavechangedwillbereplicatedtothetargetclusters.Thiswilloptimizetransfertimesandminimizebandwidthused.”19
o “Intheeventthesecondarysystemisnotavailableduetoasystemornetworkinterruption,thereplicationjobwillbeabletorollbackandrestartatthelastsuccessfulcopyoperation.”20
o “Uponacriticalfailureorlossofnetworkconnection,analertwillbesenttoallrecipientsconfiguredtoreceivecriticalalerts.”21
• DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups
HathiTrust’sabilitytorecoverfromadisasterisalsoensuredbythenightlyautomatedtapebackupsperformedbytheTivoliStorageManager(TSM)clientapplicationinstalledontheingestserversconnectedtotheHathiTruststorageclusterandmanagedbyMichigan’sITCSTSMGroup.TheTSMBackupServiceStandardServiceLevelAgreement22outlinestheobligationsandresponsibilitiesofboththeserviceproviderandHathiTrust:
14HathiTrust.“UpdateonMay2009Activities”(2009)retrievedfromhttp://www.hathitrust.org/updates_may2009on2July2009.15Snavely,Cory(Head,UMLibraryITCoreServices).Personalemailon13July2009.16“BackupandRecoveryWithIsilonIQClusteredStorage,”2007p.1117Ibid.18Ibid.19Ibid.20Ibid.21Ibid22PleaserefertoAppendixF(TSMBackupServiceStandardServiceLevelAgreement).
2009‐08‐24 7
o “TheprogressiveincrementalmethodologyusedbyTivoliStorageManageronlybacksupneworchangedversionsoffiles,therebygreatlyreducingdataredundancy,networkbandwidthandstoragepoolconsumptionascomparedtotraditionalmethodologiesbasedonperiodicfullbackups.”23
o “ITCSisresponsibleforallofthecentralserverhardware,tapehardware,networkinghardware,andrelatedcomponents.ITCSisalsoresponsibleforhardwaremaintenanceaswellassoftwaremaintenance,administration,andsecurityauditsonthecentral(non‐client)TSMservers.”(TSMBackupServiceSLA,sec.4.1)
o “ITCSprovides7x24on‐callmonitoringandsupport,andstrivestokeeptheserversupinproductionatalltimes.Thetargetup‐timeis99.9%ofthetime.TheTSMhardwaredesignismodularandshouldallowustotakepiecesoutofservicewithoutaffectingcustomers.Wheneverpossible,systemmaintenancewillbeperformedduringstandardweekendmaintenancewindowsasdefinedbyITCS.”(sec.4.2)
o “Inanemergency,[email protected](thiswillgototheon‐callstaff’spagerinrealtime).(sec.4.6)
o “ITCSisresponsibleforphysicalsecurity.Machineaccessaudits,OSsecurity,andnetworksecurityontheTSMserverendarealsotheresponsibilityofITCS.”(sec.4.9)
o “Theservice[…]includesdatacompression,dataencryptions,anddatareplication.”(sec.1.0)
o “ITCSwillmaintainatleasttwoTSMsitesandwillmirrordatabetweenthesitestoprovideredundancyintheeventofadisaster.CurrentlythosesitesaretheArborLakesDataFacility(ALDF)at4251PlymouthRd.andtheMichiganAcademicComputingCenter(MACC)locatedat1000OakbrookDr.”(sec.4.10)
o “Bothfacilitiesaresecure,climatecontrolledsitesdesignedandbuiltforhighavailableproductionservices.”24
o “Intheeventofacustomerdisasterwithlarge‐scale(afullserverormore)dataloss,ITCSwillworkwiththecustomertooptimizetherestoretimetobestofourability.Wewillonlybeabletodevoteresourcestotheextentthatothercustomersarenotaffected.Restoringlargefileservers(multipleTerabytes)cantakeseveraldays.Ifcustomerswanttominimizethisamountoftimetorestore,wecanpurchaseadditionalresourcesforthispurpose.Contactusdirectly,andwe’llworkoutascenariowithcostinginformation.IntheeventofaMAJORcampusoutageaffectingalargenumberofcustomers,ITCSmanagementwillworkwithcustomerstodeterminehowtoprioritizecustomerrestores.”(sec.4.11)
o “DisasterRecoveryplanningistheresponsibilityofthecustomerunit.”(sec.5.8)HavingestablishedthemainDisasterRecoverystrategiesemployedbyHathiTrust,wemaynowproceedtoinvestigatethemeansbywhichitanticipatesandmitigatesthemostcommonthreatsfacingdigitalrepositories.
23IBM.“IBMTivoliStorageManager:FeaturesandBenefits”(2009)retrievedfromhttp://www‐01.ibm.com/software/tivoli/products/storage‐mgr/features.html?S_CMP=rnavon16June2009.24InformationTechnologyCentralServicesattheUniversityofMichigan.“FrequentlyAskedQuestionsabouttheTSMBackupService”(2009)retrievedfromhttp://www.itcs.umich.edu/tsm/questions.phpon16June2009.
2009‐08‐24 8
Scenario1:HardwareFailureorObsolescenceandDataLoss
• Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss ThefollowingtablehighlightsthevariouseventswhichposearisktothehardwareanddataofHathiTrust.Thesethreatsmaystemfromflawsormalfunctionsintheequipmentitselforasaresultofexternaleventsthatincludephysicalsecuritybreachesandnaturalormanmadedisasters.Thearrangementofthesepotentialrisksreflectstherelativeseverityoftheirrespectiveconsequences.
• HathiTrust’sSolutionsforHardwareFailureandDataLoss
ThethreatsfacedbyHathiTrust’shardware(andassociatedapplicationsaswellasthedatastoredtherein)arecomprisedofthefailureofredundantfeatures,failurethatexceedscomponents’toleranceforredundancy,andsinglepointsoffailure.Whilethefailureofredundantcomponentsmayhappenmorefrequently(i.e.,thelossofanindividualdrivewithintheIsilonIQcluster),suchlossesdonothavealargeimpactontherepository;eventswhichcompromisesinglepointsoffailurewillhavemuchgreaterconsequencesforthecontinuityofHathiTrustoperations.Atthesametime,whileacomponentmayhaveredundancyononelevel(forexample,therearefiveserversdedicatedtoingest),thatcomponentsimultaneouslymaybeconsideredatahigherleveltobeasinglepointoffailure(i.e.,becausetheingestserversarehousedinasinglechassis,theentireunitisvulnerabletoaneventsuchasafire).Thisdualityhighlightstheneedforvigilanceandforesightinmanagingtherepository’sinfrastructure. BecauseHathiTrustreliesheavilyuponhardwaretofulfillitsmissionanddeliverservicestoitsdesignatedcommunityofusers,theselectionofequipmentanddevelopmentofsystemarchitecture
Severity EventHighimpact Lossatasinglepointoffailure
• Anadditionalfailurepasttoleranceswhenonlyonesiteisoperational• Serviceisunavailableandcannotberestoreduntilcomponentisrepaired/restored
ModerateImpact Failureofacomponentpastredundancytolerance• Systemnolongerhasredundancy:additionallossorfailureofcomponentswill
resultinlossofsystem.Thisisaparticularproblemifonesiteisalreadydown.• Lossofdbserver(homeofRightsdb)orofbothWebserversatasitewillrender
thatlocationinaccessible• LossoffourdrivesornodesineitherIsilonstorageclusterwillresultinthelossof
thatinstance.Theclusterwillbeofflineandunabletohandlereadorwriterequests;alltrafficwouldhavetobehandledbytheremainingsite.
• LossofUMArborLakessitewouldpreventperformanceoftapebackups.• LossofUMMACCsitewoulddepriveIUsiteofdataredundancy• Lossofingestserverswouldpreventnewcontentfromenteringrepository
LowImpact Failureofredundantsystemcomponents• Includesredundantcomponentswithineachsiteaswellasgeneralredundancy
betweentheIUandUMsiteso HTinfrastructurehasbeendesignedtoavoidsinglepointsoffailureandto
ensuredataandequipmentredundancyo Servicecontinuesinanuninterruptedandtransparentmanner
2009‐08‐24 9
hasaimedatminimizingthedangersposedbysinglepointsoffailurethroughtheintroductionofstrategicredundancies.ThebasicmeansforavoidingthedisastrouseffectsofhardwarefailureordatalosshavebeentheestablishmentoftheIndianapolismirrorsiteandthenightlybackupofcontenttotape.(Formoredetail,pleaserefertotheprecedingsection).Whilethesestrategiesaccountforextraordinaryevents,HathiTrust’sserverreplacementscheduleallowstherepositorytoanticipatetheresultsofnormalequipmentuseanddepreciation.Stepstosafeguardthelong‐termfunctionalityofHathiTrusthavethereforebeencomplementedbyaconsiderationofbestpracticesfordisasterpreparedness.
• RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructureThefollowingsectionsprovideageneraloutlineofHathiTrust’sredundantcomponentsand
singlepointsoffailure.Giventhecomplexityoftherepository’sinfrastructure,unknownorunanticipatedscenariosmayexist;futureDisasterRecoveryPlanningwillthusinvolveaperiodicreviewofkeyfeaturesandvulnerabilities.
o SiteRedundancy:TheestablishmentofthemirrorsiteinIndianaprovidesHathiTrustwithafullyredundantoperation.Becausebothinstancesprovidefullaccesstocontentinadditiontootherrepositoryfunctions,userswillnotexperiencealossordegradationofserviceintheeventthatserviceislostfromonesite.KeyexceptionstoHathiTrust’ssiteredundancyarenotedbelow.
o RedundantComponentsatEachSite:ThefollowingcomponentsprovideeachsitewithatoleranceunderwhichlimitedfailureswillnotdisruptmajorHathiTrustfunctionsanduserservices.
Webservers:eachsitehastwoserverssothatifonefails,theothermaycontinuetohandletraffic.ThesealsohosttheGeoIPdatabase.
IsilonIQclusters:thecurrentconfigurationof21nodesfeaturesN+3parityprotection;thisdataredundancypermitsthesimultaneousfailureof3drivesonseparatenodesorthelossofthreeentirenodeswithoutservicedegradation.
Ingestservers:theAnnArborsitepossessesfiveserverssothatingestmaycontinue(albeitataslowerrate)intheeventofanyfailures.
LargeScaleSearch(LSS)Solrindex:currentlyhousedonthewebservers,butwillsoonbemaintainedonfivenewserversinAnnArbor.
o SinglePointsofFailure:25Thesearecomponentsofasystemwhich,iflost,willpreventtheentiresystemfromfunctioning.Eventhosecomponentswithwhollyredundantpeerdevices(suchastheweboringestservers)maybeconsideredsinglepointsoffailureiftheyhaveexceededtheircapacitytosustainlosses(i.e.,ifonewebserveratasitehasalreadybeenlost).
SinglePointsofFailureattheComponentLevel:BecauseonlyoneofthesecomponentsexistsateachHathiTrustsite,alosswillresultinsystemfailure.
• MYSQLdatabaseserver:housestherightsdatabase,ingesttrackingdatabase,andtheCollectionBuilderSolrindex
• Servernetworkswitches• Outboundnetworkswitches
SinglePointsofFailureattheSystemLevel:Whileanygivencomponentmayhavevariousdegreesofinternalredundancy(suchasmultiplepowersuppliesor
25ContentinthissectioniscourtesyofCorySnavely(personalemailfrom13July2009).
2009‐08‐24 10
multipledrives)itmightstillfailasawholeandthusresultinthelossofaparticularinstanceofHathiTrust.Thefollowingarecomponentslocatedateachsitewhich,whilepossessedofinternalredundancies,arestillsubjecttocompleteloss(asintheeventofafire)andmaythusrenderasiteinoperable.
• IsilonIQstoragecluster:theentireclustercouldbelostinalarge‐scaleevent.Additionally,thelossofafourthdriveornodewillexceedthecluster’sfailuretoleranceandresultinaservicedisruption.
• Webservers:shouldonefail,theremainingserverwillbeasinglepointoffailure.
• Bladeserverchassis:sinceweb,ingest,anddatabaseserversarehousedinonechassis,theentireunitcouldpotentiallyfail.
• LSSindex:inthenearfuture,theserversinAnnArborwillbethesoleinstanceoftheLargeScaleSearchindex.
• MirlyndatabaseandMirlyn2Solrindex26:thesearecurrentlykeycomponentsoftheUMLibraryinfrastructure;shouldthesebeunavailable,accesstoanduseofHathiTrustwillbecompromised.
• KeyFeaturesofHathiTrust’sIsilonIQClusteredStorage
TheIsilonIQstorageclusterstoresandprovidesdigitalobjectsforHathiTrust’spartnerlibrariesandmembersofitsdesignatedcommunity.Theclusterprovidesahighdegreeofinherentredundancy,whichgivesbothHathiTrustsitesaconsiderabledegreeoftoleranceinregardstothefailureofvariousaspectsofthestorageunits.Asoneexample,Isilon’sproprietaryOneFSoperatingsystempermitstheindividualstoragenodes—theindividualserversthatarethebuildingblocksofthecluster—tofunctionas‘coherentpeers’sothatanyonenode‘knows’everythingcontainedontheotherunitsinthecluster.
o “Isilon'sOneFSoperatingsystem[…]intelligentlystripesdataacrossallnodesinaclustertocreateasingle,sharedpoolofstorage.”27
o “Becauseallfilesarestripedacrossmultiplenodeswithinacluster,nosinglenodestores100%ofafile;ifanodefails,allothernodesintheclustercandeliver100%ofthefileswithinthatcluster.”28
o “Adistributedclusteredarchitecturebydefinitionishighlyavailablesinceeachnodeisacoherentpeertotheother.Ifanynodeorcomponentfails,thedataisstillaccessiblethroughanyothernode,andthereisnosinglepointoffailureasthefilesystemstateismaintainedacrosstheentirecluster.”29
26MirlynisthenameoftheUniversityofMichigan’scurrentOnlinePublicAccessCatalog,whichissupportedbytheAlephintegratedlibrarysystem.Mirlyn2isabetaversionofUM’srecentlyimplementednextgenerationcatalog,basedontheVuFindplatform,whichwillbecomethemainlibrarycatalogonAugust3,2009.27IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon17June2009.28IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.7.“Incomputerdatastorage,datastripingisthetechniqueofsegmentinglogicallysequentialdata,suchasasinglefile,sothatsegmentscanbeassignedtomultiplephysicaldevices.[…]ifonedrivefailsandthesystemcrashes,thedatacanberestoredbyusingtheotherdrivesinthearray.”(http://en.wikipedia.org/wiki/Data_striping,retrievedon16August,2009).29IsilonSystems.“BreakingtheBottleneck:SolvingtheStorageChallengesofNextGenerationDataCenters”(2008)p.8
2009‐08‐24 11
HathiTrust’sIsilonIQclustersensureahighdegreeofdataredundancywiththeirN+3parityprotection.N+3providestriplesimultaneousfailureprotectionsothatuptothreedrivesonseparateIsilonIQnodes,orthreeentirenodes,canfailatthesametimeandalldatawillstillbefullyavailable.
o “TraditionalRAID‐5parityprotectionresultsindatalossifmultiplecomponentsfailpriortothecompletionofarebuild.FlexProtect,incontrast,automaticallydistributesalldataanderrorcorrectioninformationacrosstheentireIsilonclusterandwithitsrobusterrorcorrectiontechniquesefficientlyandreliablyensuresthatalldataremainsintactandfullyaccessibleevenintheunlikelyeventofsimultaneouscomponentfailures.”30
o “Eachfileisstripedacrossmultiplenodeswithinacluster,with[three]paritystripesforeachdatablock.”31
ThefilesystemmayalsoperformaDynamicSectorRepair(DSR)atthetimeofanyfilewriting.Ifitencountersabaddisksector,thefilesystemwilluseparityinformationelsewhereinthesystemtorebuildthenecessaryinformationandrewriteanewblockelsewhereelseonthedrive.Thebadsectorwillberemappedbythedrivesothatitisneverusedagainandthewriteoperationwillbecompleted. TheIsilon“restriper”isameta‐process/infrastructurethathasfourprimaryphasestohelpmanageandprotectdataintheeventthatcomponentsoftheclustersustainapartialfailureormalfunction.Theprocessesrunasbackgroundoperationsanddonotrequiresystemdowntime.3233
o FlexProtectrepairsdata(i.e.,intheeventofadriveloss)usingparity. “IsilonOneFSwithFlexProtectcanboasttheindustryleadingMeanTimeto
DataLoss(MTTDL)forpetabyteclusters.”34 “FlexProtectintroducesstate‐of‐the‐artfunctionality,whichrebuildsfaileddisks
inafractionofthetime,harnessesfreestoragespaceacrosstheentireclustertofurtherinsureagainstdataloss,andproactivelymonitorsandpreemptivelymigratesdataoffofat‐riskcomponents.”35
o AutoBalance“rebalancesthedatainaclusteraccordingtobusinessrules,inrealtime,non‐disruptively.”36
“Assoonasthe[neworrepaired]nodeisturnedonandnetworkcablesareconnected,AutoBalanceimmediatelybeginstomigratecontentfromtheexistingstoragenodestothenewlyaddednodeacrosstheclusterinterconnectback‐endswitch,re‐balancingallofthecontentacrossallnodesintheclusterandmaximizingutilization.”37
30IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon30June2009.31IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.732IsilonX‐SeriesSpecifications(productbrochure)33InformationontheIsilonrestripercomesfromapersonalemailsentbyKipCranfordofIsilonSystems,Inc.on1June2009.34IsilonSystems.“DataProtectionforIsilonScale‐OutNAS”(2009)p.435IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon15June2009.36McFarland,Anne.“IsilonAcceleratesDeliveryofDigitalContent”TheClipperGroupNavigator(2003).37IsilonSystems.“TheClusteredStorageRevolution”(2008)p.13
2009‐08‐24 12
o Collectcleansuporphanednodesanddatablockstopreventfragmentationofdata.o MediaScanverifiesdisksectors.
ThefunctionofMediaScanistoscaneveryblockinthefilesystemlookingforbaddisksectors.Ifitencountersabadsector,itwillperformaDynamicSectorRepair(DSR)anduseparityinformationelsewhereinthesystemtorebuildthenecessaryinformationandrewriteanewblocksomewhereelseonthedrive.
MediaScanperiodicallyreviewsdatablocksanddisksectorsthatmaynothavebeenaccessed,fromafilelevel,inmonthsoryearsandtherebyhelpstokeepthedrivesashealthyaspossible.
o AsoftheOneFS5.0release,allfilesystemmetadatacanbecheckedbytheIntegrityScanrestriperphase.ThisprocesswillallowHathiTrusttocompletelycheckfiledataandmetadataviaassociatedchecksums.
Otherinstancesofinherentredundancyincludenon‐volatileRAM,afullyjournaledfilesystem,andsoftwareapplicationsthatmanageclientconnectionsintheeventofanode’sfailure.
o “OneFSisafully‐journaledfilesystemwithlargeamountsofbattery‐backednon‐volatilerandomaccessmemory(NVRAM)withineachnode,whichensurestheintegrityofthefilesystemintheeventofunexpectedfailuresduringanywriteoperation.”38
o “TheIsilonSmartConnectmodule[…ensures]thatwhenanodefailureoccurs,allin‐flightreadsandwritesarehandedofftoanothernodeintheclustertofinishitsoperationwithoutanyuserorapplicationinterruption.[…]Ifanodeisbroughtdownforanyreason,includingafailure,thevirtualIPaddressesontheclientswillseamlesslyfailoveracrossallothernodesinthecluster.Whentheofflinenodeisbroughtbackonline,SmartConnectautomaticallyfailsbackandrebalancestheNFSclientsacrosstheentireclustertoensuremaximumstorageandperformanceutilization.”39
• HardwareSupportandService HathiTrustequipmentiscoveredbysupportandserviceagreementswithitsvariousvendors(SunMicrosystems,Dell,CDW‐G,etc.).Agoodexampleofonesuchagreementisfoundinthe“Platinum”supportprovidedbyIsilonSystemsandwhichincludes:
o Extended24x7x365Telephone&OnlineHardwareandSoftwareSupporto 24x7ProactiveMonitoring&Alerts–EmailHome(forHardwareandSoftware)o ReturnPartstoFactoryforRepairand4‐hourReplacementPartsDeliveryo SupportIQ(EnhancedServiceabilityDiagnostics)andSystemEventTrackingo On‐siteTroubleshootingo IsilonHardwareInstallationo SoftwareProductDocumentation,ReleaseNotes,andaccesstoProductTechnicalNoteso RemoteDiagnosis(ProvidedUserGrantsAccess)o Maintenance&PatchReleases
38IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.939IsilonSystems.“DataProtectionforIsilonScale‐OutNAS”(2009)p.6
2009‐08‐24 13
o MinorandMajorUpgradeReleases(IncludesPerformanceImprovements,NewFeatures,ServiceabilityImprovements).40
• EquipmentTrackingLITCoreServices(CS)maintainsaninventoryofserversonawikipageaccessibletoitsstaff.
Detailsincludeeachserver’sname,location,onlineandretiredates,upgrades,notesonstorage,anditsprimaryservice.Additionalinformationisprovidedrelatedtospecifications,supportcontracts,andkeycontactinformation.TheCSserverinventoryiscurrentlyoutofdate.
• HardwareReplacementSchedule
o “HathiTrustreplacesstorageregularly,approximatelyevery3‐4yearsorastheusablelifeofstorageequipmentdictates”(HTTRACC1.7)
o “HathiTruststaffupgradehardwareonaregularbasis(i.e.,everythreeorfouryears),andtohelpdetectmorerapidgrowthindemands,thewebserverandstorageinfrastructureshavetheirownperformancemonitoringthatindicateoverloadconditions.”(HTTRACC1.10)
• TimelineforEmergencyReplacementofHathiTrustInfrastructureShouldaseriouseventrequirethereplacementofpart(orall)oftheHathiTrusttechnical
infrastructure,thefollowingtimelineprovidesageneralestimateofthetimerequiredtoorder,ship,andinstallnewequipment.AcursoryreviewofthetimenecessaryforHathiTrusttorecoverfromamajordisasteratthemainAnnArbororIndianapolisdatacentersuggeststhatalargeeventcouldidleaninstanceoftherepositoryforatleastamonthandahalf.Inadditiontotheserversandswitchesmentionedabove,criticalcomponentsincludefour30Apowerdistributionunits(PDUs)perrackandfourracksperdatacenterasofthiswriting.
o SubmissionofPurchaseOrders: Forordersunder$5,000,theM‐PathwaysapplicationallowstheUniversity
Library’sbusinessmanagertosendpurchaseordersdirectlytovendors. Forordersover$5,000,ProcurementServicesnormallytakesonetotwo
businessdaystoapprovethepurchase,buttheprocessmaytakeuptoaweekifquestionsariseoradditionalpurchaseinformationisneeded.
o DeliveryofEquipment: Productsthevendorhasinstockandavailableforimmediateshipmenttake1‐3
daystobedelivered. Itemsthatneedtobeconfigured(suchasservers)usuallytake1‐2weeks. Isilonstoragewilltake3weekstobedeliveredinaworstcasescenario.
o Installation: 3daysFTEforIsilonIQclusterinadditiontothetimerequiredforotherservers,
switches,PDUsandrackunits.
40IsilonSystems.“SupportAdvantageOfferings”(2009)retrievedfromhttp://www.isilon.com/support/?page=planson30June2009.
2009‐08‐24 14
o DataRestoration:about.5TB/hour(15days,asofJune2009)41 WhileHThasabout110TBofdatainitsstorage,thebackuptapesmaintained
bytheTSMGroupcontainroughly176TBofinformationduetothedataencryptionusedtoprotecttheintellectualrightsofthematerial(asof06/2009).
Thelengthoftimerequiredfora‘bare‐metalrestoration’willbeinfluencedbytapemounts,networkspeed,restoringtotheNFSshares,decryption,etcetera.
Ifthelibrary/HTweretopurchaseanadditionaltapedrive(atroughly$20,000),theprocesscouldbespedup,perhapstoabout1TB/hour.
Intheeventofalarge‐scaledisasterinwhichmultiplecampusunitsrequireextensivedatarestoration,theTSMBackupServiceSLAstatesthat“ITCSmanagementwillworkwithcustomerstodeterminehowtoprioritizecustomerrestores.”(sec.4.11)ThisdeterminationwillreflecttheUniversityofMichigan’sorganizationalpriorities42:
• Priority1:Healthandsafetyoffaculty,staff,students,hospitalpatients,contractors,renters,andanyotherpeopleonUniversitypremises.
• Priority2:Deliveryofhealthcareandhospitalpatientservices• Priority3:Continuationandmaintenanceofresearchspecimens,
animals,biomedicalspecimens,researcharchives.• Priority4:Deliveryofteaching/learningprocessesandservices• Priority5:SecurityandpreservationofUniversityfacilities/equipment.• Priority6:Maintenanceofcommunity/Universitypartnerships.
o Fractionalrestoreswould,forthemostpart,runatcomparablespeedsunlesstherewasaneedtorestorealargenumberofrandomfiles,inwhichcasetherewouldbeadecreaseinspeedduetotapeseekandmounttimes.
o DelaysinrecoverycouldbeincreaseddramaticallyiftheMACCdatacenteroritsinfrastructurehassustaineddamageandneedsrepair.
• HathiTrustandInsuranceCoverageattheUniversityofMichigan
TheOfficeofFinancialOperationsreviewsandaddsfinancialassetsgreaterthan$5,000totheassetmanagementsystemoftheUniversityofMichigan.ThePropertyControlOfficeisthenresponsiblefortaggingfinancialassetswithuniqueUniversityofMichiganidentifiersandtrackingthem.RiskManagementServicesadministerstheUniversity’spropertyinsuranceandwillprovidethereimbursementofreplacementcostsforitemsself‐insuredbyMichigan.AsofJuly2009,thenatureandextentoftheUniversityofMichigan’sinsurancecoverageforHathiTrusthardwareremainedunderreview.ThemaincontactwithRiskManagementServicesinthismatterhasbeenCyndiMesa,HeadofUMLibraryFinance.
41Hanover,Cameron(ITCSTSMGroupStorageEngineer).Personalemailon23June2009.42UniversityofMichiganAdministrativeInformationServices.“EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning”(2007)retrievedfromhttp://www.mais.umich.edu/projects/drbc_methodology.htmlon6July2009.
2009‐08‐24 15
Scenario2:NetworkConfigurationErrors
• Review:RisksInvolvingNetworkConfigurationErrorsThefollowingtablesummarizestherisksfacingHathiTrustastheresultofnetworkconfiguration
errors.ConsiderationisgiventonetworkconnectionswithinUMdatacentersaswellasatUM’sHatcherGraduateLibrary(siteofkeyadministrativeanddevelopmentactivities).Thearrangementoftheseeventsreflectstherelativeseverityoftheirrespectiveconsequences.
• HathiTrust’sSolutionsforNetworkConfigurationErrors
HathiTrust’scontinuedaccesstotheInternetviatheUMnetBackboneisessentialforitscontinuedprovisionofservice.TherepositoryreceivesnetworkinfrastructuremaintenancethroughUM’sITCS/ITCom;withitsrobustdisasterplanninginadditiontothelessonslearnedfromtheMidwestblackoutof2003,ITComguaranteescontinuednetworkaccessinallbutthemostcatastrophicscenarios.Intheeventofawidespreadpoweroutage,HathiTrustwouldbeabletomaintainaccesstotheUMnetBackbonesincedatacentersareequippedwithredundantpowersuppliesandtheHatcherGraduateLibraryiscurrentlycategorizedasapriorityrecipientofpowerfromtheuniversity.ITCSalsohas17generatorswhichcanbeusedtomaintainpowertonetworkswitchesintheeventofablackout.TheresponsibilitiesandobligationsofbothpartiesareoutlinedintheCustomerNetworkInfrastructureMaintenanceServiceAgreement.43
• ExtentofITComSupporto “ITComagreestoprovidetheUnitNetworkInfrastructureMaintenancetoincludedata
switches,routers,accesspoints,hubs,uninterruptiblepowersupplies(UPS’s),firewalls,andotheridentifiedandagreeduponcomponents.”(ITCSsec.1.0)
43PleaserefertoAppendixG(ITCS/ITComCustomerNetworkInfrastructureMaintenanceServiceAgreement).
Severity EventHighimpact • Lossofservernetworkswitchoroutboundnetworkswitch
• LossofaccesstoUMnetBackbone
ModerateImpact • ExtendedlossofpoweratHatcherLibrarycouldleadtolossoflocalserversanddisruptionofadministrativeandoperationalactivities.
LowImpact • LossofpowerthatthreatensabilitytoconnecttoLocalAreaNetwork(LAN)/Backbone
o Thelibraryremains(fornow)apriorityrecipientofelectricityfromtheUMpowerplant
o CampusdatacentershaveUPSsandredundantbackuppower• Failureoflocal/server‐sideconnections
o Shouldproblemsarisewithconnectionstoindividualnodes,theclusteredarchitectureoftheIsilonsystemwillallowread/writerequeststobehandledbyalternatenodes.
o IfconnectionsfailatoneHTsite,trafficcanbehandledbyremainingsite.
2009‐08‐24 16
• ITComResponsibilities
o “ProvideandmaintainthenecessarymaterialsandelectroniccomponentstooperatetheUnitNetworkInfrastructure.”(sec.5.2)
o “ProvideconfigurationandNetworkInfrastructureAdministrationsupportnecessarytorepairandmaintaintheUnitNetworkInfrastructurehardwareandsoftwarecoveredbythisagreement.”(sec.5.3)
o “Monitor24hours/dayand365days/year(24x365),supportedprotocolstothebackboneinterfaceoftheUnitsnetworkuptoandincludingtheextensiontothefirsthuborswitch.”(sec.5.6)
o “Monitor24hours/dayand365days/year(24x365),networkinterfacesonuninterruptiblepowersupplies(UPS)thatsupporttheUnitnetworkswitches.ProvidenotificationintheeventthataUPSisactivated,(inputpowerislostordegradedandsystemswitchestobatterypower),deactivated,(inputpowerisrestored),orunreachable.ProvidenotificationtotheUnitNetworkAdministratorwhenbatteriesdegradetothepointofneedingreplacement.”(sec.5.7)
o “ProvidemaintenanceonthestationcablingasinstalledbyITCom,oranapprovedU‐MvendorwhichmetITCominstallationspecifications.”(sec.5.8)
o “ProvidePreventativeMaintenance(clean&vacuum)oneachCustomerUnitswitchcoveredinthisagreementyearly.”(sec.5.9)
• ITComServicesinResponsetoOutagesorDegradationImpactingtheNetworko “Aresponsewithin30minutesoftheITComNOCnotificationortheUnit’scall,to
provideinformationtotheUnitonspecificstepsthathavebeen/willbetakentoresolvetheproblem.”(sec.7.2.1)
o “Anon‐sitevisit,ifnecessary,withintwo(2)hoursoftheresponse(i.e.,themaximumon‐siteresponsetimewillbetwoandahalf(21/2)hours).AnupdatewillbeprovidedtotheUnitNetworkAdministratorifonsiteandabestguessETRwillbeprovidedbasedonavailablefacts.ITComwillcontinuetoprovidetheUnitwithupdateseverytwohoursduringanoutage.”(sec.7.2.1)
o “IfanoutageisidentifiedwithintheagreementservicehoursITComwillresolvetheoutageeveniftherepairtimeextendsbeyondtheserviceagreementhours.”(sec.7.2.1)(Repairsoutsideoftheagreementhoursresultinadditionallaborexpenses.)
o ConductmonitoringviaSNMPPOLLINGatoneminuteintervals.(sec.7.2.1)
• HathiTrustResponsibilitiesITCom’sresponsibilitiesendatthefirstnetworkswitchandfromtheretoitsservers,HathiTrust
isresponsibleformaintainingnetworkconnectivityandsecurity.TherepositoryusesInternet2forcommunicationandsynchronizationbetweentheAnnArborandIndianapolissites.EachIsilonnodehasdual10GBInfinibandportsforinternal(i.e.,intra‐cluster)communicationanddual1GBEthernetforexternalcommunication.Scenario3:NetworkSecurityandExternalAttacks
2009‐08‐24 17
• Review:RisksInvolvingNetworkSecurityandExternalAttacks
ThefollowingtablegivesageneraloverviewofthebasicthreatanexternalattackornetworksecuritybreachposestoHathiTrust;entriesarearrangedbyseverity.Thelist,however,isnotexhaustiveandnoattempthasbeenmadetopublicizepotentialvulnerabilities.
• HathiTrust’sSolutionsforNetworkSecurity
MaliciousactivityagainstHathiTrustcouldinvolveunauthorizedaccesstoasystemordata,denialofservice,orunauthorizedchangestothesystem,software,ordata.Asanacademicentity,therepositoryisseenaslessofatargetforsuchactionsthancommercialorgovernmentaltargets;despitethisperceivedlowerrisk,HathiTrusthasnotbeenlulledintoafalsesenseofsecurity.TherepositorytakesseriouslythepotentialforviolationsofitsnetworkandoperatingsystemsecurityandthereforehasinstitutedaprogramofperiodicsoftwareupdatesinadditiontothemaintenanceofanITCom‐supportedfirewall,authentication‐requiredaccess,andothermeasures(suchasthrottlingsoftwaretodeterdenialofserviceattacks).Becausecontentiscurrentlyacceptedfromtrustedsources(namely,GoogleandlegacydigitalcollectionsfromHathiTrustpartners)theGROOVEprocessdoesnotincludeavirusdetectionphase.Asdigitalobjectsareingestedfromagreaternumberofsources,additionalsecuritymeasuresshouldbeconsidered.
o “HathiTruststaffapplysecurityupdatestotheoperatingsystemandtonetworkingdevicesassoonastheybecomeavailableinordertominimizesystemvulnerability.Aswithnewsoftwarereleases,securityupdatesaretestedinadevelopmentenvironmentbeforebeingreleasedtoproduction.Softwarepackagesthatpresentalowersecurityriskandthathaveagreaterpotentialtoaffectapplicationbehavior(webservers,languageinterpreters,etc.)aregenerallyinstalled,configuredandtestedmanuallytoallowforgreatercontrolinmanagingupdates.Softwareupdatesarenotappliedautomatically;moreover,updatesthatpresentapotentialforhavinganimpactonsystembehaviorareappliedandtestedfirstinthedevelopmentenvironment.Ifnoimpactsareseen,HathiTruststaffapplytheseupdatesinproductionafteratestingperiodofatleastoneweek.”(HTTRACC1.10)
Severity EventsHighimpact • UnauthorizedaccesstoHathiTrustcontentleadstotheinfringementofcopyrights.
• Lossofdataorfunctionalityforanextendedperiodoftimeasaresultofmaliciousactivity.
ModerateImpact • HathiTrustservicesaretemporarilyunavailableasaresultofmaliciousactivity.LowImpact • ThedeliveryofHathiTrustservicesslowsastheresultofmaliciousactivity.
• Asecurityweaknessexistswithinthesystembutremainsunexploited.
2009‐08‐24 18
Scenario4:FormatObsolescence
• Review:RisksInvolvingFormatObsolescenceThefollowingtableoutlinesthethreatsposedbyformatobsolescenceandarrangesthem
accordingtotheirpotentialseverity.
• HathiTrust’sSolutionsforFormatObsolescence
AnawarenessandacknowledgementofthedangersofformatobsolescencehasledHathiTrusttoimplementproactivepoliciesandprocedurestoensurelong‐termaccesstotherepository’scontent.Therepositoryonlyacceptsspecificformatsthatmeetrigorousspecificationsand,throughthepriorexperienceofUniversityofMichiganpersonnel,hasdevelopedprotocolsforthesuccessfulmigrationofcontentfromoneformattoanother.Inaddressingthethreatofformatobsolescence,thepreservationoftheintegrityandauthenticityofdepositedcontenthasbeenanoverarchingconcern.
• SelectionofFileFormatso “HathiTrustiscommittedtopreservingtheintellectualcontentandinmanycasesthe
exactappearanceandlayoutofmaterialsdigitizedfordeposit.HathiTruststoresandpreservesmetadatadetailingthesequenceoffilesforthedigitalobject.HathiTrusthasextensivespecificationsonfileformats,preservationmetadata,andqualitycontrolmethods,includedintheUniversityofMichigandigitizationspecifications,datedMay1,2007.”44(HTTRACB1.1)
o “HathiTrustcurrentlyingestsonlydocumentedacceptablepreservationformats,includingTIFFITUG4filesstoredat600dpi,JPEGorJPEG2000filesstoredatseveralresolutionsrangingfrom200dpito400dpi,andXMLfileswithanaccompanyingDTD(typicallyMETS).HathiTrustsupportstheseformatsbecauseoftheirbroadacceptanceaspreservationformatsandbecausetheformatsaredocumented,openandstandards‐based,givingHathiTrustaneffectivemeanstomigrateitscontentstosuccessivepreservationformatsovertime,asnecessary.TheRepositoryAdministratorshaveundertakensuchtransformationsinthepast;moreover,HathiTrustoffersend‐userservicesthatroutinelytransformdigitalobjectsstoredinHathiTrustto“presentation”formatsusingmanyofthewidelyavailablesoftwaretoolsassociatedwithHathiTrust’s
44Specificationsareavailableathttp://www.lib.umich.edu/lit/dlps/dcs/UMichDigitizationSpecifications20070501.pdf
Severity EventsHighimpact • Applicationsandhardwarearenolongerabletoreadordisplaydigitalobjects.
• Errorsintranslatingandreadingfilesarenotunderstoodoracknowledgedbyrepositoryusers.
ModerateImpact • ProblemswiththetranslationoffileformatsresultinDIPsthatdonotfaithfullyreflecttheoriginaldigitalobjects.
LowImpact • Formatsandassociatedapplicationschangebutretaincompatibilitywitholderversionsofthefileformats.
2009‐08‐24 19
preservationformats.HathiTrustgivesattentiontodataintegrity(e.g.,throughchecksumvalidation)aspartofformatchoiceandmigration.”45
o “Eachformatconformstoawell‐documentedandregisteredstandard(e.g.,ITUTIFFandJPEG2000)and,wherepossible,isalsonon‐proprietary(e.g.,XML).”(HTTRACB4.2)
• FormatMigrationPoliciesandActivitieso “HathiTrustiscommittedtomigratingtheformatsofmaterialscreatedaccordingto[its]
specificationsastechnology,standards,andbestpracticesinthedigitallibrarycommunitychange.”(HTTRACB1.1)
o “HathiTruststaffmembersconductmigrationsfromonestoragemediumtoanotherusingtoolsthatvalidatechecksumsinternally.(Digitalobjectsarestoredbothonlineandontape,andtheonlinestoragesystemconductsregularscanstodetectandcorrectdataintegrityproblems.)Atotalfilecountisdonefollowingalargedatatransfer,andregularlyscheduledintegritychecksfollow.”(HTTRACC1.7)
o “[HathiTrust]hasmigratedlargeSGML‐encodedcollectionstoXML,andLatin‐1characterencodingstoUTF‐8Unicode.Oursuccessinmigratingfromolderformatstonewerformatsdemonstratesourcommitmenttoourcollectionsandourabilitytokeepmaterialsinourrepositoryviable.Allmigrationsaredocumentedinchangelogs.”(HTTRACB4.2)
45HathiTrust.“Preservation”(2009)retrievedfromhttp://www.hathitrust.org/preservationon16June2009.
2009‐08‐24 20
Scenario5:CoreUtilityand/orBuildingFailure
• Review:RisksInvolvingCoreUtilityorBuildingFailureThefollowingtablesummarizesthedangersautilityorbuildingfailureposestoHathiTrustand
rankseventsbytheirpotentialseverity.
• HathiTrust’sSolutionsforUtilityorBuildingFailure
ThecontinueddeliveryofHathiTrust’sservicesdependsuponthemaintenanceofpower,environmentalcontrol,andsecurityinitsserverenvironmentattheMichiganAcademicComputingCenter(MACC)andotherlocationsthathostcomponentsoftherepository.Inthisrespect,HathiTrustisheavilyreliantupontheinfrastructureoftheMACCaswellasthatoftheArborLakesDataFacility,hometooneinstanceoftheTSMGroup’sbackuptapelibrary.BothlocationsprovidecloselymonitoredandhighlyredundantenvironmentsthathelpensurethatHathiTrust’sinfrastructureremainssecureandoperable.Atthesametime,administrativeanddatamanagementfunctionscriticaltothedevelopmentandmaintenanceoftherepositorytakeplaceintheUniversityofMichigan’sHatcherGraduateLibrary.TheserviceandcooperationofMichigan’sPlantOperationsDivisionarethereforecriticalforthecontinuedaccesstoanduseofthisstructureintheoperationofHathiTrust.
• GeneralMaintenanceandRepairsinUniversityofMichiganFacilitiesFacilitiesandmaintenanceissuesontheUniversityofMichigancampusarereportedtothe
PlantOperationsDivision,theDepartmentofPublicSafety(DPS),andOccupationalSafetyandEnvironmentalHealth(OSEH)inadditiontotheimpactedfacility’smanager.RepairworkiscoordinatedbytheUniversityLibraryfacilitiesmanagerinconjunctionwithadministratorsandworkersfromPlantOperations.
• TheMichiganAcademicComputingCenter(MACC) TheMACChostsmanyofthekeycomponentsoftheMichigan’sUniversityLibrarysystemandas
wellasthetechnicalinfrastructureofHathiTrust.TheUniversityofMichigandoesnotownthebuildinginwhichthedatacenterislocatedbutinsteadoperatestheMACCinconjunctionwiththeMichiganInformationTechnologyCenter(MITC)Foundationandotherpartners.TheMACCServerHostingService
Severity Events• ExtensivestructuraldamagerenderstheMACC(orkeyelementsofits
infrastructure)unusableandnecessitatestheestablishmentofahotsitetorecoverandcontinueoperations.
• Additionalfailurepasttoleranceinbackupcoolingorpowerinfrastructure
Highimpact
ModerateImpact • Failureofbackuppowerpastredundancytolerance(failureof2generators)
o DatacentercoordinatormayinitiateloadshedandshutdownhalfoftheMACC(butlibraryrackswillremainoperational)
• Structuraldamagerendersfacilitytemporarilyunsafeand/orunusable.LowImpact • Lossofpower
• Lossofenvironmentalcontrolunitswithinredundancy
2009‐08‐24 21
LevelAgreement46liststheresponsibilitiesofthedatacenteraswellastherepository;ofparticularsignificancearetheMACC’sagreementsto:
o “Provideacontrolledphysicalenvironmenttosupportservers[with]roomaveragetemperatureofbetween65and75degreesand35‐50%relativehumidity[and]monitoredenvironmentals(temperature,humidity,smoke,water,electrical.”(sec.4.1)
o “Provideadequate,conditioned,60‐cycleelectricalservicewithadequatebackupelectricalcapacitytosupportcircuits,service,andoutlets[andalsoto]provideUninterruptiblePowerSupply(UPS)andgeneratorbackup”(sec.4.2)
o “Provide7x24telephonecontactforemergenciesandforemergencyaccesstofacility.”(sec.4.4)
Inadditiontofeaturessuchasredundantelectricalandenvironmentalsystems,theMACCmaintainsafull‐timecoordinatorandstaffwhoprovide24x7responsestofailuresormalfunctionsintheserverenvironment.AlertspromptedbyissueswiththeenvironmentalsystemsorpoweraresenttotheUniversityofMichiganNetworkOperationsCenter(NOC)duringnon‐businesshours.
o Overview: “TheMACC'sredundancyisdesignedtoensurethesafetyandsecurityofthe
datahousedwithin.Itconsistsof:• Adualpowerpathfromthepropertylinetothepowerdistribution
units• Dieselpoweredgeneratorsforelectricalbackup• Flywheels(notbatteries)toprovidepowerwhilethegeneratorscome
on• State‐of‐the‐artgeneratorsandflywheelsforbackuppower• Threeextracomputerroomairconditioners• Twoextradrycoolers• Glycolloopforcoolingwithtwoparallelpathwayswithcrossovervalves
atregularintervals.”47 “Astate‐of‐the‐artmonitoringsystemkeepstrackof1,700differentparameters
andautomaticallynotifiesstaffofanyirregularity.”48o EnvironmentalControlsandMonitoring
“TheMACChas18ComputerRoomAirConditioningunits(CRACs).Atanygiventime,only15arenecessarytomaintaintherequiredtemperatureandhumidity.[Thus,thecomputerroomhasN5+1redundancyinitscoolingability.]Italsoisequippedwithanumberofportablecoolerstoaddressspecificcoolingneeds.Theheatfromtheroomistransferredtoanunder‐floorglycolloopthatreleasestheheattotheoutdoors.”49
46PleaserefertoAppendixH(MACCServerHostingServiceLevelAgreement).47MichiganAcademicComputingCenter.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon16June2009.48‐‐.“MichiganAcademicComputingCenter”(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June2009.49‐‐.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon16June2009.
2009‐08‐24 22
“Thelayoutofthefacilityallowsthefrontonthecomputerrackstobefacingthecoldaisles.Theseaisleshaveperforatedfloortilesthroughwhichthecoolairispumpeddirectlytothecomputerslocatedthere.Heatisdischargedfromthebacksofthecomputers,whichcreatesthehotaisles.Thisalternatingarrangementfacilitatesthecoolingprocess,asthehotairproducedbythecomputerscanbesiphonedoffbeforeitminglestoomuchwiththecoolerairofthefacility.”50
“TwoseparatesmokedetectionandfirealarmsystemsprotecttheMACC.Oneisforthebuilding;theotherisfortheMACCitself.Thetwosystemsworktogethertoactivatealarmsystemsandnotifythefiredepartmentandkeypersonnel.Intheeventofanactualfire,thefire‐suppressionsystempipeswillnotfillwithwaterunlessthereisapressuredropcausedbymeltingofoneormoreofthesprinklerheads.”51
o BackupPower “Threegenerators,eachroughlythesizeofarailcar,providebackuppower.
Onlytwoofthethreearerequiredtorunthefacilityintheeventofapoweroutage.”52
“TheMACCusesenvironmentallyresponsibleflywheelsinsteadofbatteriesforpowerbackupwhilethegeneratorscomeonline.Thecombinationofgeneratorsandflywheelsprovidesthefacilitywithafullyredundantuninterruptiblepowersystem(UPS).”53
TheMACChasacontractwiththeUMPlantOperationsDivisionforthedeliveryofdieselfuelforitsgeneratorsintheeventofanextendedblackout.54
Intheeventthatabackupgeneratorisdisabled,theMACCcoordinatorwillinitiateloadshed,inwhichonehalfoftheMACCwillbeshutdownsothattheotherhalf(andrequisiteenvironmentalsystems)maycontinuetooperate.TheHathiTrustandUMLibraryracksareamongthosewhichwillretainpowershouldthisresponseprovenecessary.55
• ArborLakesDataFacility(ALDF)TheALDFhousestheTSMGroup’sinfrastructureandoneinstanceofthebackuptapelibrary
thatformsanintegralpartofHathiTrust’sDisasterRecoverystrategy.AsthehomeofcriticalcomponentsoftheUMnetBackbone,theALDFprovidesasafeandsecurelocationforonesetoftherepository’sbackuptapes.Intheinterestofsecurity,thisreportwillomitfurtherinformationontheexactnatureofthefacility’spowerandenvironmentalsystems.
50Ibid.51Ibid.52‐‐.“MichiganAcademicComputingCenter”(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June2009.53Ibid.54Gobeyn,Rene(MACCDataCenterCoordinator).Personalinterviewon23June2009.55Ibid.
2009‐08‐24 23
Scenario6:SoftwareFailureorObsolescence
• Review:RisksInvolvingSoftwareFailureorObsolescenceThefollowingtabledetailsvariousrisksinherenttosoftwarefailureorobsolescenceandranks
themaccordingtotheirseverity.
• HathiTrust’sSolutionsforSoftwareIssues
ThedevelopmentanduseofHathiTrust’stoolsandresourcesdependsonhighlyfunctionalsoftwareapplications.Repositorypolicieshavethereforebeencraftedtoensurethattheseapplicationsarethoroughlytestedandregularlyupdatedtominimizethethreatofserviceoutagesasaresultofsoftwarefailureorobsolescence.HathiTrustfurthermoreemploysopensourceapplicationsthatarewell‐supportedandenjoywidespreaduseanddevelopmentwithinthedigitallibrarycommunity.
o “Changesinsoftwarereleasesofallcomponentsofthesystem(fromingesttoaccess)aredevelopedandtestedinanisolated“development”environmenttoprepareforreleasetoproduction.Whenreadyforrelease,developersrecordthechangesmadeandincrementversionnumbersofsystemcomponentsasappropriateusingaversioncontrolsystem.Newversionsofsoftwarearereleasedusingautomatedmechanisms(inordertopreventmanualerrors).Majorchangesandupgradesinhardwarearchitecturearerecordedinmonthlyreportsofunitactivity,andthusaretraceabletothatlevelofdetail.”(HTTRACC1.8).
o “Additionally,subsetsofproductiondataareavailableinthedevelopmentenvironmenttoallowdeveloperstoensurepropersystembehaviorbeforereleasingchangestoproduction.”(HTTRACC1.9)
o “Inordertodesign,buildandmodifysoftwareforthedesignatedend‐usercommunity,HathiTrustconductsanactiveusabilityprogramandseeksinputfromtheStrategicAdvisoryBoardofHathiTrust.Similarly,withregardtosoftwaredevelopmentinsupportofthearchivingneedsoftheParticipatingLibraries,HathiTrustfocusesonthedevelopmentofhighlyfunctionalingestandvalidationmechanisms.HathiTrustalsoseeksandrespondstoguidancefromtheStrategicAdvisoryBoardwithregardtoarchivingservices.”(HTTRACC2.2)
Severity Events
Highimpact • Softwarebugescapesdetectionindevelopmentenvironmentandresultsincrashofapplication.
ModerateImpact • Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfullaccesstodigitalobjects.
• Improperversionofsoftwareisintroducedtosystem(couldhaveagreaterorlesserimpactdependingonresultsoferrorandrepository’sabilitytodetectit).
LowImpact
• Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfulluseofsystemcapabilities(i.e.,rotationofimagesoradditionalfunctionality)
2009‐08‐24 24
Scenario7:OperatorError
• Review:RisksInvolvingOperatorErrorThefollowingtablesummarizesriskstoHathiTrustposedbyoperatorerror;eventsareranked
accordingtotheirpotentialseverity.
• HathiTrust’sSolutionsforOperatorError
Inanyhumanenterprise,occasionaloperatorerrorisunavoidable;HathiTruststrivestoensurethatanysucheventsaredetectedandresolvedinatimelyfashion.56Tohelpavoidoccurrencesandmitigatetheirpotentialimpact,HathiTrusthasautomatedmanyproceduresandalsoreliesuponapplicationassertions,whichcannotifyadministratorswhenprocessesarenotoperatingcorrectly.Evenifanerrorisintroducedtothefilesystemandthenbackedup,theTSMclientsavesuptosevenversionsofafileforuptosixmonthssothatanearlierversioncanberetrieved.
• Ingest:TheGoogleReturn(Object‐Oriented)ValidationEnvironment(GROOVE)processis
entirelyautomatedtoavoidtheintroductionofoperatorerrortotheprocess;stepsinclude:o Identificationofmaterialforingesto DecryptionandunzippingoffilesFormatverificationandvalidationwithJHOVEo LunBarcodeandMD5checksumvalidationo CreationofHathiTrustMETSdocumentso EstablishmentofHathiTrusthandles(persistentURLs)o Extensionofthepairtreefiledirectory(asnewmaterialentersthesystem)
• ArchivalStorage:Filesstoredwithintherepositoryarenotaccesseddirectlyormanipulatedby
staffsothatneitherthezippedimageandOCRfilesnortheMETSdocumentmaybeaccidentlyalteredordeleted.
• Dissemination:Thepage‐turnerapplicationreferencesthestoredimageandthencreatesa.png(forTIFFs)or.jpg(forJPEG2000s)filefordisplaytotheviewer.
• DataManagement:“Newversionsofsoftwarearereleasedusingautomatedmechanisms(inordertopreventmanualerrors).”(HTTRACC1.8)
56PleaserefertoAppendixB(HathiTrustOutagesfromMarch2008throughApril2009).
Severity EventsHighimpact • Operatorerrorresultsintheirreparablelossofdataordamagetoequipment.
• Operatorerrorresultsinlossofkeyrepositoryfunctions(ingest,storage,dissemination,etc.)foranextendedperiodoftime.
ModerateImpact • Operatorerrorremainsundetectedandcausespersistentproblemsinthesystembuthasnolongtermconsequences.
LowImpact • Operatorerrorisdetectedbynormalproceduresorviaanactivitylogandcanbereadilycorrected.
2009‐08‐24 25
Scenario8:PhysicalSecurityBreach
• Review:RisksInvolvingaPhysicalSecurityBreach MaintainingthephysicalsecurityoftheHathiTrustinfrastructureisyetanothercrucialelementintherepository’seffortstomanagerisksandtherebylessenthechancethatadisaster‐typeeventoccurs.Risksinvolvethedamageanddestructionofequipmentandcouldevenextendtounauthorizedsystemaccess.MultiplelevelsofsecurityexistatboththeMichiganAcademicComputingCenter(MACC)andtheArborLakesDataFacility(ALDF)toprotectHathiTrustfromtheactsofvandalism,destructionormalicioustampering.Detailsonthepotentialimpactsofaphysicalsecuritybreacharecoveredin“Scenario1:HardwareFailure”and“Scenario3:NetworkSecurity.”
• HathiTrust’sSolutionsforPhysicalSecurityo “Eachof[theHathiTrust]storageortapeinstancesisphysicallysecure(e.g.,inalocked
cageinamachineroom)andonlyaccessibletospecifiedpersonnel.”57
• SecurityattheMACCTheMACCServerHostingSLAstatesthedatacenterstaffwill:
o “Provideservicesnecessarytomaintainasafe,secure,andorderlyenvironmentforalltenantsoftheMACC.”(sec.4.7)
o “ProvideaccesscontrolviaHiDcardandbiometricreadersforthoselistedontheTenantStaffAuthorizedforAccesslist.”(sec.4.5)
TheMACCWebsiteandtheMichiganAcademicComputingCenterOperatingAgreement58provideadditionaldetailsconcerningtheresourcesandproceduresthathelpprotectHathiTrust’sequipmentattheMACC.TheMACCDataCenterCoordinatorpersonallyoverseestheenforcementofsecurityprotocolsandconductsregularauditsofsecuritylogsand,whennecessary,reviewssurveillancevideofootage.
o SecuritySystems “State‐of‐the‐artsecuritydevicessuchasirisscanners,cameras,closedcircuit
televisionandon‐callstaffkeepthedataandmachineshousedintheMACCsafe.”59
“Accesstothedatacenterwillbebytwo‐factorauthentication(accesscardandirisscan)orescorted,supervisedaccess.Accesstothebuildingwillbebyaccesscard.”(MACCOA,sec.5.3.1)
“Camerasthroughoutthecorridor,securitytrap,andfacilitywillbemonitoredandmaintainedbytheDataCenterCoordinator.”(sec.5.2.1)
o SecurityProcedures
57HathiTrust.“Technology”(2009)retrievedfromhttp://www.hathitrust.org/technologyon15June2009.58PleaserefertoAppendixI(MichiganAcademicComputingCenterOperatingAgreement).59MichiganAcademicComputingCenter.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon17June2009.
2009‐08‐24 26
“TheOperationsAdvisoryCommitteewillestablishproceduresforgrantingaccesscardstothefacilitytothosewhosejobsrequirehands‐onaccesstosystems.AllrequestsforaccesscardswillbevettedandapprovedbytheOperationsAdvisoryCommitteeattheirnextmeeting.”(sec.5.3.2)
“Everyoneontheaccesslistforthedatacenterwillberequiredtoattendatrainingsessionbeforeworkinginthedatacenterandsignanaccessagreementstatingpoliciestheymustobservewhileinthedatacenter.”(sec.5.3.8)
• SecurityattheALDFAsnotedintheTSMBackupServiceSLA,theUniversityofMichigan’sITCS“isresponsiblefor
physicalsecurity”attheALDF.(sec.4.9)WhilethisdocumentwillnotdetailspecificfeaturesoftheALDF’soperation,multiplelevelsofsecurityandoversightareemployed.
2009‐08‐24 27
Scenario9:NaturalorManmadeDisaster
• Review:RisksInvolvingaNaturalorManmadeDisasterThefollowingtabledetailstheriskstoHathiTrustposedbyanaturalormanmadedisaster;
eventsarerankedbyorderoftheirseverity.DuetopossibleoverlapbetweenthisscenarioandScenario1(HardwareFailure),readersareencouragedtoconsultthatearliersection.
• HathiTrust’sSolutionsforNaturalorManmadeCatastrophicEvents
TheUniversityofMichiganAnnArborCampusEmergencyProcedures(revisedJanuary2008)hassetprocedurestoaddressbuildingevacuations(intheeventoffire),tornadoes,severeweather,flooding,chemical/biological/radioactivespills,aswellasbombthreats,civildisturbances,andactsofviolenceorterrorism.60Inallcases,staffwillfollowthedirectionsofPublicSafetyandnotre‐enterbuildingsorresumework“untiladvisedtodosobyDPSorOSEHorsomeonefromon‐siteincidentcommand.”
Intheeventofaseverenaturalormanmadedisaster,therepairandrestorationofthephysicallocationsofHathiTrustinfrastructurewouldneedtobecoordinatedbetweentherepositoryandtheappropriatefacilitymanagers.SuchactivitywouldrelyuponthedisasterrecoveryplansinplaceattheMITCBuilding(homeoftheMACC)andUniversityofMichigan(whichincludestheHatcherGraduateLibraryandtheALDF).Itmustbenotedthataneventwhichcausessignificantdamagetoanimportantstructureortoabuilding’sinfrastructurecouldresultinthelossofaninstanceoftherepositoryforanextendedperiodoftime.Insuchacase,HathiTrustwouldneedtosetupanalternatehotsiteuntilstructuralrestorationiscomplete(oranewfacilityhasbeenfound).
60PleaseseeAppendixC(WashtenawCountyHazardRankingList).
Severity EventsHighimpact • Widespreaddamagetoadatacenterand/oritsinfrastructurethatforcesan
instanceoftherepositorytofindanewhotsitewithsufficientpowersupply,environmentalcontrols,andsecurity.
• Damagetoworkareasforcestafftorelocatetoanewcenterofoperations.• Extensivelossordamagetohardwarerequireslarge‐scalereplacement.• Withtheextendedlossofonesite,HathiTrustlosesredundancy(andpossiblysome
functionality:i.e.theabilitytoingestnewmaterialinAnnArbor)andthusacentralcomponentofitsdisasterrecoveryandbackupplans.
• AnactofviolenceorterrorismoccursatornearHathiTrustfacilities.ModerateImpact • Aneventresultsinanextendedoutageatonesitethatexceedstherecoverytime
objective.• Hardwaresustainssomedamageandsiteisabletocontinueoperationina
reducedcapacity.• Anactualorthreatenedactofviolenceorterrorismforcesthetemporary
evacuationorquarantineofHathiTrustfacilities.LowImpact • LocalconditionsresultinatemporaryoutageataHathiTrustsite.
2009‐08‐24 28
• BasicDisasterRecoveryStrategies
Intheimmediateaftermathofalarge‐scalemanmadeornaturaldisaster,therepository’simmediaterecoverywillbeenabledbyitsbasicsystemarchitecture:
o “theinitiative’stechnologyconcentratesoncreatingaminimumoftwosynchronizedversionsofhigh‐availabilityclusteredstoragewithwidegeographicseparation(thefirsttwoinstancesofstoragearelocatedinAnnArbor,MIandIndianapolis,IN),aswellasanencryptedtapebackup(writtentoandstoredinaseparatefacilityoutsideofAnnArbor).”61
TheestablishmentofthemirrorsiteinIndianapolisandtheretentionofmultiplebackuptapesattwolocationsinAnnArborensurethataseriouseventateitherlocationwillnotimpedethecontinuedfunctioningoftherepositoryattheother.ConsiderationmustbegivenastohowdataattheIndianapolissitewillbebackedupandhowkeyrepositoryfunctions(suchasingest)willproceediftheAnnArborinstanceisoff‐lineforanextendedperiodoftime.Likewise,along‐termoutageattheIUlocationwouldrequireHathiTrusttoestablishathirdsitefordatabackup(i.e.,alocationwhereadditionalcopiesofbackuptapescouldbestored).
61HathiTrust.“Technology”retrievedfromhttp://www.hathitrust.org/technologyon15June2009.
2009‐08‐24 29
Scenario10:MediaFailureorObsolescence
• Review:RisksInvolvingMediaFailureorObsolescenceThefollowingtablesummarizesriskstoHathiTrustposedbythefailureofthemediausedforits
databackups.Whiletherisksfromthisarelimited(bothcopiesofthetapebackupswouldhavetobeimpactedfordatatobeunavailable),theissueshouldnonethelessbeaddressedwithregulartestrestorationsand/orinspectionsofthemedia.
• HathiTrust’sSolutionsforMediaFailure
GiventhenatureofHathiTrust’sstoragesystem,thisscenarioisonlyaconcerninregardstothedigitalmagnetictapesusedbytheTSMGroupforbackups.
o Twotapecopiesofallbackupdataaremadeandthesearestoredinseparateclimate‐controlledconditionsintapelibrariesattheMACCandtheALDF.
o Contentistransferredtonewtapeduringdatadefragmentation(whichoccurswhenexistingtapesare80%full),
o Ifadegradedorotherwise‘bad’sectionoftapeisdetectedduringabackupprocedurethattapeisimmediatelymarkedas“readonly.”
Dataisthenceforthwrittentoadifferenttape;existingdataonthebadtapewillbecopiedtoproperlyfunctioningmedia.
Ifdatacannotbereclaimedfrombadtape,theTSMGroupwouldcontactHathiTrustsothatthebackupofcontentcanbeproperlycompleted.
• RemainingVulnerabilities
ThereissomereasonforconcerninthisareabecausetheTSMGroupdoesnothavearegularprogramtomonitoritsmediaforphysicaldegradationorimpairmentafterdatadefragmentation.Whilethetapesarereportedtobehighlydependable,problemssuchas“stickyshed”(thehydrolysisofthetape’sbinder)couldbecomeanissuewitholdertapes.Aregularprogramoftapevalidationortestrestorationswouldprovideanopportunitytocheckonthephysicalconditionanddataintegrityofthetapes.Likewise,thecreationofascheduleforthereplacementofoldertapescouldavoidfutureproblemswithmediadegradation.
Severity EventsHighimpact • Physicaldegradation(i.e.intapebinder,substrate,ormagneticcontent)affects
bothcopiesofolderbackuptapes.ModerateImpact • Becausebackuptapesarenotregularlytestedoraudited,thephysicalsubstrateof
tapesmaydegradeovertime.
LowImpact • Badtapeisdetectedduringatapebackup.
2009‐08‐24 30
ConclusionsandActionItems
• ConclusionsAsthisreportdemonstrates,avarietyofriskmanagementstrategiesinadditiontodesign
elements,operatingprocedures,andserviceandsupportcontractsendowHathiTrustwiththeabilitytopreserveitsdigitalcontentandcontinueessentialrepositoryfunctionsintheeventofarangeofdisasters.TheestablishmentoftheIndianapolismirrorsite,theperformanceofnightlytapebackups,andtheredundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpracticesandwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Asitis,disastersoftenresultfromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsofaDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensurethat,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedserviceprovider.
IntheefforttosecureHathiTrust’slong‐termcontinuity,thepresentdocumentstandsmerelyasapreliminarystepintheestablishmentofalegitimateDisasterRecoveryPlan.ThedataonHathiTrust’spolicies,procedures,andcontractsconsolidatedhereinshouldfacilitatethedatacollectionrequisitetotheinitialphasesoftheplanningprocess,butthecoreactivitiesofformulatingtechnicalandadministrativeresponsestrategiesanddelegatingrolesandresponsibilitiesremaintobeundertaken.ThefollowingsectionoutlinesrecommendationsandactionitemsderivedfromresearchintotherepositoryaswellasfromdiscussionswithCorySnavelyandotherHathiTruststaffmembers.ItemshavebeenseparatedintoanapproximatetimelineofactivityrangingfromShortTermthroughLongTermandthearrangementwithineachcategoryrepresentsasuggested(butbynomeansdefinitive)orderofaccomplishment.ForamoredetailedexplanationofactionitemsrelatedexplicitlytoDisasterRecoveryPlanning,pleaserefertotheoverviewoftheplanningprocessinAppendixEorconsultAppendixDforalistofmorecomprehensiveguidesandresources.(NB:*=Denotesanongoingactivity.)
• ShortTermActionItems(0‐6months)a. ResolvethenatureandextentoftheinsurancecoverageforHathiTrustequipment.b. ArrangewithTSMGroupadministratorstoperiodicallyperformavolumeauditof
backuptapestoensuredataintegrity.c. InstituteperiodictestrestoreswithTSMGrouptoensurethattheprocesswillrun
smoothlyintheeventofadisaster.d. Discussthecreationofalong‐termreplacementscheduleforbackuptapeswiththe
TSMGrouptoavoidthepossibilityofmediadegradation.e. Improvecontroloversystemcomponents
i. Updatethehardwareinventorytoincludeallimportantsystemcomponents;documentmodels,serialnumbers,UMID’s,associatedsoftwareandversionnumber,dateofpurchase,originalcost,aswellasvendorcontactinformationandproductsupportcontracts.*
2009‐08‐24 31
ii. Establishasoftwareinventorytodocumentnecessaryapplicationsintheeventofhardwareloss;shouldincludepurpose,acquisitiondate,cost,licensenumber,andversionnumber.*
iii. CreateamapidentifyingwherecomponentsareintheMACCandwithinindividualracks*
iv. Reviewandassesspointsoffailureaswellastheadequacyofredundantcomponents.*
f. Establishphonetreesi. Includekeycontactsfordifferenttypesofdisasterii. Prioritizephonetreestotargetindividualswho
1. Makedecisions2. Havevitalinformation3. Canofferassistanceinresolvingsituations
iii. Distributeinformationandexplainprotocolstoallrelevantstaff*iv. Developaregularmaintenance/updateschedule(onceevery4‐6months)*
g. Thoroughlydocumentandmakeavailable(asneeded)importantinstitutionalknowledgesothatHathiTrustmaycontinuetofunctionintheeventoftheextendedabsenceorlossofkeystaff.*
h. IdentifydisasterpreparednessanddisasterrecoverymeasuresinplaceatIndianapolis.
• IntermediateTerm(6‐12months)a. FormaDisasterRecoveryPlanningCommitteetoresearchanddevelopplansandto
overseetheirimplementation.b. CommunicateandcoordinateplanningactivitiesbetweenAnnArborandIndianapolis.*
i. Considertheformationofsub‐committeesforlocalizedresearchanddevelopmentofplansandanexecutivecommitteetooverseetheimplementationandmanagementofplans.
c. DraftaDisasterRecoveryPlanningpolicystatementtodefinethemandate,responsibilities,andobjectivesfortheplan.
d. Initiatethedatacollectionandanalysisphaseoftheplanningprocess.i. Identifycorerepositoryfunctionsandassociatedhardwareandinfrastructure
elements.ii. Determinethepotentialimpactfromthelossofthosefunctionsiii. Definethelevelsoffunctionalityrequiredforpartialaswellasfullrecovery.
EstablishwhatlevelisneededforHTtofulfillitsmissionandtheneedsofitsusers.
iv. DefineHathiTrust’sRecoveryTimeObjective(RTO:themaximumallowableoutageperiodforservices)andRecoveryPointObjective(RPO:thepointintimetowhichdatastoresmustbereturnedfollowingadisaster).
v. Determinetheavailabilityofresourcesintheeventofadisasterandestablishtherepository’sprioritizationwithmajorserviceprovidersandvendors(i.e.,TSMGroup,ITCom,Isilon,etc.).
2009‐08‐24 32
e. Addressrisksuncoveredinthedatacollectionphaseandinstitutepreventativecontrolsasneededtoanticipateandmitigatethoserisks.*
f. Developrecoverystrategiestobringcorefunctionsbackonlineassoonaspossiblewithinasetcostrange.
i. Establishalogicalprogressionintherestorationofservicesandassociatedcomponents.
ii. Identifytheresourcesrequiredfortheseefforts.iii. Consideralternativesolutions,includingpartial(vs.full)recovery
g. Communicateplanninggoalsandeffortstokeycontactsfromserviceprovidersandvendorstobettercoordinaterecoveryefforts.*
h. InitiatetheproductionofcoreDisasterRecoverydocuments(seeAppendixEformoreinformation).Thefollowinglistisnotexhaustive;datacollectionandanalysiswillhelpdetermineifallorotherplans(i.e.,awebcontinuityplan)areneeded.
i. BusinessContinuityPlan:detailsHathiTrust’scorefunctionsandtheprioritiesforre‐establishingeachintheeventofadisruption.
ii. ContinuityofOperationsPlan:focusesonrestoringanorganization’s(usuallyaheadquarterselement)essentialfunctionsatanalternatesiteandperformingthosefunctionsforupto30daysbeforereturningtonormaloperations.
iii. ITContingencyPlan:addressesexplicitlythedisasterplanningforcomputers,servers,andelementsofthetechnicalinfrastructurethatsupportkeyapplicationsandfunctions.
iv. CrisisCommunicationsPlan:establishesproceduresforinternalandexternalcommunicationsduringandafteranemergency.
v. Cyber‐IncidentResponsePlan:definestheproceduresforrespondingtocyberattacksagainsttheHathiTrustITsystem.
vi. OccupantEmergencyPlan:definesresponseproceduresforstaffintheeventofasituationthatposesapotentialthreattothehealthandsafetyofHathiTrustpersonnelortheirenvironment.(ThisrequirementisaddressedbyUniversityofMichiganBuildingEmergencyActionPlans.)
vii. DisasterRecoveryPlan:bringstogetherguidanceandproceduresfromtheotherplanstoenabletherestorationofcoreinformationsystems,applications,andservices.ThisplandefinesrolesandresponsibilitieswithinDisasterResponseTeams.
viii. DisasterRecoveryTrainingPlan:establishesthesituationsandprocedurestobecoveredbyHathiTrust’sDisasterRecoverytraining.
• LongTerm(12+months)
a. CompleteandimplementDisasterRecoveryPlans.i. Distributephysicalcopiesoftheplansasneededandincludeatleastonecopy
inanoff‐sitelocation.ii. Integrateelementsofresponsestrategiesintosystemarchitecturetofacilitate
theirdeploymentintheeventofadisaster.*
2009‐08‐24 33
b. DisasterRecoveryCommitteeshouldmonitorchangesinbestpracticesandtechnology,updateplans,andoverseeorganizationalreadiness.*
i. InitiatestafftrainingsothatindividualsarefamiliarwithDisasterRecoveryproceduresandcommunicationprotocols.*
ii. InstituteregulartestsofdisasterpreparednesswithsimulateddisastersinvolvingdifferentcomponentsofHathiTrustoperations.*
iii. EstablishascheduleformaintenanceandrevisionstotheDisasterRecoverydocuments.*
iv. CoordinateDisasterRecoveryPlanimplementation,training,andreviewwithIndianapolis.*
c. StoreanadditionalcopyofbackuptapesatathirdsitetoincreaseexposureandlimitthechancethatawidespreadeventinAnnArborcouldimpactbothlocalcopies.
d. ExplorethepossibilityofestablishingathirdsiteforHathiTrust’sdigitalobjectstoincreaseexposureandaddressconcernsovertherelativegeographicalproximityofIndianapolisandAnnArbor.
e. Determinethefeasibilityofmovingoperationstoa“hot”siteinAnnArborshouldadisasterrendertheMACCunusable.
i. Identifysuitablesitesandconsidermakingpreliminaryarrangements.ii. Identifyandpriceoutequipment/infrastructurenecessarytocontinue
operations.f. PlanforintegrationofnewsystemcomponentsshouldthesuddencollapseofIsilon
leaveHathiTrustwithoutservice/support.g. Consideranincreasetosystemsecuritymeasuresascontentbecomesacceptedfroma
widerrangeofsourcesandasHathiTrustbecomesahigher‐profileorganization.
2009‐08‐24 34
APPENDIXA:ContactInformationforImportantHathiTrustResources
IndianaUniversityMirrorSite
• AndrewPoland(Staff,InformationTechnologyServices)o [email protected] (317)274‐0746
• TroyDeanWilliams(VicePresidentforInformationTechnology,IUatBloomington)o [email protected] (812)856‐5323
UniversityofMichiganMichiganAcademicComputingCenter(MACC):HousesmuchofthetechnicalinfrastructureoftheUniversityLibrary’sdigitalresources.
• ReneGobeyn(MACCDataCenterCoordinator)o [email protected] (734)936‐2654
• ITComUMNOC(NetworkOperationsCenter)o [email protected] (734)647‐8888
ITCS‐ITCom:ResponsibleformaintainingnetworkconnectionstotheUMnetBackboneandInternet;ITComprovidesmaintenanceandsupportservicesforhardwareandsoftware.
• MikeBrower(SeniorProjectManager,UMLibraries)o [email protected] (734)936‐9736
• KrystalHall(DisasterRecoveryPlanner,ITCS/ITComOperations)o [email protected] (734)647‐3214
• ITComUMNOC(NetworkOperationsCenter)o [email protected] (734)647‐8888
TivoliStorageManagerGroup:Responsiblefornightlyautomatedtapebackupsofstorageservers.
• AndrewInman(ServiceManager)o [email protected] (734)615‐6286
• CameronHanover(StorageEngineer)o [email protected] (734)764‐7019
• GeneralSupport:[email protected]• Emergencycontact:[email protected]
o Messagewillgotoon‐callstaff’spagerinrealtime• [email protected]
ArborLakesDataFacility:HousesoneinstanceoftheTSMbackuptapelibrary.
• ITComUMNOC(NetworkOperationsCenter)
2009‐08‐24 35
o [email protected] (734)615‐4209
• KenPritchard(ALDFfacilitymanager)o [email protected] (734)615‐2812
ProcurementServices:Approvesdepartmentalpurchasesover$5,000;buyersalsoworkasintermediarieswithvendors.
• SteveWorden(UMHardwarePurchasingSpecialist)o [email protected] (734)645‐8972
• ShellyEauclaire(SeniorBuyer,PurchasingServices)o [email protected] (734)615‐8767
• IanPepper(UMDellComputersContractAdministrator)o [email protected] (734)647‐4981
• JeffRabbitt(AlternateDellContractAdministrator)o [email protected] (734)644‐9232
PropertyControl:Responsiblefortrackingandtaggingtheuniversity’sassets.
• MaryEllenLyon(BusinessOperationManager)o [email protected] (734)647‐3351(t,th)o (734)763‐1197(m,w,f)
OfficeofFinancialAnalysis:
• DavidStorey(InventoryCoordinator):DeliversUMpropertytagstoequipmentattheMACC.o [email protected] (734)647‐4264
RiskManagementServices:Providesinsurancecoverageofuniversityassets.
• KathleenRychlinski(AssistantDirector,RiskManagementServices)o [email protected] (734)763‐1587
Non‐UniversityContactInformationIsilonSystems
• JimRamberg(RegionalTerritoryManager)o [email protected] Desk:(847)330‐6399o Cell:(630)561‐2463
SunMicrosystems
• ChristineSluman(ServiceSalesRep—Education)o [email protected] (303)557‐3660,ext.60519
2009‐08‐24 36
o (303)949‐1567(Cell)• LarryZimmerman(MichiganAccountManager‐Sales)
o [email protected] (248)880‐3756
CDW‐G
• UniversityofMichiganAccountTeamo [email protected]
• HansenChennikkra(AccountManager)o [email protected] (866)339‐3639
• AdamSullivan(AccountManager)o [email protected] (866)339‐4118
DellComputers
• BrianUllestad(HigherEducationAccountManager)o [email protected] 1‐800‐274‐7799ext.7249522
2009‐08‐24 37
APPENDIXB:HathiTrustOutagesfromMarch2008throughApril200962
• April2009:HathiTrustexperiencedreducedperformancefrom11:00pmEDTonThursday,April23to8:22amEDTonFriday,April24duetoadatabaseproblematoneofthesitesandfrom5:30pmto9:00pmEDTonThursday,April30duetounintendedconsequencesfromanetworkingconfigurationchange.
• March2009:HathiTrustwasunavailableonTuesday,March3from7:00‐8:00amESTandonThursday,March5from7:00‐7:45amESTforoperatingsystemanddatabasesoftwareupgrades.
• February2009:OnSunday,February22at8:40amEST,apowersurgeresultingfromelectricalsystemmaintenancecausedHathiTrustdatabaseandwebserverstogooffline.Stafflearnedoftheproblematapproximately6:00pmEST,andservicewasrestoredby6:30pmEST.
• January2009:AbriefoutageisscheduledinJanuaryforastoragesystemsoftwareupgrade.• December2008:OnFriday,December19at7:30amEST,HathiTrustwasdownbrieflytoapply
securityupdatestoadatabaseserver.Servicewasrestoredat7:40amEST.• November2008:OnTuesday,November4at7:30amEST,HathiTrustwasdownbrieflytoapply
securityupdatestoadatabaseserver.Servicewasrestoredat7:45amEST• October2008:Nooutagesreported.• September2008:OnThursday,September18atapproximately9:30amEDT,HathiTrustbecame
inaccessibleduetoasoftwareproblemonastoragesystem;theproblemwasrelatedtoourworkwithdatasynchronization.Supportwascontactedandtheproblemwasresolvedat10:45amEDT
• August2008:OnTuesday,August26atapproximately9:00amEDT,adatabaseserverwasbroughtdowntomovetoIndianapolis.Priortoshuttingthisserverdown,wedidnotupdateamanualfailoverconfiguration,causingvolumestobeinaccessibletosomeusers.Theproblemwasresolvedat11:15amEDT.
• July2008:ServicewasunavailableonThursdayJuly31from7:00‐7:30amEDTforastoragesystemsoftwareupgrade.
• June2008:Nooutagesreported.• May2008:Nooutagesreported.• April2008:Nooutagesreported.• March2008:Nooutagesreported.
62HathiTrust.“Updates”fromhttp://www.hathitrust.org/updatesretrievedon16June2009.
2009‐08‐24 38
APPENDIXC:WashtenawCountyHazardRankingList
ThefollowinglistranksavarietyofnaturalandmanmadeeventswithinWashtenawCounty,Michigan,basedupontheirfrequencyofoccurrenceandtheextentoftheirpotentialimpact(onthecounty’spopulation).
Rank Hazard FrequencyPopulationImpacted
1Convectiveweather(severewinds,lightning,tornados,hailstorms)
Onceormore/yr.
250,000
2Hazardousmaterialsincidents:transportation
Onceormore/yr.
2,000
3 Hazardousmaterialsincidents:fixedsiteOnceormore/yr.
10,000
4Severewinterweatherhazards(ice/sleet/snowstorms)
Onceormore/yr.
250,000
5 InfrastructurefailuresOnceevery5yrs.
30,000
6 Transportationaccidents:airandlandOnceormore/yr.
100
7 ExtremetemperaturesOnceevery5yrs.
10,000
8 Floodhazards:riverine/urbanfloodingOnceevery10yrs.
2,000
9 NuclearattackHasnotoccurred
250,000
10Petroleumandnaturalgaspipelineaccidents
Onceevery10yrs.
1,000
11 Firehazards:wildfiresOnceormore/yr.
0
Source:WashtenawCountyHazardMitigationPlan(availableonlineathttp://www.ewashtenaw.org/government/departments/planning_environment/planning/planning/hazard_html)
2009‐08‐24 39
APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences
Thetopicofdisasterrecoveryplanningfortheprintandanalogresourcesoflibrarieshasbeenwidelydealtwithinprofessionalliterature,butcomparativelylittleinformationexistsconcerningthedevelopmentandimplementationofplansforthedigitalcontentofculturalinstitutions.Thefollowingbibliographydetailsresourceswhichprovideguidance,examples,andexplanationsoftheobjectivesandstrategiesfordigitalDisasterRecoveryPlans.ItconsistsprimarilyofmaterialcompiledbyLanceStuchell(ICPSRIntern)andNancyMcGovern(ICPSRDigitalPreservationOfficer)andisincludedherewiththeirpermission.
UniversityofMichiganResources
• UniversityofMichiganAdministrativeInformationServices(MAIS):EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning.
o http://www.mais.umich.edu/projects/drbc_methodology.htmlo ThissitebroadlyoutlinestheneedforandfunctionsofEmergencyManagement,
BusinessContinuity,andDisasterRecoveryPlanningatUM.Italsocontainstemplatesdesignedtohelpunitsplan,test,andauditdisasterandcontinuityprograms.
• ProvostandExecutiveVicePresidentforAcademicAffairs:StandardPracticeGuide:InstitutionalDataResourceManagementPolicy
o http://spg.umich.edu/o ThispolicydefinesinstitutionaldataresourcesasUniversityassetsandmakes
recommendationsonidentifying,preserving,andprovidingaccesstotheseassets.Thedigitalresourcesofthelibrarymaybeidentifiedassuch,basedupontheirusebydepartmentsacrosstheuniversity.
• ICPSRDisasterPlanningResources:
o DigitalPreservationOfficerNancyMcGovernispartofaDisasterRecoveryinitiativeatICPSRandoverthepastseveralyearsherteam(includingLanceStuchell)hasproducedavarietyofdocumentsandtemplatestohelpotherinstitutionsworkthethroughtheplanningprocess.
o Documentsareavailableuponrequestandshouldbepostedinthenearfuture(asofJuly2009)totheICPSRWebsite(http://icpsr.umich.edu/).
• DisasterRecoveryExperts:o ReneGobeyn(MACCDataCenterCoordinator)
ManagedandcoordinatedDisasterRecoveryforU.S.militarydatacenters [email protected]
o KrystalHall(DisasterRecoveryPlanner,ITCS/ITComOperations) HelpeddevelopcurrentITCSDisasterRecoveryplans [email protected]
2009‐08‐24 40
ExternalResources
• GeneralGuidetoDisasterPlanningo ContingencyPlanningGuideforInformationTechnologySystems:Recommendationsof
theNationalInstituteofStandardsandTechnology,NISTSpecialPublication800‐34,June2002.
http://csrc.nist.gov/publications/nistpubs/800‐34/sp800‐34.pdf AnindispensableresourcewhichwasusedheavilybyICPSRinitsDisaster
Recoveryplanning.Itcoverseverythingfrominitialdatacollectionandpolicyformationtothestructureofdisasterresponseteamsandthearticulationofrecoverystrategies.
• ExamplesandToolsfortheDocumentationOutlinedbyNISTGuide:o FullDisasterRecoveryPlan:
UnitedStatesDepartmentofAgricultureDisasterRecoveryandBusinessResumptionPlans
http://www.ocio.usda.gov/directives/doc/DM3570‐001.htmo BusinessContinuityPlan(BCP):
MAIS:EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning
http://www.mais.umich.edu/projects/drbc_templates.html Thissiteprovidesseveralresourcesthatdealwithcontinuityplanning.
o ContinuityofOperationsPrograms(COOP): FEMA:ContinuityofOperations(COOP)Programs
• http://www.fema.gov/government/coop/index.shtm• Containsalotofusefulinformationongovernmentpolicy,templates,
andtrainingresourcestoassistinthecreationofaCOOP. Ready.gov:ContinuityofOperationsPlanning
• http://www.ready.gov/business/plan/planning.html• GuidelinesforcomposingabusinessCOOP,includingwhatoutside
actorsshouldbeinvolvedintheplanningprocess. TheFloridaDepartmentofHealth:ContinuityofOperationsPlanforInformation
Technology• http://www.naphit.org/global/library/basement_docs/FL_DisasterReco
very_template.doc• Lengthy(40pages)anddetailedCOOPtemplatewrittenforanIT
environment. FloridaAtlanticUniversityLibraries:ContinuityofOperationsPlan
• http://www.staff.library.fau.edu/policies/coop‐2007.pdf• AdetailedworkingCOOP,whichincludesreactionstospecificdisaster
scenarios.o ITContingencyPlan:
2009‐08‐24 41
SeetheUSDADisasterRecoveryPlanforanexampleofanITContingencyPlan.o CyberIncidentResponsePlan:
Multi‐StateInformationSharingandAnalysisCenterCyberIncidentResponseGuide
• http://www.msisac.org/localgov/documents/FINALIncidentResponseGuide.pdf
• Theguideprovidesastep‐by‐stepprocessforrespondingtoincidentsanddevelopinganincidentresponseteam.ItmayalsoserveatemplateinordertodraftaCyber‐IncidentResponsePolicyandPlan.
o CrisisCommunicationPlan: Ready.gov:WriteaCrisisCommunicationPlan
• http://www.ready.gov/business/talk/crisisplan.html• Thissiteprovidesguidelinesforcomposingabusinessdisaster
communicationplanandincludessuggestionsfortheplan’sWebpresence.
NCStateUniversity:CrisisCommunicationPlan• http://www.ncsu.edu/emergency‐information/crisisplan.php• ThisisthepolicyandplanfortheUniversityasawhole.Whilemuchof
thispolicydealswithcommunicationatahighlevel,usefulsectionsdetailvitalcontactswithintheorganization(includingwhotocontactfirst),andhowtomanageexternalcommunications.
OtherthoroughuniversitypoliciesandplansincludetheLSU:CrisisCommunicationPlanandtheMissouriS&T:CrisisCommunicationPlan.
HeritageMicrofilmFloodUpdateEmail• ThisemailwassentinresponsetotheJune2008floodingthatoccurred
intheMidwest.• ItupdatesclientsontheoutageofNewspaperArchive.comwhich
resultedfromaflood‐inducedwidespreadpowerfailure.Itisanexcellentexampleofanexternalcrisiscommunicationtousers.
o DisasterRecoveryPlans(DRP): TheUniversityofIowa:ITServicesDisasterRecoveryPlan
• http://cio.uiowa.edu/ITplanning/Plans/ITSdisasterPrep.shtml• Thispolicydetailsthedatacollectionandassessmentwhichinformsthe
UIplanandalsoincludesemergencyprocedures,responsestrategies,andacrisiscommunicationplan.
UniversityofArkansas:ComputingServicesDisasterRecoveryPlan• http://www.uark.edu/staff/drp/• Acompleteandthoroughplanthatoutlinestheinitiationofemergency
andrecoveryprocedures,andaddresseshowtheplanwillbemaintained.
AdamsStateCollege(CO):InformationTechnologyDisasterRecoveryPlan• http://www.adams.edu/administration/computing/dr‐plan100206.pdf
2009‐08‐24 42
• Thisplanhasathoroughsectiononriskassessment. DigitalPreservationEuropeRepositoryPlanningChecklistandGuidance
• http://www.digitalpreservationeurope.eu/platter.pdf• DesignedforusewiththePlanningToolforTrustedElectronic
Repositories(PLATTER),thisdocumentoutlinesconsiderationsforaDisasterRecoveryStrategicObjectivePlan(SOP)andplacesthemincontextwithotherrepositoryplans.
o OccupantEmergencyPlan(OEP): ThisrequirementisaddressedbyUniversityofMichiganBuildingEmergency
ActionPlans(EAP).• http://www.umich.edu/~oseh/guideep.pdf
o DisasterRecoveryTrainingGuides: dPlan.org
• Providesusefulinformationontrainingandanonlineformthatwouldbeusefulinassigningtrainersandmonitoringthetrainingprocess.
CalPreservation.org:DisasterPlanExercise• http://calpreservation.org/disasters/exercise.html• Providesrolesandteachingpointsforarole‐playtrainingexercisethat
focusesonadisasterinalibrary.
• PolicyPlanningTools:o AssociationofPublicTreasurersoftheUnitedStatesandCanada:DisasterPolicy
CertificationGuidelines www.aptusc.org/includes/getpdf.php?f=Disaster_Policy.pdf Thisplanningdocumentandtemplatefordisastermanagementpolicies
providesoutlinesandexamplelanguageonseveralfacetsofastrongpolicy,includingthepossiblelossofabuilding,thereplacementofcomputerresources,andtestingandtrainingforthedisasterplan.Italsooutlinestheneedtoidentifypossiblethreatstoassets.
• ExamplesofDisasterPlanningPolicies:
o ArkansasSecretaryofState:DisasterPlanningPolicy http://www.sos.arkansas.gov/elections/elections_pdfs/register/oct_reg/016.14.
01‐020.pdf Thispolicyoutlinesareasofresponsibilitybetweendepartmentsandunits,and
includestraining,communication,andrecoveryplanupdates.o WashingtonStateDepartmentofInformationServices:DisasterRecoveryandBusiness
ResumptionPlanningPolicy http://isb.wa.gov/policies/portfolio/500p.doc ThisdocumentillustratespolicyformationforanITDisasterRecoveryPlan.It
providesguidelinesforDisasterRecoveryPlanningaswellasmaintenance,testing,andtraininginvolvedwiththerecoveryplan.
2009‐08‐24 43
o FloridaStateUniversity:InformationTechnologyDisasterRecoveryandDataBackupPolicy
http://oti.fsu.edu/oti_pdf/Information%20Technology%20Disaster%20Recovery%20and%20Data%20Backup%20Policy.pdf
ThisdocumentincludespolicyfordatabackupaswellasDisasterRecovery.PartofthepolicyincludesadefinitionofBestPracticeDisasterRecoveryProcedures,aswellasanoutlineoftheuniversity’sownITrecoveryplanningandimplementationprocedures.
• ExampleofaRelevantDisasterPlanningProgram:o OCLCDigitalArchivePreservationPolicyandSupportingDocumentation
http://www.oclc.org/support/documentation/digitalarchive/preservationpolicy.pdf
ThisdocumenthasacleararticulationofOCLC'sdisasterpolicy,alongwithanoutlineofdisasterpreventionandrecoveryproceduresandatime‐framefortherestorationofservicesintheeventofadisaster.
Thepolicyincludesagooddefinitionofadisasterpreventionandrecoveryplan:“Asetofresponsesbasedonsoundprinciplesandendorsedbyseniormanagement,whichcanbeactivatedbytrainedstaffwiththegoalofpreventingorreducingtheseverityoftheimpactofdisastersandincidents.”
OCLCembedsitsdisasterplanwithinitsoverallpreservationpolicy,stating:“Thegoalofdisasterpreventionistosafeguardthedata(contentandmetadata)intheDigitalArchiveandtosafeguardtheDigitalArchive’ssoftwareandsystems.Fordisasterpreventionandrecovery,alldata(contentandmetadata)isconsideredofequalvalue.”
• DesigningaDisasterPlanningProgram:o MichiganStateUniversity:StepbyStepGuidetoDisasterRecoveryPlanning
http://www.drp.msu.edu/Documentation/StepbyStepGuide.htm Thisprogrambreaksdownthedisasterplanningprocessintosteps,and
providesinformationrelevanttoindividualunitswithinauniversitysetting.TheMSUDisasterRecoveryPlanningHomepage(http://www.drp.msu.edu/)alsooffersavarietyofresources.
o MinnesotaStateArchives:DisasterPreparedness http://www.mnhs.org/preserve/records/docs_pdfs/disaster_000.pdf Thisdocumentisadetailedguidetothedisasterplanningprocess.Whilemostly
dealingwithpaperrecords,thedocumentclearlyidentifiesdifferentrolesandresponsibilitiesformembersoftheplanningandrecoveryteam.
o CiscoSystems:DisasterRecoveryBestPracticesWhitePaper http://www.cisco.com/warp/public/63/disrec.pdf
2009‐08‐24 44
ThepaperoutlinesDisasterRecoveryusingtheframeworkoftheaboveresources,buttailorsittoanITpointofview.Ithasusefulinformationonhowtoprepareandrecoverbothhardwareandsoftwareassets.
o AT&T:KeyElementstoanEffectiveBusinessContinuityPlan http://www.business.att.com/content/article/Key_to_Effective_BC_Plan.pdf Ashortpaperthatsummarizesbusinesscontinuityplanningintheprivate
sector.
• GeneralInformationo FederalEmergencyManagementAdministration:EmergencyManagementGuidefor
Business&Industry http://www.fema.gov/business/guide/index.shtm Apracticalguidewithstep‐by‐stepadviceoncreatingaDisasterRecovery
program.Includesinformationontheformationonaplanningcommittee,organizationalanalysis,anddetailsonspecifichazards.
o SpecialLibrariesAssociationInformationPortal:DisasterPlanningandRecovery http://www.sla.org/content/resources/infoportals/disaster.cfm Anexhaustivelistofresources,thispageincludesarticlesondigitaldisaster
recoverystrategiesaswellasinformationonplanning,examplesofplans,andlinkstoawiderangeofresourcesinthepublicandprivatesector.
WrittenResources:
• Wellheiser,JohannaandJudeScott.AnOunceofPrevention:IntegratedDisasterPlanningforArchives,Libraries,andRecordCentres.Lanham,MD:ScarecrowPress,2002.
o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004233950&local_base=AA_PUB
• Cox.RichardJ.FlowersAftertheFuneral:ReflectionsonthePost‐9/11DigitalAge.Lanham,MD:ScarecrowPress,2003.
o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004341258&local_base=AA_PUB
• Matthews,GrahamandJohnFeather,eds.DisasterManagementforLibrariesandArchives.Burlington,VT:Ashgate,2003.
o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004354795&local_base=AA_PUB
2009‐08‐24 45
APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess
VariousresourcesagreethatthereisnoonewaytogoaboutinitiatingaDisasterRecoveryprogramordraftingaDRplan.Anorganizationmustproceedaccordingtoitsfunctionsandresourcesaswellastheneedsofitsdesignatedcommunityofusers.ThefollowingdiscussiondrawsheavilyupontheICPSRDisasterPlanningPolicyFramework(writtenbyNancyMcGovernandLanceStuchell)andtheContingencyPlanningGuideforInformationTechnologySystemspublishedbyNIST(2002).Assuch,itrepresentsaconsolidationandsimplificationofinformationpresentedinmoredepthelsewhere.Alistofplanningresources(withlinkinformationtofulltexts)isavailableinAppendixD.
• BasicPreceptsofDisasterRecoveryPlanning
1) DisasterRecoveryPlanningisacontinuousactivitythatinvolvesmonitoringinternalconditionsaswellasevolutionsintechnologyandthreats;respondingtonewdevelopmentsthatarise;revisingplanssothattheyremainrelevantandeffective;trainingstaffaccordingtoplans;andtestingorganizationalreadiness.
a. Thereisnosingledocumentwhichcontains“theplan”;rather,aDisasterRecoveryPlanconsistsofasuiteofdocumentsthatrequirearegularscheduleoftestingandrevisiontobeeffective.
b. ThereisnopointatwhichaDisasterRecoveryPlanis“finished.”
2) DisasterRecoveryPlanningneedstobeanorganizationwideactivity
a. DisasterrecoverymustbeoneofthebasicfunctionsofHathiTrust.
b. Aneffectiveplanneedsfulladministrativesupport.
c. Policiesandproceduresmustcomplementandconformtodisasterresponseplansestablishedbytheuniversity,city,andDepartmentofHomelandSecurity.
3) DisasterrecoverycannotbelimitedtothehardwareandsoftwarecomponentsordatacollectionsofHathiTrust;planningmustalsoaccountfortheimpactofhumanemergenciesontherepository’soperations.
• EssentialStepsinDisasterRecoveryPlanning
1) EstablishaDisasterRecoveryPlanningCommittee.
a. Thisgroupwillresearchanddeveloptheplanandhelpwithitsimplementationaswellasmonitorthetraining,testing,andrevisingofplanstoensureorganizationalcomplianceandreadiness.
b. Thecommitteeshouldinvolveindividualsrepresentingthevariousmissioncriticalunitswithinthelibrary(fromadministrationtoCoreServicestotheDigitalPreservationLibrarian)whowillparticipateinthedevelopmentofpolicyandrecoveryplanning.
c. Itisessentialthatthecommitteeinvolveindividualswiththeauthoritytosupportandenforcerecommendations.
d. Thecommittee’sactivitiesshouldinitiatetheformationofaDisasterResponseProgram.
2) DraftaDisasterRecoveryPlanningPolicyStatement
2009‐08‐24 46
a. Enablestheorganization—andothers—tounderstandthescopeandnatureoftheDisasterRecoveryPlan.
b. Establishestheorganizationalframeworkandresponsibilitiesfortheplanningprocess.
c. Keypolicyelements(asdetailedintheNISTreport):
i. Rolesandresponsibilitieswithintheorganizationinregardstoplanning
ii. MandateforDisasterRecoveryaswellasanystatutoryorregulatoryrequirements
iii. Scopeasappliestothetype(s)ofplatform(s)andorganizationalfunctionssubjecttoDisasterRecoveryPlanning
iv. ResourcerequirementsfortheDisasterRecoveryprogram
v. Trainingrequirements
vi. Exerciseandtestingschedules(atleastonemajorannualtest)
vii. Planmaintenanceschedule(elementsshouldbereviewedannually)
viii. Frequencyofbackupsandstorageofbackupmedia.
3) ConductDataCollectionandAnalysis(i.e.“BusinessImpactAnalysis”)
a. Determinecriticalfunctionsandidentifyspecificsystemresourcesrequiredtoperformthem.Minimumrequirementsforfunctionalityshouldbeestablished.
b. Determinerisksandvulnerabilitiesfacingtherepository’ssystemsandinfrastructure.
c. Identifyandcoordinatewithinternalandexternalpointsofcontacttodeterminehowtheydependonorsupporttherepositoryanditsfunctions;considerhowonefailuremightcascadeintoothers.
i. IdentifyresourcesthatarecrucialtoHathiTrust(I.e.,Mirlyn)
ii. Determinetheallowableoutage/disruptiontimefortheseresources
d. Developrecoverypriorities;balancethecostofinoperabilityagainstthecostofrecovery
i. DetermineHathiTrust’spositionwithintheprioritiesoftheuniversityaswellaswithitsmajorserviceprovidersandvendors(i.e.,TSMGroup,ITCom,Isilon,etc.)tobetterunderstandhowthatprioritizationwillimpactrecoveryefforts.
ii. Establishthemostcrucialfunctionswhichmustberestoredfirst.
iii. DetermineHathiTrust’sRecoveryTimeObjective(RTO,i.e.,themaximumallowableoutageperiod)andRecoveryPointObjective(RPO,i.e.,thepointintimetowhichdatafilesmustberestoredafteradisaster).
iv. Reviewpotentialresources(financial,personnel,etc.)withinHathiTrustaswellasthoseavailableviacontracts,serviceproviders,andproductsupport.ThisstepshouldinvolvetheclarificationofHathiTrust’spositionwithintheuniversity’saswellaskeyserviceproviders’andvendors’priorities.
4) Addressrisksuncoveredinthedatacollectionphaseandinstitutepreventativecontrolsasneededtoanticipateandmitigatethoserisks.
2009‐08‐24 47
5) Developrecoverystrategiesthatrespondtothepotentialimpactsandmaximumallowableoutagetimesestablishedinthedatacollectionphase.Effortsshouldfocusonsolutionsthatarecost‐effectiveandtechnicallyviable.
a. Strategiesshouldbedesignedtobringcorefunctionsbackonlineassoonaspossiblewithinanestablishedcostrange.
b. Recoveryeffortsmustbeprioritizedaccordingtothenatureofcorefunctionsaswellaslogicalorderofprocedures.
c. Alternativesolutionsshouldbeconsideredbaseduponcost,availabilityofresources,outagetimes,levelsoffunctionality(partialvs.full),andabilitytointegratemethodswithexistinginfrastructure.
d. Determinethepracticalityofpartial(vs.full)recoveryinordertobringservicesbackonlineinatimelyandcost‐effectivemanner.
e. Recoverystrategiesandresourcesshouldbeincorporated(aspossible)intotherepository’ssystemarchitecturesothatintheeventofadisaster,theresponsemayproceedinanefficientandstraightforwardmanner.
6) FormalizeandrecordcollecteddataandrecoverystrategiesinDisasterRecoveryDocuments.Intheprocessofproducingthiswiderangeofdocuments,anorganizationisforcedtoconsideranddocumentpoliciesandproceduresrelatedtoavarietyofkeyadministrativeandtechnicalissues.Thedecisionofwhichplanstoinclude(andwhichtoexclude)mustbedeterminedbaseduponareviewofHathiTrust’sneedsandobjectives.Additionaldocuments(aWebcontinuityplan,forinstance)maybenecessarybasedupondatacollectionandanalysis.
a. BusinessContinuityPlan
i. Businesscontinuityistheabilityofabusinesstocontinueitsoperationswithminimaldisruptionordowntimeintheeventofnaturalormanmadedisasters.
ii. Suchplanningallowsanorganizationtoensureitssurvivalbyconsideringpotentialbusinessinterruptionsandestablishingappropriate,cost‐effectiveresponses.
iii. TheBusinessContinuityPlandetailsHathiTrust’scorefunctionsandtheprioritiesforre‐establishingeachintheeventofadisruption.Itshouldaddresskeyadministrativeandsupportfunctionsaswellasthosewhichdirectlyinvolvetherepository’sdesignatedcommunity.
iv. Theplanshouldthoroughlydocumentthenatureofkeyfunctions,interdependences,theimpactoftheirloss,andalternativemeanstoensuretheircontinuationintheeventofadisaster.MAISoffersausefulBusinessContinuityplanningtemplateathttp://www.mais.umich.edu/projects/drbc_templates.html.
b. ContinuityofOperationsPlan(COOP)
i. TheCOOPfocusesonrestoringanorganization’s(usuallyaheadquarterselement)essentialfunctionsatanalternatesiteandperformingthosefunctionsforupto30daysbeforereturningtonormaloperations.
2009‐08‐24 48
ii. ThisplanmayincludetheBusinessContinuityPlanandDisasterRecoveryPlanasappendices.
c. ITContingencyPlan
i. TheITContingencyPlanaddressesdisasterplanningforcomputers,servers,andelementsofthetechnicalinfrastructurethatsupportkeyapplicationsandfunctions.
ii. Itshouldaccountforthefollowing:
1. Documenthardwareandsoftware
2. Developanemergencycontactlist
3. Backupandstorealldatafilesoff‐site
4. Proactivelymonitorequipmentanddata
5. Installandupdateantivirussoftwareonbothcomputersandservers
6. Developrecoveryscenarios
7. Communicateandmonitortheplan
iii. TheplanallowsHathiTrusttoformalizeanddocumentproceduresandpoliciesalreadyinplaceanddetailstherepository’sadherencetothesegoals.
d. CrisisCommunicationsPlan
i. CommunicationisavitallyimportantaspectofDisasterRecoveryPlanningandanorganization’sactualresponseinadisaster.
ii. TheCrisisCommunicationsPlanestablishesproceduresforinternalandexternalcommunicationsduringandafteranemergency.
iii. Thedifferentphasesofcrisiscommunicationencompasstheinitialnotificationofanevent,damageassessment,andplanactivationaswellasstatusreports(asneeded)andtheeventualcompletionofrecoveryefforts.
iv. Activationofthecommunicationsplanmustbetheresponsibilityofaspecificindividual.
v. TheDisasterResponseTeamcoordinateswiththeCrisisCommunicationTeamtoensurethatinformationprovidedaboutanemergencyisclear,concise,andconsistent.
e. Cyber‐IncidentResponsePlan
i. ThisplandefinestheproceduresforrespondingtocyberattacksagainsttheHathiTrustITsystem.
ii. Itprovidesaformalframeworkfortheidentification,mitigation,andrecoveryfrommaliciouscomputerincidents,suchasunauthorizedaccesstoasystemordata,denialofservice,orunauthorizedchangestosystemhardware,software,ordata.
2009‐08‐24 49
f. OccupantEmergencyPlan
i. TheOccupantEmergencyPlandefinesresponseproceduresforlibrarystaffintheeventofasituationthatposesapotentialthreattothehealthandsafetyofpersonnel,theenvironment,orHathiTrustproperty.
ii. HathiTrustmayutilizetheframeworkprovidedbyUMBuildingEmergencyActionPlansforthiselement.
g. DisasterRecoveryPlan
i. TheprimaryfocusoftheDisasterRecoveryPlanistherestorationofcoreinformationsystems,applications,andservices.
ii. Theplanbringstogetherguidanceandproceduresfromtheotherplans(i.e.,BusinessContinuityPlan,ITContingencyPlan,CrisisCommunicationsPlan,etc.)pertainingtoemergenciesthatresultininterruptionsofservicethatexceedacceptabledowntimes,asdefinedintheBCP.
iii. Theplanshoulddetailestablishedrecoverystrategiesforspecificdisastersituationsaswellastheteamsinvolvedintheirexecution.
iv. Personnelshouldbechosentostaffdisasterresponseteamsbasedontheirskillsandknowledge.Ideally,teamswouldbestaffedwiththepersonnelresponsibleforthesameorsimilaroperationundernormalconditions.It’salsoimportantthatteammembersshouldbefamiliarwiththegoalsandproceduresofotherteamstofacilitateinter‐teamcoordination.Eachteamisledbyateamleader(withasuitablealternate)whodirectsoverallteamoperationsandactsastheteam’srepresentativetomanagementandliaisonswithotherteamleaders.DisasterResponsecannotbeindividual‐specificoroverlyreliantonspecificpeople.Teamsmustassigneachroleatleastonealternateintheeventthatcorepeopleareunavailableatthetimeofadisaster.
v. NISTsuggeststhatacapablestrategywillrequiresomeorallofthefollowingfunctionalgroups.ForHathiTrust,manyofthesearealreadyinplaceintheformofUniversityofMichiganunitsandserviceproviders.
1. Anauthoritativeroleforoveralldecision‐makingresponsibility
2. SeniorManagementOfficial
3. ManagementTeam
4. DamageAssessmentTeam
5. OperatingSystemAdministrationTeam
6. SystemsSoftwareTeam
7. ServerRecoveryTeam(e.g.,clientserver,Webserver)
8. LAN/WANRecoveryTeam
9. DatabaseRecoveryTeam
10. NetworkOperationsRecoveryTeam
11. ApplicationRecoveryTeam(s)
2009‐08‐24 50
12. TelecommunicationsTeam
13. HardwareSalvageTeam
14. AlternateSiteRecoveryCoordinationTeam
15. OriginalSiteRestoration/SalvageCoordinationTeam
16. TestTeam
17. AdministrativeSupportTeam
18. TransportationandRelocationTeam
19. MediaRelationsTeam
20. LegalAffairsTeam
21. Physical/PersonnelSecurityTeam
22. ProcurementTeam(equipmentandsupplies)
h. DisasterRecoveryTrainingPlan
i. ThisplanwillestablishthesituationsandprocedurestobecoveredbyHathiTrust’sDisasterRecoverytraining.
ii. Thecontentsoftheplanshouldreflecttherangeofresponsibilitiesheldbetweenadministrators,departmentheads,andstaffwithinHathiTrust.
iii. TheplanshouldaccommodateDisasterRecoveryPlanningCommitteemembersaswellasthoseoftheDisasterResponseTeam.Forthelatter,itshouldidentifykeyrolesandresponsibilitiesinrecoveryefforts.
iv. Theplanshouldallowin‐housetrainingtobesupplementedbyexternalopportunities.
v. Aregularlyscheduledemergencydrillsshouldalsobeincludedtotestthereadinessofstaffandtheappropriatenessofresponseprocedures.
7) Implementelementsdevelopedinplanningprocess.Proceduresandpoliciesrelatedtocommunication,technologicalsolutions,etc.mustbeincorporatedintoHathiTrust’soveralldesignandoperationsothatDisasterRecoverybecomesacriticalorganizationalfunction.
8) InstituteregularprogramoftrainingandtestingtobesurethatstaffunderstandandacceptpoliciesandproceduresandtoensurethatHathiTrustispreparedforadisaster.
9) ConductregularreviewandmaintenanceofDisasterRecoverydocumentstorespondtochangesinpersonnel,organizationalstructureorfunctions,andevolutionsintechnologyand/orthreats.
• MainPhasesinaDisasterResponse:
1) Notification/Activation:Thisphasecoverstheinitialactionsonceasituationhasbeendetectedoristhreatened.Itincludesdamageassessmentandtheimplementationofanappropriateresponsestrategy.
a. Properdiagnosisandcommunication(bothinternalandexternal)ofadisasterisessential.
2009‐08‐24 51
b. Thenatureofindividualeventswilldeterminewhoneedstobeinvolved(i.e.,facilitiesmanagement,coreservices,etc.).
2) Recovery:Thisphasefocusesonthereturntoapre‐establishedleveloffunctionality(plansshoulddetailpartialaswellasfullrecoveries).
a. ResponseteamsimplementrecoverystrategiesandadheretoproceduresandprotocolsoutlinedinDisasterRecoveryDocuments
3) Reconstitution:Afterrecoveryeffortsarecomplete,normaloperationsmustberestored.Thismayinvolvethereconstructionoffacilitiesand/orinfrastructureaswellasthetestingofrestoredelementstoensuretheirfullfunctionality.
2009‐08‐24 52
APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)
2009‐08‐24 53
APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardSA(2006)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)
2009‐08‐24 54
APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)