managing and preserving data and complex digital objects€¦ · managing and preserving data and...
TRANSCRIPT
ManagingandPreservingDataandComplexDigitalObjectsKATHERINESKINNER,PHDEXECUTIVEDIRECTOR,EDUCOPIAINSTITUTEAUGUST9,2017:ETD2017WORKSHOP
LearningObjectivesAfterthisworkshop,youwillbeableto:1. Demonstrateyourfamiliaritywithemergingmethodsandtoolsto
supportstudents’digitalobjectmanagement2. Anticipatetheimpactofexpandingresearchformatsand
materialsonlifecyclemanagementpracticesforETDprograms3. Describewhatotherinstitutionsaredoingtosupportstudentsin
managing(orlearningtomanage)theirresearchoutputs4. Deliverworkshopstostudentsand/orfacultyonyourown
campustohelpthembuildtheirdigitalmanagementskills
Agenda09:00– 09:20:Welcome,Introductions09:20– 09:40:Activity09:40– 10:00:ManagingDataandComplexDigitalObjects10:00– 10:15:Break10:15– 11:00:DataOrganization,FileFormats,andStorage11:00– 11:40:ConnectingStudentNeedstoUniversityNeeds11:45– 12:00:Wrapup
ActivityUsingthestickynotesprovidedonyourtable,pleaseanswerthefollowingquestion:
Whatcontenttype(s)doesyourinstitution’sETDprogramcurrentlyaccept?
◦Writedownonefileformattypeperstickynote◦ Asyoufinish,pleasesticktheminthemiddleofyourtable◦ Tallythenumberandwritethatdownonaseparatesticky◦ “Idon’tknow”isaperfectlyreasonableanswer
ActivityUsingthestickynotesprovidedonyourtable,pleaseanswerthefollowingquestion:
Whatcontenttypesdoesyourinstitution’sdigitalpreservationprogramcurrentlysupport?
◦Writedownonefileformattypeperstickynote◦ Asyoufinish,pleasesticktheminthemiddleofyourtable◦ Tallythenumberandwritethatdownonaseparatesticky◦ “Idon’tknow”isaperfectlyreasonableanswer
Reflections
Photo by Erik Eastman on Unsplash Photo by Yolanda Sun on Unsplash Photo by Mike Wilson on Unsplash
ETD+ audio-videofiles
digitalart
GISdatasetsvisualizations
softwarecode
digitaltext
researchdata
ManagingDataandComplexDigitalObjects
http://metaarchive.org
ETDplus team:EducopiaInstitute
MetaArchive Cooperative
NDLTD
ProQuest
CarnegieMellonUniversity
ColoradoStateUniversity
HBCULibraryAlliance
IndianaStateUniversity
OregonStateUniversity
PennStateUniversity
PurdueUniversity
UniversityofLouisville
UNCSchoolofLibraryandInformationScience
UniversityofNorthTexas
UniversityofTennessee- Knoxville
VirginiaTechUniversity
9
Coreresearchquestion:HowcaninstitutionsbestensurethelongevityandavailabilityofETDresearchdataandcomplexdigitalobjects(e.g.,software,multimediafiles)thatcompriseanintegralcomponentofstudentthesesanddissertations?
Startingplace:Surveys§NineCampuses§TenIRBs(argh!)§Twosurveys§795studentresponses§35administratorsandstaff
Photo by Braden Collum on Unsplash.com
ParticipatingUniversitiesCarnegieMellonUniversityColoradoStateUniversityIndianaStateUniversityOregonStateUniversityPennStateUniversity
PurdueUniversityUniversityofLouisvilleUniversityofTennesseeKnoxvilleVirginiaTechUniversity
Fully80%of795studentsreporttheywillproducenon-textfilesintheirdissertationorthesisresearch
Fully80%of795studentsreporttheywillproducenon-textfilesintheirdissertationorthesisresearch
Studentsreportthatthesenon-PDFfileslikeresearchdata,video,digitalart,andsoftwarecodeareeitherasimportantormoreimportantthanthosesubmittedasPDFstosatisfydegreerequirements.
Mostimportantcontenttoauthor Mostimportantcontenttoothersasguessedbyauthor
Photoby JoeGardner on Unsplash
35%+ofrespondentssaidthattheirnon-textfilesaretheirmostimportantfiles
…butonly13%ofrespondentsreportedplanstoactuallysubmit thosematerialsintheirETDpackage.
Astudent’sperceptionofvalue isnotthekeydriverforwhatresearchoutputss/hechoosestosubmitinathesisordissertationpackage.
NEEDTOCLOSETHISGAP. Photo by Sonja Guina on Unsplash
0
5
10
15
20
25
30
Yes No Don'tknow Prefernottoanswer
DoesyourinstitutionacceptobjectsinadditiontothePDFofthethesisordissertation?
ADMINISTRATORS’SURVEY:
YES NO DON’TKNOW PREFERNOTTOSAY
STUDENTSURVEY:
Guidancedocuments
None
Software
Webinars
Workshops
Threeprongedapproach1.EducateETDprogrammanagers2.Educate students3.Educate faculty
Discussion
Photo by James Pond on Unsplash
Agenda09:00– 09:20:Welcome,Introductions09:20– 09:40:Activity09:40– 10:00:ManagingDataandComplexDigitalObjects10:00– 10:15:Break10:15– 11:15:ETD+Toolkit:DataOrganization,FileFormats,Storage11:15– 11:45:ConnectingStudentNeedstoUniversityNeeds11:45– 12:00:Wrapup
TheETD+Toolkithelpstheacademiccommunitytotrainstudentstoensurethelongevityandaccessibilityoftheirresearchoutputs.
WhatistheToolkitAnopensetofsixmodulesandevaluationinstrumentsthatpreparestudentstocreate,store,andmaintaintheirresearchoutputs.
MODULE 1: COPYRIGHTHow can students gain appropriatepermissions and how can studentssignal copyright for their own works?
MODULE 2: DATA ORGANIZATIONHow can students structure, describe,store, and deposit data and researchfiles for reuse and/or future access?
MODULE 3: FILE FORMATSHow will the formats students choosemake future access to their researcheasier or more difficult?
MODULE 4: METADATAHow can students store informationdescribing their files to make sure theycan tell what they are in the future?
MODULE 5: STORAGEHow can students make well informedchoices about where to store theirresearch materials?
MODULE 6: VERSION CONTROLWhat mechanisms can students use tomake it easier to see the history of a filewith multiple versions?
Eachmoduleincludes:1. LearningObjectives2. One-pageHandout3. GuidanceBrief(customizable)4. Slideshowwithpresenternotes5. Evaluationsurvey
WhocanusetheToolkit?Anyonemayfreelyadoptandadaptthistoolkit.
http://educopia.org/etdplustoolkit
LINKTOTHETOOLKIT
https://educopia.org/publications/etdplustoolkit
https://educopia.org/publications/etdplustoolkit
https://educopia.org/publications/etdplustoolkit
Solet’sexperienceacoupleofthese!
ETD+ audio-videofiles
digitalart
GISdatasetsvisualizations
softwarecode
digitaltext
researchdata
MODULE 1: COPYRIGHTHow can students gain appropriatepermissions and how can studentssignal copyright for their own works?
MODULE 2: DATA ORGANIZATIONHow can students structure, describe,store, and deposit data and researchfiles for reuse and/or future access?
MODULE 3: FILE FORMATSHow will the formats students choosemake future access to their researcheasier or more difficult?
MODULE 4: METADATAHow can students store informationdescribing their files to make sure theycan tell what they are in the future?
MODULE 5: STORAGEHow can students make well informedchoices about where to store theirresearch materials?
MODULE 6: VERSION CONTROLWhat mechanisms can students use tomake it easier to see the history of a filewith multiple versions?
DataOrganization
Photo by jesse orrico on Unsplash
KeytakeawayThedecisionsyoumakeabouthowyouorganizeandstructureyourdatatodaywillhaveimplicationsforhowyouandotherscanaccessandmakeuse(orsense!)ofthatdatainthefuture.
Photo by Fede Casanova on Unsplash
Whyisdatahardtodealwith?• Datawithoutdatadocumentation(e.g.,adatadictionary)
isoftenimpossibletounderstand.
Whyisdatahardtodealwith?• Datawithoutdatadocumentation(e.g.,adatadictionary)
isoftenimpossibletounderstand.• Withoutaccesstospecific(oftenexpensive)software,a
datafilemaybeunabletobeviewedorused.
Whyisdatahardtodealwith?• Datawithoutdatadocumentation(e.g.,adatadictionary)
isoftenimpossibletounderstand.• Withoutaccesstospecific(oftenexpensive)software,a
datafilemaybeunabletobeviewedorused.• IRBandfunderrequirementsmayimpactthewayyou
needtostructureyourdata.
Whyisdatahardtodealwith?• Datawithoutdatadocumentation(e.g.,adatadictionary)
isoftenimpossibletounderstand.• Withoutaccesstospecific(oftenexpensive)software,a
datafilemaybeunabletobeviewedorused.• IRBandfunderrequirementsmayimpactthewayyou
needtostructureyourdata.• Asdatausageincreases,dataoftenneedstobe
interoperableinordertoenablesharingandreuse.
Structuringyourdatawellhelpsyouto…ReproduceresultsReuseitinthefutureShareitwithothersGainandretaincredibilityComplywithIRB/funderrequirements
Photo by Joe Pizzio on Unsplash
Questionstoask…repeatedly!1. Whatarethedataorganizationstandardsforyour
field?2. Whatarethedataexportoptionsinthesoftware
youareusing?3. Whatformsofthedatawillbeneededforfuture
access?
Photo by Khara Woods on Unsplash
ProvidingcontextforyourdataDocument:1. Thedata’spurpose2. Alistofthefilesinyourdatapackage3. Datadictionarylistinganddescribingallvariables
Dataorganizationprinciples• Useonevariablepercolumn.• Makeoneobservationperrow.• Usehuman-readablecolumnnames.• Includeonetablepertab.• Indicaterelationshipsbetweentablesusingakey.
MovieTitle Director Distributor RunningTime Budget ReleasedPeterPan HerbertBrenon ParamountPictures 105minutes 40,030 Dec291924GirlShy FredC.Newmeyerand
SamTaylorPatheExchange 82minutes 400,000 Apr201924
Greed EricVonStroheim Metro-Goldwyn-Mayer 140minutes 665,603 Dec4, 1924
• ConsiderwhatyourNULLvaluesareandhowtheyarerepresented
• Considerwhatcontextualdocumentationisrequired
• Usestandarddatarepresentations(e.g.,(YYYYMMDDfordates)
• Useformattingtoconveyinformation
• Placecommentsincells• Usespecialcharactersin
fieldnames• Useblankspacesorsymbols
incolumnnames
Do Donot
• SocialSciences:ICPSRhttp://www.icpsr.umich.edu/icpsrweb/deposit/index.jsp
• Genomics:GenBank https://www.ncbi.nlm.nih.gov/genbank/• EarthSciences:NASA’sEarthdata https://earthdata.nasa.gov/• Archaeology:tDAR http://www.tdar.org/• Oceanography:NODChttp://www.nodc.noaa.gov/• BioSciences:Dryadhttps://datadryad.org/
Discipline-baseddatarepositories
Source- GuidanceBriefs:ManagingYourETDResearchFiles
Examplehand-out
• Chooseonespreadsheetyouareusingforacurrentdata-gatheringproject.o Usingthe“DataOrganizationPrinciples,”checktoseeifyour
filemeetsthoserequirements.o Createadatadictionaryforthespreadsheetthatdescribes
themeaningofeachcolumnheader.
Activity
FileFormats
• Images:jpg,gif,tiff,png,ai,svg,...• Video:mpeg,m2tvs,flv,dv,...• GIS:kml,dxf,shp,tiff,...• CAD:dxf,dwg,pdf,…• Data:csv,mdf,fp,spv,xlx,tsv,...
ExamplesofFileFormats
• Usesoftwarethatimportsandexportsdataincommonformats.
KeyConcepts
• Usesoftwarethatimportsandexportsdataincommonformats.
• Askadvisorsandcolleagueswhatformatstheyuseandwhy.
KeyConcepts
• Usesoftwarethatimportsandexportsdataincommonformats.
• Askadvisorsandcolleagueswhatformatstheyuseandwhy.• Chooseaformatwithfunctionsthatsupportyourresearch
needs.
KeyConcepts
• Usesoftwarethatimportsandexportsdataincommonformats.
• Askadvisorsandcolleagueswhatformatstheyuseandwhy.• Chooseaformatwithfunctionsthatsupportyourresearch
needs.• Saveyourcontentinmultipleformatstospreadyourrisk
acrosssoftwareplatforms(e.g.,docx,pdf,&txt;ormp4,avi,&mpg).
KeyConcepts
• SustainabilityofDigitalFormatshttp://www.digitalpreservation.gov/formats/content/content_categories.shtml
• RecommendedFormatsStatementhttps://www.loc.gov/preservation/resources/rfs/
Choosingafileformat
• RobustLinkshttp://robustlinks.mementoweb.org/• Archive-Ithttps://archive-it.org/• Wayback Machinehttp://waybackmachine.org/• Screenshots
SavingWebresources
• Optionsincludeproprietary,freeware,andopensourcesolutions.
FileFormatConversions
• Optionsincludeproprietary,freeware,andopensourcesolutions.
• Formatsinbroaduseusuallyhavemoreavailableoptionsforconversion.
FileFormatConversions
• Optionsincludeproprietary,freeware,andopensourcesolutions.
• Formatsinbroaduseusuallyhavemoreavailableoptionsforconversion.
• Whenyouconvertthefile,recognizethattheprocessmaytransformyourcontent.
FileFormatConversions
• Optionsincludeproprietary,freeware,andopensourcesolutions.
• Formatsinbroaduseusuallyhavemoreavailableoptionsforconversion.
• Whenyouconvertthefile,recognizethattheprocessmaytransformyourcontent.
• Beforeyouconvert,identifywhatcharacteristicsaremostimportanttomaintainintheconversionprocess.
FileFormatConversions
• Embedfonts.• Embedhyperlinks.• Stabilizehyperlinks.• Storesupplementarymaterialsasseparatefiles.• VerifyPDF/Acompliance.• TestEVERYTHING.
PDF-specificadvice
ManyETDprogramsfavorpdffiles.Ifyouexportresearchoutputstopdf,makesureyou:1. Embedyourfonts2. Embed(andtest!)hyperlinks3. Stabilizeyourweb-basedresourcesand
citations(usingatoollikeRobustLinks,Archive-It,orPermaCC)
4. Storesupplementarymaterialsasseparatefiles
5. VerifythePDF/Acompliance(useAcrobatPro“Preflight”featureunder“Edit)
Beforeyouundertakeanyconversion,youneedtoidentifywhatcharacteristicsofyourdataareimportanttomaintainduringtheconversion.Forexample,arethecolorsinadocumentorimageimportant?Isthepaginationessential?Whataboutreferences?Youwillwanttotesttheseafteryourconversioniscompletetoensurethatyouhaveaconversionthatwillmeetyourneeds.
AdditionalResources:● ListofFileFormats(Wikipedia)● RecommendedFormatsStatement
(Library ofCongress)● EvaluatingYourFileFormats (UK
NationalArchives)● ReformattingGuides (USNational
Archives)
FileFormats
Howtoselectfileformats:● Usesoftwarethatimportsandexportsdata
incommonformats.● Askadvisorsandcolleagueswhatformats
theyuse.● Chooseaformatwithfunctionsthatsupport
yourresearchneeds.● Savefinalversionsofyourcontentin
multipleformatsinordertospreadyourriskacrossmultiplesoftwareplatforms(e.g.,docx,pdf,andtxt;ormp4,avi,andmpg).
Ifyouusewebsite-basedmaterialsasevidenceorreferences,takeprecautionstoensurethatifthecontentmoves,changes,ordisappears,youstillhaveevidenceofitsexistence.CurrenttoolstohelpyouensurethelongevityofthesematerialsincludeRobustLinks andArchive-It.Youcanalsotakescreenshotsofimportantdigitalcontentinordertopreservethelookandfeelofanobject.
Thereisnoperfectfileformat.Eachwillhaveadvantagesanddisadvantagesdependingonyourresearchuses.Selectafileformat,orsetoffileformats,thathelpsyoucompleteyourresearchnow,andthatyoucanaccessagaininthefuture.Thisisimportantbothforyourresearchoutputs(whatyoucreate)andyourresearchinputs(materialsyouuseintheresearchprocess).
Commonfiletypesinclude:Images:jpg,gif,tiff,png,ai,svg,…Video:mpeg,m2tvs,flv,dv,…GIS:kml,dxf,shp,tiff,…CAD:dxf,dwg,pdf,…Data:csv,mdf,fp,spv,xlx,tsv,…Text:txt,rtf,tvi,doc,pdf…
Considerwhatmighthappenifyoucannolongeruseyoursoftware.Whetherthesoftwarepublishergoesbankrupt,thelatestversionrefusestoreadolderdata,oryoucan’taffordapersonallicenseforitafteryougraduate,theendresultisthesame.Losingaccesstoyoursoftwarecanmeanlosingyourdata,especiallyifitistheonlysoftwarethatcanreadyourdata.
Examplehand-out
Lookatafolderofyourresearchmaterialsandanswerthefollowingquestions.• Whatsoftwaredoyouneedtoaccessthesematerials?• Isthereariskoflosingaccesstothatsoftware,noworlater?• Wouldacolleaguebeabletoopenanduseyourmaterialsifyou
sharedthem?• Canyousubmityourthesis/dissertationanditsrelatedresearch
materialsusingfileformatssupportedbyyoursoftware?
Activity
Storage
Photo by Samuel Zeller on Unsplash
WhyStoringCopiesinMultipleLocationsMatters…
Viruses
File Corruption
Physicaldisasters
Theft
Storage device malfunctions
Accidental erasure
Malicious deletion
Overwritten files
Lost password or key
Hacking
Bit Rot
Acopyofyourdigitalcontent,ideallystoredinadifferentlocationfromtheoriginal,usuallymadetopreventdataloss.
Back-up
Photo by Kalle K on Unsplash
• Laptop• Desktop• Externalharddrive(spinningorSSD)• Flashdrive• The“cloud”
Commonstorageoptions
• Maintainatleastonelocal(i.e.,non-cloud-based)copyofyourcontent.
• Maintainatleastthreeseparatecompletecopiesofyourresearchcontent.
• Maintainatleastoneofthosecopiesinadifferentgeographiclocation.
• Maintainahistoryofchangesinatleastonelocation(e.g.,usinga“TimeCapsule”).
Storagerecommendations
Theseriesofmanagedactivitiesnecessarytoensurecontinuedaccesstodigitalmaterialsforaslongasnecessary
-DigitalPreservationCoalition
Preservation
Photo by Jakob Owens on Unsplash
• Produceandmaintainaninventoryofallofyourcontent,documentingfilenames,sizes,locations,types,and“checksums”.
• Createandregularlycheck“checksums”foryourmostimportantresearchfiles.
• Employatoollike“Fixity”toscanspecifiedfoldersordirectoriesonaregularbasisandreportchangestoyouviaemail.https://github.com/avpreserve/fixity
Managedactivitiesandpreservation
• Systematizeyourfolder- andfile-nameconventionsusinghuman-identifiablenames.
• Usenamingconventionstomarkversionsoffiles(e.g.,MusicofSocialChange-v12.csv).
• Makesureyourfilenamesarefollowedbythecorrectfileextension(e.g.,.txt,.csv).
• Avoidusingspecialcharactersinallfileandfoldernames(e.g.,\?:*?<>{}[]&$,;.!).
Moremanagedactivities…
• JesusVigo,“WorldBackupDay:BestPracticestoBackupYourData,”TechRepublichttp://www.techrepublic.com/article/world-backup-day-best-practices-to-backup-your-data/
• Forgeneralinfoonarchivingandbackingupcontent,seethePersonalDigitalArchivingresources.http://digitalpreservation.gov/personalarchiving/
Resources
StorageBack-up:Acopyofyourdigitalcontent,ideallystoredinadifferentlocationfromtheoriginal,usuallymadetopreventdataloss.Preservation:The“seriesofmanagedactivitiesnecessarytoensurecontinuedaccesstodigitalmaterialsforaslongasnecessary”.–DigitalPreservationCoalition
Whereandhowyouchoosetostoreyourresearchmaterialsandwritingswilldeterminehowlongtheysurvive.Tomitigateagainstloss,makeyourownback-upsonaregular,formalizedschedule(e.g.dailyorweekly).
Threatstostorageenvironments:• Naturaldisaster• Humanerror• Humanmalice• Drivefailure• Formatobsolescence• Mediaobsolescence• Bitrot• Businessfailure• Softwareorhardwareerror
Advancedrecommendations:1. Produceandmaintainaninventoryofallof
yourcontent,documentingfilenames,sizes,locations,andtypes
2. Createandregularlycheck“checksums”ordigitalsignaturesforyourmostimportantresearchfiles.Checksumscanbegeneratedbyseveralopensourcetoolsandutilitiesandtheycanbestoredinyourinventory.
3. Monitoryourcontenttoensuremissing,moved,andrenamedfilesareautomaticallybroughttoyourattention.Atoollike“Fixity”canscanspecifiedfoldersordirectoriesonaregularbasisandreportchangestoyouviaemail.
Resources1. For“back-up”advice,seeJesusVigo,
BestPracticestoBackupYourData2. Formoreoncloud-basedbackups,pleasesee
CharlesBeagrieLtd.HowCloudStoragecanaddresstheneedofpublicarchivesintheUK
3. Forgeneralinformation,seealsoPersonalDigitalArchiving
Basicrecommendations:1. Maintainatleastonelocal(i.e.,non-cloud-
based)copyofyourcontent2. Maintainatleastthreeseparatecomplete
copiesofyourresearchcontent3. Maintainatleastonecopyinadifferent
geographiclocation4. Maintainahistoryofchangesinatleastone
location(e.g.,usinga“TimeCapsule”softwarepackagetoautomaticallybackupyourcontentwithoutdeletingoldercopies)
5. Documentinatextfilehow,when,andwhereyoustoreandbackupyourmaterials
6. Systematizeyourfolder- andfile-nameconventionsusinghuman-identifiableinformation
7. Usenamingconventionstomarkversionsoffiles,e.g.,usingconsecutivenumberstotrackafilethroughalleditsandrevisionsthattakeplacetoit.(e.g.,filename-v12.txt)
8. Makesureyourfilenamesarefollowedbythecorrectfileextension(e.g.,.txt,.csv)
9. Avoidusingspecialcharactersinallfileandfoldernames(e.g.,\?:*?<>{}[]&$,;.!)
10. Documenttheformatsyouaremanagingandthepotentialsustainabilityissues
11. Saveacopyofyourresearchfilesinnonproprietaryformats,sothatyoudon’tneedasoftwarelicensetorenderandusethem.
Examplehand-out
• Takeoneprojectyouareworkingonnow,anddevelopaspreadsheet-basedinventoryfortheassociatedfilesindicatingfilenames,sizes,types,andstoragelocations.(Usehttp://www.cdlib.org/services/dsc/contribute/docs/submission.inventory.rtf asaguide).
• Establisharegularroutineforbackingupyourcontentinatleastoneadditionallocation.Makesuretheroutineincludesaregularschedule,awayofstoringcontentorganizedbythedateofabackup,andawaytomaintainmultiplebackupssimultaneously.
Activity
ETDplus team:EducopiaInstitute
MetaArchive Cooperative
NDLTD
ProQuest
CarnegieMellonUniversity
ColoradoStateUniversity
HBCULibraryAlliance
IndianaStateUniversity
OregonStateUniversity
PennStateUniversity
PurdueUniversity
UniversityofLouisville
UNCSchoolofLibraryandInformationScience
UniversityofNorthTexas
UniversityofTennessee- Knoxville
VirginiaTechUniversity
79
ConnectingStudentandUniversityNeeds
Photo by Bryan Minear on Unsplash
StudentneedsTrainingopportunitiesRelevantexamplesUniversity“sealofapproval”FaithininstitutionalsafetyGuidelinesandrewards
UniversityneedsTrainingmaterials
ProfessionaldemonstratorsBrandingpower
ContributionstorepositoryStandardsandexpectations
Photo by Ivars Krutainis on Unsplash
Pleaseletusknowwhatyouthink!
ETD+ audio-videofiles
digitalart
GISdatasetsvisualizations
softwarecode
digitaltext
researchdata
https://educopia.org/publications/etdplustoolkit