beecher cni fall 2010 v4
DESCRIPTION
This is a talk from the Coalition for Networked Information Fall 2010 Member Meeting (CNIfall2010). I talked about our project to use Fedora as archival storage for social science research data and documentation.TRANSCRIPT
Preserving Social Science Research Data Using Fedora
Bryan Beecher
Inter-university Consortium for Political and Social Research (ICPSR)
CNI Fall 2010 Membership Meeting
ICPSR
• World’s largest social science research data archive– Lots of files (millions)– Small files (6TB total)
• Long track record of success – 50 yrs– Trust us– Enormous legacy burden
ICPSR
• Survey data are our core– Low volume of new content compared
to natural sciences– We curate each item extensively
(disclosure, quality, format, usability)
• Strong access orientation– Talk like an archive– Walk like an archive?
Walking the walk
• Good storage container for content and its metadata
• OAIS-compliant• Generate SIPs and AIPs (and DIPs)• But…
What should we do?
Where to begin?
Focus areas• Preservation• Going forward• Reusable
Do not try to include• Access• Everything we have
A Solution
• Fedora objects– Container for stuff we ingest and
preserve
• Fedora services– To generate AIPs and SIPs
• Tool to generate FOs from existing content and metadata
Ingest
• The Motivated Depositor– Eager to describe
the research data in great detail
– Uploads complete, machine-readable metadata
Ingest (continued)
• The Unmotivated Depositor– Upload a variety
of proprietary file formats for documentation and data
– Leaves the baby on the doorstep
Ingest – Nov 2010 deposits
Ingest (continued)
• Typical deposit– Research data in one of the common
stat packages (SAS, SPSS, etc)– Technical documentation in a
proprietary format (Word, PDF)– A proto-SIP in quasi-OAIS terms– Minimal level of metadata regarding
how the survey was conducted
Ingest container – file level
• Vanilla Fedora Object– Will never know
what sort of content format to expect
– Use the RELS-EXT to connect related files
Ingest container – deposit
• Another plain Fedora Object– Points to all of the
files stored in the file-level objects
– Relatively little metadata stored for this level of object
Ingest container – example
Ingest container – example
Ingest and the OAIS PDI
• Reference – unique Fedora PID• Fixity – Fedora-generated checksum• Provenance – identity of depositor
recorded in the DC Datastream• Context – original file name captured
in the content Datastream• Access Rights – terms of deposit
Generating OAIS SIPs
• Original content– Normalized version too, if applicable– What’s normalization in this context?
• Preservation Description Information (PDI)– As described previously
• Delivered via SDef/SDep combo
Ingest – continued
• Data– Disclosure analysis– Recoding
• Documentation– Corrections– Clarifications
• Normalized formats
Ingest – finale
• Packaged into a “study”– Data, doc
questionnaire, user guide, etc
– Normalized formats for preservation
– Convenient formats for access
Ingest – finale
PID
REPORT(test/plain)
objectProperties
DC
RELS-EXT
AUDIT
icpsr:release-28748-file-3
QUESTIONNAIRE(application/pdf)
objectProperties
DC
RELS-EXTisPartOf: release-15868
AUDIT
icpsr:release-28748-file-1
STATA-DICT(text/plain)
objectProperties
DC
RELS-EXTisPartOf: release-15868
AUDIT
DATA(text/plain)
DDI(text/xml)
SAS-SETUPS(text/plain)
SPSS-SETUPS(text/plain)
STATA-SETUPS(text/plain)
icpsr:release-28748-file-2
CODEBOOK(application/pdf)
objectProperties
DC
RELS-EXTisPartOf: release-15868
AUDIT
Generating OAIS AIPs
• For each object (file)– Everything from the SIP plus
• Preservation events• Description of the transformation used• Preservation commitment
– Its post-processed version
• Delivered via SDef/SDep combo
Example AIP
PID
REPORT(test/plain)
objectProperties
DC
RELS-EXT
AUDIT
icpsr:release-28748-file-3
QUESTIONNAIRE(application/pdf)
objectProperties
DC
RELS-EXTisPartOf: release-15868
AUDIT
icpsr:release-28748-file-1
STATA-DICT(text/plain)
objectProperties
DC
RELS-EXTisPartOf: release-15868
AUDIT
DATA(text/plain)
DDI(text/xml)
SAS-SETUPS(text/plain)
SPSS-SETUPS(text/plain)
STATA-SETUPS(text/plain)
icpsr:release-28748-file-2
CODEBOOK(application/pdf)
objectProperties
DC
RELS-EXTisPartOf: release-15868
AUDIT
PID
objectProperties
DC
RELS-EXT
AUDIT
Questions we faced
• Datastreams or relationships?• What about our XML?• AIPs or DIPs?• How to build FOXML?
Datastreams /relationships?
PID
CONTENT X
objectProperties
DC
RELS-EXT
AUDIT
PID
CONTENT Y
objectProperties
DC
RELS-EXT
AUDIT
PID
CONTENT Y
objectProperties
DC
RELS-EXT
AUDIT
CONTENT X
Our XML
• DDI v2– Contains lots of the information one
might expect to find in the DC
• Strategy– Duplicate it
AIPs or DIPs
• Lots of copies• Destination
– Archival Storage remote location– Repository for ingest
Building FOXML
• Source– Database– DDI XML
• Re-usable tool
Special Thanks
The Team• Peggy Overcashier• Nathan Adams• Nancy McGovern• Mary Vardigan
The Funder• National Science
Foundation Award 0958382
• INTEROP EAGER program