harvard e-journal archiving study dale flecker june, 2002

33
HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

Upload: brice-lloyd

Post on 06-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

E-JOURNAL MODEL IS DIFFERENT “Copies” are remote, held in publisher systems –Not replicated across different institutions Perpetual license provides limited comfort in the absence of independent copies Long-term preservation involves very different issues than day-to-day access

TRANSCRIPT

Page 1: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

HARVARDE-JOURNAL ARCHIVING

STUDY

Dale FleckerJune, 2002

Page 2: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

JOURNAL ARCHIVING IN THE PAPER ERA

• Large-scale redundancy• Access copy and archival copy usually the

same• Not just storage, but preservation

– includes environmental control, library binding, repair, reformatting. . .

• Deliberate, long-term archiving largely the role of national and research libraries

Page 3: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

E-JOURNAL MODEL IS DIFFERENT

• “Copies” are remote, held in publisher systems– Not replicated across different institutions

• Perpetual license provides limited comfort in the absence of independent copies

• Long-term preservation involves very different issues than day-to-day access

Page 4: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

E-JOURNAL ARCHIVING A GROWING PROBLEM

• Libraries bearing double costs– the e-journals users prefer– the paper for preservation

• Publishers cannot convert totally to digital– authors and editors distrust e-only journals because of

concerns about persistence– libraries demand paper for preservation

• Libraries preserving paper version, but electronic more complete, increasingly the copy of record

Page 5: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

MELLON E-JOURNAL ARCHIVING PROGRAM

• 13 institutions invited to submit proposals for a planning projects

• Two approaches – Large-scale distributed replication (LOCKSS)– Centralized archives serving a wider

community

Page 6: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

CENTRAL ARCHIVES PLANNING PROJECTS

• Publisher-based – Harvard (Wiley, Blackwell, University of Chicago

Press)– Penn (Oxford and Cambridge University Presses) – Yale (Elsevier)

• Discipline-based – Cornell (agriculture), – NYPL (performing arts)

• Dynamic e-journals – MIT

Page 7: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

FOUR BASIC ASSUMPTIONS

• Archive should be independent of publishers– responsibility of institutions for whom archiving is a

core mission• Archiving requires active publisher partnership• Address long timeframes (100 years?)• Archive design based on Open Archival

Information System (OAIS) model

Page 8: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

CENTRAL ARCHIVE MODEL

• Archive negotiates relationship with publisher• Publisher deposits content regularly• Content accompanied by metadata to support

discovery and preservation • Archived content only accessible under specific

conditions• Archive assumes responsibility for long-term

preservation

Page 9: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

SOME INTERESTING QUESTIONS

• What is archived?• In what format?• When is archive accessible?• Who can access archived content?• What does the archive “preserve”?• Who does archiving?• How is the archive paid for?• How is the archive governed?

Page 10: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

WHAT CONTENT IS ARCHIVED?

E-journals not simply articles….

Page 11: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

SOME COMMON STUFF

• Journal description• Editorial board• Instructions to authors• Rights and usage terms• Copyright statement• Ordering information• Reprint information• Indexes

• Career information• News• Events lists• Discussion fora• Editorials• Errata• Reviewers• Conference

announcements

Page 12: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

HARD AREAS

• Masthead, “front matter” stored as web pages, not in content management systems

• No control over the format of “associated materials” (datasets, images, tables, etc.)

• Advertising very complex– dynamic, frequently from third party, can involve

country-specific complexities• Links frequently separate from articles

– regularly updated, sometimes dynamic

Page 13: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

OUR INCLINATION

• Exclude little except advertisements – based on discussions with librarians and scholars– different from most “local loading”

• Articles include supplementary materials• Include an “issue object” in addition to the

article components– masthead, news, jobs, meetings, etc

Page 14: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

Format for archived articles?

Page 15: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

PDF?

• PDF almost universally available from publishers– and the only format available for some journals

• There are qualms...– proprietary– marked-up for display, not meaning– supports limited functionality– long-term “preservability” unclear– unlikely to remain the universal format over time

Page 16: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

MARKED-UP TEXT?

• SGML/XML increasingly common– and likely to become more so

• Greater functionality, easier migration as technology changes

• Complex– DTDs vary widely from publisher to publisher– DTDs far from stable– archive documentation and rendering would be complex

Page 17: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

“INTERCHANGE” ARTICLE DTD

• Intended for exchanging content between independent players

• Reduces complexity of interaction – archive needs to document, migrate, and display only

one format• archive can choose whether to maintain articles in

interchange DTD, or transform at ingest for long-term storage– publisher needs deposit only one format for all archives

Page 18: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

“INTERCHANGE” ARTICLE DTD

• Mellon, Harvard, National Library of Medicine, 2 consultants (Inera, Mulberry) working on draft standard DTD

• Design based on current publisher practice– must be easy for publishers to produce– homogenizes many elements – leaves options in some difficult areas– eliminates elements specific to individual publisher

delivery systems

Page 19: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

INTERCHANGE DTD ISSUES

• How low is the common denominator? • What gets lost?

– inevitably sacrifices some functionality and original appearance

• Transformation from publisher’s “native” DTD involves risks

• Some technically difficult areas– extended character sets, mathematical and chemical

formulae, tables. “generated text”

Page 20: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

SGML/XML QUALITY CONTROL PROBLEM

• SGML/XML is an output rather than the input for many publishers today– may not fully reflect the output (PDF, print) that

users see day-to-day…how do you know it is good?• If SGML/XML is transformed for deposit, errors

can introduced• Quality control of ingested content is expensive

but critical for a sound archive

Page 21: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

ARCHIVE MORE THAN ONE FORMAT?

• Publisher-based archive must accept PDF in any case (only format available for some titles)– so include both SGML and PDF when available?

• belt and suspenders

• Accept publisher’s original SGML also?– preserve information lost in conversion to

interchange DTD– maintenance over time problematic

Page 22: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

WHEN IS ARCHIVE ACCESSIBLE?

• Most publishers instinctively prefer “dark” archives– does not compete with publisher’s service

• If “dark”, what “trigger events” make it accessible?– after a given period of time (‘moving wall”)?– when content is not otherwise accessible (“failsafe”)? – only when content enters the public domain?

Page 23: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

IS “DARK” DANGEROUS?

If content is dark, how do you know it is still good?

(real users are the best auditors)

Page 24: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

WHO CAN ACCESS ARCHIVE CONTENT?

• Just other subscribing institutions?– does the archive need to maintain complex records of

license rights?• defining licensees a nightmare• tracking license changes over time another nightmare

• Individual subscribers? – an even greater nightmare

• Everybody?– dramatically easier to administer

Page 25: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

WHAT DOES THE ARCHIVE PRESERVE?

• Preservation is a format-by-format issue– and most e-journals are composed of many

formats• How much “look and feel” preserved?• Just preserve the “core intellectual content”?• Does archive insure content remains

“render-able” as technology changes?

Page 26: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

HARVARD’S DIGITAL REPOSITORY

• Repository specifies preferred (“normative”) formats, which will be kept useable

• Just maintain bits for others– for e-journals this is likely for many “associated

materials” (datasets, models, etc.) • generally accepted in ANY format• maintaining the viability of such wildly heterogeneous

materials unrealistic– keep unaltered for future “digital archeology”

Page 27: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

WHO DOES ARCHIVING?

• “Common good” activity– model based on a few archives serving many

subscribers• Is this an appropriate role for individual

universities?– research libraries have technical capability,

relationships with publishers and subscribers– BUT how archiving would be paid for is central…...

Page 28: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

HOW IS THE ARCHIVE PAID FOR?

• First question: who benefits?– publishers, libraries, authors, scholarly societies…– is there a way to share costs?

• Cost categories include– preparation of “archivable” objects– ingestion and quality control– long-term storage– preservation

Page 29: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

PROPOSED MODEL

• Publisher assumes cost of preparing objects in standard format (whenever possible)

• Deposited material accompanied by two part fee from publisher– ingest fee to cover up-front costs

• varies with publisher effort to create easily archived objects???– “dowry” to create maintenance endowment

• Real funding sources include subscribers, authors, societies

Page 30: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

HOW IS THE ARCHIVE GOVERNED?

* Publishers hand their its intellectual property to independent party -- do they have a continuing say?

* Are there other stakeholders whoshould also have a say?

Page 31: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

HARVARD’S MODEL ARCHIVE

• Accept content for all titles a publisher produces– archive as many journal elements as possible

• Maintain an archive serving the entire community

• Store and maintain more robust formats (e. g., XML) when possible

• Collect metadata to support administration and preservation

Page 32: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

HARVARD’S MODEL ARCHIVE

• Requires only a few archival copies of any given journal

• Archive assumes responsibility for preservation migration when canonical versions deposited

• Organizational and economic model difficult

Page 33: HARVARD E-JOURNAL ARCHIVING STUDY Dale Flecker June, 2002

NEXT?

Over to Kevin….