crowdsourced manuscript transcription ben brumfield roots and routes 2012

31
Crowdsourced Manuscript Transcription Ben Brumfield Roots and Routes 2012

Upload: lynette-malone

Post on 16-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

Crowdsourced Manuscript Transcription

Ben BrumfieldRoots and Routes 2012

Not just crowdsourcing...

• Collaborative work

• Off-site solo work

• Private work

Not just manuscripts...

• Maps

• Textiles

• Music

• Flawed OCR

Not just transcription...

• Indexing

• Editing

• Identification

Counting seals on Arctic ice caps.

What it isn't

We'll concentrate on web-based tools for extracting text from images, not addressing:

• Oral History

• Video

• Audio Transcription

• Image Manipulation

• Transcription/Facsimile Display

Tools exist for these tasks, nevertheless.

Break

What materials are you working with outside of modern, printed books and websites?

Origins (Approaches)

Two Approaches and one Dead End

• Indexing

• Editing

• Tagging

Indexing

• Structured Data

• Extracts from Text vs. Representing Text

• Databases for Search and Analysis

• Granular Quality Control

• Gamification

Editing

• Books, Diaries, Letters, Articles

• Representing Text

• Traditional Editorial Workflow

• Digital or Print Editions

Tagging

• Too small

• Too imprecise

Origins (Traditions)

• OCR Correction

• Documentary Editing

• Genealogy

• Natural Science

• Astronomy

Split this into 5 slides

Online Tools

• Recent (none older than 2005)

• Influenced by origin

• Still pretty raw

• Most require tech expertise for set-up and customization

• All require making trade-offs

Lab Session 1: Breadth

NYPL What's on the Menu

Indexing

Wikisource

Editing

Selection Factors

• Source Material

• Transcript Purpose

• Organizational/Project Management Fit

• Financial and Technical Resources

Source Material

Evaluating your source material:

• Is it of interest to anyone else?

• Is it under copyright?

• Does it need restricted access?

• Is it composed of documents or records?

• Is it non-textual?

• How complex is the layout? How important is that layout?

Purpose

How will you be using the transcribed data?

• Traditional print editions

• Searchable online editions

• Do you want to use the system to analyze the text?

• How do you want to analyze the text?

• Is public engagement a goal?

• Should the transcripts be open?

Organizational/Project Management Fit

• How important is traditional editorial workflow?

• Will you rely on volunteers? How will you motivate them?

• What is the duration of the project?

• Is there a "final version"?

• Is TEI a mandate?

Financial and Technical Resources

Do you have or need:

• System administrators to install non-hosted software?

• Money to pay hosting costs?

• Programming skills to customize a tool?

• Money to pay programmers for customization?

• Support for on-going costs to keep the site running, however small?

Lab Session 2: Markup Options

FromThePage

TranscribeBentham

Technical Questions to Answer

• Where are the images now?

• How do images get into the system?

• How do transcripts get out of the system?

• How mature is the underlying technology?

• How configurable is the technology?

• How does the system work with the public face of your project?

• Where does the metadata live?

• Who will maintain this? How long?

• How many sites are using this system?

Wikisource

Pro:

• Mediawiki plus its add-on modules (e.g. print-on-demand, export).

• Wikimedia community.

• Incredibly mature.

Con:

• Wikimedia policy.

• Public editing.

• Limited mark-up.

Bentham Transcription Desk

Pro:

• MediaWiki is very mature.

• TEI Toolbar (can also be used on other systems)

• Deployed outside original project.

Con:

• Development efforts halted.

Scripto

Pro:

• Team at CHNM has a great track record.

• Your CMS is your public face.

• MediaWiki is very mature.

• Deployed and under active development.

Con:

• Your CMS handles all metadata.

• Mark-up is extremely limited.

FromThePage

Pro:

• Designed for intensive editing and indexing.

• Semantic mark-up and analysis.

• Hosting available.

Con:

• Single developer (me).

• No TEI mark-up.

Islandora TEI Editor

Caveat: I don't know much about this tool or this team.

• Based on Drupal and Fedora

• Supports TEI via friendly interface

• Many Drupal-based projects considering it.

T-PEN

Caveat: I don't know much about this tool.

• Designed for medieval manuscripts.

• Supports TEI natively.

• Line-by-line interface.

• Hosted version available.

Scribe

Pro:

• Excellent for complex layout or non-documentary transcription.

• Zooniverse team is large, well-funded, experienced.

• Configurable.

Con:

• No automated tool for loading images or viewing transcript database (yet!)

• No concept of image-as-a-text.

Pybossa

Caveat: I don't know much about this tool or this team.

• Open Knowledge Foundation's crowdsourcing task management tool.

• Designed for tabular data.

• Google Spreadsheet data entry.

• Extremely young.

TextLab

Caveat: I don't know much about this tool or this team.

• Melville Electronic Library.

• Direct addition of TEI tags to image.

Lab Session 3: Configuration

Scribe

Old Weather,

What's the Score,

Development deployments

Find me

Ben Brumfield

[email protected]

http://manuscripttranscription.blogspot.com/

@benwbrum