from edwardians online to qualidata online: preparing data for online access libby bishop, esds...

44
From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive Online Access to Qualitative Data: Opportunities and Challenges Friday 5 December, 2003 Royal Statistical Society

Upload: cameron-richards

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

From Edwardians Onlineto Qualidata Online:

Preparing data for onlineaccess

Libby Bishop,ESDS Qualidata

Economic and Social Data Service, UK Data Archive

Online Access to Qualitative Data: Opportunities and Challenges

Friday 5 December, 2003Royal Statistical Society

Page 2: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Towards a standard format for qualitative data resources

• data needs to be preserved in a uniform resource format

• easier for provider (maintenance, tools, interchange)

• easier for user (consistency across data sets)

• DDI provides an XML framework for survey content (variables) but currently no suitable standard format for the content of qualitative data

• need a comprehensive application that will enable:

• data interchange

• sophisticated on-line searching

• retrieval from encoded texts

The Edwardians Online Pilot

Page 3: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Edwardians Online Development work

• A six month investigative project towards developing such a framework in a specific resource creation project:

• Data• Models using XML standards technologies • New functionality• Data coding methods

Page 4: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Basic search and retrieve functionality

• developed online querying function based on annotation of texts and themes in XML

• keyword search interview summaries database

• keyword search interview full transcripts

• search or browse by themes from list - retrieve extracts of text in particular documents coded by that theme

• jump from extract to view in full document

• filter searchers on subsets of interviewees e.g., age, gender

Page 5: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Family Life and Work Experience Before 1918 project

• Life history interviews - classic sociological study of Edwardian Society by Professor Paul Thompson

• One of our larger datasets: conducted in early seventies - nearly 500 interviews completed

• Of value because of scope and diversity: cross-national sample of people born in Britain before 1918

• Broadly representative of qualitative interview data

• Data exists in various formats in various locations: originally recorded on audio tapes; transcribed as typed paper documents; includes supporting source materials- essays; letters

• Texts coded in thematic analysis of content

• Paper source has proved popular to be very popular for reuse

Page 6: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

• interviews up to 100,000 words

• 8hrs of audio tape

• secondary source: transcription of the dialogue- errors in interpretation

• no time indexes between sound & content

• loosely structured

• alternate speakers

Example of Interview TextExample of Interview Text

Page 7: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

  1. Household   2. Domestic routine   3. Meals   4. Influence and discipline   5. Recreation in the home   6. Recreation outside the home   7. Weekend activities and religion   8. Politics   9. Parents' interests 10. Children's leisure 11. Community and social class

• Texts coded into broad themes in family life and work

• Coded then extracts cut-and-pasted to separate filing system

• Coding systems vary in complexity

• Text coded by theme to assist research:

• management of dataset

• more rigorous interpretation of text

12. School 13. Work, except domestic service 14. Life after leaving school 15. Marriage 16. Childbirth - including sexual knowledge 18. Domestic service 19. Institutions and boarding schools20. * Occupational history

Thematic CodingThematic Coding

Page 8: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Example of Thematic Coding

• Thematic sections of variable length

• May be overlapping

Page 9: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Why Preserve Thematic Coding?

• Preserving codes preserves record of primary interpretation of dataset, promotes openness in research

– replication; confirmation; re-interpretation.

• Useful as retrieval aids for voluminous bodies of text?

– Original ‘cut-and-paste’ thematic segments proved important and popular finding aid for paper collections.

– User familiarity – CAQDAS information retrieval and management

• Some limitations:

– Codes vary with content and individual coder’s interpretation, so quality -quality variable

– Coding is not a complete representation of thematic content: for example, not coded for migration or health

Page 10: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive
Page 11: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive
Page 12: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

From Edwardians Online to Qualidata Online

• Expand number of accessible datasets

• Expanded online functionality

– Ability to search across multiple datasets

– Ability to filter on basic demographics (age, gender, residence, occupation)

– Ability to combine keyword search and filter

• Standardise and automate transcript processing tools and procedures

Page 13: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Material from additional collections

• Mothers and Daughters by Mildred Blaxter (in Scots)

• 100 Families by Paul Thompson (without speaker tags)

• Key processing steps:

– Scan

– OCR

– Proof

– Format

– XML

Page 14: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Preparing data

• Prepare digital files in appropriate format– OCR and manual tidy up– Macros to prepare text for mark-up

• Assign line IDs; remove unicode

– Excel sheet to add speaker IDs (turn takers)

– Database to tag (code) lines by theme– Scripts to transform docs to XML

• Scripts to process web retrievals – VB Script to process retrieval request using

x-link and x-pointer

Page 15: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Getting from .tif…

Page 16: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

To basic XML<u id="96" who="subject">I would rather nae ken if I had cancer. I told my man that, I

says "If I have cancer, don't tell me". I mean you might hae an idea yourself, but I wouldnae like to be telt. I told him that.</u>

<u id="97" who="interviewer">And how has your own health been over the years?</u>

<u id="98" who="subject">Och, up an' doon, y'ken .</u><u id="99" who="interviewer">Any serious illness?</u><u id="100" who="subject">No ... nae illnesses .. nae illness, ken, in that wey. Just

once I took an afa' turn at [missing words] I couldnae get ... I wis aye sleepin' this tablets I got fae the doctor and I had to sign for this tablets .. I just couldnae keep awake.</u>

<u id="101" who="interviewer">And did he say what it was?</u><u id="102" who="subject">I canna mind now, it's that long ago.. But I was really bad

at that time, otherwise now .. apart fae this broken arm and operations kidneys .. I had an operation for a cyst.</u>

<u id="103" who="interviewer">Uh-huh.. was it ... epileptic, was she?</u><u id="104" who="subject">An’ this shoulder.. I couldnae move it..</u><u id="105" who="interviewer">Uh-huh .. a joint? Seized up ... and how long ago was

that?</u><u id="106" who="subject">Well, it'll be.. ten year this 8th June. It was the same day

as Robert Kennedy was killed. that's how I ken. I was goin' into hospital in the mornin’ an’ I mind tellin’ the patients [missing words] "What a shame, Robert Kennedy's been shot, an' killed" [missing words]

Page 17: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Word document created from OCR

Page 18: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Issues in scanning and OCR

• Scanning done at 300 dpi, grey scale

• OCR varies hugely with quality of original, special challenges include (but are not limited to):

– Character recognition

– Stray marks on page

– Missing words

– Interviewer’s notes

– “Creative” character interpretation: section breaks, font changes, footnotes, super- and sub-scripts, and so on.

• Partially automated with macros, but much judgement (clerical and research) still required

Page 19: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

OCR and manual tidy up

• Work required to digitise older type face not to be underestimated

• Average of 12 hours clerical labour to prepare a 70 page document

• Apply macros in Ultraedit or Excel to remove page nos, speaker line breaks etc.

• 040• Mrs Florrie D., Wootton. Father, farm worker. B. 1892.• Your name is Mrs Florence is it? • Florrie - yes.• And you live at 13?• Castle Road. Wooton. • And you're a widow? • Yes.• And the year of your marriage was 1911? • Yes.• And the year of birth 28th July 1892? • Yes.• And that was at Whichford?

Page 20: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Final Word file(human and Excel readable)

Page 21: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Notes for transcription proofing

Main aim is to check that the speech flows and reads properly and that there are no missing sections of text. In addition:

1. Each individual stream of speech should be a continuous line followed by a carriage return.

2. Obvious spellings mistakes to be corrected, with the following exceptions:

3. Peculiar or unrecognized spellings that should be left as they are include proper names, place names and obvious cases where the original transcriber was trying to indicate the phonetics of the speech. – E.g.: “never mixed with a lot of ‘em”.

4. The spelling of proper names: such as place names, person names should be consistent.

5. Poor grammar is typically the interview content so ignore this.

6. Page numbers.

Page 22: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Transcription editing guidelines

• Basic editing (spelling, punctuation)

• Research editing– Interviewer’s annotations– Text supplemental to transcript itself

• Editing to conform with Excel and XML– MUST have tabs between speaker tags and utterance, BUT

no other tabs…– Handling special characters (“10 ½” or “ten and a half” or

“10.5”?)– Replace double spacing with paragraph formatting to

create extra space at end of paragraph

See handout:Qualidata Transcription Editing Guidelines

Page 23: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Transformation of transcripts to XML

• Export key fields to Excel sheet• Doc ID, Line ID • Text is marked up at the utterance level

• Excel macros create marked up transcripts from tab delimited fileSN30.xmlSN31.xml

Page 24: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive
Page 25: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Using Excel macros to create XML transcript

Page 26: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

New tags for searching on demographic variables

Page 27: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Handling unique features

• None of the Edwardians or 100 Families transcripts have speaker tags

• Need some way of indicating who is speaking when search results are returned

• Turn takers (usually transcribed)– Logic test to assign interviewer/subject based on end of line

character

Page 28: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Check turn-taking with no speaker tags

Page 29: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Screenshots for quali online

Page 30: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Screenshots for quali online

Page 31: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Screenshots for quali online

Page 32: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Screenshots for quali online

Page 33: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Screenshots for quali online

Page 34: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Thematic coding: sand-off Architecture in XML

• Challenges for developing an XML application included the multiple hierarchies in the transcript texts and overlapping fields or elements:

dialogue structure v thematic content

• Conventional mark-up of these structures in a single document violates nesting rules of XML

• Solution - ‘stand-off annotation’ approach whereby data and coding stored in different documents (annotation linked by Xlink and Xpointers)

• Proven utility as method for annotating multi-coded dialogue corpora. Allows for:

– allows for multiple coding schemes– accommodates overlapping elements – easily extendable

Page 35: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Base-line text unit: utterances (<U>)

Theme: politics

Theme: household

Theme: work

<U> attributes:

• id

• speaker …

• start time (audio file)

• end time (audio file)

Example of ‘Stand-off’ XML Architecture

Page 36: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive
Page 37: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive
Page 38: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Applying thematic coding

• Annotator tool in MS access using append query (appends lines to table)

• For each transcripts save as section table• Add to ‘total section’ table (cum. Line ID)• Export table as tab delimited file• Perl scripts create files with pointers to

relevant parts text for EACH transcripts– householdSN30.xml– childbirthSN30.xml– householdSN31.xml – childbirthSN31.xml

Page 39: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive
Page 40: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Theme x-query filesFile=Childbirth30.xml

- <Childbirth_set id="FLWE30" xmlns:xlink="http://www.w3.org/1999/xlink/">

  <Childbirth id="FLWE30_1" xlink:type="simple" xlink:href="30,xml#xpointer(53 to 54)" />

  <Childbirth id="FLWE30_2" xlink:type="simple" xlink:href="30.xml#xpointer(257 to 264)" />

  </Childbirth_set>

Theme ID = ChildbirthDocument ID = 30

Page 41: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive
Page 42: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive
Page 43: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Files required for web query

Web Directories:

\Documents (for indexing only)- SN30.asp (but could be xml files)

\Transcripts (for web xml retrieval)- SN30.xml

\Themes– childbirthSN30.xml (X-query – pointers

to the relevant parts of xml docs)

Page 44: From Edwardians Online to Qualidata Online: Preparing data for online access Libby Bishop, ESDS Qualidata Economic and Social Data Service, UK Data Archive

Phase II functionality and beyond

Will be adding:

• Boolean searching to view overlapping themes• Key word in theme search• Add in to text pointers to other materials – notes, researchers

annotations, audio, pictures, geo-references.

Will investigate/would like to develop:

• neat tools sets for publishing and querying data• Enable simultaneous manipulation and display of quantitative data,

e.g. via the NESSTAR system• Document and thematic coding on-line • New code retrieval on the fly • Linked thesaurus tools