from edwardians online to qualidata online: preparing data for online access libby bishop, esds...

Post on 12-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

From Edwardians Onlineto Qualidata Online:

Preparing data for onlineaccess

Libby Bishop,ESDS Qualidata

Economic and Social Data Service, UK Data Archive

Online Access to Qualitative Data: Opportunities and Challenges

Friday 5 December, 2003Royal Statistical Society

Towards a standard format for qualitative data resources

• data needs to be preserved in a uniform resource format

• easier for provider (maintenance, tools, interchange)

• easier for user (consistency across data sets)

• DDI provides an XML framework for survey content (variables) but currently no suitable standard format for the content of qualitative data

• need a comprehensive application that will enable:

• data interchange

• sophisticated on-line searching

• retrieval from encoded texts

The Edwardians Online Pilot

Edwardians Online Development work

• A six month investigative project towards developing such a framework in a specific resource creation project:

• Data• Models using XML standards technologies • New functionality• Data coding methods

Basic search and retrieve functionality

• developed online querying function based on annotation of texts and themes in XML

• keyword search interview summaries database

• keyword search interview full transcripts

• search or browse by themes from list - retrieve extracts of text in particular documents coded by that theme

• jump from extract to view in full document

• filter searchers on subsets of interviewees e.g., age, gender

Family Life and Work Experience Before 1918 project

• Life history interviews - classic sociological study of Edwardian Society by Professor Paul Thompson

• One of our larger datasets: conducted in early seventies - nearly 500 interviews completed

• Of value because of scope and diversity: cross-national sample of people born in Britain before 1918

• Broadly representative of qualitative interview data

• Data exists in various formats in various locations: originally recorded on audio tapes; transcribed as typed paper documents; includes supporting source materials- essays; letters

• Texts coded in thematic analysis of content

• Paper source has proved popular to be very popular for reuse

• interviews up to 100,000 words

• 8hrs of audio tape

• secondary source: transcription of the dialogue- errors in interpretation

• no time indexes between sound & content

• loosely structured

• alternate speakers

Example of Interview TextExample of Interview Text

  1. Household   2. Domestic routine   3. Meals   4. Influence and discipline   5. Recreation in the home   6. Recreation outside the home   7. Weekend activities and religion   8. Politics   9. Parents' interests 10. Children's leisure 11. Community and social class

• Texts coded into broad themes in family life and work

• Coded then extracts cut-and-pasted to separate filing system

• Coding systems vary in complexity

• Text coded by theme to assist research:

• management of dataset

• more rigorous interpretation of text

12. School 13. Work, except domestic service 14. Life after leaving school 15. Marriage 16. Childbirth - including sexual knowledge 18. Domestic service 19. Institutions and boarding schools20. * Occupational history

Thematic CodingThematic Coding

Example of Thematic Coding

• Thematic sections of variable length

• May be overlapping

Why Preserve Thematic Coding?

• Preserving codes preserves record of primary interpretation of dataset, promotes openness in research

– replication; confirmation; re-interpretation.

• Useful as retrieval aids for voluminous bodies of text?

– Original ‘cut-and-paste’ thematic segments proved important and popular finding aid for paper collections.

– User familiarity – CAQDAS information retrieval and management

• Some limitations:

– Codes vary with content and individual coder’s interpretation, so quality -quality variable

– Coding is not a complete representation of thematic content: for example, not coded for migration or health

From Edwardians Online to Qualidata Online

• Expand number of accessible datasets

• Expanded online functionality

– Ability to search across multiple datasets

– Ability to filter on basic demographics (age, gender, residence, occupation)

– Ability to combine keyword search and filter

• Standardise and automate transcript processing tools and procedures

Material from additional collections

• Mothers and Daughters by Mildred Blaxter (in Scots)

• 100 Families by Paul Thompson (without speaker tags)

• Key processing steps:

– Scan

– OCR

– Proof

– Format

– XML

Preparing data

• Prepare digital files in appropriate format– OCR and manual tidy up– Macros to prepare text for mark-up

• Assign line IDs; remove unicode

– Excel sheet to add speaker IDs (turn takers)

– Database to tag (code) lines by theme– Scripts to transform docs to XML

• Scripts to process web retrievals – VB Script to process retrieval request using

x-link and x-pointer

Getting from .tif…

To basic XML<u id="96" who="subject">I would rather nae ken if I had cancer. I told my man that, I

says "If I have cancer, don't tell me". I mean you might hae an idea yourself, but I wouldnae like to be telt. I told him that.</u>

<u id="97" who="interviewer">And how has your own health been over the years?</u>

<u id="98" who="subject">Och, up an' doon, y'ken .</u><u id="99" who="interviewer">Any serious illness?</u><u id="100" who="subject">No ... nae illnesses .. nae illness, ken, in that wey. Just

once I took an afa' turn at [missing words] I couldnae get ... I wis aye sleepin' this tablets I got fae the doctor and I had to sign for this tablets .. I just couldnae keep awake.</u>

<u id="101" who="interviewer">And did he say what it was?</u><u id="102" who="subject">I canna mind now, it's that long ago.. But I was really bad

at that time, otherwise now .. apart fae this broken arm and operations kidneys .. I had an operation for a cyst.</u>

<u id="103" who="interviewer">Uh-huh.. was it ... epileptic, was she?</u><u id="104" who="subject">An’ this shoulder.. I couldnae move it..</u><u id="105" who="interviewer">Uh-huh .. a joint? Seized up ... and how long ago was

that?</u><u id="106" who="subject">Well, it'll be.. ten year this 8th June. It was the same day

as Robert Kennedy was killed. that's how I ken. I was goin' into hospital in the mornin’ an’ I mind tellin’ the patients [missing words] "What a shame, Robert Kennedy's been shot, an' killed" [missing words]

Word document created from OCR

Issues in scanning and OCR

• Scanning done at 300 dpi, grey scale

• OCR varies hugely with quality of original, special challenges include (but are not limited to):

– Character recognition

– Stray marks on page

– Missing words

– Interviewer’s notes

– “Creative” character interpretation: section breaks, font changes, footnotes, super- and sub-scripts, and so on.

• Partially automated with macros, but much judgement (clerical and research) still required

OCR and manual tidy up

• Work required to digitise older type face not to be underestimated

• Average of 12 hours clerical labour to prepare a 70 page document

• Apply macros in Ultraedit or Excel to remove page nos, speaker line breaks etc.

• 040• Mrs Florrie D., Wootton. Father, farm worker. B. 1892.• Your name is Mrs Florence is it? • Florrie - yes.• And you live at 13?• Castle Road. Wooton. • And you're a widow? • Yes.• And the year of your marriage was 1911? • Yes.• And the year of birth 28th July 1892? • Yes.• And that was at Whichford?

Final Word file(human and Excel readable)

Notes for transcription proofing

Main aim is to check that the speech flows and reads properly and that there are no missing sections of text. In addition:

1. Each individual stream of speech should be a continuous line followed by a carriage return.

2. Obvious spellings mistakes to be corrected, with the following exceptions:

3. Peculiar or unrecognized spellings that should be left as they are include proper names, place names and obvious cases where the original transcriber was trying to indicate the phonetics of the speech. – E.g.: “never mixed with a lot of ‘em”.

4. The spelling of proper names: such as place names, person names should be consistent.

5. Poor grammar is typically the interview content so ignore this.

6. Page numbers.

Transcription editing guidelines

• Basic editing (spelling, punctuation)

• Research editing– Interviewer’s annotations– Text supplemental to transcript itself

• Editing to conform with Excel and XML– MUST have tabs between speaker tags and utterance, BUT

no other tabs…– Handling special characters (“10 ½” or “ten and a half” or

“10.5”?)– Replace double spacing with paragraph formatting to

create extra space at end of paragraph

See handout:Qualidata Transcription Editing Guidelines

Transformation of transcripts to XML

• Export key fields to Excel sheet• Doc ID, Line ID • Text is marked up at the utterance level

• Excel macros create marked up transcripts from tab delimited fileSN30.xmlSN31.xml

Using Excel macros to create XML transcript

New tags for searching on demographic variables

Handling unique features

• None of the Edwardians or 100 Families transcripts have speaker tags

• Need some way of indicating who is speaking when search results are returned

• Turn takers (usually transcribed)– Logic test to assign interviewer/subject based on end of line

character

Check turn-taking with no speaker tags

Screenshots for quali online

Screenshots for quali online

Screenshots for quali online

Screenshots for quali online

Screenshots for quali online

Thematic coding: sand-off Architecture in XML

• Challenges for developing an XML application included the multiple hierarchies in the transcript texts and overlapping fields or elements:

dialogue structure v thematic content

• Conventional mark-up of these structures in a single document violates nesting rules of XML

• Solution - ‘stand-off annotation’ approach whereby data and coding stored in different documents (annotation linked by Xlink and Xpointers)

• Proven utility as method for annotating multi-coded dialogue corpora. Allows for:

– allows for multiple coding schemes– accommodates overlapping elements – easily extendable

Base-line text unit: utterances (<U>)

Theme: politics

Theme: household

Theme: work

<U> attributes:

• id

• speaker …

• start time (audio file)

• end time (audio file)

Example of ‘Stand-off’ XML Architecture

Applying thematic coding

• Annotator tool in MS access using append query (appends lines to table)

• For each transcripts save as section table• Add to ‘total section’ table (cum. Line ID)• Export table as tab delimited file• Perl scripts create files with pointers to

relevant parts text for EACH transcripts– householdSN30.xml– childbirthSN30.xml– householdSN31.xml – childbirthSN31.xml

Theme x-query filesFile=Childbirth30.xml

- <Childbirth_set id="FLWE30" xmlns:xlink="http://www.w3.org/1999/xlink/">

  <Childbirth id="FLWE30_1" xlink:type="simple" xlink:href="30,xml#xpointer(53 to 54)" />

  <Childbirth id="FLWE30_2" xlink:type="simple" xlink:href="30.xml#xpointer(257 to 264)" />

  </Childbirth_set>

Theme ID = ChildbirthDocument ID = 30

Files required for web query

Web Directories:

\Documents (for indexing only)- SN30.asp (but could be xml files)

\Transcripts (for web xml retrieval)- SN30.xml

\Themes– childbirthSN30.xml (X-query – pointers

to the relevant parts of xml docs)

Phase II functionality and beyond

Will be adding:

• Boolean searching to view overlapping themes• Key word in theme search• Add in to text pointers to other materials – notes, researchers

annotations, audio, pictures, geo-references.

Will investigate/would like to develop:

• neat tools sets for publishing and querying data• Enable simultaneous manipulation and display of quantitative data,

e.g. via the NESSTAR system• Document and thematic coding on-line • New code retrieval on the fly • Linked thesaurus tools

top related