large-scale computational research in arts & humanities

Large-scale computational research in Arts and Humanities, using mostly unwritten (audio/visual) media

John ColemanFaculty of Linguistics, Philology and PhoneticsUniversity of Oxford

The future of research? 19/10/10

• What am I talking about? (Show and tell)

• How is it affecting the A/H research landscape?

• Implications (cost, strategy etc)

nce upon a time ...

There she weaves by night and day

A magic web with colours gay

1967: writing

Important/jj as/cs was/bedz Mr./np O'Donnell's/np$ essay/nn ,/, his/pp$ thesis/nn is/bez so/ql restricting/jj as/cs to/to deny/vb Faulkner/np the/at stature/nn which/wdt he/pps obviously/rb has/hvz ./. He/pps and/cc also/rb Mr./np Cowley/np and/cc Mr./np Warren/np have/hv fallen/vbn to/in the/at temptation/nn which/wdt besets/vbz many/ap of/in us/ppo to/to read/vb into/in our/pp$ authors/nns --/-- Nathaniel/np Hawthorne/np ,/, for/in example/nn ,/, and/cc Herman/np Melville/np --/-- protests/vbz against/in modernism/nn ,/, material/jj progress/nn ,/, and/cc science/nn which/wdt are/ber genuine/jj protests/nns of/in our/pp$ own/jj but/cc may/md not/* have/hv been/ben theirs/pp$$ ./. Faulkner's/np$ total/nn works/nns today/nr ,/, and/cc in/in fact/nn those/dts of/in his/pp$ works/nns which/wdt existed/vbd in/in 1946/cd when/wrb Mr./np Cowley/np made/vbd his/pp$ comment/nn ,/, or/cc in/in 1939/cd ,/, when/wrb Mr./np O'Donnell/np wrote/vbd his/pp$ essay/nn ,/, reveal/vb no/at such/jj simple/jj attitude/nn toward/in the/at South/nr-tl ./. If/cs he/pps is/bez a/at traditionalist/nn ,/, he/pps is/bez an/at eclectic/jj traditionalist/nn ./. If/cs he/pps condemns/vbz the/at recent/jj or/cc the/at present/nn ,/, he/pps condemns/vbz the/at past/nn with/in no/ql less/ap force/nn ./. If/cs he/pps sees/vbz the/at heroic/jj in/in a/at Sartoris/np or/cc a/at Sutpen/np ,/, he/pps sees/vbz also/rb --/-- and/cc he/pps shows/vbz --/-- the/at blind/jj and/cc the/at mean/jj ,/, and/cc he/pps sees/vbz the/at Compson/np family/nn disintegrating/vbg from/in within/rb ./. He/pps is/bez not/* one/cd to/to remain/vb more/ql comfortably/rb and/cc

XML TEI: still writing

<inscript id="halu0001"><sourceDesc><physObj type="ashlar" engrave="engraved" color=""><desc>Negev. Elusa (Haluza). 100-299 CE. Limestone ashlar dressed as a tabula ansata. </desc><letterHgt min="1.7" max="3.5">1.7-3.0 cm Aramaic, 2.5-3.5 cm Greek</letterHgt><dateRange calendar="Gregorian" from="100" to="299">100 CE to 299 CE<note>Based on the Greek and Palmyrene script.</note></dateRange><discovery><place region="Negev" city="Elusa (Haluza)" site="Foundation of an abandoned Beduin structure" locus=""><note>The inscription consists of two lines of Greek followed by one of

VRE for the Study of Documents and Manuscripts: writing

A trial of Kathryn Sutherland’s Jane Austen manuscript project; supported by CCH, King's College London but here ported to our VRE-SDM demonstrator

From writing to video

YouTube surpasses Yahoo as world’s #2 search engine

Researchers of the future

• Just as comfortable creating multimodal online content

- video, games, websites etc –

as writing essays

• Online video is interesting; TV is boring (passive)

Здесь Будет ]ружен Памятник ]божденный т[

Здесь Будет ]ружен Памятник ]божденный т[руд]

Here will be erected a monument to liberated labour

Vocal tract movements in speech

Resonance tuning in soprano singing and vocal tract shaping

Erik Bresch, Speech Production and Articulation kNowledge GroupUniversity of Southern California

Mining a Year of Speech:a “Digging into Data” project

http://www.phon.ox.ac.uk/mining/

John Coleman Greg Kochanski

Ladan Ravary Sergio GrauOxford University Phonetics Laboratory

Lou Burnard

Jonnie RobinsonThe British Library

with support from

Mark LibermanJiahong YuanChris Cieri

Phonetics Laboratoryand Linguistic Data ConsortiumUPenn

with support from NSF

The “Digging into Data” challenge“The creation of vast quantities of Internet

accessible digital data and the development of techniques for large-scale data analysis and visualization have led to remarkable new discoveries in genetics, astronomy and other fields ...

With books, newspapers, journals, films, artworks, and sound recordings being digitized on a massive scale, it is possible to apply data analysis techniques to large collections of diverse cultural heritage resources as well as scientific data.”

The “Year of Speech”• A grove of corpora, held at various sites with a

common indexing scheme and search tools

• US English material: 2,240 hrs of telephone conversations

• 1,255 hrs of broadcast news

• As-yet unpublished talk show conversations (1000 hrs), Supreme Court oral arguments (5000 hrs), political speeches and debates

• British English: Spoken part of the British National Corpus, 10 million words of transcribed speech

• Recently digitized by collaboration with British Library

C-SPAN

• US cable TV channel covering Senate/House proceedings, committees, current affairs discussion shows

• 20-year archive of publicly open video

• Large parts of the proceedings officially transcribed and published

Digging for audio: kinds of questions someone might ask

1. When did X say Y?

For example, "find the video clip where George Bush said 'read my

lips'."

2. How do arguments work?

For example, how do different people handle interruptions?

3. How frequent are linguistic features such as phrase-final rising

intonation ("uptalk") across different age groups, genders, social

classes, and places?

4. Who says “ask” and who says “aks”?

British National Corpus• Collected in early 1990s by consortium of

dictionary makers (Collins, Longman, OUP) and academics (Oxford, Lancaster, Oslo-Bergen)

• 100m word text (XML) corpus, of which 10m is transcribed speech

• c. 4.2 m words is demographically-sampled recordings of unplanned conversations

• British Market Research Bureau loaned Sony Walkmans to recruits

• c. 5 m words is “context-governed” speech (educational, business, public speeches/meetings, 'leisure' – sports, clubs, broadcast, phone-ins etc)

• Transcribed by audio typists and structured in XML database with rich metadata annotations

A few speech samples from the BNC

• A domestic drama

• Political commentary/current affairs

• Are dogs people too?

Practicalities

• In order to be of much practical use, such very large corpora must be indexed at word and phoneme level

• All included speech corpora must therefore have associated text transcriptions

• We use the Penn Phonetics Laboratory Forced Aligner to associated each word and segment with the corresponding start and end points in the sound files

'Speech in the wild'

Rethinking language• Dogs• Parrot talk (to/about, not by)• Talk to inanimate objects• We can look forward to ...

Listen they were all going [belch] that ain't a burp he said

Like I'd be talking like this and suddenly it'll go [mimics microphone noises]

He simply went [sound effect] through his nose

A future of research?

• Survey of audio-visual tools and resources in the Humanities (AHRC ICT Strategy project)

http://www.phon.ox.ac.uk/ictKey findings

– Growing but relatively poorly-supported use of audio & video in many subjects (Music, Modern Languages, Modern History, Archaeology, Classics, Art, Linguistics)

– Annotation, search and browse tools are essential– Digital data storage and processing power required

vastly outstrips text and photos, and is commensurate with e-Science grid computing

How big is “big science”?Human genome: 3 GB DASS audio sampler: 350 GBHubble space telescope: 0.5 TB/year 'Year of Speech': 1 - 2 TBSloan digital sky survey: 16 TB Beazley Archive & partners: 20 TB Ruskin School of Art student projects: 30 TB 10m Google Books: ~150 TB Survivors of the Shoah Visual History Foundation 180 TBLarge Hadron Collider: 15 PB/year = 100 x Google Books

Photographic collections, film libraries, museum catalogues etc are pretty large nowadays

How big is “big science”?Human genome: 3 GB DASS audio sampler: 350 GBHubble space telescope: 0.5 TB/year 'Year of Speech': 1 - 2 TBSloan digital sky survey: 16 TB Beazley Archive & partners: 20 TB Ruskin School of Art student projects: 30 TB 10m Google Books: ~150 TB Survivors of the Shoah Visual History Foundation 180 TBLarge Hadron Collider: 15 PB/year = 100 x Google Books

Photographic collections, film libraries, museum catalogues etc are pretty large nowadays

-------------- humanities

Why does big matter?• What kind of questions you can study

depends on the material you've got. (Obviously.)

• Humanities deals with rare and unique works and interpretations, not repeatable events.

• To study rare events/things and connections, it can be important to just have a lot of data – as much as possible – in order to have enough examples.

Rare(ish) events in English

• I’[n] trying 160 instances in BNC• See[n] to 310• Alar[ŋ] clock 18

• Swimmi[m] pool 44• Getti[m] paid 19• Weddi[m] present 15

Challenges: technology• Amount of material

• Storage– CD quality: 635 MB/hour– Uncompressed .wav files: 115 MB/hour– 16 acoustic analysis parameters: 1.44

MB/hour– 2.8 GB/day– 85 GB/month– 1.02 TB/year

• Computing – distance measures, etc.– alignment of labels– searching and browsing

Challenges: technology• Storing 1.02 TB/year: not really a problem

in 21st century

• 1 TB (1000 GB) hard drive costs c. £65

• Computing (distance measures, alignments, labels etc): multiprocessor cluster

Collaboration, not collection

Search interface 2(e.g. BL)

Search interface 1(e.g. Oxford)

Search interface 3(e.g. Penn)

Search interface 4(e.g. Lancaster ?)

BNC-XML database - retrieve time stamps

Spoken BNCrecordings - BL sound server(s)

LDC database - retrieve time stamps

Spoken LDCrecordings - various locations

Collaboration, not collection

Search interface 2(e.g. BL)

Search interface 1(e.g. Oxford)

Search interface 3(e.g. Penn)

Search interface 4(e.g. Lancaster ?)

BNC-XML database - retrieve time stamps

Spoken BNCrecordings - BL sound server(s)

LDC database - retrieve time stamps

Spoken LDCrecordings - various locations

Database of time stamps produced using consistent indexing standards

Your recordings - whatever location

Challenges: dispersal/aggregation• Dispersed resources; grid computing

• Need for international standards (for authorisation etc.)

• Humanities research may require new support structures (cf. 'big science' comparisons)

• 'Federated library' or 'national research laboratory' models?

Challenges: dispersal/technology• Finding stuff

• Doing something with it

• Transformation, new interpretations

Challenges: human

• Human aspect more important than hardware

• Who is qualified to carry out such work?• Employment prospects• What training provision is required?

• Should training in computer programming become normal for arts/humanities students?

Possible impacts• Will open up Year of Speech data and tools to linguistics,

phonetics, speech communication, oral history, education

• Automatic and reliable indexing of spoken on-line materials would be a “killer app”

• Caveat … it is practically impossible to predict the impact of developments in the market (cf. Microsoft, Google, YouTube)

• or that come to market (transistors, lasers, holograms).

• So it’s even harder to reliably predict impacts of cutting-edge research

Thank you for your time and attention

http://www.phon.ox.ac.uk/mining/http://bvreh.humanities.ox.ac.uk/

http://www.phon.ox.ac.uk/ict

Spoken Babylonian

Martin Worthington: Babylonian and Assyrian Poetry and Literature: An Archive of Recordings

http://people.pwf.cam.ac.uk/mjw65/BAPLAR/Archive

The Righteous Sufferer (Ludlul bēl nēmeqi), part of Tablet II, read by Margaret Jaques Cavigneaux

Babylonian Karaoke

1 šattamma ana balāṭ adanna īteq1 One whole year to the next! The appointed time passed.

2 asaḫḫurma lemun lemunma3 zapurtī ūtaṣṣapa išartī ul uttu

2 As I turned around, it was more and more terrible;3 My ill luck was on the increase, I could find no good fortune.

4 ila alsīma ul iddina pānīšu5 usalli ištarī ul ušaqqâ rēšīša

Visualisation using 3-D/4-D models

Screen renderings of the Odeion at Pompeii, 3D visualisation and research by Martin Blazeby, King's Visualisation Lab

Visualisation using 3-D/4-D models

Screen renderings of the Odeion at Pompeii, 3D visualisation and research by Martin Blazeby, King's Visualisation Lab

large-scale computational research in arts & humanities

hepps andcc alsorb

andcc hepps showsvbz

hepps condemnsvbz theat

cowleynp andcc

hepps seesvbz alsorb

scientific data

andcc hermannp melvillenp

data analysis techniques

Education

arts&humanities pg 2011

isi arts and humanities

otla arts humanities

managing arts and humanities data

humanities & arts requirement

arts & humanities review

australian academy of the humanities 2nd humanities, arts...

arts, humanities, social sciences and performing arts

humanities liberal arts

school of arts & humanities

humanities 221: visual arts

arts and humanities newsletter

virginia humanities. arts & hearts association

arts humanities elements analysis 0

humanities and the arts

agency website arts, culture and humanities arts, culture...

why take humanities class? mrs. schneider intro to arts &...

faculty of arts & humanities

arts & humanities instructional building

arts & humanities citation index