language, corpora and context: a thrilling case studypszaxc/dress/ais10.pdf · nelc: sms component...

Language, Corpora and Context: A ‘Thrilling’ Case Study

Dawn Knight The University of Nottingham

Corpus development at Nottingham

§  Mono-modal corpora § CANCODE § CANBEC § The Health Communication Corpus

§  Multi-modal corpora § The Nottingham Multi-Modal Corpus § The Nottingham Learner Corpora

§  Heterogeneous corpora § The Nottingham eLanguage Corpus (NeLC) § Feasibility corpora

Challenges

§  Being able to model the dynamic spatial and temporal context of social activity and to use the knowledge gained for predictive support of user applications is one of the deepest longstanding challenges within ubiquitous computing.

§  Within the field of (Corpus) Linguistics a parallel challenge is to account systematically for how our language varies from one context to another according to dynamic changes in the environment, in channels of communication and the social context of human interaction.

NeLC: SMS component

§  5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: §  F1: 1870 SMSs, 37,772 words over 50 days §  F2: 1505 SMSs, 32,137 words over 55 days §  F3: 100 SMSs, 1879 words over 20 days §  F4: 52 SMSs, 732 words over 2 days §  M1: 165 SMSs, 1674 words over 50 days

§  Logs detail the following: §  Content, date, time, sender and receiver details (age,

gender, occupation, relationship), location and activity in location

NeLC: SNSs

§  2589 updates from 167 contributors in 1 user log (circa 31,300 words over 50 days)

§  Content from 43 ‘celebrity’ updates extracted over a 50 day period. 16,857 in total (more than 250,000 words)

§  Both forms of logs include: §  Time and date submitted, contributor information

(gender, age, occupation), network/ location and content (i.e. the text)

NeLC: Other Text-based data

§  Blogs and feedback forums taken from Newspaper and broadcasting websites- data has been extracted but not yet processed.

§  Possible plans for collecting email data, in collaboration with extended ‘a day in the life of your language’ data collection.

§  Texts require anonymising and (possibly) standardising? – also ‘separating’?

A Day in the Life of your Language

§  Associated metadata also recorded §  Data requires transcription

Representing data- requirements §  The ability to search data and metadata in a principled and

specific way (encoded and/or transcribed text-based data), within and/or across the three global domains of data; devices/ data type(s), time and/or location and participants/ given contributions (i.e. the production of ‘sub-corpora’)

§  Tools that allow for the frequency profiling of events/ elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such)

§  New methods for drilling into the data, through mining specific relationships within and between domain(s). This may be comparable to current social networking software, mind maps or more topologically based methods

§  Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and

domain specific characteristics

Future developments in DRS- mock-up 1

Feasibility Corpora: Thrill

§  A 55,000 word corpus of fairground discourse, comprised of synchronised records of audio, video and sensory (i.e. heart rate) data. §  55 participants (mainly recorded in pairs) §  19 women, 26 men §  Ages range from teens to late 50s § Over 11 hours video


§  Data has been transcribed and divided into 4 key phases: §  Pre-ride phase (i.e. walking around the theme park) §  The elevation of the ride §  Start of the ride §  Ride terminus

§  Aims: §  To examine whether any patterns emerge in specific

language used within/ across these phases. §  To outline and test an appropriate to the analysis of

heterogeneous data sets for linguistic enquiry.

Phase 1: Frequency based

word clouds


word clouds


word clouds


word clouds

Next Steps

§  More detailed analyses/ comparisons of the data.

§  Extend the analyses, exploring more varied datasets from NeLC and ‘A day in the life of your language’.

§  Collect more data (using a wider cross section of different participants), using the mobile toolkit.

Acknowledgements Research team The Digital Records for e-Social Science Project is funded by the ESRC.

language, corpora and context: a thrilling case studypszaxc/dress/ais10.pdf · nelc: sms component...

Documents