language, corpora and context: a thrilling case studypszaxc/dress/ais10.pdf · nelc: sms component...

18
Language, Corpora and Context: A ThrillingCase Study Dawn Knight The University of Nottingham

Upload: others

Post on 04-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Language, Corpora and Context: A ‘Thrilling’ Case Study

Dawn Knight The University of Nottingham

Page 2: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Corpus development at Nottingham

§  Mono-modal corpora § CANCODE § CANBEC § The Health Communication Corpus

§  Multi-modal corpora § The Nottingham Multi-Modal Corpus § The Nottingham Learner Corpora

§  Heterogeneous corpora § The Nottingham eLanguage Corpus (NeLC) § Feasibility corpora

Page 3: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Challenges

§  Being able to model the dynamic spatial and temporal context of social activity and to use the knowledge gained for predictive support of user applications is one of the deepest longstanding challenges within ubiquitous computing.

§  Within the field of (Corpus) Linguistics a parallel challenge is to account systematically for how our language varies from one context to another according to dynamic changes in the environment, in channels of communication and the social context of human interaction.

Page 4: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

NeLC: SMS component

§  5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: §  F1: 1870 SMSs, 37,772 words over 50 days §  F2: 1505 SMSs, 32,137 words over 55 days §  F3: 100 SMSs, 1879 words over 20 days §  F4: 52 SMSs, 732 words over 2 days §  M1: 165 SMSs, 1674 words over 50 days

§  Logs detail the following: §  Content, date, time, sender and receiver details (age,

gender, occupation, relationship), location and activity in location

Page 5: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

NeLC: SNSs

§  2589 updates from 167 contributors in 1 user log (circa 31,300 words over 50 days)

§  Content from 43 ‘celebrity’ updates extracted over a 50 day period. 16,857 in total (more than 250,000 words)

§  Both forms of logs include: §  Time and date submitted, contributor information

(gender, age, occupation), network/ location and content (i.e. the text)

Page 6: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

NeLC: Other Text-based data

§  Blogs and feedback forums taken from Newspaper and broadcasting websites- data has been extracted but not yet processed.

§  Possible plans for collecting email data, in collaboration with extended ‘a day in the life of your language’ data collection.

§  Texts require anonymising and (possibly) standardising? – also ‘separating’?

Page 7: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

A Day in the Life of your Language

§  Associated metadata also recorded §  Data requires transcription

Page 8: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Representing data- requirements §  The ability to search data and metadata in a principled and

specific way (encoded and/or transcribed text-based data), within and/or across the three global domains of data; devices/ data type(s), time and/or location and participants/ given contributions (i.e. the production of ‘sub-corpora’)

§  Tools that allow for the frequency profiling of events/ elements within and across domains (providing raw counts, basic statistical analysis tools, and methods of graphing such)

§  New methods for drilling into the data, through mining specific relationships within and between domain(s). This may be comparable to current social networking software, mind maps or more topologically based methods

§  Graphing tools for mapping the incidence of words or events, for example, over time and for comparing sub-corpora and

domain specific characteristics

Page 9: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !
Page 10: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Future developments in DRS- mock-up 1

Page 11: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Future developments in DRS- mock-up 1

Page 12: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Feasibility Corpora: Thrill

§  A 55,000 word corpus of fairground discourse, comprised of synchronised records of audio, video and sensory (i.e. heart rate) data. §  55 participants (mainly recorded in pairs) §  19 women, 26 men §  Ages range from teens to late 50s § Over 11 hours video

Page 13: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Feasibility Corpora: Thrill

§  Data has been transcribed and divided into 4 key phases: §  Pre-ride phase (i.e. walking around the theme park) §  The elevation of the ride §  Start of the ride §  Ride terminus

§  Aims: §  To examine whether any patterns emerge in specific

language used within/ across these phases. §  To outline and test an appropriate to the analysis of

heterogeneous data sets for linguistic enquiry.

Page 14: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Feasibility Corpora: Thrill

Page 15: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !
Page 16: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Phase 1: Frequency based

word clouds

Phase 2: Frequency based

word clouds

Phase 3: Frequency based

word clouds

Phase 4: Frequency based

word clouds

Page 17: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Next Steps

§  More detailed analyses/ comparisons of the data.

§  Extend the analyses, exploring more varied datasets from NeLC and ‘A day in the life of your language’.

§  Collect more data (using a wider cross section of different participants), using the mobile toolkit.

Page 18: Language, Corpora and Context: A Thrilling Case Studypszaxc/DReSS/AIS10.pdf · NeLC: SMS component ! 5 logs so far (3692 SMSs / circa 74,000 words)- currently being anonymised: !

Acknowledgements Research team The Digital Records for e-Social Science Project is funded by the ESRC.