linguistic research and the clarin infrastructure jan odijk

48
1 Linguistic Research And The CLARIN Infrastructure Jan Odijk Digital Humanities Lecture, Utrecht 23 Oct 2012

Upload: melvin-gray

Post on 03-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Linguistic Research And The CLARIN Infrastructure Jan Odijk Digital Humanities Lecture, Utrecht 23 Oct 2012. Introduction Basic Facts & Research Questions Do the Research Consult Grammars Select from relevant data from multiple sources Apply tools to enrich data Analyze the data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

1

Linguistic ResearchAnd

The CLARIN Infrastructure

Jan Odijk

Digital Humanities Lecture, Utrecht 23 Oct 2012

Page 2: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

2

Overview

• Introduction• Basic Facts & Research Questions• Do the Research

– Consult Grammars– Select from relevant data from multiple sources– Apply tools to enrich data– Analyze the data

• Conclusions

Page 3: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

3

Introduction

• Suppose you’re a linguistic researcher in 1980 (no internet, no computers,…)– And libraries would not exist….

• I am a linguistic researcher in 2012– But no infrastructure for data and tools exists!– though there are many data and tools

• CLARIN has as its main goal to remedy this

Page 4: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

4

Basic Facts

• Heel, erg, and zeer are synonyms (‘very’)

• Zeer, erg can modify verbs, adjectival predicates and prepositional predicates

• Heel can only modify adjectival predicates– A: Hij is daar zeer/erg/heel blij mee – P: Hij is daar zeer/erg/*heel mee in zijn nopjes– V: Dat verbaast ons zeer/erg/*heel.

Page 5: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

5

Basic Facts

• English very is like heel in these respects;– P: *He is very in love– A: He is very amorous – V: It surprised us very *(much))

Page 6: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

6

Basic Facts

• Difference: – not due to semantics– Purely syntactic– As far as we know: does not follow from a general

rule– So it must be ‘learned’ by a child acquiring Dutch

as first language

Page 7: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

7

Research Question (1)

• How does a child acquiring Dutch as a first language get to ‘know’ that zeer and erg can modify verbs, prepositional and adjectival predicates?

Page 8: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

8

Hypotheses (1)

• Hypothesis 1a– Once a word is encountered for the first time, a

critical phase (‘training phase’) starts in which the word properties will be determined based on input; after this phase the word properties are fixed.

– A sufficient number of actual examples occurring in this period sets the word properties (positive evidence)

Page 9: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

9

Hypotheses (1)

• Hypothesis 2a– Once a word is encountered for the first time, its

grammatical properties are initially set by Semantic Bootstrapping: D (semcat) -> syncat

– A sufficient number of actual examples occurring in this period will add to the word properties (positive evidence)

– Sufficient amount of input that is contradictory to the semantically bootstrapped properties overrules them

Page 10: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

10

Research Question (2)

• How can a child acquiring Dutch as first language get to ‘know’ that heel cannot modify prepositional predicates and verbs?– Children are never taught that it is not possible;– They are also never or seldom corrected for

language errors, and if they are, they seem to ignore it (Negative evidence plays no role)

Page 11: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

11

Hypotheses (2)

• Hypothesis 1b– Absence of relevant constructions in the training

phase of a word leads to absence of the property (indirect negative evidence)

• Hypothesis 2b– Absence of relevant constructions in the training

phase of a word does not lead to absence of the property for semantically bootstrapped properties

Page 12: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

12

Related Questions

• Do children ever make errors against this?• Is a ‘training phase’ for word properties real?• How ‘long’ is this training phase? • What is a ‘sufficiently large’ number of actual

examples• Does semantic bootstrapping play a role, and if so

which one• Are these words acquired in different language

acquisition stages?

Page 13: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

13

Related Questions

• Can this be related to the different modification potential?

• Is there a relation with the fact that zeer appears to be rather formal, while heel and erg are not?

Page 14: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

14

Related Questions

• adverb-adjective agreement (substandard): – heel/hele dikke boeken ‘very thick books’– erg/erge dikke boeken– Zeer/*zere dikke boeken– Is this somehow related?

• What about other, closely related, words?

Page 15: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

15

Consult Grammars

• Currently– Consult paper and electronic grammars

• ANS and e-ANS e.g. section 15-3-1-1

• In the near Future– Consult Taalportaal with (I hope/expect)

• All examples formally marked as such • All examples parsed/tagged, using ISOCAT DCs and

searchable• Links to (possibly complex queries) to illustrate with real

data from treebanks and other annotated data

Page 16: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

16

Find Data

• Which data and tools (LRs) exist that might contribute to answering these questions?

• Currently:– you have to search for them in multiple places– Many relevant data are not publicly visible (you

will encounter them by personal contacts only)– Or you have to create them yourself

Page 17: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

17

Find Data

• There is no place/site where you query:– Give me a list of all LRs for the Dutch language– What is the size of all Dutch text corpora (in #tokens)– Give me a list of all Dutch data that contain children

2-7 years old as speaker– Give me a list of all Dutch data containing any of the

words heel, zeer, erg

• Not even in most individual data centres (TST-Centrale, ELRA, LDC, ..)

Page 18: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

18

Find Data

• CLARIN– Provides a flexible framework incl. tools for

making descriptions of LRs (‘metadata’)• CMDI

– Supports (assistance, execution, funding) the creation of metadata for LRs

– Supports making these metadata (and the actual data) visible and accessible via CLARIN portals

Page 19: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

19

Find Data

• CLARIN– Provides facilities for semantic interoperability

• ISOCAT, Relation Registry (coming soon)

– browsing, searching and querying facilities for the metadata

• Initial prototype: Virtual Language Observatory

– Will enable you to collect the data that are relevant to you in a virtual collection

– This will save the researcher a lot of time– It will enlarge the empirical basis for the research

Page 20: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

20

Closely Related Words

• Find words that are closely related– Adverbs that function as an intensifier (‘booster’)– Are (near-)synonymous, hyponyms, or co-

hyponyms– Also (near-)antonyms are relevant

• In order to determine their properties and potential further generalizations

Page 21: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

21

Closely Related Words

• Using e.g. – Synonym information in traditional dictionaries– Dutch EuroWordnet (currently via ELRA M0016)– Or Cornetto (via the Dutch HLT-Agency)

• Currently searchable only via – a plug-in in an old version (3.5) of Firefox. or – In programs via a python module

• A CLARIN-NL project to improve this

Page 22: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

22

Closely Related Words

Found via synonym dictionaries:• abnormaal afschuwelijk akelig bijster bijzonder

bovenmatig buitengemeen buitensporig danig donders eminent enorm exceptioneel extra extraordinair extreem fabelachtig fenomenaal geweldig gigantisch intens kolossaal merkwaardig mirakels onbeschrijfelijk ongelofelijk ongehoord ongekend ongemeen onmenselijk onmetelijk ontzettend onwijs speciaal uitermate uiterst uitzonderlijk verdraaid verduiveld verrekte verschrikkelijk vet zeldzaam …..

Page 23: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

23

Closely Related Words

• zeer:adverb:3 / heel:adverb:5 (from Cornetto)• zeer:3/d_r-343077, allemachtig:2/d_r-9922, beestachtig:2/d_r-23835, bijzonder:4/c_546765,

bliksems:2/d_r-32612, bloedig:2/d_r-32881, bovenmate:1/d_r-36728, buitengewoon:2/d_r-39235, buitenmate:1/d_r-39294, buitensporig:2/d_r-401837, crimineel:4/d_a-53026, deerlijk:2/d_r-57321, deksels:2/d_r-57728, donders:2/d_r-62605, drommels:2/d_r-65820, eindeloos:3/c_546740, enorm:2/d_r-74285, erbarmelijk:2/d_r-74877, fantastisch:6/d_r-79264, formidabel:2/d_r-82704, geweldig:4/d_r-92392, goddeloos:2/d_r-94633, godsjammerlijk:2/d_r-94798, grenzeloos:2/d_r-96846, grotelijks:1/d_r-98244, heel:5/d_r-106880, ijselijk:2/d_r-118854, ijzig:4/c_546756, intens:2/d_r-123517, krankzinnig:3/d_r-142403, machtig:4/d_r-165866, mirakels:1/d_r-173095, onsterachtig:2/d_r-175264, moorddadig:4/d_r-175475, oneindig:2/d_r-193740, onnoemelijk:2/d_r-194761, ontiegelijk:2/d_r-415154, ontstellend:2/d_r-415165, ontzaglijk:2/d_r-415176, ontzettend:3/d_r-196906, onuitsprekelijk:2/d_r-415180, onvoorstelbaar:2/d_r-415191, onwezenlijk:2/d_r-197464, onwijs:4/d_r-197468, overweldigend:2/d_r-205004, peilloos:2/d_r-213144, reusachtig:3/d_r-239357, reuze:2/d_r-239379, schrikkelijk:2/d_r-256144, sterk:7/d_r-272639, uiterst:4/d_r-300933, verdomd:2/d_r-308293, verdraaid:4/c_546761, verduiveld:2/d_r-308522, verduveld:2/d_r-308569, verrekt:3/d_r-418644, verrot:3/d_r-418648, verschrikkelijk:3/d_r-312634, vervloekt:2/d_r-314372, vreselijk:5/d_r-323099, waanzinnig:2/d_r-329061, zeldzaam:2/d_r-419882, zwaar:10/d_r-347153

Page 24: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

24

Basic Facts: Correct?

• Check the basic facts• Check against occurrences in corpora

– Problem: each of the 3 words is ambiguous!• Erg (4x)= noun(de) ‘erg’; noun(het)’evil’, adj+adv

‘unpleasant’, adv ’very’• Zeer (3x)= noun ‘pain’; adj ‘painful’; adv ‘very’• Heel (3x) = adj ‘whole’; verbform ‘heal’; adv ‘very’

– PoS-tagged corpus will help somewhat• But most corpora do not distinguish adj from adv by

category! (searching for PoS bigrams will help slightly)

– A fully-parsed corpus would be ideal

Page 25: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

25

Basic Facts: Correct?

– LASSY Small: 1M manually verified parsed corpus– Interface to LASSY Small

• Requires knowledge of XPATH/XQUERY

– Very Simple Interface to LASSY Small• limited options but simple commands

– Example-based interface GrETEL (CLARIN- Flanders)

• Greedy Extraction of Trees for Empirical Linguistics• Generates XPATH/XQUERY expression on the basis of

an example sentence plus markings of what is relevant in it

Page 26: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

26

Basic Facts: Correct?

– Queries: erg::mod:; zeer::mod: ; heel::mod:– Extract from Statistics:

– Query: heel::mod:ww

erg zeer heel

ADJ 143 268 263

WW 35 49 9

BW 1 1 7

Page 27: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

27

Basic Facts: Correct?

• Analysis– 8 examples are forms that are ambiguous between adjectival and verbal participle,

• All are examples of adjectival participles but LASSY represents all participles as verbal

– In 1 example heel modifies the adj open from the expression open staan voor, but wrongly analyzed as modifying the verb staan

• CLARIN will offer facilities to make annotations to such corpora• Same queries could be done

– for the other related words– on LASSY Large Corpus (2.4 billion words, automatically parsed)– In the CGN corpus (but it uses a different interface)

• But this will require facilities for ‘batch jobs’ or more complicated queries (maybe via web services)

Page 28: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

28

Acquisition Corpora: Search

• E.g. data in the CHILDES system (part of TalkBank– 7 corpora for Dutch– But with their own data formats (CHAT) and tools (

CLAN)

• However, also mirrored at MPI and accessible via (ANNEX/)TROVA (again another interface)

Page 29: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

29

Acquisition Corpora: Search

• Give records for utterances containing erg with– Corpus (e.g. Van Kampen Corpus)– File: (e.g. laura74.cha)– Line: (e.g. 139)– Part Role: (e.g. Child)– Child Gdr: (e.g. female)– Age: (e.g. 5;6.12)– UTT (e.g. “ja , die s erg moeilijk .”)

• Maybe also some preceding/following context• Map attribute names and values to ISOCAT

Page 30: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

30

Acquisition Corpora: Search

• Corpus: Van Kampen• File: sarah21.cha• Line: 630• Speaker: Child• Child Gender: Female• Age: 2;7.16• UTT: “prinses e(r)g groot !”

Page 31: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

31

Acquisition Corpora: Search

• For each child, give list of pairs session + age of the child

• For child and each session, give #occurrences of zeer, heel, erg

• etc, etc.• Such queries (Some example attempts )

– Mixed metadata/content search– Over multiple resources– Specific output formats

• are not so easy with the current interfaces!!

Page 32: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

32

Acquisition corpora: Search

• Heel is found 153 times in Van Kampen corpus• Erg is found 77 times in Van Kampen corpus

– But many are an irrelevant use of erg

• PoS-tagging the corpus might be useful– Search for POS-bigrams (e.g. erg/adj */adj)– Add lemma’s

• Or even full parsing, at least of the adult speech

Page 33: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

33

Acquisition corpora: Parse

• CLARIN-NL – Web services are being developed

• For PoS-tagging text• For full parsing of text• (and many more)

– To be usable by humanities researchers – in a user-friendly way in work flow systems

• Usefulness depends on– Size of the data (effort to select manually)– Quality of the web services

Page 34: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

34

Store the found data

• The found and newly created data – should be stored in a supported format– With automatically generated metadata– With automatically generated provenance data– Using data categories mapped to or from ISOCAT– For which PIDs are provided– Stored on a server of a CLARIN-centre– So that they

• can become proper resources on their own• Are visible, accessible and interpretable as part of enriched

publications

Page 35: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

35

Search in CGN / SONAR

• To assess level of formality– Give absolute and relative frequencies of

heel/hele/erg/erge/zeer as adj by text genre, and speaker/participants education level

– In CGN (spoken corpus)– In SONAR (written corpus)– Idem but for the word + the following Pos-tag– Idem but in the fully parsed part of CGN and in

LASSY + the PoS tag of the modifiee head

Page 36: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

36

Interpret the data

• Interpret the data in function of the hypotheses being investigated

• Apply analytical / statistical tools to the data– CLARIN should support formats of frequently used

statistical packages such as SPSS, R, etc.

• The research will surely lead to new questions, so to new queries

• Reach conclusions and publish an open access enriched publication

Page 37: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

37

Broaden the scope

• Do the same for worden/raken (‘become’/ ‘get’)

• NP, PP and AP can be predicate complements

• Worden and raken take predicate complements

• They are (almost) synonymous• worden: takes only NP or AP• raken: takes only AP or PP

Page 38: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

38

Broaden the scope

• AP: Zij werd / raakte zwanger• PP: Zij *werd / raakte in verwachting• NP: Zij werd / *raakte burgemeester

Andrepeat the process

Exercise

Page 39: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

39

Conclusions

• There is no adequate infrastructure for linguistic research

• There are bits and pieces, but– Finding LRs is not easy– LRs have their own formats, data categories, user /

search interfaces– Limited formal and no semantic interoperability– Search in combined LRs very difficult if not impossible

full research potential is not exploited• CLARIN(-NL) attempts to remedy this

Page 40: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

40

CLARIN-NL

Thanks for your attention!

http://www.clarin.nl/

Page 41: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

41

No Entry!

Page 42: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

42

Basic Facts: Correct?

• De omgang met de buren gebeurt op een heel ontspannen manier en de vrouw van de dominee heeft zelfs al Wolderse vlaai leren bakken . (parse) – heel:ADJ:mod:WW:ontspannen

• De verschijnselen zijn heel verschillend . (parse) – heel:ADJ:mod:WW:verschillend

• ,, Op het voorterrein ging het nog heel overtuigend . (parse) – heel:ADJ:mod:WW:overtuigend

• Ze hebben heel gericht en planmatig volkscafés bezocht om daar hun gif te spuien . (parse) – heel:ADJ:mod:WW:gericht

• Ze is zelfs met een ' meester ' getrouwd : Marc Dassesse _ mevrouw Spiritus-Dassesse zet heel geëmanicipeerd haar meisjesnaam voorop _ is nu een gerenomeerd fiscaal adviseur en hoogleraar aan de ULB . (parse)

– heel:ADJ:mod:WW:geëmanicipeerd • Gelukkig krijg ik nog heel geregeld te horen : ' Gerard jongen , dat doe je gewoon foùt ' . (parse)

– heel:ADJ:mod:WW:geregeld • Dat is een heel verrassend resultaat en het stemt tot optimisme . (parse)

– heel:ADJ:mod:WW:verrassend • De biermarkt is heel versnipperd en wordt overspoeld door nieuwe productlanceringen . (parse)

– heel:ADJ:mod:WW:versnipperd • Toch staan we hier heel open voor voorstellen . (parse)

– heel:ADJ:mod:WW:staan

Page 43: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

Metadata search CGN+CHILDESDutch && 2<age<7

Page 44: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

Regexp content searchheel|zeer|erg|erge|hele

Page 45: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

Resultset export to file

Page 46: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

CGNregexp ^heel$|^erg$|…

Page 47: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

CGNregexp op WORDS tier + POS

Page 48: Linguistic Research And  The CLARIN Infrastructure Jan Odijk

48

Exercise

• ‘Worden takes APs not PPs as predc’ • Use the LASSY-Small Very Simple Interface• Give me all sentences in which the word

“worden” takes a predicative (predc) PP complement:– rel='predc' and hlemma='worden‘ and postag='vz'

• Do you find examples with this query?• How do you interpret this?