foundations of language science and technology - corpus linguistics - silvia hansen-schirra
DESCRIPTION
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra. Outline. Why corpora, why interpreted corpora Many types of annotation - linguistic annotation - non-linguistic annotation New developments. Why corpora?. Linguistics linguistic theory. - PowerPoint PPT PresentationTRANSCRIPT
Foundations of Language Science and Technology
- Corpus Linguistics -
Silvia Hansen-Schirra
Outline
Why corpora, why interpreted corpora
Many types of annotation - linguistic annotation- non-linguistic annotation
New developments
Why corpora?
CognitionCognition
models of human models of human language processinglanguage processing
EngineeringEngineering
language technologylanguage technologyapplicationsapplications
LinguisticsLinguistics
linguistic theorylinguistic theory
Empirical linguistics
corpus data experimentalpsycholinguistic data
introspective data
DB of relevant data
research
Engineering motivation● information extraction ● question-answering● statistical machine translation● parser training and evaluation
=> increased need for deeply annotated corpora
Cognitive motivation
● experience-oriented frequency-based models● models of gradiant grammaticality● metrics of complexity
Resource description metadatalanguage: Spanish, English, German
sublanguage/register: regional dialect, sociolect, vernacular, professional jargon, toddler speech
text sort(s): newspaper articles, wire news, political speech, control commands
subject domain: stock rates, flight reservations,
type of producers: professional journalist, student, radiologist
mode of production: spoken, written, signed, morsed
medium of production: pencil, PC with MS Word, dictaphone
conditions of production: spontaneous, carefully composed, produced under time pressure
transmission encoding: raw ascii code, HTML, digitized phone signal, unicode
medium of transmission: telephone, WWW, CB radio
storage encoding: raw ASCII code, HTML, AIFF
medium of storage: DAT tape, CD ROM, hard disk
mode of presentation: spoken, written, signed
medium of presentation: newspaper, radio, book, tv show, theater performance,
type of intended recipients: newspaper reader, booking agent, theater audience
number of intended recipients: point-to-point, multicast, broadcast
synchronicity of discourse: synchronous dialogue, asynchronous
direction: one-way, two-way
Linguistic annotation
● part-of-speech tags, ● word sense information, ● morphosyntactic features of words, ● constituent structures for phrases or sentences, ● coreference markers,● dependency structures,● predicate-argument structures,● reference identifications for term phrases,● information structures within sentences,● intonation contours,● speech acts,● discourse relations - discourse structures.
Other annotations● judgements of native speakers on the acceptability or appropriateness of the utterance, ● information on speaker(s), ● information on hearer(s) or intended audience,● information on the utterance situation (time, place, circumstances)● information on the published source, ● typographic information,● layout and document structure, ● textual transcriptions of spoken utterances,● transcription of pauses,● error tagging.
Raw vs. linguistically interpreted corpora
search term: word=form...play a significant part in determining growth and form....each molecule can form four hydrogen bonds...
vs.
search term: word=form & pos=N...play a significant part in determining growth and form.
search term: word=form & pos=V...each molecule can form four hydrogen bonds...
search term: is *edAlpha interferon is produced by white blood cells...
search term: were *edIn the late 1970s interferons were hailed as "wonder drugs"...
vs.
search term: pos=VB {0,1} pos=VVNGamma is not induced by viruses at all...So interferons could be described as the antibiotics of the virus...Only two of these have yet been identified...
Raw vs. linguistically interpreted corpora
Syntactically annotated corpora:treebanks
• German treebank project: TiGer Treebank• English reference treebank: Penn Treebank• Treebank + semantic information:
Prague Dependency Bank
TiGer Treebank
ImAPPRART
Dat
in
nächstenADJA
Sup.Dat.Sg.Neut
nahe
JahrNNDat.
Pl.NeutJahr
.$.
HD SB OC
HDOAMO
AC NK NK NK NK NK NK
S
VP
NPNPPP
willVMFIN
3.Sg.Pres.Indwollen
dieARTNom.
Sg.Femdie
RegierungNN
Nom.Sg.Fem
Regierung
ihrePPOSAT
Acc.Pl.Masc
ihr
ReformpläneNNAcc.
Pl.MascPlan
umsetzenVVINF
Inf
umsetzen
ImAPPRART
Dat
in
nächstenADJA
Sup.Dat.Sg.Neut
nahe
JahrNNDat.
Pl.NeutJahr
.$.
HD SB OC
HDOAMO
AC NK NK NK NK NK NK
S
VP
NPNPPP
willVMFIN
3.Sg.Pres.Indwollen
dieARTNom.
Sg.Femdie
RegierungNN
Nom.Sg.Fem
Regierung
ihrePPOSAT
Acc.Pl.Masc
ihr
ReformpläneNNAcc.
Pl.MascPlan
umsetzenVVINF
Inf
umsetzen
annotation on word level:part-of-speech,
morphology, lemmata
TiGer Treebank
ImAPPRART
Dat
in
nächstenADJA
Sup.Dat.Sg.Neut
nahe
JahrNNDat.
Pl.NeutJahr
.$.
HD SB OC
HDOAMO
AC NK NK NK NK NK NK
S
VP
NPNPPP
willVMFIN
3.Sg.Pres.Indwollen
dieARTNom.
Sg.Femdie
RegierungNN
Nom.Sg.Fem
Regierung
ihrePPOSAT
Acc.Pl.Masc
ihr
ReformpläneNNAcc.
Pl.MascPlan
umsetzenVVINF
Inf
umsetzen
node labels:phrase categories
TiGer Treebank
ImAPPRART
Dat
in
nächstenADJA
Sup.Dat.Sg.Neut
nahe
JahrNNDat.
Pl.NeutJahr
.$.
HD SB OC
HDOAMO
AC NK NK NK NK NK NK
S
VP
NPNPPP
willVMFIN
3.Sg.Pres.Indwollen
dieARTNom.
Sg.Femdie
RegierungNN
Nom.Sg.Fem
Regierung
ihrePPOSAT
Acc.Pl.Masc
ihr
ReformpläneNNAcc.
Pl.MascPlan
umsetzenVVINF
Inf
umsetzen
edge labels:syntactic functions
TiGer Treebank
ImAPPRART
Dat
in
nächstenADJA
Sup.Dat.Sg.Neut
nahe
JahrNNDat.
Pl.NeutJahr
.$.
HD SB OC
HDOAMO
AC NK NK NK NK NK NK
S
VP
NPNPPP
willVMFIN
3.Sg.Pres.Indwollen
dieARTNom.
Sg.Femdie
RegierungNN
Nom.Sg.Fem
Regierung
ihrePPOSAT
Acc.Pl.Masc
ihr
ReformpläneNNAcc.
Pl.MascPlan
umsetzenVVINF
Inf
umsetzen
crossing branches fordiscontinuous constituency types
TiGer Treebank
Penn Treebank
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
Penn Treebank
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
annotation on word level:part-of-speech
Penn Treebank
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
phrase categories
Penn Treebank
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
syntactic functions
Prague Dependency Bankchce
wantsSb
KdowhoSb
ACT.T
investovatto-invest
ObjACT.VOL.T
stehundred
ObjRESTR.F
koruncrowns
AtrPAT.F
doto
AuxP
automobilucar
AdvDIR.F
Prague Dependency Bankchce
wantsSb
KdowhoSb
ACT.T
investovatto-invest
ObjACT.VOL.T
stehundred
ObjRESTR.F
koruncrowns
AtrPAT.F
doto
AuxP
automobilucar
AdvDIR.F
annotation on word level:lemmata, morphology
Prague Dependency Bankchce
wantsSb
KdowhoSb
ACT.T
investovatto-invest
ObjACT.VOL.T
stehundred
ObjRESTR.F
koruncrowns
AtrPAT.F
doto
AuxP
automobilucar
AdvDIR.F
syntactic functions
Prague Dependency Bankchce
wantsSb
KdowhoSb
ACT.T
investovatto-invest
ObjACT.VOL.T
stehundred
ObjRESTR.F
koruncrowns
AtrPAT.F
doto
AuxP
automobilucar
AdvDIR.F
dependency structure
Prague Dependency Bankchce
wantsSb
KdowhoSb
ACT.T
investovatto-invest
ObjACT.VOL.T
stehundred
ObjRESTR.F
koruncrowns
AtrPAT.F
doto
AuxP
automobilucar
AdvDIR.F
semantic information on constituent roles,
theme/rheme, etc.
New developments
● historical dimension (e.g., Corpus of the History of German Language)
● multilayer stand-off linguistic markup
● multimodal markup/interpretation
● new types of treebanks:● CS treebanks with dependency links (NEGRA, TIGER)● machine-annotated corpora for statistical training (e.g., Redwoods Treebank)● Dependency (Tree)Banks (Prague, PARC)● Grammatical Relation (Tree)Banks (Briscoe & Carroll)