foundations of language science and technology - corpus linguistics - silvia hansen-schirra

27
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra

Upload: lorant

Post on 19-Mar-2016

46 views

Category:

Documents


1 download

DESCRIPTION

Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra. Outline. Why corpora, why interpreted corpora Many types of annotation - linguistic annotation - non-linguistic annotation New developments. Why corpora?. Linguistics linguistic theory. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Foundations of Language Science and Technology

- Corpus Linguistics -

Silvia Hansen-Schirra

Page 2: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Outline

Why corpora, why interpreted corpora

Many types of annotation - linguistic annotation- non-linguistic annotation

New developments

Page 3: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Why corpora?

CognitionCognition

models of human models of human language processinglanguage processing

EngineeringEngineering

language technologylanguage technologyapplicationsapplications

LinguisticsLinguistics

linguistic theorylinguistic theory

Page 4: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Empirical linguistics

corpus data experimentalpsycholinguistic data

introspective data

DB of relevant data

research

Page 5: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Engineering motivation● information extraction ● question-answering● statistical machine translation● parser training and evaluation

=> increased need for deeply annotated corpora

Page 6: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Cognitive motivation

● experience-oriented frequency-based models● models of gradiant grammaticality● metrics of complexity

Page 7: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Resource description metadatalanguage: Spanish, English, German

sublanguage/register: regional dialect, sociolect, vernacular, professional jargon, toddler speech

text sort(s): newspaper articles, wire news, political speech, control commands

subject domain: stock rates, flight reservations,

type of producers: professional journalist, student, radiologist

mode of production: spoken, written, signed, morsed

medium of production: pencil, PC with MS Word, dictaphone

conditions of production: spontaneous, carefully composed, produced under time pressure

transmission encoding: raw ascii code, HTML, digitized phone signal, unicode

medium of transmission: telephone, WWW, CB radio

storage encoding: raw ASCII code, HTML, AIFF

medium of storage: DAT tape, CD ROM, hard disk

mode of presentation: spoken, written, signed

medium of presentation: newspaper, radio, book, tv show, theater performance,

type of intended recipients: newspaper reader, booking agent, theater audience

number of intended recipients: point-to-point, multicast, broadcast

synchronicity of discourse: synchronous dialogue, asynchronous

direction: one-way, two-way

Page 8: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Linguistic annotation

● part-of-speech tags, ● word sense information, ● morphosyntactic features of words, ● constituent structures for phrases or sentences, ● coreference markers,● dependency structures,● predicate-argument structures,● reference identifications for term phrases,● information structures within sentences,● intonation contours,● speech acts,● discourse relations - discourse structures.

Page 9: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Other annotations● judgements of native speakers on the acceptability or appropriateness of the utterance, ● information on speaker(s), ● information on hearer(s) or intended audience,● information on the utterance situation (time, place, circumstances)● information on the published source, ● typographic information,● layout and document structure, ● textual transcriptions of spoken utterances,● transcription of pauses,● error tagging.

Page 10: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Raw vs. linguistically interpreted corpora

search term: word=form...play a significant part in determining growth and form....each molecule can form four hydrogen bonds...

vs.

search term: word=form & pos=N...play a significant part in determining growth and form.

search term: word=form & pos=V...each molecule can form four hydrogen bonds...

Page 11: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

search term: is *edAlpha interferon is produced by white blood cells...

search term: were *edIn the late 1970s interferons were hailed as "wonder drugs"...

vs.

search term: pos=VB {0,1} pos=VVNGamma is not induced by viruses at all...So interferons could be described as the antibiotics of the virus...Only two of these have yet been identified...

Raw vs. linguistically interpreted corpora

Page 12: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Syntactically annotated corpora:treebanks

• German treebank project: TiGer Treebank• English reference treebank: Penn Treebank• Treebank + semantic information:

Prague Dependency Bank

Page 13: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

TiGer Treebank

ImAPPRART

Dat

in

nächstenADJA

Sup.Dat.Sg.Neut

nahe

JahrNNDat.

Pl.NeutJahr

.$.

HD SB OC

HDOAMO

AC NK NK NK NK NK NK

S

VP

NPNPPP

willVMFIN

3.Sg.Pres.Indwollen

dieARTNom.

Sg.Femdie

RegierungNN

Nom.Sg.Fem

Regierung

ihrePPOSAT

Acc.Pl.Masc

ihr

ReformpläneNNAcc.

Pl.MascPlan

umsetzenVVINF

Inf

umsetzen

Page 14: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

ImAPPRART

Dat

in

nächstenADJA

Sup.Dat.Sg.Neut

nahe

JahrNNDat.

Pl.NeutJahr

.$.

HD SB OC

HDOAMO

AC NK NK NK NK NK NK

S

VP

NPNPPP

willVMFIN

3.Sg.Pres.Indwollen

dieARTNom.

Sg.Femdie

RegierungNN

Nom.Sg.Fem

Regierung

ihrePPOSAT

Acc.Pl.Masc

ihr

ReformpläneNNAcc.

Pl.MascPlan

umsetzenVVINF

Inf

umsetzen

annotation on word level:part-of-speech,

morphology, lemmata

TiGer Treebank

Page 15: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

ImAPPRART

Dat

in

nächstenADJA

Sup.Dat.Sg.Neut

nahe

JahrNNDat.

Pl.NeutJahr

.$.

HD SB OC

HDOAMO

AC NK NK NK NK NK NK

S

VP

NPNPPP

willVMFIN

3.Sg.Pres.Indwollen

dieARTNom.

Sg.Femdie

RegierungNN

Nom.Sg.Fem

Regierung

ihrePPOSAT

Acc.Pl.Masc

ihr

ReformpläneNNAcc.

Pl.MascPlan

umsetzenVVINF

Inf

umsetzen

node labels:phrase categories

TiGer Treebank

Page 16: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

ImAPPRART

Dat

in

nächstenADJA

Sup.Dat.Sg.Neut

nahe

JahrNNDat.

Pl.NeutJahr

.$.

HD SB OC

HDOAMO

AC NK NK NK NK NK NK

S

VP

NPNPPP

willVMFIN

3.Sg.Pres.Indwollen

dieARTNom.

Sg.Femdie

RegierungNN

Nom.Sg.Fem

Regierung

ihrePPOSAT

Acc.Pl.Masc

ihr

ReformpläneNNAcc.

Pl.MascPlan

umsetzenVVINF

Inf

umsetzen

edge labels:syntactic functions

TiGer Treebank

Page 17: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

ImAPPRART

Dat

in

nächstenADJA

Sup.Dat.Sg.Neut

nahe

JahrNNDat.

Pl.NeutJahr

.$.

HD SB OC

HDOAMO

AC NK NK NK NK NK NK

S

VP

NPNPPP

willVMFIN

3.Sg.Pres.Indwollen

dieARTNom.

Sg.Femdie

RegierungNN

Nom.Sg.Fem

Regierung

ihrePPOSAT

Acc.Pl.Masc

ihr

ReformpläneNNAcc.

Pl.MascPlan

umsetzenVVINF

Inf

umsetzen

crossing branches fordiscontinuous constituency types

TiGer Treebank

Page 18: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Penn Treebank

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

Page 19: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Penn Treebank

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

annotation on word level:part-of-speech

Page 20: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Penn Treebank

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

phrase categories

Page 21: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Penn Treebank

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

syntactic functions

Page 22: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Prague Dependency Bankchce

wantsSb

KdowhoSb

ACT.T

investovatto-invest

ObjACT.VOL.T

stehundred

ObjRESTR.F

koruncrowns

AtrPAT.F

doto

AuxP

automobilucar

AdvDIR.F

Page 23: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Prague Dependency Bankchce

wantsSb

KdowhoSb

ACT.T

investovatto-invest

ObjACT.VOL.T

stehundred

ObjRESTR.F

koruncrowns

AtrPAT.F

doto

AuxP

automobilucar

AdvDIR.F

annotation on word level:lemmata, morphology

Page 24: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Prague Dependency Bankchce

wantsSb

KdowhoSb

ACT.T

investovatto-invest

ObjACT.VOL.T

stehundred

ObjRESTR.F

koruncrowns

AtrPAT.F

doto

AuxP

automobilucar

AdvDIR.F

syntactic functions

Page 25: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Prague Dependency Bankchce

wantsSb

KdowhoSb

ACT.T

investovatto-invest

ObjACT.VOL.T

stehundred

ObjRESTR.F

koruncrowns

AtrPAT.F

doto

AuxP

automobilucar

AdvDIR.F

dependency structure

Page 26: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

Prague Dependency Bankchce

wantsSb

KdowhoSb

ACT.T

investovatto-invest

ObjACT.VOL.T

stehundred

ObjRESTR.F

koruncrowns

AtrPAT.F

doto

AuxP

automobilucar

AdvDIR.F

semantic information on constituent roles,

theme/rheme, etc.

Page 27: Foundations of Language Science  and Technology -  Corpus Linguistics - Silvia Hansen-Schirra

New developments

● historical dimension (e.g., Corpus of the History of German Language)

● multilayer stand-off linguistic markup

● multimodal markup/interpretation

● new types of treebanks:● CS treebanks with dependency links (NEGRA, TIGER)● machine-annotated corpora for statistical training (e.g., Redwoods Treebank)● Dependency (Tree)Banks (Prague, PARC)● Grammatical Relation (Tree)Banks (Briscoe & Carroll)