nlp. introduction to nlp background –from the early ‘90s –developed at the university of...

21
NLP

Upload: katherine-atkinson

Post on 18-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

NLP

Page 2: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Introduction to NLP

The Penn Treebank

Page 3: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Description

• Background– From the early ‘90s– Developed at the University of Pennsylvania– (Marcus, Santorini, and Marcinkiewicz 1993)

• Size– 40,000 training sentences– 2400 test sentences

• Genre– Mostly Wall Street Journal news stories and some spoken conversations

• Importance– Helped launch modern automatic parsing methods

Page 4: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

External links

• Treebank-3– http://catalog.ldc.upenn.edu/LDC99T42

• Original version– http://catalog.ldc.upenn.edu/LDC95T7

• Tokenization guidelines– http://www.cis.upenn.edu/~treebank/tokenization.html

• The American National Corpus– http://www.anc.org

Page 5: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Tag Description Example CC coordinating conjunction and CD cardinal number 1, third DT determiner the EX existential there there is FW foreign word d‘oeuvre IN preposition/subordinating conjunction in, of, like JJ adjective green JJR adjective, comparative greener JJS adjective, superlative greenest LS list marker 1) MD modal could, will NN noun, singular or mass table NNS noun plural tables NNP proper noun, singular John NNPS proper noun, plural Vikings PDT predeterminer both the boys POS possessive ending friend's

Penn Treebank tagset (1/2)

Page 6: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Tag Description Example PRP personal pronoun I, he, it PRP$ possessive pronoun my, his RB adverb however, usually, naturally, here, good RBR adverb, comparative betterRBS adverb, superlative best RP particle give up TO to to go, to him UH interjection uhhuhhuhh VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non-3d take VBZ verb, 3rd person sing. present takes WDT wh-determiner which WP wh-pronoun who, what WP$ possessive wh-pronoun whose WRB wh-abverb where, when

Penn Treebank tagset (2/2)

Page 7: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Example sentence

• WSJ/12/WSJ_1273.MRG, sentence 11 • Because the CD had an effective yield of 13.4 % when

it was issued in 1984 , and interest rates in general had declined sharply since then , part of the price Dr. Blumenfeld paid was a premium -- an additional amount on top of the CD 's base value plus accrued interest that represented the CD 's increased market value .

Page 8: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

(S (SBAR-PRP (IN Because) (S (S (NP-SBJ (DT the) (NNP CD)) (VP (VBD had) (NP (NP (DT an) (JJ effective) (NN yield)) (PP (IN of) (NP (CD 13.4) (NN %)))) (SBAR-TMP (WHADVP-4 (WRB when)) (S (NP-SBJ-1 (PRP it)) (VP (VBD was) (VP (VBN issued) (NP (-NONE- *-1)) (PP-TMP (IN in) (NP (CD 1984))) (ADVP-TMP (-NONE- *T*-4)))))))) ...

Parsed sentence

Page 9: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

(S (SBAR-PRP (IN Because) (S (S (NP-SBJ (DT the) (NNP CD)) (VP (VBD had) (NP (NP (DT an) (JJ effective) (NN yield)) (PP (IN of) (NP (CD 13.4) (NN %)))) (SBAR-TMP (WHADVP-4 (WRB when)) (S (NP-SBJ-1 (PRP it)) (VP (VBD was) (VP (VBN issued) (NP (-NONE- *-1)) (PP-TMP (IN in) (NP (CD 1984))) (ADVP-TMP (-NONE- *T*-4)))))))) (, ,) (CC and) (S (NP-SBJ (NP (NN interest) (NNS rates)) (PP (IN in) (ADJP (JJ general)))) (VP (VBD had) (VP (VBN declined) (ADVP-MNR (RB sharply)) (PP-TMP (IN since) (NP (RB then))))))))

(VP (VBD was) (NP-PRD (NP (DT a) (NN premium)) (: --) (NP (NP (NP (DT an) (JJ additional) (NN amount)) (PP-LOC (IN on) (NP (NP (NN top)) (PP (IN of) (NP (NP (DT the) (NNP CD) (POS 's)) (NN base) (NN value)))))) (CC plus) (NP (VBN accrued) (NN interest))) (SBAR (WHNP-2 (WDT that)) (S (NP-SBJ (-NONE- *T*-2)) (VP (VBD represented) (NP (NP (DT the) (NNP CD) (POS 's)) (VBN increased) (NN market) (NN value))))))) (. .))

(, ,) (NP-SBJ (NP (NN part)) (PP (IN of) (NP (NP (DT the) (NN price)) (SBAR (WHNP-3 (-NONE- 0)) (S (NP-SBJ (NNP Dr.) (NNP Blumenfeld)) (VP (VBD paid) (NP (-NONE- *T*-3))))))))

Page 10: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

(S (SBAR-PRP (IN Because) (S (S (NP-SBJ (DT the) (NNP CD)) (VP (VBD had) (NP (NP (DT an) (JJ effective) (NN yield)) (PP (IN of) (NP (CD 13.4) (NN %)))) (SBAR-TMP (WHADVP-4 (WRB when)) (S (NP-SBJ-1 (PRP it)) (VP (VBD was) (VP (VBN issued) (NP (-NONE- *-1)) (PP-TMP (IN in) (NP (CD 1984))) (ADVP-TMP (-NONE- *T*-4)))))))) (, ,) (CC and) (S (NP-SBJ (NP (NN interest) (NNS rates)) (PP (IN in) (ADJP (JJ general)))) (VP (VBD had) (VP (VBN declined) (ADVP-MNR (RB sharply)) (PP-TMP (IN since) (NP (RB then))))))))

(VP (VBD was) (NP-PRD (NP (DT a) (NN premium)) (: --) (NP (NP (NP (DT an) (JJ additional) (NN amount)) (PP-LOC (IN on) (NP (NP (NN top)) (PP (IN of) (NP (NP (DT the) (NNP CD) (POS 's)) (NN base) (NN value)))))) (CC plus) (NP (VBN accrued) (NN interest))) (SBAR (WHNP-2 (WDT that)) (S (NP-SBJ (-NONE- *T*-2)) (VP (VBD represented) (NP (NP (DT the) (NNP CD) (POS 's)) (VBN increased) (NN market) (NN value))))))) (. .))

(, ,) (NP-SBJ (NP (NN part)) (PP (IN of) (NP (NP (DT the) (NN price)) (SBAR (WHNP-3 (-NONE- 0)) (S (NP-SBJ (NNP Dr.) (NNP Blumenfeld)) (VP (VBD paid) (NP (-NONE- *T*-3))))))))

Page 11: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Peculiarities

• Complementizers– e.g., “that”

• Gaps– *NONE*

• SBAR– SBAR COMP S– E.g., “that *NONE* represented the CD’ market value”

Page 12: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

tgrep

A < B A immediately dominates BA << B A dominates BA <- B B is the last child of AA <<, B B is a leftmost descendant of AA <<` B B is a rightmost descendant of AA . B A immediately precedes BA .. B A precedes BA $ B A and B are sistersA $. B A and B are sisters and A immediately precedes BA $.. B A and B are sisters and A precedes B

Page 13: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

The use of treebanks

• Disadvantages– A lot more work to annotate 40K+ sentences than to write a grammar.

• Advantages– Statistics about different constituents and phenomena– Training systems– Evaluating systems– Multilingual extensions

Page 14: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Introduction to NLP

Parsing evaluation

Page 15: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Evaluation methodology (1/2)• Classification tasks

– Document retrieval– Part of speech tagging– Parsing

• Data split– Training– Dev-test– Test

Page 16: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Evaluation methodology (2/2)• Baselines

– Dumb baseline– Intelligent baseline– Human performance (ceiling)

• New method• Evaluation methods

– Accuracy– Precision and Recall

• Multiple references– Interjudge agreement

Page 17: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Kappa

• Agreement vs. expected agreement– P(A) is the level of agreement of the judges– P(E) is the expected probability of agreement by chance

• When k > .7 – agreement is considered high• Question

– Judge agreement on a binary classification task is 60%, is this high?

)(1

)()(

EP

EPAP

Page 18: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Answer

• Data– P(A) = .6– P(E) = .5

• Kappa– k = .1/.5 = .2– not high

)(1

)()(

EP

EPAP

Page 19: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Parsing evaluation

• Precision and recall– get the proper constituents

• Labeled precision and recall– also get the correct non-terminal labels

• F1– harmonic mean of precision and recall

• Crossing brackets– (A (B C)) vs ((A B) C)

• PTB corpus– training 02-21, development 22, test 23

Page 20: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

Evaluation example

GOLD = (S (NP (DT The) (JJ Japanese) (JJ industrial) (NNS companies)) (VP (MD should) (VP (VB know) (ADVP (JJR better)))) (. .)

Bracketing Recall = 80.00Bracketing Precision = 66.67Bracketing FMeasure = 72.73Complete match = 0.00No crossing = 100.00Tagging accuracy = 87.50

CHAR = (S (NP (DT The) (JJ Japanese) (JJ industrial) (NNS companies)) (VP (MD should) (VP (VB know)) ((ADVP (RBR better)))) (. .))

Page 21: NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size

NLP