9/13/1999 jhu cs 600.465/jan hajic 1 introduction to natural language processing ai-lab 2003.09.03

14
9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Process ing AI-Lab 2003.09.03

Upload: flora-marsh

Post on 01-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/Jan Hajic 1

Introduction to Natural Language Processing

AI-Lab

2003.09.03

Page 2: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic2

Intro to NLP

• Instructor: Jan Hajič (Visiting Assistant Professor)– CS Dept. JHU, office: NEB 324A

– Hours: Mon 10-11, Tue 3-4

– preferred contact: by e-mail

• Teaching Assistant: Gideon Mann– CS Dept. JHU, office: NEB 332

– Hours: TBA

• Room: NEB 36, MTW 2-3 (50 mins.)

Page 3: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic3

Textbooks you need

• Manning, C. D., Schütze, H.: • Foundations of Statistical Natural Language Processing. The

MIT Press. 1999. ISBN 0-262-13360-1. [required - on order]

• Allen, J.: • Natural Language Understanding. The Benjamins/Cummins P

ublishing Co. 1994. ISBN 0-8053-0334-0. [required - available]

• Wall, L. et al.: • Programming PERL. 2nd ed. O’Reilly. 1996. ISBN 1-56592-1

49-6. [recommended - available (main store)/on order]

Page 4: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic4

Other reading• Charniak, E:

– Statistical Language Learning. The MIT Press. 1996. ISBN 0-262-53141-0.

• Cover, T. M., Thomas, J. A.:– Elements of Information Theory. Wiley. 1991. ISBN 0-471-06259-6.

• Jelinek, F.:– Statistical Methods for Speech Recognition. The MIT Press. 1998. ISBN 0-262-10066-5

• Proceedings of major conferences:– ACL (Assoc. of Computational Linguistics)

– EACL (European Chapter of ACL)

– ANLP (Applied NLP)

– COLING (Intl. Committee of Computational Linguistics)

Page 5: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic5

Course requirements• Grade components: requirements & weights:

– Class participation: 7%

– Midterm: 20%

– Final Exam: 25%

– Homeworks (4): 48%

• Exams: – approx. 15 questions:

• mostly explanatory answers (1/4 page or so),

• only a few multiple choice questions

Page 6: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic6

Homeworks

• Homeworks:1 Entropy, Language Modeling

2 Word Classes

3 Classification (POS Tagging, ...)

4 Parsing (syntactic)

• Organization• (little) paper-and-pencil exercises, lot of programming

• strict deadlines (2pm on due date); only one may be max. 5 days late; turning-in mechanism: TBA

• absolutely no plagiarism

Page 7: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic7

Course segments

• Intro & Probability & Information Theory (3)– The very basics: definitions, formulas, examples.

• Language Modeling (3)– n-gram models, parameter estimation

– smoothing (EM algorithm)

• A Bit of Linguistics (3)– phonology, morphology, syntax, semantics, discourse

• Words and the Lexicon (3)– word classes, mutual information, bit of lexicography.

Page 8: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic8

Course segments (cont.)

• Hidden Markov Models (3)– background, algorithms, parameter estimation

• Tagging: Methods, Algorithms, Evaluation (8)– tagsets, morphology, lemmatization

– HMM tagging, Transformation-based, Feature-based

• NL Grammars and Parsing: Data, Algorithms (9)– Grammars and Automata, Deterministic Parsing

– Statistical parsing. Algorithms, parameterization, evaluation

• Applications (MT, ASR, IR, Q&A, ...) (4)

Page 9: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic9

NLP: The Main Issues• Why is NLP difficult?

– many “words”, many “phenomena” --> many “rules”• OED: 400k words; Finnish lexicon (of forms): ~2 . 107

• sentences, clauses, phrases, constituents, coordination, negation, imperatives/questions, inflections, parts of speech, pronunciation, topic/focus, and much more!

• irregularity (exceptions, exceptions to the exceptions, ...)• potato -> potato es (tomato, hero,...); photo -> photo s, and ev

en: both mango -> mango s or -> mango es• Adjective / Noun order: new book, electrical engineering, gener

al regulations, flower garden, garden flower, ...: but Governor General

Page 10: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic10

Difficulties in NLP (cont.)– ambiguity

• books: NOUN or VERB?– you need many books vs. she books her flights online

• No left turn weekdays 4-6 pm / except transit vehicles (Charles Street at Cold Spring)

– when may transit vehicles turn: Always? Never?• Thank you for not smoking, drinking, eating or playing

radios without earphones. (MTA bus)– Thank you for not eating without earphones??– or even: Thank you for not drinking without earphones!?

• My neighbor’s hat was taken by wind. He tried to catch it.– ...catch the wind or ...catch the hat ?

Page 11: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic11

(Categorical) Rules or Statistics?• Preferences:

– clear cases: context clues: she books --> books is a verb– rule: if an ambiguous word (verb/nonverb) is preceded by a match

ing personal pronoun -> word is a verb

– less clear cases: pronoun reference– she/he/it refers to the most recent noun or pronoun (?) (but maybe

we can specify exceptions)

– selectional:– catching hat >> catching wind (but why not?)

– semantic: – never thank for drinking in a bus! (but what about the earphones?)

Page 12: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic12

Solutions

• Don’t guess if you know:• morphology (inflections)

• lexicons (lists of words)

• unambiguous names

• perhaps some (really) fixed phrases

• syntactic rules?

• Use statistics (based on real-world data) for preferences (only?)

• No doubt: but this is the big question!

Page 13: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic13

Statistical NLP

• Imagine:– Each sentence W = { w1, w2, ..., wn } gets a probability P

(W|X) in a context X (think of it in the intuitive sense for now)

– For every possible context X, sort all the imaginable sentences W according to P(W|X):

– Ideal situation:best sentence (most probable in context X) NB: same for

interpretation

P(W) “ungrammatical” sentences

Page 14: 9/13/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing AI-Lab 2003.09.03

9/13/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic14

Real World Situation

• Unable to specify set of grammatical sentences today using fixed “categorical” rules (maybe never, cf. arguments in MS)

• Use statistical “model” based on REAL WORLD DATA and care about the best sentence only (disregarding the “grammaticality” issue)

best sentence

P(W)

Wbest Wworst