1 i256: applied natural language processing marti hearst oct 9, 2006

34
1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

1

I256: Applied Natural Language Processing

Marti HearstOct 9, 2006 

 

Page 2: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

2

Today

Finish Conditional Probabilities and Bayesian LearningIntro to Classification; Identification of

LanguageAuthor

Page 3: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

3Slide adapted from Dan Jurafsky's

Conditional Probability

A way to reason about the outcome of an experiment based on partial information

In a word guessing game the first letter for the word is a “t”. What is the likelihood that the second letter is an “h”?How likely is it that a person has a disease given that a medical test was negative?A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?

Page 4: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

4Slides adapted from Mary Ellen Califf

Conditional Probability

Conditional probability specifies the probability given that the values of some other random variables are known.

P(Sneeze | Cold) = 0.8 P(Cold | Sneeze) = 0.6

The probability of a sneeze given a cold is 80%.The probability of a cold given a sneeze is 60%.

Page 5: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

5Slide adapted from Dan Jurafsky's

More precisely

Given an experiment, a corresponding sample space S, and the probability lawSuppose we know that the outcome is within some given event B

The first letter was ‘t’

We want to quantify the likelihood that the outcome also belongs to some other given event A.

The second letter will be ‘h’

We need a new probability law that gives us the conditional probability of A given BP(A|B) “the probability of A given B”

Page 6: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

6Slides adapted from Mary Ellen Califf

Joint Probability Distribution

The joint probability distribution for a set of random variables X1…Xn gives the probability of every combination of values

P(X1,...,Xn)

Sneeze ¬Sneeze Cold 0.08 0.01 ¬Cold 0.01 0.9

The probability of all possible cases can be calculated by summing the appropriate subset of values from the joint distribution. All conditional probabilities can therefore also be calculated

P(Cold | ¬Sneeze)

Page 7: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

7Slide adapted from Dan Jurafsky's

An intuition

• Let’s say A is “it’s raining”.• Let’s say P(A) in dry California is .01• Let’s say B is “it was sunny ten minutes ago”• P(A|B) means

• “what is the probability of it raining now if it was sunny 10 minutes ago”

• P(A|B) is probably way less than P(A)• Perhaps P(A|B) is .0001

• Intuition: The knowledge about B should change our estimate of the probability of A.

Page 8: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

8Slide adapted from Dan Jurafsky's

Conditional Probability

Let A and B be eventsP(A,B) and P(A B) both means “the probability that BOTH A and B occur”

p(B|A) = the probability of event B occurring given event A occursdefinition: p(A|B) = p(A B) / p(B)

P(A, B) = P(A|B) * P(B) (simple arithmetic)P(A, B) = P(B, A)

)(

),()|(

BP

BAPBAP

Page 9: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

9

Bayes Theorem

We start with conditional probability definition:

So say we know how to compute P(A|B). What if we want to figure out P(B|A)? We can re-arrange the formula using Bayes Theorem:

)(

),()|(

BP

BAPBAP

)(

)()|()|(

AP

BPBAPABP

Page 10: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

10Slide adapted from Dan Jurafsky's

Deriving Bayes Rule

P(A |B) P(B | A)P(A)

P(B)

P(A |B) P(B | A)P(A)

P(B)

P(A |B) P(A B)P(B)

P(A |B) P(A B)P(B)

P(B | A) P(A B)P(A)

P(B | A) P(A B)P(A)

P(B | A)P(A) P(A B)

P(B | A)P(A) P(A B)

P(A |B)P(B) P(A B)

P(A |B)P(B) P(A B)

P(A |B)P(B) P(B | A)P(A)

P(A |B)P(B) P(B | A)P(A)

Page 11: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

11Slides adapted from Mary Ellen Califf

How to compute probilities?

We don’t have the probabilities for most NLP problemsWe can try to estimate them from data

(that’s the learning part)

Usually we can’t actually estimate the probability that something belongs to a given class given the information about itBUT we can estimate the probability that something in a given class has particular values.

Page 12: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

12Slides adapted from Mary Ellen Califf

Simple Bayesian Reasoning

If we assume there are n possible disjoint tags, t1 … tn

P(ti | w) = P(w | ti) P(ti) P(w)

Want to know the probability of the tag given the word.

P(w| ti ) = number of times we see this tag with this word divided by how often we see the tag

P(w| ti ) = Sum(word with tag i) / (count of tag i in corpus)

P(ti ) = Sum(count of tag i in corpus) / (count of all tags)

P(w) = Sum(count of word w in corpus) / (count of all words)

Page 13: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

13

Some notation

P(fi| Sentence)

This means that you multiple all the features together

P(f1| S) * P(f2 | S) * … * P(fn | S)

There is a similar one for summation.

Page 14: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

14

Naïve Bayes ClassifierThe simpler version of Bayes was:

P(B|A) = P(A|B)P(B)P(Sentence | feature) = P(feature | S) P(S)

Using Naïve Bayes, we expand the number of feaures by defining a joint probability distribution:

P(Sentence, f1, f2, … fn) = P(Sentence) P(fi| Sentence)

We learn P(Sentence) and P(fi| Sentence) in training

Test: we need to state P(Sentence | f1, f2, … fn)

P(Sentence| f1, f2, … fn) =

P(Sentence, f1, f2, … fn) / P(f1, f2, … fn)

Page 15: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

15Slides adapted from Mary Ellen Califf

Bayes Independence Example

If there are many kinds of evidence, we need to combine themBy assuming independence, we ignore the possible interactions:

Imagine there are diagnoses ALLERGY, COLD, and WELLSymptoms SNEEZE, COUGH, and FEVER

Prob Well Cold Allergy P(d) 0.9 0.05 0.05 P(sneeze|d) 0.1 0.9 0.9 P(cough | d) 0.1 0.8 0.7

P(fever | d) 0.01 0.7 0.4

Page 16: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

16Slides adapted from Mary Ellen Califf

If symptoms are: sneeze & cough & no fever: P(well | s, c, not(f)) = P(e | well) P(well) / P (e)= (P(s | well) * P (c | well) * 1 - P(f|well)) * P(well) / P(e)= (0.1)(0.1)(0.99)(0.9)/P(e) = 0.0089/P(e)

P(cold | e) = (.05)(0.9)(0.8)(0.3)/P(e) = 0.01/P(e) P(allergy | e) = (.05)(0.9)(0.7)(0.6)/P(e) = 0.019/P(e)

P(e) = .0089 + .01 + .019 = .0379 P(well | e) = .23 P(cold | e) = .26 P(allergy | e) = .50

Diagnosis: allergy

Bayes Independence Example

Page 17: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

17

Kupiec et al. Feature Representation

Fixed-phrase featureCertain phrases indicate summary, e.g. “in summary”

Paragraph featureParagraph initial/final more likely to be important.

Thematic word featureRepetition is an indicator of importance

Uppercase word featureUppercase often indicates named entities. (Taylor)

Sentence length cut-offSummary sentence should be > 5 words.

Page 18: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

18

Details: Bayesian Classifier

Assuming statistical independence:

k

j j

k

j j

kFP

SsPSsFPFFFSsP

1

121

)(

)()|(),...,|(

),(

)()|,...,(),...,|(

,...21

2121

k

kk FFFP

SsPSsFFFPFFFSsP

Probability that sentence s is includedin summary S, given that sentence’s feature value pairs

Probability of feature-value pairoccurring in a source sentencewhich is also in the summary

compressionrate

Probability of feature-value pairoccurring in a source sentence

Page 19: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

19

Language Identification

Page 20: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

20

Language identification

Tutti gli esseri umani nascono liberi ed eguali in dignità e diritti. Essi sono dotati di ragione e di coscienza e devono agire gli uni verso gli altri in spirito di fratellanza.

Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen.

Universal Declaration of Human Rights, UN, in 363 languageshttp://www.unhchr.ch/udhr/navigate/alpha.htm

Page 21: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

21

Language identification

égaux eguali iguales

edistämään

Ü¿How to do determine, for a stretch of text, which language it is from?

Page 22: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

22

Language Identification

Turns out to be really simpleJust a few character bigrams can do it (Sibun & Reynar 96)

Used Kullback Leibler distance (relative entropy)Compare probability distribution of the test set to those for the languages trained onSmallest distance determines the languageUsing special character sets helps a bit, but barely

Page 23: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

23

Language Identification (Sibun & Reynar 96)

Page 24: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

24

Confusion Matrix

A table that shows, for each class, which ones your algorithm got right and which wrong

Algorithm’s guess

Gold standard

Page 25: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

25

Page 26: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

26

Author Identification(Stylometry)

Page 27: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

27

Author Identification

Also called Stylometry in the humanities

An example of a Classification Problem

Classifiers:Decide which of N buckets to put an item in(Some classifiers allow for multiple buckets)

Page 28: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

28

The Disputed Federalist Papers

In 1787-1788, Jay, Madison, and Hamilton wrote a series of anonymous essays to convince the voters of New York to ratify the new U. S. Constitution.Scholars have consensus that:

5 authored by Jay51 authored by Hamilton14 authored by Madison 3 jointly by Hamilton and Madison

12 remain in dispute … Hamilton or Madison?

Page 29: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

29first page

Author identification

Federalist papers In 1963 Mosteller and Wallace solved the problem

They identified function words as good candidates for authorships analysis

Using statistical inference they concluded the author was Madison

Since then, other statistical techniques have supported this conclusion.

Page 30: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

30

Function vs. Content Words

High rates for “by” favor M, low favor HHigh rates for “from” favor M, low says littleHigh rats for “to” favor H, low favor M

Page 31: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

31

Function vs. Content Words

No consistent pattern for “war”

Page 32: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

32

Federalist Papers Problem

Fung, The Disputed Federalist Papers: SVM Feature SelectionVia Concave Minimization, ACM TAPIA’03

Page 33: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

33

Discussion

Can Pseudonymity Really Guarantee Privacy?Rao and Rohatgi, 2000

Page 34: 1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006

34

Next Time

Guest lecture by Elizabeth Charnock and Steve Roberts of Cataphora