1 i256: applied natural language processing marti hearst oct 4, 2006

48
1 I256: Applied Natural Language Processing Marti Hearst Oct 4, 2006

Post on 21-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

1

I256: Applied Natural Language Processing

Marti HearstOct 4, 2006 

 

2

Today

Discourse-based summarizationMulti-document summarizationIntro to probability

This will get us towards understanding the Naïve Bayes method used in the Kupiec et al. paperIt will take a couple of class sessions.

3From lecture notes by Nachum Dershowitz & Dan Cohen

Rhetorical Structure TheoryRhetorical Structure Theory

One of many theories, but an important and popular oneMann & Thompson’88

Descriptive theory of text organizationRelations between pairs of text spans

nucleus & satellitenucleus & nucleus

Relations such asBackground textElaborationConcession (“Even though”)

4From lecture notes by Nachum Dershowitz & Dan Cohen

Rhetorical Structure Theory Rhetorical Structure Theory Rhetorical Structure Theory Rhetorical Structure Theory

5Slide adapted from Drago Radev

2Elaboration

2Elaboration

8Example

2BackgroundJustification

3Elaboration

8Concession

10Antithesis

Mars experiences

frigid weather

conditions(2)

Surface temperatures typically average

about -60 degrees

Celsius (-76 degrees

Fahrenheit) at the

equator and can dip to -

123 degrees C near the

poles(3)

4 5Contrast

Although the atmosphere

holds a small

amount of water, and water-ice

clouds sometimes develop,

(7)

Most Martian weather involves

blowing dust and carbon monoxide.

(8)

Each winter, for example, a blizzard of

frozen carbon dioxide

rages over one pole, and a few meters of

this dry-ice snow

accumulate as

previously frozen carbon dioxide

evaporates from the opposite

polar cap.(9)

Yet even on the summer pole, where

the sun remains in the sky all day long,

temperatures never warm

enough to melt frozen

water.(10)

With its distant orbit (50 percent farther from the sun than Earth) and

slim atmospheric

blanket,(1)

Only the midday sun at tropical latitudes is

warm enough to

thaw ice on occasion,

(4)

5Evidence

Cause

but any liquid water formed in this way would

evaporate almost

instantly(5)

because of the low

atmospheric pressure

(6)

Nuclei: solidSatellites: dotted

6

RST for Summarization (Marcu 99)

Do they work in theory?Got 13 people to summarize several textsCompared this to the sentence fragments that would be produced by a perfect RST structure

– Give more weight to nodes higher up that are not parenthetical

– This did quite well

Then compare to what happens when the algorithm finds the relations

– Was better than MS Word– MSWord not much better than random(!)

7

Multi-Document Summarization

Image from http://blogs.salon.com/0001424/2005/07/27.html

8

Multi-Document Summarization

What’s different from single-doc?What properties do we want it to have?

9

Types of MD SummariesSingle event/person tracked over a long time period

Elizabeth Taylor’s bout with pneumoniaGive extra weight to character/event

Multiple events of a similar natureMarathon runners and racesMore broad brush, ignore dates

An issue with related eventsGun controlIdentify key concepts and select sentences accordingly

10

Multi-Document Summarization

Group similar documents together, select representative onesGet good coverage of the main pointsMinimize redundancy among selected passagesOverall cohesiveness and coherenceInclude appropriate contextShow changes over timeIdentify inconsistencies among stories

11

MDS by Goldstein et al.

Algorithm:Segment documents into passagesUse a query to identify relevant passagesApply MMR-MD metric to select from these

– Define desired length in advance; this will be used for computing redundancy and similarity measures

Assemble selected passages using a cohesion measure

12

MDS by Goldstein et al. 2000

MMR-MD metricCompute a balance between two measuresSIM1 computes for each passage:

– Similarity between query and passage– Coverage score (is it in more or more clusters; how big are the

clusters it is in?)– Content measure (not really defined; appears similar to

features used in Kupiec et al.)– Relative timestamp among the passages

SIM2 computes:– Similarity between this passage and those already selected for

inclusion– Similarity between this passage and the clusters that already

selected passages are in– Whether or not another passage from this document is already

selected, but taking document length into account

13

MDS by Goldstein et al. 2000

Using first measure only:

14

MDS by Goldstein et al. 2000

Setting gamma = 0.3, time-line ordering

15first page

Learning Sentence Extraction Rules (Kupiec et al. 95)

ResultsAbout 87% (498) of all summary sentences (568) could be directly matched to sentences in the source (79% direct matches, 3% direct joins, 5% incomplete joins) Location was best feature at 163/498 = 33%Para+fixed-phrase+sentence length cut-off gave best sentence recall performance … 217/498=44%At compression rate = 25% (20 sentences), performance peaked at 84% sentence recall

J. Kupiec, J. Pedersen, and F. Chen. "A Trainable Document Summarizer", Proceedings of the 18th ACM-SIGIR Conference, pages 68--73, 1995.

16

Kupiec et al. FeaturesFixed-phrase feature

Certain phrases indicate summary, e.g. “in summary”

Paragraph featureParagraph initial/final more likely to be important.

Thematic word featureRepetition is an indicator of importance

Uppercase word featureUppercase often indicates named entities. (Taylor)

Sentence length cut-offSummary sentence should be > 5 words.

17From lecture notes by Nachum Dershowitz & Dan Cohen

Uses Bayesian classifier:

Assuming statistical independence:

k

j j

k

j j

kFP

SsPSsFPFFFSsP

1

121

)(

)()|(),...,|(

),(

)()|,...,(),...,|(

,...21

2121

k

kk FFFP

SsPSsFFFPFFFSsP

Bayesian Classifier Bayesian Classifier (Kupiec at el 95)Bayesian Classifier Bayesian Classifier (Kupiec at el 95)

18

Intro to Probability Theory

Including Bayes Theorem and Naïve Bayes

19Slide adapted from Dan Jurafsky's

Definition of Probability

Probability theory encodes our knowledge or belief about the collective likelihood of the outcome of an event.

We use probability theory to try to predict which outcome will occur for a given event.

20Slide adapted from Dan Jurafsky's

Sample Spaces

We think about the “sample space” as being set of all possible outcomes:

For tossing a coin, the possible outcomes are Heads or Tails.

For competing in the Olympics, the set of outcomes for a given contest are {gold, silver, bronze, no_award}.

For computing part-of-speech, the set of outcomes are {JJ, DT, NN, RB, … etc.}

We use probability theory to try to predict which outcome will occur for a given event.

21Slides adapted from Mary Ellen Califf

Axioms of Probability Theory

Probabilities are real numbers between 0 1 representing the a priori likelihood that a proposition is true.

Necessarily true propositions have probability 1, necessarily false have probability 0.

P(true) = 1 P(false) = 0

22Slide adapted from Dan Jurafsky's

Probability Axioms

Probability law must satisfy certain properties:Nonnegativity

P(A) >= 0, for every event A

Additivity If A and B are two disjoint events, then the probability of their union satisfies:P(A U B) = P(A) + P(B)

Normalization The probability of the entire sample space S is equal to 1, i.e., P(S) = 1.

P(Cold) = 0.1 (the probability that it will be cold)P(¬Cold) = 0.9 (the probability that it will be not cold)

23Slide adapted from Dan Jurafsky's

An example

An experiment involving a single coin tossThere are two possible outcomes, Heads and TailsSample space S is {H,T}If coin is fair, should assign equal probabilities to 2 outcomesSince they have to sum to 1

P({H}) = 0.5P({T}) = 0.5P({H,T}) = P({H})+P({T}) = 1.0

24Slide adapted from Dan Jurafsky's

Another exampleExperiment involving 3 coin tossesOutcome is a 3-long string of H or T

S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}

Assume each outcome is equally likely (equiprobable)

“Uniform distribution”

What is probability of the event A that exactly 2 heads occur?

A = {HHT,HTH,THH}P(A) = P({HHT})+P({HTH})+P({THH})= 1/8 + 1/8 + 1/8=3/8

25Slide adapted from Dan Jurafsky's

Probability definitions

In summary:

number of outcomes corresponding to event E P(E) = ---------------------------------------------------------

total number of outcomes

Probability of drawing a spade from 52 well-shuffled playing cards:

13/52 = ¼ = 0.25

26Slide adapted from Dan Jurafsky's

How about non-uniform probabilities? An example:

A biased coin,twice as likely to come up tails as heads, is tossed twice

What is the probability that at least one head occurs?Sample space = {hh, ht, th, tt} (h = heads, t = tails)Sample points/probability for the event:

(each head has probability 1/3; each tail has prob 2/3):ht 1/3 x 2/3 = 2/9 hh 1/3 x 1/3= 1/9th 2/3 x 1/3 = 2/9 tt 2/3 x 2/3 = 4/9

Answer: 5/9 = 0.56 (sum of weights in red since want outcome with at least one heads)

By contrast, probability of event for the unbiased coin = 0.75

27Slide adapted from Dan Jurafsky's

Moving toward language

What’s the probability of drawing a 2 from a deck of 52 cards?

What’s the probability of a random word (from a random dictionary page) being a verb?

P(drawing a two) 4

52

1

13.077

P(drawing a verb) #of ways to get a verb

all words

28Slide adapted from Dan Jurafsky's

Probability and part of speech tags

• What’s the probability of a random word (from a random dictionary page) being a verb?

• How to compute each of these?• All words = just count all the words in the dictionary• # of ways to get a verb: number of words which are verbs!• If a dictionary has 50,000 entries, and 10,000 are verbs….

P(V) is 10000/50000 = 1/5 = .20

P(drawing a verb) #of ways to get a verb

all words

29Slide adapted from Dan Jurafsky's

IndependenceMost things in life depend on one another, that is, they are dependent:

If I drive to SF, this may effect your attempt to go there.It may also effect the environment, the economy of SF,etc.

If 2 events are independent, this means that the occurrence of one event makes it neither more nor less probable that the other occurs.

If I flip my coin, it won’t have an effect on the outcome of your coin flip.

P(A,B)= P(A) · P(B) if and only if A and B are independent.P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25Note: P(A|B)=P(A) iff A,B independent P(B|A)=P(B) iff A,B independent

30Slide adapted from Dan Jurafsky's

Conditional Probability

A way to reason about the outcome of an experiment based on partial information

How likely is it that a person has a disease given that a medical test was negative?In a word guessing game the first letter for the word is a “t”. What is the likelihood that the second letter is an “h”?A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?

31Slides adapted from Mary Ellen Califf

Conditional Probability

Conditional probability specifies the probability given that the values of some other random variables are known.

P(Sneeze | Cold) = 0.8 P(Cold | Sneeze) = 0.6

The probability of a sneeze given a cold is 80%.The probability of a cold given a sneeze is 60%.

32Slide adapted from Dan Jurafsky's

More precisely

Given an experiment, a corresponding sample space S, and the probability lawSuppose we know that the outcome is within some given event B

The first letter was ‘t’

We want to quantify the likelihood that the outcome also belongs to some other given event A.

The second letter will be ‘h’

We need a new probability law that gives us the conditional probability of A given BP(A|B) “the probability of A given B”

33Slides adapted from Mary Ellen Califf

Joint Probability Distribution

The joint probability distribution for a set of random variables X1…Xn gives the probability of every combination of values

P(X1,...,Xn)

Sneeze ¬Sneeze Cold 0.08 0.01 ¬Cold 0.01 0.9

The probability of all possible cases can be calculated by summing the appropriate subset of values from the joint distribution. All conditional probabilities can therefore also be calculated

P(Cold | ¬Sneeze)

BUT it’s often very hard to obtain all the probabilities for a joint distribution

34Slide adapted from Dan Jurafsky's

Conditional Probability Example

• Let’s say A is “it’s raining”.• Let’s say P(A) in dry California is .01• Let’s say B is “it was sunny ten minutes ago”• P(A|B) means

• “what is the probability of it raining now if it was sunny 10 minutes ago”

• P(A|B) is probably way less than P(A)• Perhaps P(A|B) is .0001

• Intuition: The knowledge about B should change our estimate of the probability of A.

35Slide adapted from Dan Jurafsky's

Conditional Probability

Let A and B be eventsP(A,B) and P(A B) both means “the probability that BOTH A and B occur”

p(B|A) = the probability of event B occurring given event A occursdefinition: p(A|B) = p(A B) / p(B)

P(A, B) = P(A|B) * P(B) (simple arithmetic)

P(A, B) = P(B, A) (AND is symmetric)

36

Bayes Theorem

We start with conditional probability definition:

So say we know how to compute P(A|B). What if we want to figure out P(B|A)? We can re-arrange the formula using Bayes Theorem:

)(

),()|(

BP

BAPBAP

)(

)()|()|(

AP

BPBAPABP

37Slide adapted from Dan Jurafsky's

Deriving Bayes Rule

P(A |B) P(B | A)P(A)

P(B)

P(A |B) P(B | A)P(A)

P(B)

P(A |B) P(A B)P(B)

P(A |B) P(A B)P(B)

P(B | A) P(A B)P(A)

P(B | A) P(A B)P(A)

P(B | A)P(A) P(A B)

P(B | A)P(A) P(A B)

P(A |B)P(B) P(A B)

P(A |B)P(B) P(A B)

P(A |B)P(B) P(B | A)P(A)

P(A |B)P(B) P(B | A)P(A)

38Slides adapted from Mary Ellen Califf

How to compute probilities?

We don’t have the probabilities for most NLP problemsWe can try to estimate them from data

(that’s the learning part)

Usually we can’t actually estimate the probability that something belongs to a given class given the information about itBUT we can estimate the probability that something in a given class has particular values.

39Slides adapted from Mary Ellen Califf

Simple Bayesian Reasoning

If we assume there are n possible disjoint tags, t1 … tn

P(ti | w) = P(w | ti) P(ti) P(w)

Want to know the probability of the tag given the word.Can count how often we see the word with the given tag in the training set, and how often the tag itself occurs.

P(w| ti ) = number of times we see this tag with this word divided by how often we see the tag

This is the learning part.

Since P(w) is always the same, we can ignore it.So now only need to compute P(w | ti) P(ti)

40Slides adapted from Mary Ellen Califf

Bayes Independence Example

Imagine there are diagnoses ALLERGY, COLD, and WELL and symptoms SNEEZE, COUGH, and FEVER

Prob Well Cold Allergy P(d) 0.9 0.05 0.05 P(sneeze|d) 0.1 0.9 0.9 P(cough | d) 0.1 0.8 0.7

P(fever | d) 0.01 0.7 0.4

41Slides adapted from Mary Ellen Califf

By assuming independence, we ignore the possible interactions.If symptoms are: sneeze & cough & no fever: P(well | e) = (0.9)(0.1)(0.1)(0.99)/P(e) = 0.0089/P(e) P(cold | e) = (.05)(0.9)(0.8)(0.3)/P(e) = 0.01/P(e) P(allergy | e) = (.05)(0.9)(0.7)(0.6)/P(e) = 0.019/P(e)

P(e) = .0089 + .01 + .019 = .0379 P(well | e) = .23 P(cold | e) = .26 P(allergy | e) = .50

Diagnosis: allergy

Bayes Independence Example

42

Some notation

P(fi| Sentence)

This means that you multiply all the features together

P(f1| S) * P(f2 | S) * … * P(fn | S)

There is a similar one for summation.

43

Kupiec et al. Feature Representation

Fixed-phrase featureCertain phrases indicate summary, e.g. “in summary”

Paragraph featureParagraph initial/final more likely to be important.

Thematic word featureRepetition is an indicator of importance

Uppercase word featureUppercase often indicates named entities. (Taylor)

Sentence length cut-offSummary sentence should be > 5 words.

44

Training

Hand-label sentences in training set (good/bad summary sentences)Train classifier to distinguish good/bad summary sentencesModel used: Naïve BayesCan rank sentences according to score and show top n to user.

45

Naïve Bayes ClassifierThe simpler version of Bayes was:

P(B|A) = P(A|B)P(B)/P(A)P(Sentence | feature) = P(feature | S) P(S)/P(feature)

Using Naïve Bayes, we expand the number of feaures by defining a joint probability distribution:

P(Sentence, f1, f2, … fn) = P(Sentence) P(fi| Sentence)/ P(fi )

We learn P(Sentence) and P(fi| Sentence) in training

Test: we need to state P(Sentence | f1, f2, … fn)

P(Sentence| f1, f2, … fn) =

P(Sentence, f1, f2, … fn) / P(f1, f2, … fn)

46

Details: Bayesian Classifier

Assuming statistical independence:

k

j j

k

j j

kFP

SsPSsFPFFFSsP

1

121

)(

)()|(),...,|(

),(

)()|,...,(),...,|(

,...21

2121

k

kk FFFP

SsPSsFFFPFFFSsP

Probability that sentence s is includedin summary S, given that sentence’s feature value pairs

Probability of feature-value pairoccurring in a source sentencewhich is also in the summary

compressionrate

Probability of feature-value pairoccurring in a source sentence

47From lecture notes by Nachum Dershowitz & Dan Cohen

Each Probability is calculated empirically from a corpus

See how often each feature is seen with a sentence selected for a summary, vs how often that feature is seen in any sentence.

Higher probability sentences are chosen to be in the summaryPerformance:

For 25% summaries, 84% precision

Bayesian Classifier Bayesian Classifier (Kupiec at el 95)(Kupiec at el 95)Bayesian Classifier Bayesian Classifier (Kupiec at el 95)(Kupiec at el 95)

48

How to compute this?

For training, for each feature f:For each sentence s:

– Is the sentence in the target summary S? Increment T– Increment F no matter which sentence it is in.

P(f|S) = T/NP(f) = F/N

For testing, for each document:For each sentence

– Multiply the probabilities of all of the features occurring in the sentence times the probability of any sentence being in the summary (a constant). Divide by the probability of the features occurring in any sentence.