1 i256: applied natural language processing marti hearst oct 4, 2006
Post on 21-Dec-2015
217 views
TRANSCRIPT
2
Today
Discourse-based summarizationMulti-document summarizationIntro to probability
This will get us towards understanding the Naïve Bayes method used in the Kupiec et al. paperIt will take a couple of class sessions.
3From lecture notes by Nachum Dershowitz & Dan Cohen
Rhetorical Structure TheoryRhetorical Structure Theory
One of many theories, but an important and popular oneMann & Thompson’88
Descriptive theory of text organizationRelations between pairs of text spans
nucleus & satellitenucleus & nucleus
Relations such asBackground textElaborationConcession (“Even though”)
4From lecture notes by Nachum Dershowitz & Dan Cohen
Rhetorical Structure Theory Rhetorical Structure Theory Rhetorical Structure Theory Rhetorical Structure Theory
5Slide adapted from Drago Radev
2Elaboration
2Elaboration
8Example
2BackgroundJustification
3Elaboration
8Concession
10Antithesis
Mars experiences
frigid weather
conditions(2)
Surface temperatures typically average
about -60 degrees
Celsius (-76 degrees
Fahrenheit) at the
equator and can dip to -
123 degrees C near the
poles(3)
4 5Contrast
Although the atmosphere
holds a small
amount of water, and water-ice
clouds sometimes develop,
(7)
Most Martian weather involves
blowing dust and carbon monoxide.
(8)
Each winter, for example, a blizzard of
frozen carbon dioxide
rages over one pole, and a few meters of
this dry-ice snow
accumulate as
previously frozen carbon dioxide
evaporates from the opposite
polar cap.(9)
Yet even on the summer pole, where
the sun remains in the sky all day long,
temperatures never warm
enough to melt frozen
water.(10)
With its distant orbit (50 percent farther from the sun than Earth) and
slim atmospheric
blanket,(1)
Only the midday sun at tropical latitudes is
warm enough to
thaw ice on occasion,
(4)
5Evidence
Cause
but any liquid water formed in this way would
evaporate almost
instantly(5)
because of the low
atmospheric pressure
(6)
Nuclei: solidSatellites: dotted
6
RST for Summarization (Marcu 99)
Do they work in theory?Got 13 people to summarize several textsCompared this to the sentence fragments that would be produced by a perfect RST structure
– Give more weight to nodes higher up that are not parenthetical
– This did quite well
Then compare to what happens when the algorithm finds the relations
– Was better than MS Word– MSWord not much better than random(!)
8
Multi-Document Summarization
What’s different from single-doc?What properties do we want it to have?
9
Types of MD SummariesSingle event/person tracked over a long time period
Elizabeth Taylor’s bout with pneumoniaGive extra weight to character/event
Multiple events of a similar natureMarathon runners and racesMore broad brush, ignore dates
An issue with related eventsGun controlIdentify key concepts and select sentences accordingly
10
Multi-Document Summarization
Group similar documents together, select representative onesGet good coverage of the main pointsMinimize redundancy among selected passagesOverall cohesiveness and coherenceInclude appropriate contextShow changes over timeIdentify inconsistencies among stories
11
MDS by Goldstein et al.
Algorithm:Segment documents into passagesUse a query to identify relevant passagesApply MMR-MD metric to select from these
– Define desired length in advance; this will be used for computing redundancy and similarity measures
Assemble selected passages using a cohesion measure
12
MDS by Goldstein et al. 2000
MMR-MD metricCompute a balance between two measuresSIM1 computes for each passage:
– Similarity between query and passage– Coverage score (is it in more or more clusters; how big are the
clusters it is in?)– Content measure (not really defined; appears similar to
features used in Kupiec et al.)– Relative timestamp among the passages
SIM2 computes:– Similarity between this passage and those already selected for
inclusion– Similarity between this passage and the clusters that already
selected passages are in– Whether or not another passage from this document is already
selected, but taking document length into account
15first page
Learning Sentence Extraction Rules (Kupiec et al. 95)
ResultsAbout 87% (498) of all summary sentences (568) could be directly matched to sentences in the source (79% direct matches, 3% direct joins, 5% incomplete joins) Location was best feature at 163/498 = 33%Para+fixed-phrase+sentence length cut-off gave best sentence recall performance … 217/498=44%At compression rate = 25% (20 sentences), performance peaked at 84% sentence recall
J. Kupiec, J. Pedersen, and F. Chen. "A Trainable Document Summarizer", Proceedings of the 18th ACM-SIGIR Conference, pages 68--73, 1995.
16
Kupiec et al. FeaturesFixed-phrase feature
Certain phrases indicate summary, e.g. “in summary”
Paragraph featureParagraph initial/final more likely to be important.
Thematic word featureRepetition is an indicator of importance
Uppercase word featureUppercase often indicates named entities. (Taylor)
Sentence length cut-offSummary sentence should be > 5 words.
17From lecture notes by Nachum Dershowitz & Dan Cohen
Uses Bayesian classifier:
Assuming statistical independence:
k
j j
k
j j
kFP
SsPSsFPFFFSsP
1
121
)(
)()|(),...,|(
),(
)()|,...,(),...,|(
,...21
2121
k
kk FFFP
SsPSsFFFPFFFSsP
Bayesian Classifier Bayesian Classifier (Kupiec at el 95)Bayesian Classifier Bayesian Classifier (Kupiec at el 95)
19Slide adapted from Dan Jurafsky's
Definition of Probability
Probability theory encodes our knowledge or belief about the collective likelihood of the outcome of an event.
We use probability theory to try to predict which outcome will occur for a given event.
20Slide adapted from Dan Jurafsky's
Sample Spaces
We think about the “sample space” as being set of all possible outcomes:
For tossing a coin, the possible outcomes are Heads or Tails.
For competing in the Olympics, the set of outcomes for a given contest are {gold, silver, bronze, no_award}.
For computing part-of-speech, the set of outcomes are {JJ, DT, NN, RB, … etc.}
We use probability theory to try to predict which outcome will occur for a given event.
21Slides adapted from Mary Ellen Califf
Axioms of Probability Theory
Probabilities are real numbers between 0 1 representing the a priori likelihood that a proposition is true.
Necessarily true propositions have probability 1, necessarily false have probability 0.
P(true) = 1 P(false) = 0
22Slide adapted from Dan Jurafsky's
Probability Axioms
Probability law must satisfy certain properties:Nonnegativity
P(A) >= 0, for every event A
Additivity If A and B are two disjoint events, then the probability of their union satisfies:P(A U B) = P(A) + P(B)
Normalization The probability of the entire sample space S is equal to 1, i.e., P(S) = 1.
P(Cold) = 0.1 (the probability that it will be cold)P(¬Cold) = 0.9 (the probability that it will be not cold)
23Slide adapted from Dan Jurafsky's
An example
An experiment involving a single coin tossThere are two possible outcomes, Heads and TailsSample space S is {H,T}If coin is fair, should assign equal probabilities to 2 outcomesSince they have to sum to 1
P({H}) = 0.5P({T}) = 0.5P({H,T}) = P({H})+P({T}) = 1.0
24Slide adapted from Dan Jurafsky's
Another exampleExperiment involving 3 coin tossesOutcome is a 3-long string of H or T
S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
Assume each outcome is equally likely (equiprobable)
“Uniform distribution”
What is probability of the event A that exactly 2 heads occur?
A = {HHT,HTH,THH}P(A) = P({HHT})+P({HTH})+P({THH})= 1/8 + 1/8 + 1/8=3/8
25Slide adapted from Dan Jurafsky's
Probability definitions
In summary:
number of outcomes corresponding to event E P(E) = ---------------------------------------------------------
total number of outcomes
Probability of drawing a spade from 52 well-shuffled playing cards:
13/52 = ¼ = 0.25
26Slide adapted from Dan Jurafsky's
How about non-uniform probabilities? An example:
A biased coin,twice as likely to come up tails as heads, is tossed twice
What is the probability that at least one head occurs?Sample space = {hh, ht, th, tt} (h = heads, t = tails)Sample points/probability for the event:
(each head has probability 1/3; each tail has prob 2/3):ht 1/3 x 2/3 = 2/9 hh 1/3 x 1/3= 1/9th 2/3 x 1/3 = 2/9 tt 2/3 x 2/3 = 4/9
Answer: 5/9 = 0.56 (sum of weights in red since want outcome with at least one heads)
By contrast, probability of event for the unbiased coin = 0.75
27Slide adapted from Dan Jurafsky's
Moving toward language
What’s the probability of drawing a 2 from a deck of 52 cards?
What’s the probability of a random word (from a random dictionary page) being a verb?
P(drawing a two) 4
52
1
13.077
P(drawing a verb) #of ways to get a verb
all words
28Slide adapted from Dan Jurafsky's
Probability and part of speech tags
• What’s the probability of a random word (from a random dictionary page) being a verb?
• How to compute each of these?• All words = just count all the words in the dictionary• # of ways to get a verb: number of words which are verbs!• If a dictionary has 50,000 entries, and 10,000 are verbs….
P(V) is 10000/50000 = 1/5 = .20
P(drawing a verb) #of ways to get a verb
all words
29Slide adapted from Dan Jurafsky's
IndependenceMost things in life depend on one another, that is, they are dependent:
If I drive to SF, this may effect your attempt to go there.It may also effect the environment, the economy of SF,etc.
If 2 events are independent, this means that the occurrence of one event makes it neither more nor less probable that the other occurs.
If I flip my coin, it won’t have an effect on the outcome of your coin flip.
P(A,B)= P(A) · P(B) if and only if A and B are independent.P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25Note: P(A|B)=P(A) iff A,B independent P(B|A)=P(B) iff A,B independent
30Slide adapted from Dan Jurafsky's
Conditional Probability
A way to reason about the outcome of an experiment based on partial information
How likely is it that a person has a disease given that a medical test was negative?In a word guessing game the first letter for the word is a “t”. What is the likelihood that the second letter is an “h”?A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?
31Slides adapted from Mary Ellen Califf
Conditional Probability
Conditional probability specifies the probability given that the values of some other random variables are known.
P(Sneeze | Cold) = 0.8 P(Cold | Sneeze) = 0.6
The probability of a sneeze given a cold is 80%.The probability of a cold given a sneeze is 60%.
32Slide adapted from Dan Jurafsky's
More precisely
Given an experiment, a corresponding sample space S, and the probability lawSuppose we know that the outcome is within some given event B
The first letter was ‘t’
We want to quantify the likelihood that the outcome also belongs to some other given event A.
The second letter will be ‘h’
We need a new probability law that gives us the conditional probability of A given BP(A|B) “the probability of A given B”
33Slides adapted from Mary Ellen Califf
Joint Probability Distribution
The joint probability distribution for a set of random variables X1…Xn gives the probability of every combination of values
P(X1,...,Xn)
Sneeze ¬Sneeze Cold 0.08 0.01 ¬Cold 0.01 0.9
The probability of all possible cases can be calculated by summing the appropriate subset of values from the joint distribution. All conditional probabilities can therefore also be calculated
P(Cold | ¬Sneeze)
BUT it’s often very hard to obtain all the probabilities for a joint distribution
34Slide adapted from Dan Jurafsky's
Conditional Probability Example
• Let’s say A is “it’s raining”.• Let’s say P(A) in dry California is .01• Let’s say B is “it was sunny ten minutes ago”• P(A|B) means
• “what is the probability of it raining now if it was sunny 10 minutes ago”
• P(A|B) is probably way less than P(A)• Perhaps P(A|B) is .0001
• Intuition: The knowledge about B should change our estimate of the probability of A.
35Slide adapted from Dan Jurafsky's
Conditional Probability
Let A and B be eventsP(A,B) and P(A B) both means “the probability that BOTH A and B occur”
p(B|A) = the probability of event B occurring given event A occursdefinition: p(A|B) = p(A B) / p(B)
P(A, B) = P(A|B) * P(B) (simple arithmetic)
P(A, B) = P(B, A) (AND is symmetric)
36
Bayes Theorem
We start with conditional probability definition:
So say we know how to compute P(A|B). What if we want to figure out P(B|A)? We can re-arrange the formula using Bayes Theorem:
)(
),()|(
BP
BAPBAP
)(
)()|()|(
AP
BPBAPABP
37Slide adapted from Dan Jurafsky's
Deriving Bayes Rule
P(A |B) P(B | A)P(A)
P(B)
P(A |B) P(B | A)P(A)
P(B)
P(A |B) P(A B)P(B)
P(A |B) P(A B)P(B)
P(B | A) P(A B)P(A)
P(B | A) P(A B)P(A)
P(B | A)P(A) P(A B)
P(B | A)P(A) P(A B)
P(A |B)P(B) P(A B)
P(A |B)P(B) P(A B)
P(A |B)P(B) P(B | A)P(A)
P(A |B)P(B) P(B | A)P(A)
38Slides adapted from Mary Ellen Califf
How to compute probilities?
We don’t have the probabilities for most NLP problemsWe can try to estimate them from data
(that’s the learning part)
Usually we can’t actually estimate the probability that something belongs to a given class given the information about itBUT we can estimate the probability that something in a given class has particular values.
39Slides adapted from Mary Ellen Califf
Simple Bayesian Reasoning
If we assume there are n possible disjoint tags, t1 … tn
P(ti | w) = P(w | ti) P(ti) P(w)
Want to know the probability of the tag given the word.Can count how often we see the word with the given tag in the training set, and how often the tag itself occurs.
P(w| ti ) = number of times we see this tag with this word divided by how often we see the tag
This is the learning part.
Since P(w) is always the same, we can ignore it.So now only need to compute P(w | ti) P(ti)
40Slides adapted from Mary Ellen Califf
Bayes Independence Example
Imagine there are diagnoses ALLERGY, COLD, and WELL and symptoms SNEEZE, COUGH, and FEVER
Prob Well Cold Allergy P(d) 0.9 0.05 0.05 P(sneeze|d) 0.1 0.9 0.9 P(cough | d) 0.1 0.8 0.7
P(fever | d) 0.01 0.7 0.4
41Slides adapted from Mary Ellen Califf
By assuming independence, we ignore the possible interactions.If symptoms are: sneeze & cough & no fever: P(well | e) = (0.9)(0.1)(0.1)(0.99)/P(e) = 0.0089/P(e) P(cold | e) = (.05)(0.9)(0.8)(0.3)/P(e) = 0.01/P(e) P(allergy | e) = (.05)(0.9)(0.7)(0.6)/P(e) = 0.019/P(e)
P(e) = .0089 + .01 + .019 = .0379 P(well | e) = .23 P(cold | e) = .26 P(allergy | e) = .50
Diagnosis: allergy
Bayes Independence Example
42
Some notation
P(fi| Sentence)
This means that you multiply all the features together
P(f1| S) * P(f2 | S) * … * P(fn | S)
There is a similar one for summation.
43
Kupiec et al. Feature Representation
Fixed-phrase featureCertain phrases indicate summary, e.g. “in summary”
Paragraph featureParagraph initial/final more likely to be important.
Thematic word featureRepetition is an indicator of importance
Uppercase word featureUppercase often indicates named entities. (Taylor)
Sentence length cut-offSummary sentence should be > 5 words.
44
Training
Hand-label sentences in training set (good/bad summary sentences)Train classifier to distinguish good/bad summary sentencesModel used: Naïve BayesCan rank sentences according to score and show top n to user.
45
Naïve Bayes ClassifierThe simpler version of Bayes was:
P(B|A) = P(A|B)P(B)/P(A)P(Sentence | feature) = P(feature | S) P(S)/P(feature)
Using Naïve Bayes, we expand the number of feaures by defining a joint probability distribution:
P(Sentence, f1, f2, … fn) = P(Sentence) P(fi| Sentence)/ P(fi )
We learn P(Sentence) and P(fi| Sentence) in training
Test: we need to state P(Sentence | f1, f2, … fn)
P(Sentence| f1, f2, … fn) =
P(Sentence, f1, f2, … fn) / P(f1, f2, … fn)
46
Details: Bayesian Classifier
Assuming statistical independence:
k
j j
k
j j
kFP
SsPSsFPFFFSsP
1
121
)(
)()|(),...,|(
),(
)()|,...,(),...,|(
,...21
2121
k
kk FFFP
SsPSsFFFPFFFSsP
Probability that sentence s is includedin summary S, given that sentence’s feature value pairs
Probability of feature-value pairoccurring in a source sentencewhich is also in the summary
compressionrate
Probability of feature-value pairoccurring in a source sentence
47From lecture notes by Nachum Dershowitz & Dan Cohen
Each Probability is calculated empirically from a corpus
See how often each feature is seen with a sentence selected for a summary, vs how often that feature is seen in any sentence.
Higher probability sentences are chosen to be in the summaryPerformance:
For 25% summaries, 84% precision
Bayesian Classifier Bayesian Classifier (Kupiec at el 95)(Kupiec at el 95)Bayesian Classifier Bayesian Classifier (Kupiec at el 95)(Kupiec at el 95)
48
How to compute this?
For training, for each feature f:For each sentence s:
– Is the sentence in the target summary S? Increment T– Increment F no matter which sentence it is in.
P(f|S) = T/NP(f) = F/N
For testing, for each document:For each sentence
– Multiply the probabilities of all of the features occurring in the sentence times the probability of any sentence being in the summary (a constant). Divide by the probability of the features occurring in any sentence.