a statistical model for domain- independent text segmentation masao utiyama and hitoshi isahura...
TRANSCRIPT
A Statistical Model for Domain-Independent Text Segmentation
Masao Utiyama and Hitoshi Isahura
Presentation by Matthew Waymost
Introduction
• Algorithm find maximum-probability segmentation using a statistical method.
• No training required.
• Domain-independent.
Statistical Model
• Find the probability of a segmentation S given a text W.
• Use Bayes rule to find maximum-probability segmentation.
)Pr()|Pr(maxargˆ SSWSS
Definition of Pr(W|S)
• Assume statistical independence of topics and of words within the scope of a topic.
• Assume different topics have different word distributions.• Can breakdown into double product of probabilities
across words and segments.• Uses Laplace estimator for word frequency prediction.
kn
wfSw
i
iji
iij
1)()|Pr(
Definition of Pr(S)
• Varies depending on prior information.• In general, assume no prior information.• Prevents the algorithm from generating too
many segments; counteracts Pr(W|S).
nmmnS log2)Pr(
Algorithm
• Convert the probability function into a cost function by taking the negative log.
• Given a text W, define gi to be the gap between word wi and wi+1.
• Create a directed graph where the nodes are the gaps between words and the edges cover a segment between the gaps the edge connects.
• Calculate all edge weights by using the cost function and find the minimum-cost path from the first to last node.
Algorithm
• The calculated path represents the minimum-cost segmentation by correlating the edges to segments.
Algorithm – Features
• Determines the number of segments, but can also specify the number of edges in the shortest path.
• Can specify where segmentation occurs by only using a subset of all possible edges where both nodes connected by the edge meet user-specified conditions.
• Algorithm is insensitive to text length.– Good for summarization
Algorithm – Evaluation
• Compared algorithm against C99 (Choi 2000).
• Artificial test corpus extracted from the Brown corpus used.
• Probabilistic error metric used to evaluate performance.
• Results of Utiyama algorithm significantly better at 1% level than Choi algorithm.