a statistical model for domain- independent text segmentation masao utiyama and hitoshi isahura...

A Statistical Model for Domain-Independent Text Segmentation

Masao Utiyama and Hitoshi Isahura

Presentation by Matthew Waymost

Introduction

• Algorithm find maximum-probability segmentation using a statistical method.

• No training required.

• Domain-independent.

Other Methods

• Lexical Cohesion

• Statistical– Hidden Markov model (Yamron et al., 1998)

Statistical Model

• Find the probability of a segmentation S given a text W.

• Use Bayes rule to find maximum-probability segmentation.

)Pr()|Pr(maxargˆ SSWSS

Definition of Pr(W|S)

• Assume statistical independence of topics and of words within the scope of a topic.

• Assume different topics have different word distributions.• Can breakdown into double product of probabilities

across words and segments.• Uses Laplace estimator for word frequency prediction.

kn

wfSw

i

iji

iij

1)()|Pr(

Definition of Pr(S)

• Varies depending on prior information.• In general, assume no prior information.• Prevents the algorithm from generating too

many segments; counteracts Pr(W|S).

nmmnS log2)Pr(

Algorithm

• Convert the probability function into a cost function by taking the negative log.

• Given a text W, define gi to be the gap between word wi and wi+1.

• Create a directed graph where the nodes are the gaps between words and the edges cover a segment between the gaps the edge connects.

• Calculate all edge weights by using the cost function and find the minimum-cost path from the first to last node.

Algorithm

• The calculated path represents the minimum-cost segmentation by correlating the edges to segments.

Algorithm – Features

• Determines the number of segments, but can also specify the number of edges in the shortest path.

• Can specify where segmentation occurs by only using a subset of all possible edges where both nodes connected by the edge meet user-specified conditions.

• Algorithm is insensitive to text length.– Good for summarization

Algorithm – Evaluation

• Compared algorithm against C99 (Choi 2000).

• Artificial test corpus extracted from the Brown corpus used.

• Probabilistic error metric used to evaluate performance.

• Results of Utiyama algorithm significantly better at 1% level than Choi algorithm.

Algorithm – Evaluation

• Assessment of algorithm using real texts is needed.

• Advantages over HMM– No training required (implies domain-

independence).– Can incorporate probabilistic information into

model.

• Might be expandable to detect word descriptions in text.

a statistical model for domain- independent text segmentation masao utiyama and hitoshi isahura...

Documents