timestamped graphs: evolutionary models of text for multi-document summarization ziheng lin and...

30
Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University of Singapore, Singapore

Upload: harvey-shepherd

Post on 17-Jan-2018

231 views

Category:

Documents


0 download

DESCRIPTION

Using Evolutionary Models of Text for Multi-document summarization 3TextGraphs 2 at HLT/NAACL 2007 Prestige as sentence selection One motivation of using graphical methods was to model the problem as finding prestige of nodes in a social network PageRank used random walk to smooth the effect of non-local context HITS and SALSA to model hubs and authorities In summarization, lead to TextRank and LexRank Contrast with previous graphical approaches (Salton et al. 1994) Did we leave anything out of our representation for summarization? Yes, the notion of an evolving network

TRANSCRIPT

Page 1: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

Timestamped Graphs:Evolutionary Models of Text for Multi-document Summarization

Ziheng Lin and Min-Yen KanDepartment of Computer Science

National University of Singapore, Singapore

Page 2: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

2TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Summarization• Traditionally, heuristics for extractive summarization

– Cue/stigma phrases– Sentence position (relative to document, section, paragraph)– Sentence length – TF×IDF, TF scores– Similarity (with title, context, query)

• With the advent of machine learning, heuristic weights for different features are tuned by supervised learning

• In last few years, graphical representations of text have shed new light on the summarization problem

Page 3: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

3TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Prestige as sentence selection• One motivation of using graphical methods was to model the problem as finding prestige of nodes in a social network

• PageRank used random walk to smooth the effect of non-local context• HITS and SALSA to model hubs and authorities• In summarization, lead to TextRank and LexRank• Contrast with previous graphical approaches (Salton et al. 1994)

• Did we leave anything out of our representation for summarization?Yes, the notion of an evolving network

Page 4: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

4TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Social networks change!Natural evolving networks (Dorogovtsev and Mendes, 2001)

– Citation networks: New papers can cite old ones, but the old network is static

– The Web: new pages are added with an old page connecting it to the web graph, old pages may update links

Page 5: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

5TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Evolutionary models for summarization

Writers and readers often follow conventional rhetorical styles - articles are not written or read in an arbitrary way

Consider the evolution of texts using a very simplistic model

– Writers write from the first sentence onwards in a text– Readers read from the first sentence onwards of a text

• A simple model: sentences get added incrementally to the graph

Page 6: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

6TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Timestamped Graph ConstructionApproach

– These assumptions suggest us to iteratively add sentences into the graph in chronological order.– At each iteration, consider which edges to add to the graph.

– For single document: simple and straightforward: add 1st sentence, followed by the 2nd, and so forth, until the last sentence is added

– For multi-document: treat it as multiple instances of single documents, which evolve in parallel; i.e., add 1st sentences of all documents, followed by all 2nd sentences, and so forth

• Doesn’t really model chronological ordering between articles, fix later

Page 7: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

7TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Timestamped Graph ConstructionModel: • Documents as columns

– di = document i

• Sentences as rows–sj = jth sentence of document

Page 8: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

8TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Timestamped Graph Construction• A multi document example

doc1 doc2 doc3

sent1

sent2

sent3

Page 9: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

9TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

An example TSG: DUC 2007 D0703A-A

Page 10: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

10TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Properties of nodes

Timestamped Graph Construction

Properties of edges

Input text transformation

function

These are just one instance of TSGs Let’s generalize and formalize themDef: A timestamped graph algorithm tsg(M) is a 9-tuple

(d, e, u, f,σ, t, i, s, τ) that specifies a resultingalgorithm that takes as input the set of texts M andoutputs a graph G

Page 11: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

11TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Edge properties (d, e, u, f)• Edge Direction (d)

– Forward, backward, or undirected

• Edge Number (e)– number of edges to instantiate per timestep

• Edge Weight (u)– weighted or unweighted edges

• Inter-document factor (f)– penalty factor for links between documents in multi-document sets.

Page 12: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

12TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Node properties (σ, t, i, s)• Vertex selection function σ(u, G)

– One strategy: among those nodes not yet connected to u in G, choose the one with highest similarity according to u– Similarity functions: Jaccard, cosine, concept links

(Ye et al.. 2005)

• Text unit type (t)– Most extractive algorithms use sentences as elementary units

• Node increment factor (i) – How many nodes get added at each timestep

• Skew degree (s)– Models how nodes in multi-document graphs are added– Skew degree = how many iterations to wait before adding the 1st sentence of the next document– Let’s illustrate this …

Page 13: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

13TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Skew Degree Examplestime(d1) < time(d2) < time(d3) < time(d4)

d1 d2 d3 d4 d1 d2 d3 d4

Skewed by 1 Skewed by 2 Freely skewed

d1 d2 d3 d4

Freely skewed = Only add a new document when it would be linked by some node using vertex function σ

Page 14: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

14TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Input text transformation function (τ)• Document Segmentation Function (τ)

– Problem observed in some clusters where some documents in a multi-document cluster are very long– Takes many timestamps to introduce all of the sentences, causing too many edges to be drawn

–Τ(G) segments long documents into several sub docs

• Solution is too hacked – hope to investigate more in current and future work

d5 d5bd5a

Page 15: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

15TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Timestamped Graph Construction• Representations

– We can model a number of different algorithms using this 9-tuple formalism:

(d, e, u, f, σ, t, i, s, τ)– The given toy example:

(f, 1, 0, 1, max-cosine-based, sentence, 1, 0, null)

– LexRank graphs:(u, N, 1, 1, cosine-based, sentence, Lmax, 0, null)

N = total number of sentences in the cluster; Lmax = the max document lengthi.e., all sentences are added into the graph in one timestep,

each connected to all others, and cosine scores are given to edge weights

Page 16: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

TSG-based summarization

MethodologyEvaluation

Analysis

Page 17: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

17TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

System Overview• Sentence splitting

–Detect and mark sentence boundaries–Annotate each sentence with the doc ID and the sentence number –E.g., XIE19980304.0061: 4 March 1998 from Xinhua News; XIE19980304.0061-14: the 14th sentence of this document

• Graph construction–Construct TSG in this phase

Page 18: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

18TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

System Overview• Sentence Ranking

– Apply topic-sensitive random walk on the graph to redistribute the weights of the nodes

• Sentence extraction– Extract the top-ranked sentences – Two different modified MMR re-rankers are used, depending on whether it is main or update task

Page 19: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

19TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Evaluation• Dataset: DUC 2005, 2006 and 2007. • Evaluation tool: ROUGE: n-gram based automatic evaluation• Each dataset contains 50 or 45 clusters, each cluster contains

a query and 25 documents

• Evaluate on some parameters–Do different e values affect the summarization process?–How do topic-sensitivity and edge weighting perform in running PageRank?–How does skewing the graph affect the information flow in the graph?

Page 20: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

20TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Evaluation on number of edges (e)Tried different e values • Optimal performance: e = 2• At e = 1, graph is too loosely connected, not suitable for PageRank

→ very low performance• At e = N, a LexRank system

N NN

e = 2e = 2

Page 21: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

21TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Evaluation (other edge parameters)• PageRank: generic vs topic-sensitive • Edge weight (u): unweighted vs weighted

• Optimal performance: topic-sensitive PageRank and weighted edges

Topic-sensitive

Weighted edges

ROUGE-1 ROUGE-2

No No 0.39358 0.07690

Yes No 0.39443 0.07838

No Yes 0.39823 0.08072

Yes Yes 0.39845 0.08282

Page 22: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

22TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Evaluation on skew degree (s)• Different skew degrees: s = 0, 1 and 2• Optimal performance: s = 1• s = 2 introduces a delay interval that is too large

• Need to try freely skewed graphs

Skew degree ROUGE-1 ROUGE-2

0 0.36982 0.07580

1 0.37268 0.07682

2 0.36998 0.07489

Page 23: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

23TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Holistic Evaluation in DUCWe participated in DUC 2007 with an extractive-based TSG system

• Main task: 12th for ROUGE-2, 10th for ROUGE-SU4 among 32 systems

• Update task: 3rd for ROUGE-2, 4th for ROUGE-SU4 among 24 systems• Used a modified version of maximal marginal relevance to penalize links in previously read articles

– Extension of inter-document factor (f)

• TSG formalism better tailored to deal with update / incremental text tasks• New method that may be competitive with current approaches

– Other top scoring systems may do sentence compression, not just extraction

Page 24: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

24TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Conclusion• Proposed a timestamped graph model for text understanding and summarization

– Adds sentences one at a time• Parameterized model with nine variables

– Canonicalizes representation for several graph based summarization algorithms

Future Work• Freely skewed model• Empirical and theoretical properties of TSGs (e.g., in-degree distribution)

Page 25: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

Backup Slides

25 Minute talk total26 Apr 2007, 11:50-12:15

Page 26: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

26TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Differences for main and update task processingMain task:

1. Construct a TSG for input cluster

2. Run topic-sensitive PageRank on the TSG

3. Apply first modified version of MMR to extract sentences

Update task:

• Cluster A:– Construct a TSG for cluster A– Run topic-sensitive PageRank on the TSG– Apply the second modified version of MMR to extract sentences

• Cluster B:– Construct a TSG for clusters A and B– Run topic-sensitive PageRank on the TSG; only retain sentences from B– Apply the second modified version of MMR to extract sentences

• Cluster C:– Construct a TSG for clusters A, B and C– Run topic-sensitive PageRank on the TSG; only retain sentences from C– Apply the second modified version of MMR to extract sentences

Page 27: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

27TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Sentence Ranking• Once a timestamped graph is built, we want to compute an prestige score for each node• PageRank: use an iterative method that allows the weights of the nodes to redistribute until stability is reached• Similarities as edges → weighted edges; query → topic-sensitive

Topic sensitive (Q)

portion

Standard random

walk term

Page 28: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

28TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Sentence Extraction – Main task• Original MMR: integrates a penalty of the maximal similarity of the candidate document and one selected document

• Ye et al. (2005) introduced a modified MMR: integrates a penalty of the total similarity of the candidate sentence and all selected sentences

• Score(s) = PageRank score of s; S = selected sentences• This is used in the main task

Penalty: All previous sentence similarity

Page 29: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

29TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

Sentence Extraction – Update task•Update task assumes readers already read previous cluster(s)

– implies we should not select sentences that have redundant information with previous cluster(s)

• Propose a modified MMR for the update task: – consider the total similarity of the candidate sentence with all selected sentences and sentences in previously-read cluster(s)

• P contains some top-ranked sentences in previous cluster(s)

Previous cluster overlap

Page 30: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and Min-Yen Kan Department of Computer Science National University

30TextGraphs 2 at HLT/NAACL 2007

Using Evolutionary Models of Text for Multi-document summarization

References• Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based centrality as salience in text summari-zation. Journal of Artificial Intelligence Research, (22).

• Rada Mihalcea and Paul Tarau. 2004. TextRank: Bring-ing order into texts. In Proceedings of EMNLP 2004.

• S.N. Dorogovtsev and J.F.F. Mendes. 2001. Evolution of networks. Submitted to Advances in Physics on 6th March 2001.

• Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Com-puter Networks and ISDN Systems, 30(1-7).

• Jon M. Kleinberg. 1999. Authoritative sources in a hy-perlinked environment. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1999.

• Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen Kan. 2005. NUS at DUC 2005: Understanding docu-ments via concepts links. In Proceedings of DUC 2005.