evolutionary timeline summarization: a balanced optimization framework via iterative substitution...

Evolutionary Timeline Summarization: a Balanced Optimization Framework via Iterative Substitution

Rui Yan, Yan Zhang

Peking University

Evolutionary Timeline Summarization

• Motivation: Given the massive collection of time-stamped web documents related to a general news query, ETS aims to return the evolution trajectory along the timeline, consisting of individual but correlated summaries of each date.

• ETS– Optimization problem via iterative substitution– Balance coherence/diversity measurement and

local/global summary quality– Four key requirements: relevance, coverage,

coherence, diversity

Outline

Problem Formulation

1

Optimization Framework3

Experiments and Evaluation4

Related Work

2

Related Work

• Multi-document Summarization (MDS)– extractive/abstractive– extractive summarization method

• centroid-based/graph-based ranking method • P: miss the temporal dimension

– timeline construction method• Clusters of noun phrases and named entities;

usefulness and novelty; interest and burstiness• P: evolutionary characteristics are not considered

• ETS improvement– Generate component summaries which have

influence on “neighbors”.

Understanding News

• Topic detection and tracking (TDT)– Lexical similarity, temporal proximity, query

relevance, clustering techniques, etc.

• News correlation– Named entities, data or place information,

domain knowledge

• ETS – Not seek to cluster “topics” like in TDT but to

utilize evolutionary correlations of news coherence/diversity for summarization

Problem Formulation

• Input: Given a general query Q={q1, q2, . . . , q|

Q|} from users where qi is a query word, we obtain a sentence collection C from query related documents. We cluster the sentences into {C1,C2, . . . ,C|T|} by associated publish dates T={t1, t2, . . . , t|T|}. ti is the timestamp of sub-collection Ci.

• Output: A evolutionary timeline which consists of a series of individual but correlated summary items, i.e. I={I1, I2, . . . , I|T|}, where Ii on date ti is a subset of Ci (Ii C⊆ i).

4 theoretical measures

• An effective summary should properly consider the following four key requirements:– Relevance: be related to the query– Coverage: keep alignment with the source collection– Coherence: consistency among component summaries– Diversity: few redundant sentences

• Related formula:

• Relevance

• Coverage

• Coherence

• Diversity

Objective Function

Utility: Given the source collection, the utility of an individual summary item Ii is evaluated based on the weighted combination of these requirements.

The ETS task is to predict the optimized sentence subset of Ii

* from the space of all combinations for all dates. The objective function is as follows:

Sentence Selection for Summaries

• Ii(n-1) : sentences generated in the (n-1)-th

iteration • Si

n: top ranked sentences in the n-th iteration

• an intersection set:• a substitutable sentence set: • a candidate sentence set:• During every iteration, our goal is to find a

substitutive pair <xi,yi> for Ii :• The performance of such substitution can be

measured by the utility gain function:

ii C

ii C

ni

ni

ni SIZ )1()(

)()1()( ni

ni

ni ZIX

)()()( ni

ni

ni ZSY

iiiii RYXyx :,

Balanced Optimization

• The objective function changes into maximization of utility gain by substitute <x i,yi> during each iteration, formally,

• To make a tradeoff between the global optimization and local optimization, the utility for Ii can be rewritten as follows:

iiiiii

yxYyXx

ii uyx ,,maxarg,

Interpolative Optimization

Does the algorithm exist the extreme situations: significant rise in local utility which offsets much global utility loss still makes an available selection and vice versa.

Local Optimization

Global Optimization

• A new balanced maximization framework enforcing both local and global optimization is proposed.

THM : all possible <x,y> pairs

A straightway understanding is that we find a maximized overall utility at the j-th status space on data ti, while at the same time global utility and local utility satisfy the four constraints.

ML :

MG:

A[a][b][c] = max{Mj,a} : where a is to record the processing column, b is to record how many Mj,i

G<0 before column a on the path and c is to record the sum of Mj,i

G before column a on the path.

P[a][b][c]: record the path information

Experiment——Dataset

• 10251 news articles from 10 selected sources.• 6 topics belong to different categories

Experimental System Setup

• Preprocessing: discarding non-event texts and filtering events non-relevant to any query words.

• Compression Rate: the compression rate on ti is set as

• Off-line Systems vs On-line System: off-line system are optimized based on neighboring summaries on dates before and after them while on-line system is to consider neighboring summaries previously generated.

||

||

CC i

i

Algorithms for Comparison

• Random: select sentences randomly.• Centroid: extract sentences according to the

parameters(centroid value, positional value, first-sentence overlap)

• GMDS: graph-based method which constructs connectivity graph among sentences and applies the graph-based ranking algorithm to rank sentence.

• Chieu: a similar timeline system, utilizing interest and burstiness ranking.

• ETS: ETS1 for the off-line system and ETS2 for the on-line system.

Overall Performance

Stratege Selection

Constraints Selection

From Figure 4, we notice Constraint 1 and Constraint 2 are useful. Both Constraint 3 and Constraint 4 are beneficial in iteration count performance because they reduce the available search space and facilitate early pruning for state paths in Algorithm 2.

Conclusion

Advantage:

• The objective function is measured by four

properties fully. Especially, coherence are

taken into account which indicating neighboring

information is essential in evolutionary timeline

trajectory.

Disadvantage:

• Time of each sub collection is not flexible.

Burstiness may be applied to decide the

deadline of each sub collection.

• Research of Yan Zhang•

Yan's general research areas are in databases, massive information processing and Web technologies, with particular emphasis on Web information processing systems. Specifically, his research work includes the following. (Please look at his publications page to read some of his recent papers)Search Precisely and Accurately (Present)

• We propose a new technique, "search wikily", which can help users to understand search results more logically and holistically. Furthermore, we try to search semantically with the help of semantic networks, such as WordNet and Wikipedia.

• This work is currently supported by the National Key Technology R&D Pillar Program in the 11th Five-year Plan of China (Research No.: 2009BAH47B05).

• Web-based Event Detecting, Tracking and Analyzing (EDTA) (Present)• The goal of this project is to help people to better understand what happened and

what are happening in the real world.• This work is currently supported by NSFC (with Grant No. 61073081), Guangdong

- MOE Cooperation Funding Scheme (Project No. 2009B090300028).• Large-scale and Distributed Searching (Present)• Searching results are frustrated by the rapid increasing of web page amount. So

far there are some approaches- clustering, vertical searching and user behavior analysis, just to name a few. Yan and his group are interested in the improvement of the fundamental algorithms.

• Research• Rui Yan has a broad interest in real world

problems related to text information, social networks, web application, scientific literature, and multimedia. Rui's research focuses on Information Retrieval, Natural Language Processing/Computational Linguistics, Knowledge Managment and Artificial Intelligence. More specifically, he is now conducting research into summarization, social network mining and event detection.

evolutionary timeline summarization: a balanced optimization framework via iterative substitution...

Documents

date t i

subset of c i

summarization slide

th iteration s i n

sentence collection

dates t

query relevance

general query q