evolutionary timeline summarization: a balanced optimization framework via iterative substitution...
TRANSCRIPT
Evolutionary Timeline Summarization: a Balanced Optimization Framework via Iterative Substitution
Rui Yan, Yan Zhang
Peking University
Evolutionary Timeline Summarization
• Motivation: Given the massive collection of time-stamped web documents related to a general news query, ETS aims to return the evolution trajectory along the timeline, consisting of individual but correlated summaries of each date.
• ETS– Optimization problem via iterative substitution– Balance coherence/diversity measurement and
local/global summary quality– Four key requirements: relevance, coverage,
coherence, diversity
Outline
Problem Formulation
1
Optimization Framework3
Experiments and Evaluation4
Related Work
2
Related Work
• Multi-document Summarization (MDS)– extractive/abstractive– extractive summarization method
• centroid-based/graph-based ranking method • P: miss the temporal dimension
– timeline construction method• Clusters of noun phrases and named entities;
usefulness and novelty; interest and burstiness• P: evolutionary characteristics are not considered
• ETS improvement– Generate component summaries which have
influence on “neighbors”.
Understanding News
• Topic detection and tracking (TDT)– Lexical similarity, temporal proximity, query
relevance, clustering techniques, etc.
• News correlation– Named entities, data or place information,
domain knowledge
• ETS – Not seek to cluster “topics” like in TDT but to
utilize evolutionary correlations of news coherence/diversity for summarization
Problem Formulation
• Input: Given a general query Q={q1, q2, . . . , q|
Q|} from users where qi is a query word, we obtain a sentence collection C from query related documents. We cluster the sentences into {C1,C2, . . . ,C|T|} by associated publish dates T={t1, t2, . . . , t|T|}. ti is the timestamp of sub-collection Ci.
• Output: A evolutionary timeline which consists of a series of individual but correlated summary items, i.e. I={I1, I2, . . . , I|T|}, where Ii on date ti is a subset of Ci (Ii C⊆ i).
4 theoretical measures
• An effective summary should properly consider the following four key requirements:– Relevance: be related to the query– Coverage: keep alignment with the source collection– Coherence: consistency among component summaries– Diversity: few redundant sentences
• Related formula:
• Relevance
• Coverage
• Coherence
• Diversity
Objective Function
Utility: Given the source collection, the utility of an individual summary item Ii is evaluated based on the weighted combination of these requirements.
The ETS task is to predict the optimized sentence subset of Ii
* from the space of all combinations for all dates. The objective function is as follows:
Sentence Selection for Summaries
• Ii(n-1) : sentences generated in the (n-1)-th
iteration • Si
n: top ranked sentences in the n-th iteration
• an intersection set:• a substitutable sentence set: • a candidate sentence set:• During every iteration, our goal is to find a
substitutive pair <xi,yi> for Ii :• The performance of such substitution can be
measured by the utility gain function:
ii C
ii C
ni
ni
ni SIZ )1()(
)()1()( ni
ni
ni ZIX
)()()( ni
ni
ni ZSY
iiiii RYXyx :,
Balanced Optimization
• The objective function changes into maximization of utility gain by substitute <x i,yi> during each iteration, formally,
• To make a tradeoff between the global optimization and local optimization, the utility for Ii can be rewritten as follows:
iiiiii
yxYyXx
ii uyx ,,maxarg,
Interpolative Optimization
Does the algorithm exist the extreme situations: significant rise in local utility which offsets much global utility loss still makes an available selection and vice versa.
Local Optimization
Global Optimization
• A new balanced maximization framework enforcing both local and global optimization is proposed.
THM : all possible <x,y> pairs
A straightway understanding is that we find a maximized overall utility at the j-th status space on data ti, while at the same time global utility and local utility satisfy the four constraints.
ML :
MG:
A[a][b][c] = max{Mj,a} : where a is to record the processing column, b is to record how many Mj,i
G<0 before column a on the path and c is to record the sum of Mj,i
G before column a on the path.
P[a][b][c]: record the path information
Experiment——Dataset
• 10251 news articles from 10 selected sources.• 6 topics belong to different categories
Experimental System Setup
• Preprocessing: discarding non-event texts and filtering events non-relevant to any query words.
• Compression Rate: the compression rate on ti is set as
• Off-line Systems vs On-line System: off-line system are optimized based on neighboring summaries on dates before and after them while on-line system is to consider neighboring summaries previously generated.
||
||
CC i
i
Algorithms for Comparison
• Random: select sentences randomly.• Centroid: extract sentences according to the
parameters(centroid value, positional value, first-sentence overlap)
• GMDS: graph-based method which constructs connectivity graph among sentences and applies the graph-based ranking algorithm to rank sentence.
• Chieu: a similar timeline system, utilizing interest and burstiness ranking.
• ETS: ETS1 for the off-line system and ETS2 for the on-line system.
Overall Performance
Stratege Selection
Constraints Selection
From Figure 4, we notice Constraint 1 and Constraint 2 are useful. Both Constraint 3 and Constraint 4 are beneficial in iteration count performance because they reduce the available search space and facilitate early pruning for state paths in Algorithm 2.
Conclusion
Advantage:
• The objective function is measured by four
properties fully. Especially, coherence are
taken into account which indicating neighboring
information is essential in evolutionary timeline
trajectory.
Disadvantage:
• Time of each sub collection is not flexible.
Burstiness may be applied to decide the
deadline of each sub collection.
• Research of Yan Zhang•
Yan's general research areas are in databases, massive information processing and Web technologies, with particular emphasis on Web information processing systems. Specifically, his research work includes the following. (Please look at his publications page to read some of his recent papers)Search Precisely and Accurately (Present)
• We propose a new technique, "search wikily", which can help users to understand search results more logically and holistically. Furthermore, we try to search semantically with the help of semantic networks, such as WordNet and Wikipedia.
• This work is currently supported by the National Key Technology R&D Pillar Program in the 11th Five-year Plan of China (Research No.: 2009BAH47B05).
• Web-based Event Detecting, Tracking and Analyzing (EDTA) (Present)• The goal of this project is to help people to better understand what happened and
what are happening in the real world.• This work is currently supported by NSFC (with Grant No. 61073081), Guangdong
- MOE Cooperation Funding Scheme (Project No. 2009B090300028).• Large-scale and Distributed Searching (Present)• Searching results are frustrated by the rapid increasing of web page amount. So
far there are some approaches- clustering, vertical searching and user behavior analysis, just to name a few. Yan and his group are interested in the improvement of the fundamental algorithms.
• Research• Rui Yan has a broad interest in real world
problems related to text information, social networks, web application, scientific literature, and multimedia. Rui's research focuses on Information Retrieval, Natural Language Processing/Computational Linguistics, Knowledge Managment and Artificial Intelligence. More specifically, he is now conducting research into summarization, social network mining and event detection.