A System for Query-Specific Document Summarization
Ramakrishna Varadarajan,
Vagelis Hristidis.
FLORIDA INTERNATIONAL UNIVERSITY,
School of Computing and Information Sciences,
Miami.
Slide 1
Roadmap
•Need for query-specific summaries
•Our approach– Building a document graph
– Definition of summary
– Rank Summaries
• Efficient computation of summaries
• Evaluation of summarization process– Quality
– Performance
•Related Work
• Conclusions Florida International University (FIU)
Slide 2
Roadmap
•Need for query-specific summaries
•Our approach– Building a document graph
– Definition of summary
– Rank Summaries
• Efficient computation of summaries
• Evaluation of summarization process– Quality
– Performance
•Related Work
• Conclusions Florida International University (FIU)
Slide 3
Need for Query-Specific Summaries
• Locating relevant information is hard.
• Summaries are helpful because:– Provide a Quick preview of the document.
– Allow users to quickly decide relevance.
– Save user’s browsing time.
• Success of Web search engines – Query specific snippetsare important.
• Two categories of summaries:– Query-Independent – Most of prior works.
– Query-Specific – Applicable to web search engines.Florida International University (FIU)
Slide 4
Motivation
Query-Specific Summaries
Florida International University (FIU)
Slide 5
Motivation
Drawbacks
• Association between query keywords is unclear.
• Naïve approach for summarization.
• Ignores semantic relations between keywords in the document.
Summarization research till date
•Mostly Query-Independent.
• Not applicable for web search.
Florida International University (FIU)
Slide 6
Roadmap
•Need for query-specific summaries
•Our approach– Building a document graph
– Definition of summary
– Rank Summaries
• Efficient computation of summaries
• Evaluation of summarization process– Quality
– Performance
•Related Work
• Conclusions Florida International University (FIU)
Slide 7
Our Approach
• Document � graph
• We call it Document Graph.
Three Steps
Step 1: Preprocess
• Build a document graph, G.
Step 2: Summary Generation
• Given a query Q and a document graph G,
Summaries � Spanning Trees that cover all keywords
Step 3: Rank spanning trees.Florida International University (FIU)
Slide 8
Building Document Graphs
• Parse the document.
• Split it into text fragments (using delimiters or tags).
• Text Fragments represented as Nodes
• Add an edge between 2 nodes, if semantically related.
• Edges : Semantic Links
• Edge weights: Degree of association
Florida International University (FIU)
Slide 9
Example
v0
v3
v1
v2
0.026
0.015
0.015
v7 v13
v14
0.027
0.017 0.043v10
0.023
v4v8
v5
0.015
0.058 0.091
0.0150.053
0.027
0.028
0.015
v6
0.044
0.06
0.015
0.015
0.015
0.032
0.060
v90.015 0.015
v12
v110.015
0.015
0.015
0.015
v15v16 0.015
0.037
Sample DocumentDocument Graph
• Parsing delimiter – NewLine.
• Text Fragments – Paragraphs.
• 17 text fragments (v0…v16).
• 17 nodes in Document Graph.Florida International University (FIU)
Slide 10
Input parameters for Document Graph construction
–Parsing Delimiters
• For Plain Text – Newline or Period
• For HTML – Tags (<p>,<br>,<ul><ol>,<table>…etc.)
–Threshold for Edge weights
• Tradeoff of Quality and Performance.
• Edges with weights lesser, are not added.
–Maximum Fragment Size
• Limit on Node Size
Florida International University (FIU)
Slide 11
Computing edges of Document Graphs
• For every pair of nodes,
� Common Words are used (stops words – ignored)
� Thesaurus and stemmer used (rely on Oracle Intermedia Text services)
� If EScore(e) ≥ threshold, an edge is added.
• Special Case
� Adjacent Text Fragments.
� Share Close Proximity.
� Weight = Max (EScore(e) ,threshold).
Florida International University (FIU)
Slide 12
Edge Scoring
• EScore
A tf*idf adaptation.
–Query Independent.
–Edge e(u,v)
w – common word,
t (v) – text fragment corresponding to node v.
Size (v) –number of words in text fragment t(v).
( )
))(())((
))())),(()),(((
)( ))()((
vtsizeutsize
widfwvttfwutft
eEScore vtutw
+
⋅+=
∑∈ I
Florida International University (FIU)
Slide 13
Example (cont’d)
v0
v3
v1
v2
0.026
0.015
0.015
v7 v13
v14
0.027
0.017 0.043v10
0.023
v4v8
v5
0.015
0.058 0.091
0.0150.053
0.027
0.028
0.015
v6
0.044
0.06
0.015
0.015
0.015
0.032
0.060
v90.015 0.015
v12
v110.015
0.015
0.015
0.015
v15v16 0.015
0.037
Sample DocumentDocument Graph
Common words:
• BrainGate,
• CyberkineticsReasons for high weight
• Rare Words (idf is large).
Florida International University (FIU)
Slide 14
Computing Query-Specific Summaries
• Given a Query, Q and a Document Graph, G:
Summary �Minimal Total Spanning Tree.
Minimal Total Spanning Tree
• Total – Every keyword in at least one node (AND semantics)
• Minimal – To avoid redundancy (Eliminating useless leaves)
Summarization Problem
Given – Document Graph G and a Query Q
Find – Top (best) Minimal Total Spanning Tree (Summary)Florida International University (FIU)
Slide 15
Example
v0
v3
v1
v2
0.026
0.015
0.015
v7 v13
v14
0.027
0.017 0.043v10
0.023
v4v8
v5
0.015
0.058 0.091
0.0150.053
0.027
0.028
0.015
v6
0.044
0.06
0.015
0.015
0.015
0.032
0.060
v90.015 0.015
v12
v110.015
0.015
0.015
0.015
v15v16 0.015
0.037
Sample Document Document Graph
v100.017v0
0.046 0.008Top Summary for “Brain Chip Research"
Score = 67.74
Brain chip offers hope for paralyzed. Donoghue’s initial research published in the science journal Nature in 2002 consisted of attaching an implant to a monkey’s brain that enabled it to play a simple pinball computer game remotely.
Florida International University (FIU)
Slide 16
Summary Scoring Function
Requirements
Properties of Good Summaries :
• Highly relevant nodes (fragments) improve Score.
• Loose semantic Links degrade Score.
• Large spanning trees get a degraded Score.
• Based on Query-dependent & Query-Independent factors.
Summary Scoring
• This function satisfies these requirements.
• Best Summary has minimum score
∑∑∈
∈
+=Teedge
Tvnode
vNScoreb
eEScorea
)(
1)(
1Score(T)
a andb are calibrating parameters.
(a=1 & b=0.5)Florida International University (FIU)
Slide 17
Summary Node Scoring
• Node Scoring
–Widely used Okapi weighting.
–Query Dependent.
–NScore (v) =
N – Number of Documents in the collection.
tf – Term Frequency .
df – Document Frequency.
avdl – Average Document Length.
qtfk
qtfk
tfavdl
dlbbk
tfk
df
dfN
dQt++
++−
++
+−∑∈ 3
3
1
1
,
)1(.
))1((
)1(.
5.0
5.0ln
Florida International University (FIU)
Slide 18
Roadmap
•Need for query-specific summaries
•Our approach– Building a document graph
– Definition of summary
– Rank Summaries
•Efficient computation of summaries
• Evaluation of summarization process– Quality
– Performance
•Related Work
• Conclusions Florida International University (FIU)
Slide 19
ALGORITHMS
• Adaptations of BANKS [ICDE02] Algorithms
• Input : Document Graph G and Query Q
• Output : Minimal Total Spanning trees (Summaries)
• Enumeration Algorithm.
• Expanding Search Algorithm.
Pre-computation:
–A Full text Index.
–All Pairs shortest paths for each document graph
(edge weight of edge e= 1/Escore(e)).
Florida International University (FIU)
Slide 20
Roadmap
•Need for query-specific summaries
•Our approach– Building a document graph
– Definition of summary
– Rank Summaries
• Efficient computation of summaries
•Evaluation of summarization process– Quality
– Performance
•Related Work
• Conclusions Florida International University (FIU)
Slide 21
User Surveys
• To evaluate the Quality of Summaries
• Subjects : 15 Students from FIU (all levels & various majors).
• Users evaluate summaries based on their Quality.
• Rating: 1 (least descriptive) to 5 (most descriptive)
• Surveys
–Comparison with Google & MSN Desktop.
–Comparison with DUC 2005 datasets.
Florida International University (FIU)
Slide 22
Comparison with Google & MSN Desktop Engines
3.674.001.003.001.672.005
4.004.673.001.672.671.674
4.004.933.000.672.673.003
3.334.333.002.003.332.002
3.674.873.672.333.672.331
D2D1D2D1D2D1
Our ApproachMSN DesktopGoogle Desktop
Queries
Computer network security projectDeleted computer software5
Large research grantsWorm affected agencies4
Software projectsRecovering worm deleted files3
Algorithms development researchAnti-virus protection2
IT Research awardsMicrosoft worm protection1
Document D2Document D1Queries
Slide 23
Performance Experiments
01
23
45
67
2 3 4 5Number of keywords in query
Pro
cess
ing tim
e(m
sec)
Mult i-result EnumerationMult i-result Expanding
0
1
2
3
4
5
6
2 3 4 5Number of keywords in the query
Pro
cess
ing T
ime (m
sec)
Top-1 Enumeration
Top-1 Expanding
Average times to calculate node weights
17.3311.509.375.31Time (msec)
5432Number of keywords
1.81.41.31.1Top-1 Expanding Search Algorithm.
2.782.11.81.4Top-1 Enumeration Algorithm.
5432Number of keywords
Average ranks of Top-1 Algorithms
News articles from science section of cnn.com
Florida International University (FIU)
Slide 24
Roadmap
•Need for query-specific summaries
•Our approach– Building a document graph
– Definition of summary
– Rank Summaries
• Efficient computation of summaries
• Evaluation of summarization process– Quality
– Performance
•Related Work
• Conclusions Florida International University (FIU)
Slide 25
Related Work
Document Summarization
• Mostly Query-Independent
• Summarizing Web Pages– Berger et.al [SIGIR 2000] synthesizes summaries.
– Paris et.al [CIKM 2000] uses anchor text (ignores content).
• Splitting Web pages in to blocks– Song et.al [WWW2004] Block importance models (learning algorithms)
– Cai et.al [SIGIR 2004] Block level link analysis
• Document modeled as Graphs– Lexrank : Sentence Centrality using link analysis.
– TextRank: “representative” sentences using link analysis.
Keyword Search in Data Graphs
– BANKS [ICDE 2002]: group-steiner tree problem
– DISCOVER, DBXplorer.
– XRANK[2003]: search in XML documents.Florida International University (FIU)
Slide 26
Conclusions
• Method for Query-Specific Summarization.
• Exploiting inherent structure of documents for the purpose of Summarization.
• Enhanced User Satisfaction – User Surveys.
A Prototype of the System available at:
http://dbir.cs.fiu.edu/summarization
Florida International University (FIU)
Slide 27
Thank You !!!
Questions ???
Florida International University (FIU)
Slide 28
Enumeration Algorithm
Florida International University (FIU)
Slide 29
Expanding Search Algorithm
w1
v4
v2
w3
v30.01
0.02
0.02
Florida International University (FIU)
Slide 30
Comparison with DUC peers
3.672.33FT921-743.673.67FT943-16238
4.172.83FT922-133534.174.00FT943-16477
4.332.00FT921-9373.002.83FT931-3563
4.002.00FT922-1903.332.50FT944-8297
2.504.00FT921-77864.662.33FT941-3237
Our Approach
DUC PeerDoc. ID
Our approach
DUCPeerDoc. ID
Query 2 (Women in Parliaments)DUC Topic ID: d321f
Query 1 (International Organized Crime)
DUC Topic ID: d301i
Florida International University (FIU)
Slide 31
DEMO
Florida International University (FIU)
Slide 32
DEMO
Florida International University (FIU)