linking today's wikipedia and news from the pastamishra/presentations/pikm07-mishra.pdf ·...
TRANSCRIPT
Arunav Mishra
PIKM 2014The 7th Workshop for Ph.D. Students at CIKM 2014
Shanghai, 3 November 2014
Linking Today's Wikipedia and
News from the Past
&Today’s WIKIPEDIA Yesterday’s NEWS
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 2
Mr. Busy (Journalist)
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 3
Google News{Eurozone, crisis, Germany, role}
7580 results
What is the role of Germany in the Eurozone crisis?
Wikipedia{Eurozone, crisis, Germany, role}
I don’t understand..!!!
Crtl + F ..!!
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 4
Link Wiki-Excerpts and News Articles
Wiki-excerpts
Wikipedia Article News Articles
…On 16 June 2012 the European Central Bank togetherwith other European leaders hammered out plans for theECB to become a bank regulator and to form a depositinsurance program to augment national programs. Othereconomic reforms promoting European growth andemployment were also proposed.…
European Leaders to Present Plan to Quell
the Crisis Quickly
Published: June 16, 2012
Overview
Tasks and Challenges✓ 1
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 5
Datasets, Benchmark, and Evaluation2
Simplified Wiki2News 3
Approaches
Experiments
6
5
4
Summary
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 6
Linking Wikipedia To News
“How did it happen?”Wiki2News
For a given Wiki-excerpt, retrieve ranked list of past news articles providing details
Input : A Wiki-excerptOutput : Ranked list of news articles
1. Leveraging spatio-temporalexpressions in Wiki-excerpts
2. Leveraging references to entities in Wiki-excerpts
3. Bridging of vocabulary gap between news articles and Wiki-excerpts
Challenges
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 7
Linking News To Wikipedia
“How is it remembered?”News2Wiki
For a given set of past news articles,
retrieve ranked list of Wiki-excerpts that summarize the event
Input : Set of news articlesOutput : Ranked list of Wiki-excerpts
1. Reduction of verbosity of the input news articles
2. Estimating event span from specific news articles
3. Addressing change in language usage sentence construction, and spellings in historic news articles
Challenges
Overview
Tasks and Challenges
✓
1
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 8
Datasets, Benchmark, and Evaluation2
Simplified Wiki2News 3
Approaches
Experiments
6
5
4
Summary
Datasets
• English Wikipedia dump and revision history
• Latest released on 8 October 2014 dump is 10.52 GB
compressed
• The New York Times Annotated Corpus
• 2 million documents published between 1987 and 2007
• ClueWeb 09/12 corpora
• 2 billion web pages collected in 2009/2012
• Already-existing entity annotations Freebase and YAGO2
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 9
... three-year lending adventure (LTRO)".[243]
Reorganization of the European banking system
On 16 June 2012 the European Central Bank together
with other European leaders hammered out plans for
the ECB to become a bank regulator and to form a
deposit insurance program to augment national
programs. Other economic reforms promoting European
growth and employment were also proposed.[244]
Outright Monetary Transactions (OMTs)…
Benchmark
Wiki-excerpt : Existing citations as text boundaries
• Refer to a single event or a story within a larger event
• Validated by citations
European Leaders to Present Plan to Quell the
Crisis QuicklyBy JACK EWINGPublished: June 16, 2012
Link
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 10
Boundaries
Statistics
• 5,536 excerpts from 4,882 articles linked to New York Times
• Excerpts types based on length:
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 11
• 74.2% of the excerpts have temporal expressions
• Day : on January 19 2000
• Month : in August 1994
• Year : in 1995
• Time Range : since February 1988 until March 1988
Type Number of terms Number of excerpts
Factual excerpts (Short) [1,20] 1,120
Paraphrased excerpts [21,50] 2,547
Wordy excerpts (Long) [51,400] 1,869
Evaluation
• Leverage Wikipedia references to news articles contained in the New York Times
corpus or ClueWeb 09/12
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 12
Wiki2News:
Given Wiki-excerpt, target the New York Times articles that are referred
News2Wiki:
Given New York Times articles, target Wiki-excerpts that refer to them
• Evaluation Metrics:
• Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Precision,
Recall
Overview
Tasks and Challenges
✓
1
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 13
Datasets, Benchmark, and Evaluation2
Simplified Wiki2News 3
Approaches
Experiments
6
5
4
Summary
Linking Year Page Events to News
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 14
''I could hear the words my father had spoken to me when I was a child: 'One day
you will sing for kings and queens,' ''Aretha Franklin recalls about an early 1980's
London gala attended by the Prince of Wales and the Queen Mother. She has also
sung in the White House for Presidents Jimmy Carter and Bill Clinton, although she
turned down a ball in Monte Carlo for Princes Albert and Rainier and a performance
for Queen Beatrix of the Netherlands. The reason? Her fear of flying. Franklin --
who is 57 years old, has won 15 Grammy Awards and in 1987 became the
first woman inducted into the Rock and Roll Hall of Fame (though her fear
of flying also kept her from that ceremony) -- has written a self-congratulatory yet
entertaining autobiography. The fourth of five children born to a famous Baptist
preacher and a nurse's aide, she began singing gospel in her father's Detroit church.
At the age of 14, she gave birth to the first of her four children. At 16, after the birth
of her second child, she dropped out of high school. She writes openly of her two
divorces and such failed romances as her on-and-off relationship with the
Temptations' Dennis Edwards and with a man she refers to as Mr. Mystique:
''Because he is a public figure, I prefer to protect his privacy.'' However, she is open
about her feuds with other singers, like Gladys Knight and Cissy Houston. Despite
the emotional upheavals and financial problems, Franklin sang the aria ''Nessun
dorma'' at the 1998 Grammy telecast when Luciano Pavarotti took ill, and has
recorded ''A Rose Is Still a Rose,'' written by Lauryn Hill, which seems the perfect
ending for this bittersweet yet sassy chronicle. Tammy Sill Nesmith
Books in Brief: NonfictionBy Tammy Sill Nesmith
Published: October 31, 1999
ArethaFrom These Roots.By Aretha Franklin and David Ritz.Villard, $25.
Archives
January[edit]
•January 1 – Frobisher Bay, Northwest Territories,
changes its name to Iqaluit.
•January 2 – Chadian–Libyan conflict – Battle of Fada:
The Chadian army destroys a Libyan armored brigade.
•January 3 – Aretha Franklin becomes the first woman
inducted into the Rock and Roll Hall of Fame.
•January 4 – 1987 Maryland train collision:
An Amtrak train en route from Washington,
D.C. to Boston, Massachusetts collides
with Conrail engines at Chase, Maryland, killing 16.
Events[edit]
1987
This article is about the year 1987. For the
number, see 1987 (number). For other uses,
see 1987 (disambiguation).
http://www.showbiz411.com/
October 31, 1999
retrieve a ranked list of news articles providing details
Retrieval Task
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 15
Given a query event ,
Temporal expression :
• Discrete notion of time where smallest unit equals 1 day
• is an interval in time domain T,
• Example: 1987 = [1987.01.01, 1987.12.31]
Goal:
January 3, 1987
Aretha Franklinbecomes the first woman inducted into the Rock and Roll Hall of Fame.
Books in Brief: NonfictionPublished: October 31, 1999Aretha- From These Roots.
…The reason? Her fear of flying. Franklin -- who is 57 years old, has won 15 Grammy Awards and in 1987became the first woman inducted into the Rock and Roll Hall of Fame (though her fear of flying also kept her from that ceremony) …
Given, Query Event Ranked list of Documents
retrieve
Overview
Tasks and Challenges
✓
1
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 16
Datasets, Benchmark, and Evaluation2
Simplified Wiki2News 3
Approaches
Experiments
6
5
4
Summary
Text-Only (LM)
• Considers only and
• Query-likelihood approach with Dirichlet smoothing
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 17
Language models rank documents according to :
January 3, 1987
Aretha Franklin becomes the first woman inducted into the Rock and Roll Hall of Fame.
Books in Brief: NonfictionPublished: October 31, 1999Aretha- From These Roots.
…The reason? Her fear of flying. Franklin -- who is 57 years old, has won 15 Grammy Awards and in 1987became the first woman inducted into the Rock and Roll Hall of Fame (though her fear of flying also kept her from that ceremony) …
Query Event Document
Background Model
r=0.015
Publication Dates (LM+P)
• Independently generates the from the publication dates
• The second factor is estimated with sigmoid function as
• Favors documents that are published close to
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 18
Query likelihood
Publication date
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 19
Temporal Expressions (LM+T)
• Independently generates from the
• Second factor is estimated as
• Favors documents with many temporal intervals at finer temporal granularity
containing the
Indicator function
Length of intervalDocument intervals
Query likelihood Temporal expressions
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 20
Publication Dates + Temporal Expressions (LM +PT)
• Combine publication dates, temporal expressions and text only approaches
• Retrieves documents published close to the and also mentions it in content
Query likelihood Publication date Temporal expressions
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 21
Two-stage Cascade Model (CM)
Text Retrieval
Top-K ranked documents (e.g. top-30)
TemporalExpression extraction
D5
D7
D1
0
50
100
10
…
13
0…
25
0…
37
0…
49
0…
61
0…
73
0…
85
0…
Query temporal model top-k (k<K) documents Reranked K Documents
Temporalfeedback
Reranker
D1
Dk
Top-k documents(e.g. top-10)
D1
D2
DK
Query, q
Top-10
Stage 1
Stage 2
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 22
Query Temporal Model:
• Assumption: feedback expressions are salient time intervals as set of points
• Estimate query model from top-k documents
• The generative probability is estimated as
• The first factor is the normalized query likelihood of the document
• The second factor estimated as
A time point
Query temporal modelNormalized score Document
temporal model
Document Temporal Model:
• We estimate the document temporal model as
• We use Jelinek-Mercer smoothing
• The smoothing has two effects:
• Addresses the zero-probability issue
• Inverse Document Frequency like effect for frequent temporal expressions
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 23
Background temporal model model
Interpolation parameter
Document Scoring:
• We re-rank documents according to KL-divergence
• The individual components are computed as
1. Query and document temporal models
2. Event happening date and publication date
• As final output we present top-10 documents of re-ranked list
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 24
Captures recency
Captures salient time points
Preserves textual relevance
Overview
Tasks and Challenges
✓
1
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 25
Datasets, Benchmark, and Evaluation2
Simplified Wiki2News 3
Approaches
Experiments
6
5
4
Summary
Datasets, Benchmark, and Evaluation
Dataset:
• The New York Times Annotated Corpus
• 2 million articles published between years 1987 and 2007
Benchmark:
• Random sample of 50 Wikipedia events as queries
• Events that happened between 1987 and 2007
Evaluation:
• We pooled top-10 from all the methods – 1297 unique query-document pairs
• Crowdflower assessors judge an article as
(2) Highly Relevant: Main topic is the query event
(1) Somewhat Relevant: Mentions the query event
(0) Irrelevant: Completely unrelated to the query event
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 26
Results
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 27
LM LM+P LM+T LM+TP CMMAP 0.35 0.43 0.40 0.42 0.45P@5 0.55 0.63 0.61 0.61 0.66
P@10 0.48 0.57 0.54 0.55 0.58NDCG@5 0.53 0.59 0.58 0.60 0.60
NDCG@10 0.54 0.62 0.60 0.62 0.63
Insight: CM consistently outperforms the baselines
0.3
5
0.5
5
0.4
8
0.5
3
0.5
4
0.4
3
0.6
3
0.5
7
0.5
9
0.6
2
0.4
0.6
1
0.5
4
0.5
8
0.6
0.4
2
0.6
1
0.5
5
0.6 0.6
2
0.4
5
0.6
6
0.5
8
0.6 0.6
3
MAP P@5 P@10 NDCG@5 NDCG@10
LM LM+P LM+T LM+TP CM
Summary and Future Directions
• Query contains at least one named entity and many mention a geographic location
• Consider events not covered by the collection
• Consider events that span over a large time period
• Extension to queries that
• do not contain temporal expressions
• contain multiple temporal expressions
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 28
Future directions:
Summary:
• Considered a novel linking task
• Designed simple approaches to address the linking task
• Found that our two-stage cascade approach outperforms simpler baselines
Thank You
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 29
Questions and Feedback
• Alternate datasets that can be used?
• Similar scenarios where such a linking system could help?
• Applications of such linked data?
• Can more types of data sources be added?
Discussion
February 27 1991: President Bush declares war over Iraq and orders cease-fire.
CM LM+P LM+T LM+TP LM
P@10 0.7 0.7 0.6 1 0.2Publication dates win
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 30
Aretha Franklin becomes the first woman inducted into the Rock and Roll Hall of Fame.
CM LM+P LM+T LM+TP LM
P@10 0.7 0.2 0.7 0.1 0.4
January 3, 1987:
Temporal expressions win
March 19 2002: US war in Afghanistan: Operation Anaconda ends after killing 500 Taliban and Al-Qaeda fighters, with 11 allied troop fatalities.
CM LM+P LM+T LM+TP LM
P@10 0.8 0.6 0.5 0.6 0.5
Cascade approach wins
Related Work
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 31
TREC Knowledgebase Acceleration Task (KBA)Goal: Detect citation-worthy articles to a central entity from high-speed time-stamped stream
1. Defined on latest time-stamped news stream2. Entity-centric3. Final decision on human curators
1. Defined on news archives2. Can be generalized to any Wikipedia article3. Our task does not involve human
intervention
TREC KBA Wiki2News
TREC Temporal SummarizationGoal: Detect sub-events with low latency in an online, sequential setting on high-speed time-stamped stream
1. To generate a summary2. Defined on latest time-stamped news stream3. Track new developments in an event
TREC Temporal Summarization News2Wiki
1. Detect portions of Wikipedia articles2. Defined on news archives3. Context of sub-events in Wikipedia
Monday, November 09, 2015 PIKM 2014: Linking Today's Wikipedia and News from the Past 32
Topic Tracking
• Detect stories that discuss a target topic (characterized by a small number of known stories)
We detect Wiki-excerpts summarizing stories in news article
Entity Disambiguation or Linking
• To disambiguate ambiguous entity mentions by linking it to knowledge base entities.
We link entire news article to Wiki-excerpts rather than linking at word or phrases level
INEX Link-the-Wiki, and Link-Te-Ara:
• Discover incoming and outgoing links in an unlinked data collection
We link encyclopedia to news
Linking Textually Sparse and Rich Content
• Linking news articles to ongoing stream of closed captions
• Linking news articles to sparsely annotated videos in multimedia archives
A short Wiki-excerpt suffers from ambiguity, and anaphoric mentions of entities