Time MachineSTEWART WHITING AND JOEMON M. JOSE
UNIVERSITY OF GLASGOW, SCOTLAND, UK
Wikipedia as a
OMAR ALONSO
MICROSOFT BING, MOUNTAIN VIEW, CA, USA
Temporal Web AnalyticsWorkshop 2014
Introduction Wiki Characteristics Time Signals Final RemarksData
Anyone can create and edit content
Moderator-curated
Reflects time-based news, culture and phenomena
Wikipedia English started in 2001
Now contains 4.5M+ articles
~20.4 revisions per article
Vast amounts of open data
Rich structure (article hierarchy, linking, taxonomies – semantics)
Understanding Wikipedia
6th most visited website on the internet[Alexa]
Huge collaborative encyclopaedic effort
Introduction Wiki Characteristics Time Signals Final RemarksData Wikipedia as a Time Machine
Text contentPeople write about the past/present/future
Explicit/implicit structure
Meta-data signalsPulse of real-time activity
Side-effects of temporal user interest
- without needing a query log!
Wikipedia offers a great deal
of time information:
Insight into:
Story
Temporal sequencing
Entity relationships
Impact
Introduction Wiki Characteristics Time Signals Final RemarksData This Talk
How can we discover,
understand and track
past, present and future
temporal topics using
Wikipedia?
And, how can this
knowledge be exploited
in time-aware
information retrieval
tasks?
Introduction Wiki Characteristics Time Signals Final RemarksData
Wikipedia text and structure used extensively in many non-temporal IR tasks
Semantic Similarity/Relatedness Measures[GabrilovichEtAl2007 – Wiki. Explicit Semantic Analysis][StrubeEtAl2006 – WikiRelate!]
External Collection Query Expansion[XuEtAl2009]
Query Intent Modelling[HuEtAl2009]
Cross-Lingual IR[PotthastEtAl2008]
Entity Tasks – Recognition, Disambiguation etc[Many!]
IR & Wikipedia
Introduction Wiki Characteristics Time Signals Final RemarksData Time-aware IR & Wikipedia
Using Wikipedia temporal signals in time-aware IR tasks
Event/Topic Detection & TrackingDetection/tracking: [CiglanNorvag2010,OsborneEtAl2012,SteinerEtAl2013]
Summarisation: [GeorgescuEtAl2013,WhitingEtAl2012] Evaluation (ground-truth): [McMinnEtAl2013]
Event Visualisation[WattenbergEtAl2007]
Temporal Semantics - Entity/Fact Extraction[WangEtAl2010,BalogNorvag2012]
Temporal Query Intent ModellingAmbiguous intents: [ZhouEtAl2013]
Multi-faceted intents: [WhitingEtAl2013]
There are many opportunities…
Introduction Wiki Characteristics Time Signals Final RemarksData Wikipedia Characteristics
How quickly does Wikipedia reflect the world?
What topic coverage does it offer?
Is Wikipedia content high-quality?
Can it be trusted?
Introduction Wiki Characteristics Time Signals Final RemarksData Freshness/Timeliness
Latency
‘Main-stream’ events – very small (<30 mins? <2 hours? Depends who you ask…)
KBA filtering task at TREC: improve event coverage/speed
Pope Benedict XVI’s Resignation
EN and FR articles updated at 10:58 and
11:00
Reuters broke news at 10:59, following
Vatican announcement at 10:57:47
Whitney Houston’s Death
Reported on Twitter at 00:15 UTC by niece of hotel
worker who found her
Spread through Twitter, confirmed by AP via
Twitter at 00:57 UTC
WH’s article updated ‘has died’ as 01:01 UTC
Introduction Wiki Characteristics Time Signals Final RemarksData Topic Coverage
Not all topics covered
representatively
Events may only appear as a
sentence or sub-section of main
article (e.g. a celebrity in a scandal)
Separate article(s) created for
major events39th G8 Summit, 2013 North India Floods
See Also: Response to...., Criticisms of… etc.
Meta-data signals quantify impact An Analysis of Topical Coverage of WikipediaHalavais and Lackaff, 2008
Introduction Wiki Characteristics Time Signals Final RemarksData Content Quality
Idealistically – facts verified by 3rd party
through citations
Plenty of editorial guidelines
“Wikipedia is not a newspaper”
Bots make lots of changes
Talk pages contain temporal discourse
Sometimes prominent articles are
locked – far less edits (but, pre-verified)
Period Digest
1 {{death}} (Refers to the article ’infobox’ with birth and death dates.)
2 Houston died on February 11, 2012. Publicist Kristen Foster said
Saturday that the singer had died, but the cause of her death was
unknown. She died in [[Ottawa]], [[Canada]].
3 [Similar to previous.]
4 4 On February 11, 2012, publicist Kristen Foster revealed Houston
had died aged 48. A cause of death was not immediately given. She
died in her Beverly Hills home.
5 [Similar to previous.]
6 [Similar to previous.]
7 On February 11, 2012, publicist Kristen Foster revealed Houston had
died from unspecified causes at the age of 48, with unconfirmed
reports suggesting her death occurred in her room at the [[Beverly
Hilton Hotel]].
8 Houston released her new album, ”[[I Look to You]]”, on August
2009. The album’s first two singles are "I Look to You" and "Million
Dollar Bill". The album entered the [[Billboard 200]] at No. 1...
9 Local police said there were "no obvious signs of criminal intent."
Two days prior to her death, witnesses reported seeing
Houston behave erratically. They were rumored that she died of drug
overdose.
Introduction Wiki Characteristics Time Signals Final RemarksData Data Sources
Page APIsEasy random access to revisions etc. (slow!)
Article Creation/Change IRC ChannelsAll updates, no full-text
Article Creation/Change RSS/Atom FeedsNot all updates, but includes full-text content
XML Article Dumps (monthly)All article/page revisions (EN is 7TB decompressed!)
Or, current article revision onlyNeed a cluster to derive more useful datasets
Page View Dumps (hourly)Measure of article popularity, since end 2007
See stats.grok.se for an easier interface
May 2013 daily article changes RSS feed volume (in log scale) for
Wikipedia EN, FR, IT, DE and ES
Several openly available Wikipedia data sources
Introduction Wiki Characteristics Time Signals Final RemarksData Current Events Portal
Manually curated list of recent/ongoing mainstream events
Ad-hoc taxonomy, e.g. finance, sports, deaths, politics etc.
Used as a ground-truth for automated TDT evaluation
May 2013: Avg. 15 (±6) articles per day
Introduction Wiki Characteristics Time Signals Final RemarksData Temporal Expressions
Using temporal tagger
(e.g. HeidelTime)
Extracted dates in article content
YEAR, MONTH-YEAR and DAY-MONTH-YEAR
Year mentions in Wikipedia English from1900 to 2020
Visualises past and future time coverage
9/11, 2001 is a large spike
1st/2nd World Wars also prominent
‘Recentism’ - biased coverage of recent information
Introduction Wiki Characteristics Time Signals Final RemarksData Page Edit Stream
‘Arab Spring’ daily article edit frequency and length (in characters) since 27th January 2011
(to 23rd March 2012)
Derived from historic revision
dumps, RSS or IRC feeds
Changed text can be mined for
summaries, inc. references
Look for links, sections, images in
markup
Introduction Wiki Characteristics Time Signals Final RemarksData Temporal Article Structure
Changes in article (sub-)sections
Finer-grained interest over time
People edit what is changing -
Evolving section hierarchy
A temporal directed acyclic graph -
Cumulative ‘Arab Spring’ article section edit frequency since 27th January 2011
root
Introduction Wiki Characteristics Time Signals Final RemarksData Temporal Link Graph
Cumulative ‘Arab Spring’ article in- and out-link degree since 27th January 2011
Links created using
[article/redirect|[name]]
Wiki markup
Need to be careful with namespaces,
languages, link naming and redirects
Can also include external ‘citation’
links
Introduction Wiki Characteristics Time Signals Final RemarksData Page View Stream
Page views are very sensitive
Little correlation between page
edit and viewing activity
More edits than interest at first -
Correlations between articles are
interesting [CiglanNorvag2010]
‘Arab Spring’ article daily edit frequencyand page views since 27th January 2011
(to 23rd March 2012)
Introduction Wiki Characteristics Time Signals Final RemarksData Final Remarks
I have various distilled datasets with me (and can arrange download + C# MapReduce code)
ArticleEditTimestamps
SampleEventSummarisation
DisambiguationPages
TemporalLinkGraphWithSections
RedirectPages
TemporalSectionChanges
TimeExpressions
120gb total, or select
Wikipedia temporal datasets cover a
wide range of events, culture and
phenomena
Temporal meta-data and content signals
openly available
Informative power – hugely valuable
for time-aware IR research
Probably won’t beat Twitter for speed,
but Wiki has structure and quality control
Many open research questions and
opportunities for time-aware IR!
Introduction Wiki Characteristics Time Signals Final RemarksData Some Research Questions
1. How fast does Wikipedia respond to
events of different types in countries?
2. How can Wikipedia data supplement
query log, Twitter and news feed
streams to improve time-aware IR?
3. What do temporal correlations
between linked article page views
mean – is this reflected in the text
content?
4. Can event similarity be measured on
temporal and topical dimensions?
5. Can this temporal knowledge be used
to predict interest in topics that
become associated in similar ways?
(E.g. actors selected by famous shows,
or directors etc.)