icwe2013 - discovering links between political debates and media
Post on 11-May-2015
137 Views
Preview:
DESCRIPTION
TRANSCRIPT
DISCOVERING LINKS BETWEEN POLITICAL DEBATES AND MEDIADamir Juric, Delft University of Technology
Laura Hollink, VU University Amsterdam
Geert-Jan Houben, Delft University of Technology
ICWE2013
The PoliMedia project: linking politics to media
PoliMedia research questions• How is a person, subject or process covered & visualised by the
media?• How do debates and arguments develop over a longer period of
time?• Analysing the changing ideas, arguments and presentation in
different media
Issues with current approach• Go to different archives, look up original data!
Goal: explicit links to different media types in one system
Data Sets – DebatesHandelingen der Staten-General or Dutch Hansard from 1945-1995
Some provenance:1. Transcripts are made of the complete debates of the
Dutch parliament.2. Published online by the government on
http://www.statengeneraaldigitaal.nl/ (1818 - 1995) and http://officielebekendmakingen.nl/ (from 1995)
3. PoliticalMashup project has translated government pdf and txt files into XML, incl URI’s as identifiers, see http://politicalmashup.nl/
4. We build on that.
Structure of the debate data
Debate Metadata
Topic 1
Topic 2
Speaker 1 / Content
Speaker 2 / Content
Speaker 3 / Content
Speaker 1 / Content
Aan de orde is de behandeling van: - de brief van de minister van Economische Zaken inzake Borssele (16226, nr. 26). De beraadslaging wordt geopend.
NEs={Economische Zaken, Borssele}
NEs={Borssele, Partij van de Arbeid, D66}
Metadata
Speaker 1
Speaker 2
Speaker 3
Mijnheer de Voorzitter! Met de verdragen tot uitbreiding van de EEG met Denemarken, Engeland, Ierland en Noorwegen wordt een van de doelstellingen van ons buitenlands beleid verwezenlijkt.
• who, when, what • identifiers for subparts of the debate• chronological order of speakers
Data sets – Media• Newspaper articles
• at the National Library of the Netherlands• Many newspapers 1950- 1995• Text + images of newspaper layout
All data and links expressed as RDF• We have created a semantic model to capture the
datasets and link between them.• Reusing other vocabularies
• Simple Event Model (SEM)• Dublic Core• FOAF• ISOCAT
All data and links expressed as RDF
PoliMedia linking method• Debate speeches and newspaper articles are different
types of documents, so default document similarity metrics are insufficient• Speeches contain many named entities, digressions.• Newspapers are formal and concise, words are used sparingly.
• The challenge: how to create a representation of the speeches that contains enough information to be used as a query to retrieve the right media articles from the archive?
PoliMedia linking method
• Our PoliMedia linking method consists of four steps:
1. topics: enriching the existing debate metadata with topics
2. preselection of articles: when the candidate articles were published and who spoke in the debate (timeframe and speakers)?
3. automatic query creation: candidate articles are ranked based on similarity to the query (automatically created from speech text) by comparing vectors of topics and named entities
4. link creation: links are created between a speech and an article if the similarity score is above a threshold t
TopicsThe MALLET topic model package• Unsupervised analysis of text• “a Topic consists of a cluster of words that frequently occur together”• [see http://mallet.cs.umass.edu/topics.php]• Input: Text, Number of iterations, Number of topics/clusters
• Output: Words that cluster around one topic.
• Example:• Text: a speech in a debate from 1975• number of iterations: 2000• number of topics: 1
Kombrink
rente
inkomstenbelasting
bronheffing
vereenvoudiging
tarief
contourennota
Nederland
word
tussen
wetgeving
sociale
moeten
fraude
fraudebestrijding
vraag
misbruik
ten
gebruik
kamer
misbruik fraudebestrijding
ismo-rapport
Contourennota
Kombrink
EEG
Netherlandse
OESO-verband
Nederland
Contou
Engwirda
Couprie
Midden-Oosten
Euro-kapitaalmarkt
Tariefnota
Staatssecretaris
Regering
Financiën
Zwitserland
Brussel
Grave
TopicSet Speech
NE Speech
TopicSet Topic
NE Topic
Automatic query creation
Debate Metadata
Topic 1
Topic 2
Speaker 1 / Content
Speaker 2 / Content
Speaker 3 / Content
Speaker 1 / Content
Actor
Query Debate
came from
came from
Polimedia pipeline
RDF
semantic modelRDF files
NERs Speech
TopicSet Speech
NERs Topic
TopicSet Topic
contextual vectors
PoliticalMashup(xml)
Query NE
Stopword removal
Topic modeling
Query content
Expanded query creation
SRU Query (actor, date range)
automatic query creation
KB(preselect
data)
similarity calculation
ranking
filtering
articlemetadat
a
Evaluation
• We tried three different approaches:
• Experiment 1: NEs in speech
• Experiment 2: NEs + topics in speech
• Experiment 3: NEs + topics in speech and debate
• Two independent evaluators: reading the speeches and articles linked to them and manually assessing their relatedness• Randomly selected 20 debates from our dataset of 10,924 debates (different subjects: from fraud in the social system to the European elections)• Each experiment: random 50 speeches • In total: 150 speech-article pairs, namely 3 sets of 50 each
Evaluation
Results:• best approach: named entities (speech + debate descriptions) and topics (speech + debate)
(2: relevant, 1: partially relevant, 0: unrelated)
Evaluation
• Relative recall:• different evaluation: annotator reads a speech, manually creates a
suitable query for it, and assess the relevance of the articles returned for that query
Precision: 75%, recall 62%experiment 3 on 5 speeches/115 articles gave a recall of 3804 links
Conclusion• Creation of links between two very different datasets: a dataset of political
debates and a media archive• Linking method takes advantage of:
• Debate content and metadata• Named Entities and Topics from the debates• semantic partOf structure of the debates
• In experiments we have shown the added value of topics and debate structure
• Produced links• different in nature than those produced by e.g. ontology alignment tools
• Now: coarsely typed links• Future: nature and strength of the link
top related