searching the news - vanatteveldt.comvanatteveldt.com/wp-content/uploads/ijmso.pdf · searching the...

Searching the news

Using a rich ontology with time-bound roles to search throughannotated newspaper archives

Wouter van Atteveldt1, Nel Ruigrok2, Stefan Schlobach1, and Frank van Harmelen1

1 Department of Artificial IntelligenceFree University Amsterdam

De Boelelaan 1071, 1071 HV Amsterdam{wva,schlobac,Frank.van.Harmelen}@few.vu.nl

2 The Netherlands News MonitorUniversity of Amsterdam

Kloveniersburgwal 48, 1012 CX [email protected]

Abstract. A frequent motivation for annotating documents using ontologies is to allowmore efficient search. For collections of newspaper articles, it is often difficult to find spe-cific articles based on keywords or topics alone. This paper describes a system that uses aformalisation of the content of newspaper articles to answer complex queries. The data forthis system is created using Relational Content Analysis, a method used in CommunicationSciences in which documents are annotated using a rich annotation scheme based on an on-tology that includes political roles with temporal validity. Using custom inferencing over thetemporal relations and query translation, our system can be used to search for and browsethrough newspaper articles and to perform systematic analyses by evaluating queries againstall articles in the corpus. This makes the system useful both for the (Social) Scientist andfor interested laypersons.

1 Introduction

A number of services exist that offer keyword-based search of news content, such as Google news3

and LexisNexis4. Such services have severe limitations, however. The first difficulty is the semantic

gap between keywords and meaning that is always present in keyword based search. Another

limitation is that it is impossible to look for a relation, such as positive or negative, between

two concepts without specifying all of its possible lexical representations. For example a keyword-

based query ‘Blair support EU’ will not return documents containing ‘Prime Minister praises new

Commission.’ A third limitation is that search is generally bounded by articles or sentences as the

only possible unit of search, meaning that one cannot search for ‘Two articles within one week in

which a politician’s stance on a topic has changed.’

Queries such as the ones above require metadata not just about the topic and publishing details

of an article but also about the content of that article. Generating such metadata automatically is a3 http://news.google.com4 http://www.lexisnexis.com

2 Wouter van Atteveldt, Nel Ruigrok, Stefan Schlobach, and Frank van Harmelen

formidable challenge, but there are large corpora of articles manually annotated by Communication

and Political Scientists that can be used. This paper is based on an analysis of the news coverage

of the 2006 Dutch parliamentary election campaign, which was annotated manually using a rich

annotation schema, summarised as answering: “Who says what about whom/what according to

whom?” The concepts in this annotation are drawn from a detailed ontology of (political) actors

and issues, including time-dependent political function and party membership information.

The system presented here utilizes these annotations for semantically informed search through

the newspaper corpus, inspecting the results quantitatively and visually, and retrieving the original

articles the results are derived from. Based on a formalised version of the annotation of the news-

paper content, the system allows for very sophisticated queries. Moreover, since these annotations

are based on a rich background ontology, it is possible to ask general queries as well as very detailed

ones, bridging the gap between abstract concepts and concrete representation. Finally, we perform

automatic reasoning over the validity of the temporal political functions, making it possible to use

a political function such as a minister in the query, which yields answers from statements about

the various ministers during their respective time in office.

The primary users of this system are Social Scientists investigating communication processes

and effects. In the preparatory phase of such an investigation, a researcher can use the system to

form an understanding of the corpus and to formalise the concepts he or she is interested in. In the

analysis phase, the researcher can use the system to query the corpus using the formalised concepts

and export the results for statistical analysis. In the post-analysis phase, the researcher can use the

system to check the results, retrieve interesting articles for qualitative sense-making, and obtain

quotes and examples for writing about the results.

It should be noted, however, that this system can also be very interesting for users outside

academia. Often, the annotated material is of high relevance to society, such as election campaigns

and high-profile issues such as the war in Iraq or Middle East policy. This material can be very

interesting to politicians, civil society groups, NGO’s, and citizens. The system described in this

paper can help them search through the corpus for general trends or specific patterns.

The contribution of this paper is twofold. Firstly, using the annotations and the ontology sup-

plied by the Social Scientists, we are able to create and test a search system which showcases

the benefits of formalised metadata, such as querying at various levels of abstraction, searching

for potentially highly complex patterns, and reasoning about temporal roles. Secondly, the system

presented here makes the annotations created by the Social Scientists more accessible, making it

Searching the news 3

easier for the scientist to perform both quantitative and qualitative analysis, and providing other

users with a richer way of searching for news.

1.1 Knowledge Representation and Content Analysis

In this paper we present a system for searching specific patterns in an annotated newspaper archive.

In our vision, there is a potential for synergy between Knowledge Representation, especially Se-

mantic Web techniques, and Content Analysis. Studying the complex relationship between politics,

the media, and the public requires very large data sets. In order to find general statistical patterns,

these data sets often need to span multiple countries and events. Since this data is expensive to

obtain, and often needs detailed knowledge of the subject language and society, it is important that

researchers are able to combine and share data sets. Additionally, since there are many competing

theories and methodologies for analysing news patterns and effects, these data sets should lend

themselves to multiple analyses rather than being specific to one study.

We believe that using Semantic Web techniques can help alleviate these problems. We propose

annotating as close to the text as possible, and using a formal ontology to aggregate the detailed

objects found in the text to the theoretical concepts needed for the analysis. This minimises the

amount of interpretation done by the annotators, and thus the potential for unreliable coding.

Additionally, this makes it possbile to combine the analyses of the different countries or time

periods in a single scheme, where it is possible to have a concept such as ‘opposition politician’

map to different objects depending on time and place. Moreover, since it is possible to combine

different aggregation schemes in one ontology, this allows for the same data to be used for different

analyses. Finally, it is possible to express interesting variables, such as politicial frames like ‘strategic

framing’ or ‘internal conflict’, as formal patterns, for example as a SeRQL query or OWL definition.

This makes the process of aggregating and analyzing data more transparant, and makes it easier

for researchers to duplicate and expand upon studies from other groups.

This is not a one way street. Content Analysts have been annotating texts for the last decades,

and many methods have been developed to perform systematic annotation and evaluate annotation

quality (Holsti, 1969; Krippendorff, 2004). As manual annotation is a necessary part of the Semantic

Web vision, either for creating the metadata directly or for bootstrapping and evaluating Machine

Learning tools, these techniques can play an important role. Moreover, the rich annotated corpora

such as the one used in this paper can be very useful data for building and testing systems to show

the usefulness of the Semantic Web techniques.


Previously, we have shown how to use hybrid logic to define concepts within annotated media

material (Van Atteveldt and Schlobach, 2005). This can be directly generalised to using OWL

to define such concepts, and also shows the limits of this approach because of the lack of variable

binding in OWL. In (Van Atteveldt et al., 2007) we provide an in-depth discussion of the possibilities

and challenges in using RDF to formalise media data. Finally, in (Van Atteveldt et al., 2006), we

propose a vocabulary and formalisation standard for converting content analysis data to RDF.

Structure of this paper The following section will describe the corpus used in this paper and the

Relational Content Analysis that was used for the annotation. Section three will describe the way

the data were coded formally and give an overview of the ontology used for this encoding. The

fourth section will provide more information about the implemented system, and the fifth section

will discuss its usage and usefulness.

2 Domain

In this section we describe the annotated data corpus on which this study is based, a collection of

annotated newspapers and TV news items about the 2006 Dutch election campaign. After a brief

description of the campaign we will describe the method used for annotating the data and some

aspects of the corpus this resulted in.

2.1 Data Collection

The use case presented in this study is based on a content analysis of political news from August

14th (first party manifesto) until November 22nd (Election Day) in six daily newspapers and two

television news programs. Each article in the newspapers as well as each item in the television news

programs referring to a party, a politician or an issue were included in the dataset, resulting in

5.707 newspaper articles and 134 news items from the television news programs.

For the content analysis of the articles annotators used the NET method (Van Cuilenburg et al.,

1986, Network analysis of Evaluative Texts). This method is a Relational Content Analysis method

and an elaboration and generalisation of Osgood et al. (1956)’s evaluative assertion analysis. It is

based on the idea that the explicit or manifest content of a text can be depicted as a network

consisting of relations between actors and issues. Specifically, each sentence of a news item is

coded as a 〈subject, predicate, object〉 triple. The subject and object are predefined within an

extensive ontology (see next section). The predicate describes the connection between the subject

and the object and consists of a type, and quality. The type indicates the kind of relationship


The Dutch elections in the Media

Dutch parliamentary elections took place on 22nd of November 2006 with the Dutch voters causinga landslide in the political arena of The Hague. Instead of the expected titanic struggle betweenthe leaders of the Christian Democrats (CDA) and the Social Democrats (PvdA), voters desertedestablished parties, especially the PvdA and the conservative Liberal Party (VVD), in favour ofmore outspoken parties on both the left and right wing of the political spectrum. As shown by(Kleinnijenhuis et al., 2007) these short-term dynamics of voter preferences are highly influenced bypreceding news coverage.

After the summer the news coverage in the Netherland was filled with the upcoming elections andthe campaign preceding the vote. The highly speculated duel between the incumbent Prime MinisterJan Peter Balkenende and the leader of the PvdA, Wouter Bos, failed to occur. On the contrary, inSeptember the incumbent government presented the 2007 budget full of good news “After the bittercomes the sweet”, De Telegraaf, 20 September 2006) and the CDA became more often presentedas successful party responsible for economic growth, while the PvdA went into a downward spiralwith Bos presented as a ‘flip-flop.’ The other incumbent party, the VVD, could not profit from thissuccess as it was burdened down with internal fights. After the internal party election between twocandidates presenting the liberal and conservative wing of the party, the party could not manage toform a unified front. In the shadow of this battle and the problems of the PvdA, the smaller partiesseized the opportunity to present themselves as a more outspoken alternative: The right wing Partyfor Freedom (PVV) presented itself as an alternative for the right winged liberals, while the SocialistParty (SP) portrayed itself as a stronger and more outspoken alternative for the PvdA.

Fig. 1: A short overview of the Dutch 2006 elections

between subject and object such as ‘causual’, ‘action’, and ‘affinitive.’ Quality is a number that

indicates the strength and direction of relationship between subject and object and ranges from -1

to 1. Moreover, some sentences in a newspaper are quoted or paraphrased. In such cases a source

argument can be added to the coded sentence. As an example, consider the newspaper excerpt

visibile in the screenshot in Figure 2 overleaf. The headline is coded as a negative relation between

the political blocks Left and Right. The first sentence of the lead specifies this as a fight between

the incumbent prime minister Balkenende (CDA) and the challenger Wouter Bos (PvdA). In the

next sentence, Balkenende states that Bos is scaring people, which is coded as Bos acting against

the Dutch citizens with Balkenende as source. The final sentence expresses two relations: according

to Bos, investing more money would be good for the Health Care, and Bos wants to invest money

in Health Care, here coded as an affinity (issue position) relation between Bos and Health Care

Investments.

The annotators used the annotation program iNet5, which was created specifically for this

research. As shown in Figure 2, it allows annotators to efficiently select subjects and predicates

5 http://www.content-analysis.org/inet


Fig. 2: Example annotation in iNet

from the ontology using auto-complete and stores the results immediately in an RDF repository as

described in the next section.

2.2 Corpus and Ontology

The corpus consists of newspaper articles and transriptions of television news items. For the newspa-

per articles, all newspaper articles published in five Dutch national daily newspapers that contained

one of the major issues or political actors were selected. For the television items, all news items in

two evening news programs that were deemed relevant by the annotators were included. Using the

method described above, the head and lead of each news article and the full television news items

were coded. This resulted in 26.186 statements from newspaper articles and 3.967 statements from

the television news programs.

As described, the subjects and objects in these statements are drawn from a fixed vocabulary.

Originally, this consisted of 1490 concepts (310 politicians, 224 other actors, and 956 issues) in

an informal mixed is-a, has-a and part-of hierarchy. Using this hierarchy presented a number of

difficulties. Because each concept could only have one parent, it was not possible to represent

the fact that a poltician has a party membership as well as a political function. Since in the

hierarchy politicians were placed under their function, this made it difficult to do an analysis of


party relations regardless of function. A related problem is that different analyses require different

ways of organising issues, which is very difficult with a simple hierarchy. Finally, the fact that there

is no distinction between has-a, is-a, and part-of makes it difficult to do any inference other than

simple aggregation.

To alleviate these problems, this hierarchy was formalised into an ontology, separating classes

and instances and using type, subclass, and part-of relations where appropriate. The remainder of

this section will describe the two most interesting aspects of this ontology: the issue hierarchy and

the political actors.

The issue hierarchy The issue hierarchy contains a large number of issues that occur in the news,

ranging from general issues such as ‘Security’ to highly concrete issues as ‘Biometric Passports.’

In principle, this hierarchy works as an is-a hierarchy, that is, a ‘biometric passport’ is-a security

issue. However, since the more abstract issues also have to be available for annotation, and it is

undesirable to use classes as annotation objects, we decided to formalise the whole issue hierarchy

as instances in a thesaurus-like structure. As argued by van Assem et al. (2006), this is often a

good choice to formalise and combine different hierarchies, providing the needed ‘broader-term’

reasoning without committing to the semantics of subclasses and instances. Thus, we encoded the

hierarchy using a transitive and reflexive k06:subIssueOf predicate which is a subproperty of the

skos:broader term defined for SKOS (Miles and Brickley, 2005).

During the formalisation, we also merged two different categorisations of the issues. For the elec-

tion studies the vocabulary was initially developed, the categorisation was by political alignment:

leftist issues such as Social Securiy, rightist issues such as fiscal conservatism, pro-environment etc.

Afterwards, the same vocabulary was adopted for the research of the Netherlands News Monitor,

which categorised the issues by governmental department: defense, social affairs, health, etc.. As

a side benefit, by merging these vocabularies in one subissue ‘multitree’, data can be shared more

easily between these groups, and it made the categorisation of difficult issues clearer. An example

of the latter is a peacekeeping mission, which belongs to the Department of Defense but is seen as

a neo-leftist (ethically progressive) issue.

Keeping these issues as close to the text as possible and using the formal hierarchy for cate-

gorising is vital for reliable coding: if the coders have to decide whether peacekeeping is pro-defense

or neo-leftist, results will diverge and afterwards it is impossible to trace why coders made certain

decisions. As far as possible, coders should simply pick the issue closest to the occurrence in the

text without interpreting the position of that issue. In total, we have 956 issues categorised into a

double hierarchy with five layers, 15 root directions, and 20 root departmental topics.


Political Actors The other area of interest are the political actors. For the political actors, the

naıve approach is to classify them as Representatives, Administrators, Senators, etc. Unfortunately,

politicians generally fulfill such functions only for a short period, and especially for longitudinal

studies it cannot be assumed that this hierarchy is static. Therefore, all politicians are simply

classified as ‘Person’, with the real semantics in the roles rather than the class membership: for each

politician we include their party membership and poltical functions using specialised memberOf

properties. Also, for the parliamentary fractions it is recorded whether they are member of the

coalition or opposition block. These memberOf properties are all subproperties of the transitive

and reflexive k06:partOf relation to make it easy to query patterns such as a person being a member

of a part of the coalition block. In total, we have 310 political actors distributed over 16 parties

and 48 functions.

3 Representing Newspaper Content

As described above, the NET method provides a rich annotation by representing texts as networks

of statements between nodes, where the nodes are drawn from an ontology of actors and issues.

In order to effectively search through an archive of material annotated with this method, we have

formalised the NET data representation and the ontology in RDF(S). This section will briefly

describe RDF(S) and discuss some of the issues encountered in this formalisation.

Note that although this dicussion focusses on the NET method, it is for a large part applicable

to other Relational Content Analysis methods as well. Van Atteveldt et al. (2006) survey the

representational requirements of a number of (Relational) Content Analysis methods, and the

formalisation presented here can be easily adopted for the other methods as well. That study also

introduces the amcat6 namespace (standing for Amsterdam Content Analysis Toolkit) that will be

used for general content analysis vocabulary in addition to the net7 namespace, that contains NET-

specific vocabulary, and the k068 namespace containing the data from the 2006 election campaign.

3.1 RDF : The Resource Description Framework

A central element of the Semantic Web is the Resource Description Framework (RDF). This is

a standard representation specified by the Wold Wide Web Consortium (W3C) for describing

documents and other resources on the Internet, creating an interconnected Semantic Web (Antoniou

6 http://www.content-analysis.org/vocabulary/amcat#7 http://www.content-analysis.org/vocabulary/net#8 http://www.content-analysis.org/vocabulary/ontologies/k06#


and Van Harmelen, 2004). Using a graph as its data model and using XML syntax to describe

and exchange information, RDF allows data to be mixed, exported, and shared across different

applications. Using distinguished vocabulary within RDF, RDF Schema (RDFS) allows for the

definition of ontological relations such as class membership and subclass and subproperty hierarchies

(Brickley and Guha, 2004). Since the ‘data’ and ‘metadata’ are combined in one RDF graph, it is

easy to construct queries over the combined network of media data and background knowledge.

3.2 Representing media data in RDF

Annotation

k06:Bos

ont:HealthCare

k06:PvdA

net:subject

net:object

+1

net:quality

k06:Party

ont:Actor

k06:Health

k06:socialist k06:direction

k06:topic

amcat:roleSubject

Article1

Telegraph

rdf:type

rdf:type

dc:subject

dc:publisher

dc:date

2006-11-22

k06:Politician

rdf:typerdfs:subClassOf

Hester

dc:creator

k06:role-Bos-PvdA

k06:partyMemberOf

2020-01-01

amcat:roleFrom

amcat:subIssueOf

amcat:subIssueOf

rdf:type

Metadata Media Data Ontology

net:Affinity

net:ArrowType

Fig. 3: The data model used

Figure 3 exemplifies the overall formalisation of the NET method. In the center is the node

representing the annotation. To the left, this node is connected via the Dublin Core subject relation

to the sentence or article that it stems from. The metadata of the article, such as headline, date,

and publisher, is specified on the article using standard Dublin Core Vocabulary. Additionally, the

coder of the annotation is stored in the Dublin Core creator property.

In the middle pane, the quality, predicate, and arrow type of the annotation are specified

using properties. Most importantly, the subject and object of the annotation are given as links to

instances in the ontology, in this case PvdA-leader Bos and the issue Health Care. This issue is

connected via the subissue hierarchy to the topic ‘Health’ and the polticial direction ‘pro-welfare’.

On the bottom right hand side, it can be seen that Bos plays the role of PvdA-member since 1981.

This role is of type poltician, which should be interpreted as meaning a person playing this role is

a politician. The role instance is also connected with the PvdA node, which is of type party. Both

party and politician are subclasses of actor.


Two aspects of this design that merit additional discussion are representing the NET statements

as their own resource, and the way in which temporal roles are represented. These two aspects will

be discussed in the remainder of this section.

3.3 Statements about Statements

Although it seems natural to represent the media data network extracted using the NET method

in a graph-based language, it is important to keep in mind that statements and graphs are not

first-class citizens in RDF: contrary to the subject, predicate, and object they consist of, statements

do not have URIs, and it is not possible to make statements about statements. In other words: RDF

is a language for describing resources using triples, not a language for describing triples. We are

not the first to signal this difficulty: MacGregor and Ko (2003) cite the need for enriching triples

to describe event data, and a number of authors want to use RDF for describing RDF documents,

for example for reasoning about provenance and trust (Carroll et al., 2005).

Within the RDFS specification, it is possible to use RDF Reification for making statements

about statements, but this does not have clear formal semantics and is not supported by most

tools. Another solution that does not require additions to the language is using the N-ary design

pattern desribed by Noy and Rector (2005). Unfortunately, there is no standard vocabulary for

indicating that something is an N-ary triple and what the status of the different places is, making

the resulting graph difficult to interpret by third party applications. Additional proposals exist in

the literature to deal with this problem. Two proposed solutions are adding a fourth place to a

statement, creating a ‘quad’ rather than a triple (MacGregor and Ko, 2003; Dumbill, 2003), or

creating Named Graphs, assigning a URI to a set of statements (Carroll et al., 2005).

As argued in Van Atteveldt et al. (2007), enriching triples is not as simple as adding a URI to

all statements: it is important to distinguish between transparant enrichment, where the original

statement is available, and opaque enrichment, where the original statement is not available to

agents that do not interpret the enrichment. It might seem that transparant enrichment is always

preferable, but quoted statements or temporally limited relations should not be visible outside their

scope. Moreover, some enrichments add a fourth place to the statement, some give more details

about the predicate (such as the quantification used in the NET method), and some enrichments

add information about the whole triple (such as the Dublin Core metadata about an article).

Current proposals either choose one of these cases, or do not provide for mechanisms to indicate

what the intended semantics of the new information are. Since the NET method uses both opaque


and transparant enrichments and has extra arguments, metadata, and predicate quantification, this

is a serious obstacle.

Given these considerations, it was decided to follow Noy and Rector (2005) and let a NET

annotation be represented by a node rather than an arrow, with net:subject, object and net:

predicate edges to the corresponding objects. This means that the network semantics of the NET

network are lost to third parties, but it also means that we can add arbitrary information about

these annotations, such as quality, quoted source, and a link to the newspaper article the annotation

is part of. Note that this is equivalent to using the N-ary design pattern with a URI representing the

statement rather than a blank one. Also, this is structurally equivalent to using RDF reification:

our encoding can be turned into reification by having net:subject be a subproperty of rdf:subject

and likewise for the other relations.

3.4 Temporal Roles

There are a number of political roles and functions, such as being president, member of parliament,

or member of a party, that are only fulfilled by politicians for a certain time period. The functions

are social roles in the sense that they are anti-rigid, meaning that the existence of the politician

does not depend on his playing a role, and dynamic (cf. Masolo et al., 2004). As surveyed by

Steimann, treating such roles in frame-based models such as RDF, is not trivial and has received

considerable attention in the literature (Steimann, 2000; Sowa, 1988, 2000; Guarino, 1992). The

simplest approach, treating a role as a predicate, does not allow for specifying temporal bounds or

other information for the reasons discussed above. Another possibility is making each occurrence

of a role being played a subproperty of the role property. As argued by Steimann (2000), this leads

to a number of complications and does not really solve the problem.

The approach chosen here is creating an adjunct instance representing one occurrence of the

role (Wong et al., 1997). An example of this is the node k06:role-Bos-PvdA in Figure 3. This

instance is linked to the role player (ko6:Bos), and the statements that define the role (such as

the k06:partyMemberOf relation with k06:PvdA) are made with the role instance as subject. In our

system, we specify the From and To dates as properties of the role instance. This allows normal

reasoning to occur on the role instance, leaving the task of determining whether a specific person is

a member of a role for a certain annotation to the inference system (see Section 4.1). This is different

from the approach taken by Mika and Gangemi (2004), who use reification for representing roles,

and Gutierrez et al. (2005), who propose an extension to the RDF model that can be expressed

using a form of reification for reasoning with temporal validity. The reasons for taking a different


approach are that we want to work within existing tools, and want to keep querying as simple as

possible, which is enabled by treating the adjunct instances as a ‘normal’ object playing a role

rather than as a reification pointing to an object and a role.

4 Visualising and Searching Newspaper Content

The previous sections introduced the Relational Content Analysis methodoloy of modeling a text as

statements between actors and issues, and described a way of formalising these statements and the

background knowledge about the actors and issues. As argued by Sicilia (2006), it is important to

keep in mind that metadata is created to perform a specific function. The purpose the annotations

were originally created for was scientific analysis of newspaper content, and an important use of

the system lies in this function. The other purpose for formalising this knowledge is allowing the

user to search and browse through the collection more efficiently. Since the RDF graph is fairly

complex due to the requirements of making statements about the NET statements and of temporal

social roles, it is not sensible to expose the users directly to this representation for either of these

use cases. This section will describe how user queries are translated into SeRQL queries, and how

the results are visualised and presented to the user.

Since our implementation is based on the Sesame RDF storage and inference system (Broekstra,

2005), we translate user queries into SeRQL rather than the upcoming W3C recommendation

SPARQL (Prud’hommeaux and Seaborne, 2006). However, since SeRQL and SPARQL are rather

similar and we only use fairly straightforward constructs, the results presented here apply equally

to SPARQL.

4.1 Reasoning with roles

As described above, NET statements are represented by annotation resources, which have net:

subject and net:object relations to an actor or issue. Actors are often involved in social roles, such

as being president, which connect them to groups or functions for a specific period. As shown in

Figure 4, this is encoded by making an adjunct instance that represents the specific role, which is

connected to the role player, and about which the To and/or From dates are specified. This adjunct

instance is then treated as the role player playing that role, so the temporal background knowledge

is expressed about the adjunct instance. For example, in Figure 4, it is encoded that Bos has played

the role ’being a PvdA member’ since 1981; the membership of the PvdA is expressed as a direct

characteristic of the role instance.


k06:PvdA

amcat:roleSubject

k06:memberOf

1981-01-01amcat:roleFrom

k06:Bos

k06:role-Bos-PvdAAnnotation

net:subjectdc:date

2006-11-01

net:subject

Fig. 4: Inference of a Role played by a Politician

Querying this data model directly would lead to very complex queries, where the date of an

annotation has to be compared to dates of roles, with optional paths for missing To or From dates,

for each subject and object in the query. To make the querying easier, we perform a reasoning step

in the RDFS repository, using the rule shown below, which can be read as: If an annotation has

a ?rel relation to an object, and the object is the subject of a certain role-instance, and the role-

instance is valid at that moment in time, then conclude that the annotation has the ?rel relation

to the role-instance.

IF ?annotation dc:date ?ADate

AND ?annotation ?rel ?object

AND ?roleInstance amcat:roleSubject ?object

[AND ?roleInstance amcat:roleFrom ?FDate ]

[AND ?roleInstance amcat:roleTo ?TDate ]

AND (FDate is null OR FDate <= ADate)

AND (TDate is null OR TDate >= ADate)

THEN ?annotation ?rel ?roleInstance

For example, in Figure 4 the reasoner would conclude the dotted net:subject arc from the An-

notation to the role-Bos-PvdA instance, since there is no To date specified for the role, and the

date of the annotated article is after the From date. This means that we can simply query for an

annotation with a subject who is a member of the PvdA, without knowing that the original subject

Bos has not always been a PvdA member.

This rule was implemented as a custom inference layer on top of the normal RDFS engine in our

Sesame repository. Sesame computes the closure by exhaustively applying all rules using forward

changing, deriving all relations between annotations and roleInstances that were valid on that

date. Since these inferred triples are avaible for the RDFS inferencer, normal reasoning such as

subclass inference can be used to draw more conclusions from the adjunct instance. For example,

if the role instance is of type Representative, and Representative is of type Politician, we can also

conclude that the subject of our annotation is a politician.9

9 Note that we cannot assert that Bos himself is a politician, since politicians can become societal actorsand the other way around.


4.2 Querying the news

We want to allow users to specify patterns in terms of the relational NET network, ie the network of

statements between actors or issues. Since the RDF data is formalised at a lower level of abstraction,

it is necessary to translate the user queries to SeRQL queries performed on the underlying data set.

These actors and issues can be specified either as one specific instance, or in terms of the background

knowledge. For example, a user might want to query for “A PvdA-member of parliament disagreeing

with a Defense plan proposed by a Minister.” Such queries must be translated to SeRQL queries

that are evaluated on the RDF repository.

In our system, we assume that a user specifies two types of information: patterns of relations,

and constraints on the nodes in these relations. Canonically, this consists of triples of the form

(Node, +/-, Node) and (Variable, Relation, Value), respectively. For example, the question above

can be expressed by the user as the query given below:10

?Mem - ?Iss

?Min + ?Iss

?Mem ont:memberOf ont:PvdA

?Min rdf:type ont:minister

?Iss ont:subIssueOf ont:Defense

The first line specifies that we are looking for an annotation where a concept ?Mem is negative about

another concept ?Iss. The second line specifies that ?Min has to be positive about the same issue.

The third line constrains ?Mem to be a member of the PvdA party, while the fourth line constrains

?Min to be of type Minister. The final line constrains the issue to be a subissue of Defense.

These queries are translated into SeRQL as follows: a relation triple is translated into “{Ann1}net:

subject{Su};net:predicate{Pr};net:object{Obj}].” If an element of a triple is a URI, it is literally

inserted instead of the placeholder; for variables the variable name is inserted. For node constraints,

the triple is translated into “{Var}Rel{Val}”, where Var is the variable name, and rel and Val are

the specified relation and value. Since the constraints and relations use the same name for the

shared variables, these translated statements can simply be conjoined to form the FROM part of

the SeRQL. Some very useful but less trivial relations, such as ’maximal part-of’, needed custom

clauses in the WHERE part of the query. Depending on whether the user wants these patterns

to occur in a single sentence, article, week, or other unit of analysis, an additional link is made

from all annotations in the query to the article identifier, date, or week number. Finally, because

10 This notation can be easily enhanced, for example by allowing chaining of triples syntax (eg ?Mem -

?Iss; + ?Iss) or direct specification of properties on the node (eg {.ont:memberOf.Pvda} - ?Iss).The translation of such constructions to the triple form is trivial and does not affect the discussion inthis paper. We did included an option to specify the label of a concept rather than the uri, where theconcept with the best matching label is picked automatically.


we want to return the natural persons rather than the role instances, a link is made using the

(reflexive) roleSubject predicate, “{Var1}amcat:roleSubject{V1}”, and a condition is included in

the WHERE clause to specify that V1 may not be a role instance: “WHERE NOT EXISTS Var1

rdf:type amcat:roleInstance”.

A full translation of the user query given above is:

SELECT DISTINCT ArticleID, Ann1, Ann2, Mem, Min, Iss FROM

{Ann1} net:subject {VAR_Mem};

net:quality {QUA_1};

net:object {VAR_Iss};

dc:subject {} dc:identifier {ArticleID},

{Ann2} net:subject {VAR_Min};

net:quality {QUA_2};

net:object {VAR_Iss};

dc:subject {} dc:identifier {ArticleID},

{VAR_Min} amcat:roleSubject{Min};

rdf:type {ont:minister},

{VAR_Iss} amcat:roleSubject{Iss};

amcat:partOf {ont:defensie},

{VAR_Mem} amcat:roleSubject{Mem};

ont:memberOfParty {ont:pvda},

WHERE (QUA_1 < "0"^^xsd:float)

AND (QUA_2 > "0"^^xsd:float)

AND (NOT EXISTS (

SELECT * FROM {Iss} rdf:type {amcat:roleInstance}))

AND (NOT EXISTS (

SELECT * FROM {Min} rdf:type {amcat:roleInstance}))

AND (NOT EXISTS (

SELECT * FROM {Mem} rdf:type {amcat:roleInstance}))

4.3 Visualising the network

The result of this query can be used in different ways. This section will discuss these scenarios, and

show how they are supported by the system.

Searching and Browsing The simplest scenario is perhaps that of the non-academic user. The

user enters a query, such as “An article in which prime minister Balkenende is positive about Social

Security”, which translates into [Balkenende + ?I | ?I partof SocialSecurity], relying on the label

lookup to resolve the two concepts. The output of this query is shown in Figure 5(a). Each article

is listed with metadata and the number of hits, and two buttons of the left allow the user to inspect

the statements that were identified as hits, and visualise the article. This second button shows the

text of the article and the matched statements, as well as a visualisation of the full network of

the article, with the matched statements highlighted, as shown in Figure 5(b). This visualisation

offers the option to display the raw statements or aggregate by party line or political function, as

discussed above.


(a) List of matched articles

(b) Visualisation of the first article

Fig. 5: Searching for Balkenende on Social Security

Analyzing The Social Scientist gives rise to a more complex scenario. He or she has a specific

pattern in mind, and needs to know how often each possible instantiations of the pattern occurred

for each unit of analysis. For example, a researcher might be interested in the relation between

Parliament and Government. This translates to the query shown query window in the top part of

Figure 6(a). This query looks for a relation between an MP, who is a member of the Parliament

and of a Party, and a Minister, who is part of something that is-a Ministry.

The results shown in Figure 6(a) give a very broad overview of a specific aspect of the campaing.

Often, the researcher also wants to know which party and ministry were involved in the relation.

For this, the user can enter a number of variables into the ‘select’ box, which causes the system to

output results for each combination of values in the selected variables, as shown in 6(b).

Zooming in Additionally, the researcher wants to be able to check and give meaning to results

by going back to the articles. This is supported by the buttons next to each row, which allow the

researcher to ask for the list of articles or statements that are the raw data of the aggregate result.

Clicking on the ‘show articles’ button next to the August 2006 results in the list in Figure 6(c).

This list gives the source and headline of all articles matching the pattern in August, and provides


(a) Aggregate results

(b) Frequency per instantiation

(c) List of articles

VVD

Min Economy

-1.01

Min Immigration

+0.25

PvdA

Min Health

-1.01

Min Education

+1.01

SP

Min Finance

-1.01

(d) Visualising the instantations

Fig. 6: Analyzing the relationship between Government and Parliament


a final option to show the actual article as described for the first scenario, allowing the researcher

to check the pattern and/or collect quotes and headlines for use in the report.

Visualising Finally, since the underlying data is a network, it is often helpful to look at a graph

visualisation of the results. The researcher can enter a triple of variables into the ‘visualise’ box,

and the system then creates a graph consisting of the connections between these variables. Figure

6(d) shows a visualisation of the Party and Ministry variables, showcasing the fact that this makes

it possible to visualise at an arbitrary level of aggregation.11

5 Using the system: Parties in the news

This section will show how the system described above can be used for systematic analysis in a

real-world setting. Earlier election studies reveal a number of aspects of newspaper coverage that

can be shown to have a short-term effect on voter behaviour (Kleinnijenhuis et al., 2007, 2003,

1998). In the remainder of this section, we will perform a selection of the analyses from these

studies, showing that the system can give useful results for actual queries. In particular, we will

consider news on issue positions, conflict between parties, conflict within parties, and news about

a minister who was forced to step down weeks before the elections.

5.1 Issue ownership

Research about the role of media during political (election) campaigns show that different forms of

news coverage have an important influence on voters’ decisions (Kleinnijenhuis et al., 2007). First

of all parties try to generate as much media attention as possible, preferably in association with

issues they ‘own’. According to the issue-ownership theory parties which gain most attention in

the media for their own issues will get the most votes. An issue is ‘owned’ by a certain party when

the general public perceives that party to be the best party to handle it. Most issues are linked to

parties due to historical cleavages like class and religion. For example, an issue like social security

reflects the interests of the lower classes and is owned by social-democratic parties. Looking at the

issues parties are associated with, we assume that parties “owning” a certain issue will be more

often related to that issue in the news than other parties.

11 Due to the low quality of graph screenshots, the image included here is actually a postscript renderingof the GraphViz dot code (http://www.graphviz.org) generated by the system.


In order to investigate the amount of attention given to the different issue positions of the

parties, we used the following query, using the maxpartof predicate to retrieve the issue highest in

the hierarchy as described in section 4.2:

?X = ?I

?X partof ?Party

?Party isa Party

?I maxpartof ?Issue

?Issue isa Issue

The results of this query are shown in Table 1, together with the overall attention for that

party (obtained using a simple query [?X = ?Y, ?X partof ?P, ?P isa Party]). The bold cells in

this table represent the issues that the parties are connected with according to a survey held at

the beginning of the campaign period.

Table 1: Parties and Issues

Overall Issue statementsParty attention Social Security Values Financial ImmigrationSP 5.2 5.9 6.8 3.7 0.6PvdA 21.8 32.5 14.1 19.0 16.6CDA 39.9 33.1 50.5 44.0 32.1VVD 31.1 28.0 23.3 32.7 42.8PVV 2.0 0.4 5.3 0.6 7.9

The table shows the validity of the issue-ownership theory during this election. All parties have

higher scores on the issues they are related to in the popular mind than their overall attention.

Especially the CDA could profit from the issue-ownership. Apart from ‘their’ issue about norms

and values they profit from the news about the financial situation. As stated above, they were able

to claim the success of a growing economy, supplying numerous new jobs and prosperity, while the

traditional ‘issue owner’, the VVD, was busy fighting an internal battle between the leader and

number two. On the other traditional conservative issue, immigration, the new right wing party

PVV managed to compete with the VVD, receiving four times their average attention for their

position on this issue. Interesting in this respect is that the Socialist Party (SP) did not manage to

generate a lot of attention for their position on Social Security, even though respondents indicated

that they associated them with that issue together with the traditional issue owner, the PvdA

5.2 Parties and conflict

Besides news about issues, conflict news is an important factor in the success or failure of parties

during elections. Conflict can work in two ways, as an opportunity and as a threat. Whereas criticism


of opponents provide the party and the members an opportunity to create a distinct profile for

oneself, internal criticism can be fatal for a party. Another threat for a party are controversial

members receiving too much criticism, not only from the direct opponents but also from societal

actors. First we consider the conflict between the main parties during the election campaign. The

following query shows us the conflict among the different parties as well as the internal atmosphere.

!X - !Y

!X partof ?A

!Y partof ?B

?A isa party

?B isa party

PvdA-0.2125

VVD

-0.4101

CDA

-0.7270

PVV

-1.02

SP

+0.134

-0.689

-0.2311

-0.5108

-0.722

-0.7233

-0.5100

+0.2164

+0.85

-0.73

+0.311

-1.02

-0.45

+1.07

+0.224

-0.726

-0.720

+0.537

Fig. 7: Conflict between parties

Using the visualisation as described in the previous section, the results of this query are pre-

sented in Figure 7 with the arrows showing the direction of the statements from one party to

another or towards itself, the thinkness of the arrows representing the amount of statements about

praise and criticism and the color representing the average direction of these statements, ranging

from -1 (maximum disapproval, shown as dark red on the screen) to +1 (maximum approval, shown

as dark green on the screen) From the figure we see the heavy battle between the PvdA and the

CDA criticising each other (PvdA-CDA 270 times and CDA-PvdA 233 times) most often and both

very harshly (-0,7) in the news coverage. Apart from these main opponents, parties being each

others’ rivals for votes on the far ends of the political spectrum seem to leave each other in peace.

On the left, the SP and the PvdA don’t fight too much in the news, and also on the right wing, the


VVD and the PVV do not criticise each other too harshly. Most striking in this table, however, is

the enormous amount of criticism within the VVD. We will turn to this aspect in more detail in

the next section.

5.3 Internal conflict

As research showed earlier, internal conflict can be fatal for political parties. A party with closed

ranks is regarded more stable and able to manage the country than a party in which party members

are fighting for power. In this section we will look at the internal praise and conflict in the parties.

We used the following query:

!X - !Y

X partof ?A

Y partof ?A

?A isa party

0

20

40

60

80

100

120

140

160

180

200

CDA PVV PvdA SP VVD

Criticism

Praise

Fig. 8: Internal criticism and praise for the major parties

The results of this query, imported into Excel to create a graph, are shown in Figure 8. The

extensive news coverage about the internal conflict within the VVD is clearly seen. Although the

party members tried to praise each other as often as possible this kind of coverage lags far behind

the coverage focusing on the internal fights. The same picture, in somewhat milder form, is seen

for the PvdA, where internal criticism on the leadership exceeds the internal praise in the news.

The opposite tendency is seen looking at the SP and the Right Wing PVV.

5.4 Same place, different guy

If we investigate the source of the criticism on the CDA, we come across the insteresting case of

Minister of Justice Donner: After a critical report about a fire in a prison facility at Amsterdam

Airport Schiphol, he and a colleague stepped down as minister, and he was replaced by the veteran


Hirsh Ballin. We would like to know how ‘the minister of justice’ was evaluated before and after

this event.

-8

-6

-4

-2

0

2

4

200633

200634

200635

200636

200637

200638

200639

200641

200645

Donner

Hirsh-Ballin

Fig. 9: Criticism of consecutive Minister of Justice: Donner and Hirsh-Ballin(Time is expressed as year and week number)

Due to the dynamic nature of roles in our system, one query sufficed for obtaining the results

we wanted to generate. In Figure 9 we see the average value of the criticism/praise for the Minister

of Justice, ranging from -1 (most negative) to + 1 (most positive). The most interesting result

is that until week 39 Donner is returned by the query, while after that the query returns Hirsh

Ballin. This shows that the roles in system work as expected. Substantively, we see that the weeks

before resigning, the Minister was highly criticised. Only after his declaration to resign the news

coverage became positive. With the new minister the news coverage stays positive and controversy

was defused.

For political communication research as the example above, the system presented in this paper has

proven to be extremely useful. Not only does a simple query often suffice to obtain the results over

a large corpus, the queries are easy to adapt in order to sharpen the research question. Moreover,

the direct connection with the articles the results stem from is very useful. This way the researcher

can check if the query really searches for the pattern the researcher is interested in, and it also

helps to sharpen the research question and find quotes from the article to be used in writing the

report.

6 Conclusion

Searching through and analyzing news archives is difficult with keyword-based methods. In this

paper, we presented a system that utilizes detailed annotations of the content of articles acquired

by Relational Content Analysis, a manual annotation technique developed in the Social Sciences.

This technique represents an article as a network of statements between actors and issues, drawing

on a fixed vocabulary of political and societal actors and various issues.


To improve the potential for sharing and combining data, we formalised this vocabulary as

an RDFS ontology. For the issues, we created a SKOS-like thesaurus with multiple categorisation

trees. For the (political) actors, we created a number of political roles with temporal validity to

represent the fact that politicians often switch functions.

The system presented here performs custom RDF inferencing to connect time-stamped news-

paper articles with instances representing a poltician playing a certain role that was valid at that

time stamp. To shield users from this complexity, they can query the resulting graph using a simple

query language, which is translated into SeRQL and executed against the repository. The results

of a query can be listed in a table on various levels of aggregation, allowing a user to zoom in

on interesting results and refine his or her query. The system can also visualise the results as a

network, either on its own or as highlighted edges in the total network representing the article.

By performing a number of analyses that have proven useful for predicting voter behaviour in

earlier studies from Communication Science, we make it plausible that this system is useful for

actual analyses on real world data. The system simplifies the analysis of annotated content by

allowing easy definition and evaluation of complex patterns and giving the possibility of zooming

in on the texts that are the raw data of an aggregate result. This allows a researcher to define,

measure, visualise, check, and refine his or her definitions within one tool.

Limitations and Future Work In our perspective, the main limitation of this work lies in the

complexity of the representation. Even with our simplified query language, formulating queries is

not trivial and the user needs a good understanding of the ontology to define sensible queries.

This can be alleviated by adding shortcuts to the query language and by improving on the query

interface, for example using a ‘menu’ of standard queries and a way to interact with the ontology

to find concepts and relations.

Another limitation is the need for creating and maintaining a complex shared ontology by

multiple non-expert users. Although tool support exists for ontology editing and versioning, at

the moment there are no tools that support versioning and modularisation, are flexible enough to

handle the temporal roles needed in this domain, and simple enough for a user not from knowledge

management or computer science. Given the fact that the ontology needs to be adapted as new

events unfold, this effectively means that a Knowledge Management expert has to be part of a

team using this system.

Acknowledgements The authors would like to thank Mark van Assem and Laura Hollink for

insightful comments on the first version of this article and during discussions on this topic.

Bibliography

Antoniou, G. and Van Harmelen, F. (2004). A Semantic Web Primer. MIT Press, Cambridge, Ma.

Brickley, D. and Guha, R. (2004). Rdf vocabulary description language 1.0: Rdf schema. W3C Recom-mendation (http://www.w3.org/TR/rdf-schema/).

Broekstra, J. (2005). Storage, Querying and Inferencing for Semantic Web Languages. PhD thesis, FreeUniversity Amsterdam.

Carroll, J., Bizer, C., Hayes, P., and Stickler, P. (2005). Named graphs, provenance and trust. In Proceedingsof the Fourteenth International World Wide Web Conference (WWW2005), Chiba, Japan, volume 14,pages 613–622.

Dumbill, E. (2003). Tracking provenance of rdf data. Technical report, ISO/IEC.

Guarino, N. (1992). Concepts, attributes and arbitrary relations: Some linguistic and ontological criteriafor structuring knowledge bases. Data and Knowledge Engineering, 8:249–261.

Gutierrez, C., Hurtado, C., and Vaisman, A. (2005). Temporal rdf. In ESWC 2005, number 3532 in LNCS,pages 93–107, Berlin. Springer.

Holsti, O. (1969). Content Analysis for the Social Sciences and Humanities. Addison-Wesley, Reading MA.

Kleinnijenhuis, J., Oegema, D., de Ridder, J., and Ruigrok, P. (1998). Paarse Polarisatie: De slag om dekiezer in de media. Samson, Alphen a/d Rijn.

Kleinnijenhuis, J., Oegema, D., de Ridder, J., van Hoof, A., and Vliegenthart, R. (2003). De puinhopen inhet nieuws, volume 22 of Communicatie Dossier. Kluwer, Alphen aan de Rijn (Netherlands).

Kleinnijenhuis, J., Scholten, O., Van Atteveldt, W., van Hoof, A., Krouwel, A., Oegema, D., de Ridder,J. A., Ruigrok, N., and Takens, J. (2007). Nederland vijfstromenland: De rol van media en stemwijzersbij de verkiezingen van 2006. Bert Bakker, Amsterdam.

Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology (second edition). SagePublications.

MacGregor, R. and Ko, I.-Y. (2003). Representing contextualized data using semantic web tools. InPractical and Scalable Semantic Web Systems (workshop at second ISWC).

Masolo, C., Vieu, L., Bottazzi, E., Catenacci, C., Ferrario, R., Gangemi, A., and Guarino, N. (2004). Socialroles and their descriptions. In Dubois, D., Welty, C., and Williams, M., editors, Proceedings of the NinthInternational Conference on the Principles of Knowledge Representation and Reasoning (KR2004), pages267–277, Whistler, Canada.

Mika, P. and Gangemi, A. (2004). Descriptions of Social Relations. In Proceedings of the 1st Workshop onFriend of a Friend, Social Networking and the (Semantic) Web.

Miles, A. and Brickley, D. (2005). Skos core vocabulary specification. w3c. Public Working Draft, WorldWide Web Consortium, November 2005, http://www.w3.org/TR/swbp-skos-core-spec/.

Noy, N. and Rector, A. (2005). Defining n-ary relations on the semantic web. Working Draft for the W3CSemantic Web best practices group.

Osgood, C., Saporta, S., and Nunnally, J. (1956). Evaluative assertion analysis. Litera, 3:47–102.

Prud’hommeaux, E. and Seaborne, A. (2006). Sparql query language for rdf. Working Draft for the W3CRDF Data Access Working Group.

Sicilia, M. (2006). Metadata, semantics, and ontology: providing meaning to information resources. Inter-national Journal of Metadata, Semantics and Ontologies, 1(1):83–87.

Sowa, J. (1988). Using a lexicon of canonical graphs in a semantic interpreter. In Evens, M., editor,Relational models of the lexicon. Cambridge University Press, Cambridge UK.

Sowa, J. (2000). Knowledge Representation: Logical, Philosophical, and Computational Foundations.Brooks/Cole, Pacific Grove, CA.

Steimann, F. (2000). On the representation of roles in object-oriented and conceptual modelling. Data andKnowledge Engineering, 35:83–106.

van Assem, M., Malaise, V., Miles, A., and Schreiber, G. (2006). A method to convert thesauri to skos.In Sure, Y. and Domingue, J., editors, Proceedings of the ESWC’06, number 4011 in Lecture Notes inComputer Science, pages 95–109.

Van Atteveldt, W., Kleinnijenhuis, J., and Carley, K. (2006). Rcadf: Towards a relational content analysisstandard. In Presentated at the International Communication Association (ICA), Dresden.

Van Atteveldt, W. and Schlobach, S. (2005). A modal view on polder politics. In Proceedings of Methodsfor Modalities (M4M) 2005 (Berlin, 1-2 December).


Van Atteveldt, W., Schlobach, S., and van Harmelen, F. (2007). Media, politics and the semantic web: Anexperience report in advanced rdf usage. In Franconi, E., Kifer, M., and May, W., editors, ESWC 2007,number 4519 in LNCS, pages 205–219, Berlin. Springer.

Van Cuilenburg, J. J., Kleinnijenhuis, J., and De Ridder, J. A. (1986). Towards a graph theory of journalistictexts. European Journal of Communication, 1:65–96.

Wong, R., Chau, H., and Lochovsky, F. (1997). A data model and semantics of objects with dynamic roles.In Proceedings of IEEE Data Engineering Conference, pages 402–411.

searching the news - vanatteveldt.comvanatteveldt.com/wp-content/uploads/ijmso.pdf · searching the...

Documents