towards the automatic identification of the nature of citations

13
http://creativecommons.org/licenses/by-sa/3.0 Towards the automatic identification of the nature of citations Angelo Di Iorio – [email protected] Andrea Giovanni Nuzzolese – [email protected] Silvio Peroni – [email protected]

Upload: silvio-peroni

Post on 10-May-2015

437 views

Category:

Technology


3 download

DESCRIPTION

The reasons why an author cites other publications are varied: an author can cite previous works to gain assistance of some sort in the form of background information, ideas, methods, or to review, critique or refute previous works. The problem is that the best possible way to retrieve the nature of citations is very time consuming: one should read article by article to assign a particular characterisation to each citation. In this work we propose an algorithm, called CiTalO, to infer automatically the function of citations by means of Semantic Web technologies and NLP techniques. We also present some preliminary experiments and discuss some strengths and limitations of this approach.

TRANSCRIPT

Page 1: Towards the automatic identification of the nature of citations

http://creativecommons.org/licenses/by-sa/3.0

Towards the automatic identification of the nature of citations

Angelo Di Iorio – [email protected] Giovanni Nuzzolese – [email protected]

Silvio Peroni – [email protected]

Page 2: Towards the automatic identification of the nature of citations

Outline

• Interpretation of bibliographic citations

• CiTalO architecture

• Preliminary empirical evaluation

• Conclusions

Page 3: Towards the automatic identification of the nature of citations

Citations as tools

• Bibliographic citations can be seen as tools for:✦ linking research: making pointers to related works, to source of experimental data, to

methods used, etc.✦ disseminating research: conference proceedings, journals, Web platforms (e.g. blogs,

wikis), Semantic Publishing platforms and projects (e.g. OpenCitation, OpenBibliography, Lucero)

✦ exploring research: new ways of browsing article through networks of citations (e.g. CiteWiz, Citation Sensitive In-browser Summariser)

✦ evaluating research: measuring the importance of journals (e.g. impact factor) or the scientific productivity of authors (e.g. h-index)

• This work is based on the assumption that all these activities can be radically improved by exploiting the actual function of citations, i.e. author’s reason for citing a given paper

✦ Could a paper that is cited many times negatively be given a high score? ✦ Could a paper containing several self-citations be given the same score of a paper with

heterogeneous citations?

Page 4: Towards the automatic identification of the nature of citations

Intended vs. interpretedmeaning of citation functions

It extends the research outlined in earlier work [3].

Intended meaning

Interpreted meaning

The author of the text

A reader of the text Another reader of the text

“Text as a medium through which an author transmits knowledge”

“The reader decodes this meaning and literally ‘make sense’ of it”

“The meaning is not contained within the words

themselves but in the minds of the participants.”

Quotations from From Proteins to Fairytales: Directions in Semantic Publishing by Anita De Waard, DOI: 10.1109/MIS.2010.49

I want to convey a particular meaning through

this text

I recognise a particular meaning by

reading that textI recognise

a particular meaning by reading that text

These three “meanings” are not the same

Page 5: Towards the automatic identification of the nature of citations

http://wit.istc.cnr.it:8080/tools/citalo

Our setting

It extends the research outlined in earlier work [3].Interpreted meaning

A reader of the text Another reader of the text

I recognise a particular meaning by

reading that textI recognise

a particular meaning by reading that text

I recognise a particular meaning by reading that text

CiTalO (English translation: cite it) is a tool that recognises the function of citations by exploiting Semantic Web technologies (CiTO, FRED, SPARQL) and NLP techniques (IMS, AlchemyAPI)

Page 6: Towards the automatic identification of the nature of citations

pipeline

It extends the research outlined in earlier work X.

Ontology learning

Citation type extraction

Word-sense disambiguation

Alignment to CiTO

Sentiment analysis

output:cito:extends

Input: a sentence containing a reference to a bibliographic entity indicated by an “X”

Derive a logical (i.e. an OWL ontology) representation of

the sentence through FRED

Extract candidate types for the citation

by looking for patterns in FRED

output via SPARQL

Gather the sense of the candidate types through IMS with

respect to OntoWordNet

Capture the sentiment polarity emerging from the text through AlchemyAPI

Assign CiTO types to the citation

through SPARQL CONSTRUCT

Page 7: Towards the automatic identification of the nature of citations

Ontology learning andcitation type extraction

FRED outputIt is returned as an OWL ontology in RDF/XML format

SELECT ?type WHERE {?subj ?prop fred:X; a ?typeTmp. ?typeTmp rdfs:subClassOf+ ?type}

SELECT ?type WHERE {?subj ?prop fred:X; a ?type}

SELECT ?type WHERE {?subj a dul:Event;boxer:patient ?patient. ?patient a ?type}

SELECT ?type WHERE {?subj a dul:Event,?type. FILTER(?type != dul:Event)}

Candidates extractionLooking for patterns through SPARQL

Page 8: Towards the automatic identification of the nature of citations

Word-sense disambiguation and alignment to CiTO

• We use IMS (a word-sense disambiguator) to disambiguate the candidate types found, with respect to OntoWordNet (OWN)

✦ EarlierWork and Work: own:synset-work-noun-1✦ Extend: own:synset-prolong-verb-1✦ Outline: own:synset-delineate-verb-3✦ Research: own:synset-research-noun-1

It is possible to extend the set of retrieved

synsets by adding the proximal ones automaticallyW

ordn

et s

ynse

ts

retr

ieve

d by

IMS

CiTO2Wordnet is an ontology we developed to maps all the CiTO properties defining

citations with related Wordnet synsets

CiTO is an ontology that defines

functions of citations as object properties

• Finally we perform SPARQL CONSTRUCT to align the synsets to those defined in the CiTO2Wordnet ontology, which imports CiTO

✦ Properties in CiTO having opposite polarity (that is defined in the CiTOFunctions ontology) to what is returned by the sentiment analysis module are not considered

✦ If no alignment is performed, cito:citesForInformation is returned as default

cito:extends skos:closeMatchown:synset-prolong-verb-1 .

cito:extends is selected!

Page 9: Towards the automatic identification of the nature of citations

A preliminary empirical evaluation

• Comparing the results of CiTalO with a human classification of citations

• The test bed we used for our experiments includes some scientific papers (written in English) encoded in DocBook, containing citations of different types

• We automatically extracted citation sentences, through an XSLT document, from all the papers published in the 7th volume of Balisage Proceedings

✦ 18 scientific papers written by different authors✦ 377 citations (20.94 citations per paper)

• We filtered all the citation sentences from the selected articles✦ The annotation of citation functions is an hard problem to address✦ We kept only 106 citations, which were accompanied by verbs and/or other grammatical structures

carrying explicitly a particular citation function, as suggested by Teufel et al. [18]

• We marked (according to CiTO properties) these 106 citations, obtaining at least one representative citation for each of the 18 paper used (5.89 citations per paper)

• We used 21 CiTO properties out of 38 to annotate all these citations

Page 10: Towards the automatic identification of the nature of citations

CiTalO setting

• We run CiTalO on these 106 citation sentences and compared results with our annotations

• We also tested eight different configurations of CiTalO, corresponding to all possible combinations of three options:

✦ activating or deactivating the sentiment-analysis module;✦ applying or not the proximal synsets to the IMS output;✦ using an extended version of CiTO2Wordnet that includes all the synsets

Wordnet retrieves for a particular string (including those synsets having the gloss radically different to the CiTO property in consideration)

CiTO2Wordnet does not link cito:extends to that synset

CiTO2Wordnet – extendeddo link the synset with cito:extends

Consider the synset own:synset-credit-verb-3, having gloss “accounting: enter as credit”

and the CiTO definition for cito:credits“the citing entity acknowledges contributions made by the cited entity”

Page 11: Towards the automatic identification of the nature of citations

Results

Kinds and number of citations marked by usSimilarly to Teufel et al. [19] the most neutral CiTO property, citesForInformation, was the most prevalent function in our dataset too, as the second most used property was usedMethodIn

Precision and recall of CiTalO according to different configurations

No configuration that emerges as the absolutely best one from these data

Worst configurations were those that took into account all the proximal synsets

Page 12: Towards the automatic identification of the nature of citations

Conclusions

• The implementation of CiTalO is still at an early stage

• Current experiments are not enough to fully validate this approach

• However, the goal of this work was to build such a modular architecture, to perform some exploratory experiments and to identify issues and possible developments of our approach

• Future works:✦ Propose extensions to CiTO to cover more scenarios (e.g. cito:speculatesOn

and cito:citesAsPotentialSolution)✦ Decrease the noise given by proximal synsets✦ Identification of anaphoras (e.g. nouns, pronouns) speaking about the citations✦ Perform exhaustive tests with a larger set of documents and users

Page 13: Towards the automatic identification of the nature of citations

Thanks for your attention

Please come to see CiTalO in action during the ESWC Demo Session on Tuesday