semantic search of medical resources on the base of meta ... processing/167.pdf · abstract:...

SETIT 2009

5th International Conference: Sciences of Electronic,

Technologies of Information and Telecommunications March 22-26, 2009 – TUNISIA

- 1 -

Semantic Search of Medical Resources on the Base of

Meta-information and Semantic Annotation

Olfa DRIDI and Mohamed BEN AHMED

RIADI Laboratory, National School of Computer Sciences, Tunis, Tunisia

[email protected]

[email protected]

Abstract: Semantic search has been one of the motivations of the Semantic Web since it was envisioned. In my thesis, I

research the development of a new retrieval model for the exploitation of knowledge represented in ontologies to

improve search over large document repositories. In my proposed approach I consider an adaptation of the classic

model of Information Retrieval (IR), including a meta-information process, and a semantic annotation of document. In

this approach, semantic search is based on meta-information and semantic annotation in order to index document

semantically. The method has been tested on medical corpora, showing promising results respect to keyword-based

search, and providing ground for further analysis and research.

Key words: annotation, medical corpus, meta-information, ontology, semantic search.

INTRODUCTION

Semantic search has been one of the major

envisioned benefits of the Semantic Web since its

emergence in the late 90’s. One way to view a

semantic search engine is as a tool that gets formal

ontology-based queries (e.g. in RDQL, RQL,

SPARQL, etc.) from a client, executes them against a

knowledge base, and returns tuples of ontology values

that satisfy the query [SMM 83]. These techniques

typically use boolean search models, based on an ideal

view of the information space as consisting of non-

ambiguous, non-redundant, formal pieces of

ontological knowledge. A knowledge item is either a

correct or an incorrect answer to a given information

request, thus search results are assumed to be always

100% precise, and there is no notion of approximate

answer to an information need. While this conception

of semantic search brings key advantages already, our

work aims at taking a step beyond. In the view

proposed in this thesis for Information Retrieval in the

Semantic Web, a search engine returns documents,

rather than (or in addition to) exact values, in response

to user queries. Furthermore, as a fundamental

requirement for scaling up to massive information

sources, the engine should rank the documents,

according to concept-based relevance criteria.

A purely boolean ontology-based retrieval model

makes sense when the whole information corpus can

be fully represented as an ontology-driven knowledge

base. But there are well-known limits to the extent to

which knowledge can be formalized this way. First,

because converting the huge amount of information

currently available into formal ontological knowledge

at an affordable cost is currently an unsolved problem

in general. Second, documents hold a value of their

own, and are not equivalent to the sum of their pieces.

Third, wherever ontology values carry free text,

boolean semantic search systems do a full-text search

within the string values. If the values hold long pieces

of text, a form of keyword-based search is taking

place in practice beneath the ontology-based query

model, whereby the “perfect match” assumption starts

to become arguable. If no clear ranking criteria are

supplied, the search system may become useless if the

search space is too big.

The goal of my thesis is the development of an

ontology-based information retrieval model meant for

the exploitation of full-fledged domain ontologies and

knowledge bases, to support semantic search in

document repositories [VFC 05]. In contrast to

boolean semantic search systems, in the proposed

perspective full documents, rather than specific

ontology values from a KB, are returned in response

to user information needs. To cope with large-scale

information sources, an adaptation of the classic

vector-space model [SMM 83] is proposed, suitable

for an ontology-based representation, upon which a

ranking algorithm is defined.

SETIT2009

- 2 -

1. Ontology and information retrieval

Every domain has phenomena that people allocate

as conceptual or physical objects, connections and

situations. With the help of various language

mechanisms such phenomena contacts to the certain

descriptors (for example, names, and noun phrases).

For the successful solution of an informational

retrieval task it is necessary to present user knowledge

about domain of her/his interests in some form

suitable for computer processing. The specifications of

high-level domain are formed by integration of the

domain structures of low-level domains. It is

important to achieve an interoperability of domain

knowledge representation. Ontological approach is an

appropriate tool for solution of this task. Ontology is

an agreement about common use of concepts that

contains means of representation of subject knowledge

and agreements on methods of reasons. It can be

considered as the certain description of the views on

the world in some specific sphere of interests.

Ontology consists of: 1) a set of the terms; 2) a set of

rules of their use that limit their meanings in the

context of concrete domain [CBA 04].

The ontology is knowledge base of a special kind

with the semantic information about some domain. It

is a set of definitions in some formal language of

declarative knowledge fragment focused on joint

repeated use by the various users in the applications.

Ontological commitments are the agreements

aimed at coordination and consistent use of the

common dictionary. The agents (human beings or

software agents) that jointly use the dictionary do not

feel necessity of common) knowledge base: one agent

can know something that don't know the other ones,

and the agent that handles the ontology is not required

the answers to all questions that can be formulated

with the help of the common dictionary.

Every domain with the certain subject of research

has its own terminology, original dictionary used for

discussion of typical objects and processes of this

domain. The library, for example, involves the

dictionary relating to the books, references,

bibliographies, magazines etc. Thus, pattern of domain

is discovered by its dictionary - the set of words that

are used in this domain. Clearly, however, that the

specificity of domain is shown not only in the

appropriate dictionary. Besides, it is necessary: (i) to

provide strict definitions of grammar managing of

combining the dictionary terms into the statements,

and (ii) to clear logic connections between such

statements. Only when this additional information is

accessible, it is possible to understand both nature of

domain objects and important relations established

between them. Ontology - structured representation of

this information [GMM 03].

The formal model of domain ontology O is an

ordered triple O = < X, R, F >, where Х - finite set of

subject domain concepts that represents ontology O; R

- finite set of the relations between concepts of the

given subject domain; F - finite set of interpretation

functions of given on concepts and relations of

ontology O [FVC 06].

2. Related Work

The view of the semantic retrieval problem in this

thesis is very close to the proposals in KIM [KPT 04]

[PKO 04]. While KIM focuses on automatic population

and annotation of documents, my work focuses on the

ranking algorithms for semantic search. Along with

TAP, KIM is one of the most complete proposals

reported to date, to my knowledge, for building high-

quality KBs, and automatically annotating document

collections at a large scale. The proposed work

complements KIM and TAP with a ranking algorithm

specifically designed for an ontology-based retrieval

model, using a semantic indexing scheme based on

annotation weighting techniques.

Semantic Portals [MSS 03] typically provide

simple search functionalities that may be better

characterized as semantic data retrieval, rather than

semantic information retrieval. Searches return

ontology instances rather than documents, and no

ranking method is provided. In some systems, links to

documents that reference the instances are added in

the user interface, next to each returned instance in the

query answer [CVM 05], but neither the instances, nor

the documents, are ranked.

The ranking problem has been taken up again in

[SSS 03], and more recently [RSD 04]. Whereas both

of these works are concerned with ranking query

answers (i.e. ontology instances), my work is

concerned with ranking the documents annotated with

these answers. Since my respective techniques are

applied in consecutive phases of the retrieval process,

it would be interesting to experiment the integration of

the query result relevance function proposed by

Stojanovic et al into the document relevance measures

under definition in the thesis.

Finally, the thesis shares with Mayfield and Finin

[MFT 03] the idea that semantic search should be a

complement of keyword-based search as long as not

enough ontologies and metadata are available. Also, I

believe that inferencing is a useful tool to fill

knowledge gaps and missing information (e.g.

transitivity of the located in relationship over

geographical locations).

3. Proposed Approach

In the proposed view of semantic information

retrieval, I assume a knowledge base has been built

and associated to the information sources (the

document base), by using one or several domain

ontologies that describe concepts appearing in the

document text. The implemented system can work

with any arbitrary domain ontology with essentially no

restrictions, except for some minimal requirements,

which basically consist of conforming to a set of root

ontology classes: Concept should be the root of all

domain classes that can be used (directly or after

subclassing) to create instances that describe specific

SETIT2009

- 3 -

entities referred to in the documents.

Document is used to create instances that act as

proxies of documents from the information source to

be searched upon. Taxonomy is the root for class

hierarchies that are merely used as classification

schemes, and are never instantiated. The concepts and

instances in the Knowledge Base (KB) are linked to

the documents by means of explicit, non-embedded

annotations to the documents.

While I do not address here the problem of

knowledge extraction from text [CVM 05] [KPT 04]

[PKO 04], I use a tool to aid in the semi-automatic

annotation of documents.

The annotations are used by the information

retrieval system, to index semantically resources.

In the classic vector-space model, keywords

appearing in a document are assigned weights

reflecting that some words are better at discriminating

between documents than others. Similarly, in this

system, annotations are assigned a weight that reflects

how important the instance is considered to be for the

document meaning. Weights are computed

automatically by an adaptation of the TF-IDF

algorithm [SMM 83], based on the frequency of

concepts of the instances in each document. So, we

propose CF-IDF to compute weights of each concept.

Figure 1 : general schema of our proposed

approach

The approach proposes ontology-based

information retrieval. It can be seen as an evolution of

classic keyword-based retrieval techniques, where the

keyword-based index is replaced by a semantic

knowledge. The overall retrieval process is illustrated

in Figure. 1. The system takes as input a Natural

language query. This query can be represented by a set

of concepts.

The query is executed against the semantic

annotated corpus, which returns a list of pertinent

documents that satisfy the query and user’s profile.

Finally, the documents that are annotated with these

concepts are retrieved, ranked.

3.1. Corpus construction

We started by constructing my resource collection

from the web. We download documents from these

web sites:

3.2. Meta-information generation

We define meta-information as information about

information.

Meta-information generation is the act for creating

or producing information. Generating good quality

meta-information in an efficient manner is essential

for organizing and making accessible the growing

number of rich resources available on the web and in

corpora.

3.3. Semantic annotation

'Annotation', in contemporary English, according

to WordNet, has two meanings:

� note, annotation, notation: a comment

(usually added to a text);

� annotation, annotating -- the act of adding

notes.

In linguistics (and particularly in computational

linguistics) an annotation is considered a formal note

added to a specific part of the text. There are number

of alternative approaches regarding the organization,

structuring, and preservation of annotations. For

instance, all the markup languages (HTML, SGML,

XML, etc.) can be considered schemata for embedded

or in-line annotation.

Semantic annotation is information about what

entities (or, more generally, semantic features) appear

in a text and where they do. Formally, semantic

annotations represented a specific sort of metadata,

which provides references to concepts in ontologies.

4. Development

4.1. Our semantic space

To realize our approach, we propose two kinds of

ontologies. The first ontology, is called meta-

information ontology, and describes information

related to our domain. The second ontology, called

breast cancer ontology, describes concepts and their

relationship from medical domain.

We have used protégé-2000, as tool to construct

our ontologies.

4.1.1. Breast cancer ontology Our aim in developing the Breast Cancer ontology

is not to provide a perfect, generic ontology which

encompasses all of breast cancer. Instead, I have based

SETIT2009

- 4 -

it on relevant literature (article, course …).

We can’t represent this ontology in a small figure,

because it contains a lot of concepts.

4.1.2. Meta-information ontology In this ontology, we present information which we

consider pertinent in order to describe profile of users.

These informations are showed in Figure 2: title,

author, laboratory, school, description of resource,

key-words, speciality, public concerned, format of

resource, size of resource, subject, date, language …

Figure 2: meta-information ontology

4.2. GATE

We have used GATE, which is a large-scale

infrastructure for natural language processing

applications. Linguistic data associated with language

resources such as documents and corpora is encoded

in the form of annotations. GATE supports a variety of

formats including XML, RTF, HTML, SGML, email

and plain text. In all cases, when a document is

created/opened in GATE, the format is analysed and

converted into a single unified model.

Provided with GATE is a set of reusable

processing resources for common NLP tasks. (None of

them are definitive, and the user can replace and/or

extend them as necessary.) These are packaged

together to form ANNIE, A Nearly- New IE system,

but can also be used individually or coupled together

with new modules in order to create new applications.

For example, many other NLP tasks might require a

sentence splitter and POS tagger, but would not

necessarily require resources more specific to IE tasks

such as a named entity transducer. The system is in

use for a variety of IE and other tasks, sometimes in

combination with other sets of application-specific

modules.

ANNIE consists of the following main processing

resources: tokeniser, sentence splitter, POS tagger,

gazetteer, finite state transducer (based on GATE’s

built-in regular expressions over annotations

language), orthomatcher and coreference resolver .

The resources communicate via GATE’s

annotation API, which is a directed graph of arcs

bearing arbitrary feature/value data, and nodes rooting

this data into document content.

We used these processing resources in order to

extract document information which related to

concepts in ontologies.

These processing resources are represented in

Figure 3:

The tokeniser splits text into simple tokens, such

as numbers, punctuation, symbols, and words of

different types (e.g. with an initial capital, all upper

case, etc.). The aim is to limit the work of the

tokeniser to maximise efficiency, and enable greater

flexibility by placing the burden of analysis on the

grammars. This means that the tokeniser does not need

to be modified for different applications or text types.

The sentence splitter is a cascade of finite state

transducers which segments the text into sentences.

This module is required for the tagger. Both the

splitter and tagger are domain and application-

independent.

The tagger is a modified version of the Brill

tagger, which produces a part-of-speech tag as an

annotation on each word or symbol. Neither the

splitter nor the tagger are a mandatory part of the NE

system, but the annotations they produce can be used

by the grammar, in order to increase its power and

coverage.

personne

auteur

lecteur

laboratoire université

affecté à

appartient payssitué en

écrit avec

document

date de publication

possède

sujet

conserne

travaille sur

article rapport livre

littérature

publié dans

revueconférence

ouvrage

niveau de formation

type de formation

type

format possède

concept-clé

langage

possède

titre

formation initiale

possède

formation contenue

formation spécialisée

possède

étudiant

public cible

distiné pour

date de formation

document texte

document multimédia

rédigé par

durée

étudiant chercheur

médecin spécialiste

homme du monde

médecin généraliste

cours

SETIT2009

- 5 -

Figure 3 : different steps of ANNIE

The gazetteer consists of lists such as cities,

organisations, days of the week, etc. It not only

consists of entities, but also of names of useful

indicators, such as typical company designators (e.g.

‘Ltd.’), titles, etc. The gazetteer lists are compiled into

finite state machines, which can match text tokens.

The semantic tagger consists of handcrafted rules

written in the JAPE (Java Annotations Pattern Engine)

language, which describe patterns to match and

annotations to be created as a result. JAPE is a version

of CPSL (Common Pattern Specification Language),

which provides finite state transduction over

annotations based on regular expressions. A JAPE

grammar consists of a set of phases, each of which

consists of a set of pattern/action rules, and which run

sequentially. Patterns can be specified by describing a

specific text string, or annotations previously created

by modules such as the tokeniser, gazetteer, or

document format analysis.

The orthomatcher is another optional module for

the IE system. Its primary objective is to perform co-

reference, or entity tracking, by recognising relations

between entities. It also has a secondary role in

improving named entity recognition by assigning

annotations to previously unclassified names, based on

relations with existing entities.

The coreferencer finds identity relations between

entities in the text.

4.3. Semantic annotation.

We have annotated document by the use of GATE.

So, we can open our breast cancer ontology in GATE,

and we can annotate documents with concepts from

ontologies.

In a nutshell, Semantic Annotation is about

assigning to the entities in the text links to their

semantic descriptions (ontologies). This sort of

metadata provides both class and instance information

about the entities.

In Figure 4, we show step of annotating document

in GATE by using our Breast cancer ontology.

4.4. Semantic indexing and search

Meta-information generated and semantic

annotations provide semantic indexing and search.

This type of indexing enables new (semantically

enhanced) access methods. Thus the user could

specify queries, which consist of constraints,

regarding the types of entities, relations between the

entities, and entity attributes. E.g. one could specify

the NEs that are to be referred to in the documents of

interest, with name restrictions (e.g. a Person which

name ends with ‘Alabama’). An example of a query

consisting of pattern restrictions over entities could be:

“give me all documents referring to a breast cancer”.

To answer the query, our system applies the semantic

restrictions over the entities in the instance base. The

resulting set of document contains, for example, the

synonyms of “breast cancer” like mastectomy.

Figure 4 : semantic annotation with GATE

5. Conclusion and Future Work

This paper presented the notion of semantic search,

a new model allowing ontology-based annotation,

indexing, and retrieval.

The evaluation work that has been done until now

does not provide enough empirical justification about

the feasibility of the approach, technology, and

resources being used.

The challenges towards the general approach can

be summarized as follows:

� Develop (or adapt) an evaluation metric,

which properly measures the performance

of a semantic annotation system;

� Evaluation of the semantic IR against a

traditional IR engine, so as to formally

measure the positive effect of semantic

indexing (cutting out the irrelevant

results, because of the semantic

restrictions; and retrieving even more

correct results – e.g. when an entity is

mentioned with another alias, but is still

indexed by its unique identifier).

SETIT2009

- 6 -

REFERENCES

[CBA 04] Contreras, J., Benjamins, V. R., et al: A

Semantic Portal for the International Affairs

Sector. 14th International Conference on

Knowledge Engineering and Knowledge

Management (EKAW 2004). LNCS Vol. 3257

(2004) 203-215

[CVM 05] Castells, P., Fernández, M., Vallet, D.,

Mylonas, P., Avrithis, Y.: Self-Tuning

Personalized Information Retrieval in an

Ontology-Based Framework. 1st IFIP

International Workshop on Web Semantics

(SWWS 2005). LNCS Vol. 3532 (2005) 455-

470

[FVC 06] M. Fernández, D. Vallet, P. Castells.

Probabilistic Score Normalization for Rank

Aggregation. 28th European Conference on

Information Retrieval (ECIR 2006). London,

April 2006. Springer Verlag Lecture Notes in

Computer Science, Vol. 3936, pp. 553-556.

[GMM 03] Guha, R. V., McCool, R., and Miller, E.:

Semantic search. 12th International World

Wide Web Conference (WWW 2003).

Budapest, Hungary (2003) 700-709

[KPT 04] Kiryakov, A., Popov, B., Terziev, I., Manov,

Ognyanoff, D.: Semantic Annotation,

Indexing, and Retrieval. Journal of Web

Sematics 2:1 (2004) 49-79

[MFT 03] Mayfield, J., Finin, T.: Information retrieval

on the Semantic Web: Integrating inference

and retrieval. Workshop on the Semantic Web

at the 26th International ACM SIGIR

Conference on Research and Development in

Information Retrieval (SIGIR 2003). Toronto,

Canada (2003)

[MSS 03] Maedche, A., Staab, S., Stojanovic, N., Studer,

R., Sure, Y.: SEmantic portAL: The SEAL

Approach. In: Fensel, D., Hendler, J. A.,

Lieberman, H., Wahlster, W. (eds.): Spinning

the Semantic Web. MIT Press, Cambridge

London (2003) 317-359

[PKO 04] Popov, B., Kiryakov, A., Ognyanoff, D.,

Manov, D., Kirilov, A.: KIM – A Semantic

Platform for Information Extaction and

Retrieval. Journal of Natural Language

Engineering 10:3-4 (2004) 375- 392

[RSD 04] Rocha, C., Schwabe, D., de Aragão, M. P.: A

Hybrid Approach for Searching in the

Semantic Web. International World Wide Web

Conference (WWW 2004), New York (2004)

374-383

[SMM 83] Salton, G., McGill, M. Introduction to Modern

Information Retrieval. McGraw-Hill, New

York (1983)

[SSS 03] Stojanovic, N., Studer, R., Stojanovic, L.: An

Approach for the Ranking of Query Results in

the Semantic Web. 2nd International Semantic

Web Conference (ISWC 2003). LNCS Vol.

2870 (2003) 500-516

[VFC 05] Vallet, D., Fernández, M., Castells, P.: An

Ontology-Based Information Retrieval Model.

2nd European Semantic Web Conference

(ESWC 2005). LNCS Vol. 3532 (2005) 455-

470

semantic search of medical resources on the base of meta ... processing/167.pdf · abstract:...

Documents