fioboda - semantic annotation framework for web … · 2016-11-17 · semantic annotation method,...

12
http://www.iaeme.com/IJCET/index.asp 65 [email protected] International Journal of Computer Engineering & Technology (IJCET) Volume 7, Issue 5, Sep–Oct 2016, pp. 65–76, Article ID: IJCET_07_05_008 Available online at http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=7&IType=6 Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com ISSN Print: 0976-6367 and ISSN Online: 0976–6375 © IAEME Publication FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB EXTRACTED DATA C. Gnana Chithra Equity Research Consultant, Angeeras Securities, Chennai, Tamilnadu, India Dr. E. Ramaraj Professor, Department of Computer Science and Engineering, Alagappa University, Karaikudi, Tamilnadu, India ABSTRACT Semantic annotation of web pages is the state of art technology for achieving the unified objective of attaining Semantic web Universe, which enables sharing, and reusing the document content beyond the boundaries and applications. Web is a treasury of knowledge and efficient tools should be designed to explore the structured and unstructured data. Annotating million of web pages manually is an impossible task. For high information retrieval rates, automatic annotation of documents is mandatory. Metadata is added to the web pages to make it intelligent for processing in content based intelligent applications. This paper analyses the problems with the current Semantic annotation systems and proposes a new Ontology based Automatic annotation system Framework. Ontology based semantic annotation is one of the best methods for extracting data from the Knowledge Base. The integration of Modified Manning’s Sentence boundary detection algorithm and Noun Phrase Collocation algorithm and classification using machine learning techiques in the Information Extraction module, and developing a new data model and ontology for Structured Ontology engineering model is contributed in this paper. Annotation module annotates the output of the information extraction module with the aid of ontologies and dictionaries and stores the resultant annotated data as RDF triples in the Annotation database. Reasoning is made on the Annotated data by the RDF repository interface. FIOBODA is abbreviated as the Financial Instruments ontology based open document annotation. Web pages extracted from the Financial securities domain are mapped with the Finance ontology to extract the subject, predicate and object. SVM classifier is used to classify the correct and incorrect annotations. The correct output annotation data is stored in Annotation data base and RDF repository for later use. The proposed framework to an extent solves the problem of knowledge bottleneck due to its reusability and interoperability features. Key words: Dublin Core, FIOBODA, Financial Securities Ontology, Metadata, Semantic Annotation Framework.

Upload: others

Post on 13-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

http://www.iaeme.com/IJCET/index.asp 65 [email protected]

International Journal of Computer Engineering & Technology (IJCET) Volume 7, Issue 5, Sep–Oct 2016, pp. 65–76, Article ID: IJCET_07_05_008

Available online at

http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=7&IType=6

Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com

ISSN Print: 0976-6367 and ISSN Online: 0976–6375

© IAEME Publication

FIOBODA - SEMANTIC ANNOTATION FRAMEWORK

FOR WEB EXTRACTED DATA

C. Gnana Chithra

Equity Research Consultant,

Angeeras Securities, Chennai, Tamilnadu, India

Dr. E. Ramaraj

Professor, Department of Computer Science and Engineering,

Alagappa University, Karaikudi, Tamilnadu, India

ABSTRACT

Semantic annotation of web pages is the state of art technology for achieving the unified

objective of attaining Semantic web Universe, which enables sharing, and reusing the document

content beyond the boundaries and applications. Web is a treasury of knowledge and efficient tools

should be designed to explore the structured and unstructured data. Annotating million of web

pages manually is an impossible task. For high information retrieval rates, automatic annotation of

documents is mandatory. Metadata is added to the web pages to make it intelligent for processing

in content based intelligent applications. This paper analyses the problems with the current

Semantic annotation systems and proposes a new Ontology based Automatic annotation system

Framework. Ontology based semantic annotation is one of the best methods for extracting data

from the Knowledge Base.

The integration of Modified Manning’s Sentence boundary detection algorithm and Noun

Phrase Collocation algorithm and classification using machine learning techiques in the

Information Extraction module, and developing a new data model and ontology for Structured

Ontology engineering model is contributed in this paper. Annotation module annotates the output

of the information extraction module with the aid of ontologies and dictionaries and stores the

resultant annotated data as RDF triples in the Annotation database. Reasoning is made on the

Annotated data by the RDF repository interface. FIOBODA is abbreviated as the Financial

Instruments ontology based open document annotation. Web pages extracted from the Financial

securities domain are mapped with the Finance ontology to extract the subject, predicate and

object. SVM classifier is used to classify the correct and incorrect annotations. The correct output

annotation data is stored in Annotation data base and RDF repository for later use. The proposed

framework to an extent solves the problem of knowledge bottleneck due to its reusability and

interoperability features.

Key words: Dublin Core, FIOBODA, Financial Securities Ontology, Metadata, Semantic

Annotation Framework.

Page 2: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

C. Gnana Chithra and Dr. E. Ramaraj

http://www.iaeme.com/IJCET/index.asp 66 [email protected]

Cite this Article: C. Gnana Chithra and Dr. E. Ramaraj, Fioboda - Semantic Annotation

Framework For Web Extracted Data. International Journal of Computer Engineering and

Technology, 7(5), 2016, pp. 65–76.

http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=7&IType=6

1. INTRODUCTION

Many researchers are working in the area of semantic web to develop techniques and tools for searching,

mining, accessing and reasoning the semantic data. The human annotation on the web content is of high

accuracy but with a restriction of scalability of data. When large number of web data needs to be

annotated, manual annotation lacks quality and speed. The alternative methodology is to transform the

human readable web into machine readable data by adding the metadata to the document which makes it

an intelligent document. In the recent past many methodologies and frameworks were proposed by the

researchers on semi automatic annotation and automatic annotation with or without using ontology and

other lexicons. Semantic web includes technologies such as metadata, ontologies, inference and logic

modules for reasoning.

Merriam dictionary [1] defines annotation as “to add a short explanation or opinion to a text or

drawing”. When the web document is enriched with metadata for machine processing, and the process is

called as semantic annotation. Though billions of growing documents are present in the web, the search

engines such as Google, yahoo or bing does not support semantic analysis to a larger extent. Annotation

types can be classified based on their functions, features used and the prevailing technologies. Using

metadata with the content would provide rich semantic applications for the web.

The different kinds of annotation are Textual Annotation, Image annotation, PDF annotation,

Multimedia annotation, Web annotation and PDF annotation. The enormous development of research has

been carried out in the field of Information Extraction such as Named entity recognition, Relation

extraction etc. With the incorporation of Dublin core Metadata elements such as Creator, Title, subject,

description, format date etc. Into the web page, the spider or crawler builds a content index on the website

for each page. When the user makes a semantic search in the semantic search engine, the underlying

information in the semantically marked up web page helps in ranking the webpage using the content index

and the resultant web search pages area available for further processing. Semantic search is more efficient

than the normal word-to-word search made by other search engine algorithms. The crawler indexes only

the text content in the website, whereas the images, audio and video are ignored.

In the current scenario more of semi-automatic semantic annotation systems are used. This is due to the

limitation in the automatic semantic annotation of its scalability and accuracy features of generating and

representing models of annotations.

2. RELATED STUDIES

Open annotations on the web can be made classified into two types. The first one being the creation of

semi automatic annotated documents using ontologies [2]. The focus of researches is currently navigated to

automatic annotation [3].[4] has designed a new strategy incorporating information extraction and machine

learning techniques for annotating the document”. Baumgartner et. al [5]designed wrappers to extract data

from web using the supervised learning techniques. Kiryakov et.al [6] designed “KIM for knowledge and

information management infrastructure for automatic semantic annotation”. Dill et.al.[7] created a tool for

semantic tagging of texts in the large corpora. The concept of Open Annotation made by Open Annotation

Collaboration [8] is acquired by W3C open annotation community group

3. DEFINITION OF SEMANTIC ANNOTATION

Handschuh [9] defines semantic annotation as “An annotation attaches some data to some other data: it

establishes, within some context, a (typed) relation between the annotated data and the annotating

data.”Kiryakov et al.[6] defines semantic annotation as a schema and its more specific generated metadata

Page 3: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

Fioboda - Semantic Annotation Framework For Web Extracted Data

http://www.iaeme.com/IJC

enables discovering new information access methods and also to extend the existing methods

explains that semantic metadata can be defined as lin

4. FORMAL ANNOTATION

Annotation can be expressed as a tuple containing four elements.SAM = {C,S,O,P} where SAM stands for

semantic annotation method, “C” stands for the context of the annotation in which the annotat

“S” is the subject of the annotation or the data to be annotated, “P” is the predicate of the annotationor

relationship between the annotating data and “O”is the object of the annotation. With respect to the formal

annotations all the elements of “SAM” are expressed as Uniform resource Identifier (URI).In ontological

representation of semantic annotation Predicate and object are the ontological terms, and the object

conforms to the ontological standards.

5. OPEN ANNOTATION

Open annotation [8] is a strategy of modeling Web based documents for annotations. The documents are

linked to the World Wide Web and with the principles of structured and unstructured data. The annotated

documents are shared across different clients, servers and by tools and

URN is published and stored in the annotation servers with no particular protocol associated with it.

6. HUMAN ANNOTATION

Subject experts in the area of financial securities were requested to annotate the web pages. The an

annotated the instances with the targets. Experts came with different results which semantically enriched

the web pages to a larger extent. More identifiers were assigned to the same web page. Gold standard data

was obtained from the results of an

7. FIBODA FRAMEWORK

The proposed automatic semantic annotation framework is depicted in Fig.1. In this framework the crawler

collects data from the web and stores the selected pages as web documents. The web documents is the

input to the Information Extraction module. After low

the annotation module where the entities and the relationships extracted are compared with the ontological

concepts and the entity is annotated with root concept.

Figure 1 FIBODA

Semantic Annotation Framework For Web Extracted Data

CET/index.asp 67

enables discovering new information access methods and also to extend the existing methods

explains that semantic metadata can be defined as linking the related terms with each other.

FORMAL ANNOTATION

Annotation can be expressed as a tuple containing four elements.SAM = {C,S,O,P} where SAM stands for

semantic annotation method, “C” stands for the context of the annotation in which the annotat

“S” is the subject of the annotation or the data to be annotated, “P” is the predicate of the annotationor

relationship between the annotating data and “O”is the object of the annotation. With respect to the formal

of “SAM” are expressed as Uniform resource Identifier (URI).In ontological

representation of semantic annotation Predicate and object are the ontological terms, and the object

conforms to the ontological standards.

s a strategy of modeling Web based documents for annotations. The documents are

and with the principles of structured and unstructured data. The annotated

documents are shared across different clients, servers and by tools and applications of semantic web. The

URN is published and stored in the annotation servers with no particular protocol associated with it.

HUMAN ANNOTATION

Subject experts in the area of financial securities were requested to annotate the web pages. The an

annotated the instances with the targets. Experts came with different results which semantically enriched

the web pages to a larger extent. More identifiers were assigned to the same web page. Gold standard data

was obtained from the results of annotators.

FIBODA FRAMEWORK

The proposed automatic semantic annotation framework is depicted in Fig.1. In this framework the crawler

collects data from the web and stores the selected pages as web documents. The web documents is the

tion Extraction module. After low-level information processing the data is passed to

the annotation module where the entities and the relationships extracted are compared with the ontological

concepts and the entity is annotated with root concept.

FIBODA- Automatic annotation Framework Diagram

Semantic Annotation Framework For Web Extracted Data

[email protected]

enables discovering new information access methods and also to extend the existing methods. Haase [10]

king the related terms with each other.

Annotation can be expressed as a tuple containing four elements.SAM = {C,S,O,P} where SAM stands for

semantic annotation method, “C” stands for the context of the annotation in which the annotation is made,

“S” is the subject of the annotation or the data to be annotated, “P” is the predicate of the annotationor

relationship between the annotating data and “O”is the object of the annotation. With respect to the formal

of “SAM” are expressed as Uniform resource Identifier (URI).In ontological

representation of semantic annotation Predicate and object are the ontological terms, and the object

s a strategy of modeling Web based documents for annotations. The documents are

and with the principles of structured and unstructured data. The annotated

applications of semantic web. The

URN is published and stored in the annotation servers with no particular protocol associated with it.

Subject experts in the area of financial securities were requested to annotate the web pages. The annotators

annotated the instances with the targets. Experts came with different results which semantically enriched

the web pages to a larger extent. More identifiers were assigned to the same web page. Gold standard data

The proposed automatic semantic annotation framework is depicted in Fig.1. In this framework the crawler

collects data from the web and stores the selected pages as web documents. The web documents is the

level information processing the data is passed to

the annotation module where the entities and the relationships extracted are compared with the ontological

Automatic annotation Framework Diagram

Page 4: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

C. Gnana Chithra and Dr. E. Ramaraj

http://www.iaeme.com/IJCET/index.asp 68 [email protected]

Apart from ontologies other lexicons such as Word Net, Wikipedia and Google are used as knowledge

base during the annotation process. The resultant annotations are verified for their correctness. If the

annotations are correct it is added to the annotation database, otherwise it is rejected. Query parser sends

query to the inference engine and with the reasoning techniques, results are obtained from the knowledge

base as well as Annotation database.

Human annotation requires large set of data as training set. Supervised algorithms also require very

large data set for testing and training. Compared to supervised learning, semi-supervised learning only

requires less data. Automatic semantic annotation also requires data initially for learning, but very few

when compared semi-automatic technologies.

8. INFORMATION EXTRACTION MODULE

The input to this module is the extracted web pages. Html scraping is performed to remove the html tags as

well to filter the audio, video and images. The html document is converted to plain text. The text is parsed

with robust lightweight parser. The Modified sentence boundary detection and classification [11] algorithm

designed by us for this research on semantic annotations will be used in this phase is given in Fig.2.

The sentence boundaries are detected and classified correctly even with abbreviations including that of

geographical locations and identification of university degrees and for detecting url’s.

Figure 2 Modified Manning’s sentence detection algorithm

MODIFIED MANNING’S HEURSITIC ALGORITHM

• Place putative sentence boundaries after all occurrences of. ? ! (and maybe ; : -_)

• Move the boundary after following quotation marks, if any.

• Disqualify a period boundary in the following circumstances:

• If it is preceded by a known abbreviation of a sort that does not normally occur word finally,

but is commonly followed by a capitalized proper name, such as Prof. or vs.

• The period character ‘.’ in the name of the initials of a person should not be split into a

separate sentence.

• The period character in the name of educational Degrees should not be spilt into sentences.

• Lookup the ontology for recognizing the educational qualification.

• If Abbreviation contains numbers check it against the ontology.

• Abbreviations other than educational degrees and geographical data are referred with Wordnet

ontology and ontology containing honorary titles, family titles and professional titles.

• The URL should not be split as it contains periods.

• Sentence should not be split after Ellipses in English.

• Disqualify a boundary with a ? or ! if:

• It is followed by a lowercase letter (or a known name).

• When there is an imbalance in the parenthesis or bracket of sentence, do not split the sentence.

Balance the parenthesis or bracket by inserting or replacing the mark.

• Regard other putative sentence boundaries as sentence boundaries.

Page 5: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

Fioboda - Semantic Annotation Framework For Web Extracted Data

http://www.iaeme.com/IJCET/index.asp 69 [email protected]

The correctly classified sentences are parsed using sentence segmentation techniques. It is then

tokenized into smaller units called as tokens. The stop words in the list are removed, morphing analysis is

performed to find the root of the word and the Porter’s stemming algorithm stems or cuts the words to the

root. Finally the lexical process of Part-of-Speech tagging is made on the token to identify the Named

entities and associate it with POS tags.

Using the Collocation extraction and Filtering Noun phrase algorithm [12] (which is also part of this

research) the phrases are extracted from the corpus and conforming to rules of Noun phrase Filters it is

classified as the Noun Phrase Collocations. These noun phrases are passed on to the annotation phase.

Figure 3 Collocation Phrase Algorithm

9. ONTOLOGY DESIGN AND MANAGEMENT

Ontology is a model, which is made up of Concepts, attributes and relations. It defines the relationships

between the elements in such a way that it machine readable and it defines the things which are available

in the real universe. Taxonomy can be defined as the hierarchical representation of things. Ontologies and

Taxonomies are business models, which allow the concepts to be defined in different level of granularity.

Ontology adds information to the Taxonomy aiding it to define the concepts in a machine-readable

manner.

The first statement in the ontology is owl:thing which means that the ontology is a sub class of main

class owl:thing and it is built around the things in the real universe.

Algorithm for Collocation Phrase Extraction

Input: List of Phrases or n grams extracted after pre-processing the web document.

Step 1: Take a phrase p1 from the list of phrases P= {p1,p2,p3..pn) in the collection.

Step 2: Compare the phrase p1 with Word Net super thesaurus. If phrase exists then add it to

the potential collocation candidate (PCC) set. Go to step 7; Otherwise goto step 3.

Step 3: Compare the Phrase p1 with the Wikipedia Pronoun ontology. The basic requirement is

p1 should be in all capital letters. The result after the search is, if phrase exists it is the first

element in the main body add to PCC. If it is a normal noun phrase it need be capitalized. If

phrases exists then add to PCC. Go to step 7;

Otherwise goto Step 4.

Step 4: Perform Google search on the p1 and the Search engine result page (SERP) outputs

results with ranking then, p1 above the threshold is added to the PCC. Go to step 7; Otherwise

goto Step 5.

Step 5: Make a search for p1 in BNC dictionary. If phrase available then add to PCC. Go to

step 7;

Otherwise goto Step 6.

Step 6: Search Geographic Gazateer for Proper noun Phrase. If it matches add to PCC.

Step 7: If the phrase cannot be classified as PCC through step 2 to step 6 then mark the phrase

as REJECTED CANDIDATE and add it to rejected list.

Step 8: Increment the phrase to p2. Goto step 2 and proceed until the entire set is exhausted.

Step 9: Finally PCC contains the collocation phrases.

Page 6: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

http://www.iaeme.com/IJC

Ontology is a business model which explains the relationships b

information about the entities, in a way which is machine readable. The ontology, like

definitions of things in the real world. Therefore the foundation

hierarchical class structure of those real world things. The Classes in the ontology should have formal

explicit description, attributes or properties for each class and constraints or restriction on those properties.

Financial Securities domain is analyzed an

called as financial securities. The different securities are Equities, Debts, Swaps, Spots, Futures, Listed

options etc. The Classes in the financial

Figure 4 To

C. Gnana Chithra and Dr. E. Ramaraj

CET/index.asp 70

is a business model which explains the relationships between entities and additional logical

information about the entities, in a way which is machine readable. The ontology, like

definitions of things in the real world. Therefore the foundation pillars for ontology are

rchical class structure of those real world things. The Classes in the ontology should have formal

explicit description, attributes or properties for each class and constraints or restriction on those properties.

Financial Securities domain is analyzed and discussed in this paper. Financial instruments are also

called as financial securities. The different securities are Equities, Debts, Swaps, Spots, Futures, Listed

financial instruments are given in Figure 4

Top level Classes of Financial Instruments Ontology

[email protected]

etween entities and additional logical

information about the entities, in a way which is machine readable. The ontology, like taxonomy, contains

pillars for ontology are taxonomy - the

rchical class structure of those real world things. The Classes in the ontology should have formal

explicit description, attributes or properties for each class and constraints or restriction on those properties.

d discussed in this paper. Financial instruments are also

called as financial securities. The different securities are Equities, Debts, Swaps, Spots, Futures, Listed

p level Classes of Financial Instruments Ontology

Page 7: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

Fioboda - Semantic Annotation Framework For Web Extracted Data

http://www.iaeme.com/IJC

Here the Financial instrument is a thing in the universe. The financial instruments are classified as per

the CFI standards of taxonomy. The equity capital is money that is raised by the company as per the

contractual terms from the investors and investors also gain money by trading those shares in the stock

market. The Class Equity and sub-

Figure 5

The Concept equity has relationship “is raised by”,”is owned by”, “has rights defined”, and “is a “between

the entities. The following facts prove the

E.g. Owner has rights of equity.

Equity is raised by owners.

Equities are owned by investors.

Equity is a financial instrument.

Equity securities has rights defined in

Figure 6 Example word with subject, predicate and object

Here the word “Equity” refers to the subject and “owners” is the Object, “is raised by” is the

relationship between the entities.

Semantic Annotation Framework For Web Extracted Data

CET/index.asp 71

Here the Financial instrument is a thing in the universe. The financial instruments are classified as per

the CFI standards of taxonomy. The equity capital is money that is raised by the company as per the

contractual terms from the investors and investors also gain money by trading those shares in the stock

-classes [13] are given in the diagram Fig.5

Figure 5 Equity Classes in Financial Instruments ontology

ity has relationship “is raised by”,”is owned by”, “has rights defined”, and “is a “between

the entities. The following facts prove the relationship between the subject and the object.

has rights defined in Contractual terms.

Example word with subject, predicate and object classification

Here the word “Equity” refers to the subject and “owners” is the Object, “is raised by” is the

Equity is raised by owners.

s p o

Semantic Annotation Framework For Web Extracted Data

[email protected]

Here the Financial instrument is a thing in the universe. The financial instruments are classified as per

the CFI standards of taxonomy. The equity capital is money that is raised by the company as per the

contractual terms from the investors and investors also gain money by trading those shares in the stock

Equity Classes in Financial Instruments ontology

ity has relationship “is raised by”,”is owned by”, “has rights defined”, and “is a “between

between the subject and the object.

classification

Here the word “Equity” refers to the subject and “owners” is the Object, “is raised by” is the

Page 8: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

http://www.iaeme.com/IJC

The excerpt from the financial securities ontology representation [14] is given in Fig. 7.

Figure 7 Slic

Ontologies are well defined and it represents up

grave concern in the semantic annotation systems.1. When a concept in the ontology is removed then the

server to the web page.

2. When the classification of ontology is modified the annotated documents in the server should reflect the new

changes. The identifiers associated with a web page also needs to b

3. The ontology needs to be on par with the latest updated info carriers such as

entities and their relationships.

10. ANNOTATION MODULE

The extracted noun phrases from the web document, which are instances are matched t

find the higher level of concept. The conceptual representation of the word is matched with the instance.

The values of attributes of that particular concept are

annotated.

It is not mandatory that all the attributes of concepts need to be filled. The more the

attributes the concepts is clearly marked for the instance. The index range of all the instances is stored in a

file.

When there is overlapping of the concepts then there

with the relation are the possible candidates

C. Gnana Chithra and Dr. E. Ramaraj

CET/index.asp 72

The excerpt from the financial securities ontology representation [14] is given in Fig. 7.

Slice of financial instruments ontological representation

Ontologies are well defined and it represents up-to-date information. Maintenance of ontologies is a

grave concern in the semantic annotation systems. When a concept in the ontology is removed then there is a conflict between the annotated documents in the

When the classification of ontology is modified the annotated documents in the server should reflect the new

changes. The identifiers associated with a web page also needs to be updated.

The ontology needs to be on par with the latest updated info carriers such as

entities and their relationships.

ANNOTATION MODULE

The extracted noun phrases from the web document, which are instances are matched t

find the higher level of concept. The conceptual representation of the word is matched with the instance.

values of attributes of that particular concept are filled with the values in the document to be

that all the attributes of concepts need to be filled. The more the

clearly marked for the instance. The index range of all the instances is stored in a

When there is overlapping of the concepts then there exists a relation between the them. The concepts

with the relation are the possible candidates of annotation. The context of the higher level concept from

[email protected]

The excerpt from the financial securities ontology representation [14] is given in Fig. 7.

financial instruments ontological representation

Maintenance of ontologies is a

re is a conflict between the annotated documents in the

When the classification of ontology is modified the annotated documents in the server should reflect the new

The ontology needs to be on par with the latest updated info carriers such as Wiki to identify the latest

The extracted noun phrases from the web document, which are instances are matched to the ontology to

find the higher level of concept. The conceptual representation of the word is matched with the instance.

filled with the values in the document to be

that all the attributes of concepts need to be filled. The more the number of filled

clearly marked for the instance. The index range of all the instances is stored in a

exists a relation between the them. The concepts

. The context of the higher level concept from

Page 9: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

Fioboda - Semantic Annotation Framework For Web Extracted Data

http://www.iaeme.com/IJCET/index.asp 73 [email protected]

word to the sentence is analyzed to find the lower level concepts. It is assumed there exists a spatial

proximity between the concepts. The instances in the extracted web data is annotated with higher level

concepts.

Annotations are represented in the system as RDF/XML format. Uniform resource identifier(URI) may

take the form of Uniform Resource Name(URN) which is used for internal reference of the document

.Otherwise it may take the form of URL(Uniform resource Locator) for external reference in the web.

Annotations are checked whether it is URI or URL. If it is URL, it need not be converted to URI and if

annotation exists in the web document, it can be stored in the server later for indexing. But when the

Annotated document is not a web document corresponding URN will be generated and later published and

stored in the local server. The web document is integrated with the annotation data and stored for

automatically annotating the documents.

Figure 8 Graphical representation of Class Equity in the ontology.

Figure 8 Picture is adapted from [14].This represents the classes in the owl: thing which exists.

11. CLASSIFICATION OF ANNOTATION BY MACHINE LEARNING

The resultant annotated pages are Classified into Correct Annotation and Wrong annotation using the svm

classifier model. Features were studied for the classification and the Correctly classified annotations were

stored in the Annotation database and the incorrect annotation in the rejected list. The correctly classified

annotation serves as the training set data for future classifications.

11.1. SVM Classifier

SVM is a machine learning algorithm for binary classification. The concept, which is behind the svm

classifier, is that in high dimension feature space, vectors are mapped non-linearly. There is a linear

separation between the training data with minimum margins between the two classes. Test data along with

Page 10: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

C. Gnana Chithra and Dr. E. Ramaraj

http://www.iaeme.com/IJCET/index.asp 74 [email protected]

feature set and training data classifies the data to the class to which it corresponds. The features are

mapped to the feature space for performing optimization. If the training set examples cannot be separated,

the regularization parameter can be used to balance the larger margin with big training error.

12. EVALUATION

It is a very difficult procedure to evaluate the FIOBODA Framework. Hence the performance metrics

proposed by Yang [15] is evaluated for the FIOBODA. To evaluate the performance of FIOBODA first a

confusion matrix by Kohavi uses the classifiers to access features [16] or error matrix is designed, which

permits the visualization of the performance. This error matrix contains classes of two dimensions such as

actual and predicted classification.

The Confusion matrix is given in Table.1

Table 1 Confusion Matrix

Predicted

Positive Negative

Actual Positive True positive

TP

False Positive

FP

Negative False Negative

FN

True Negative

TN

Where

TP represents the number of correct predictions to the positive instance (True Positives)

FP represents the count of incorrect predictions to negative instance (False Positives)

FN represents the count of incorrectly predictions for positive instance (True Positive)

TN represents the count of correctly predictions for the negative instance (True Positive)

The following metrics preferred by Yang [13] is used to evaluate the FIOBODA framework. Three

different datasets Dataset1, Dataset 2, Dataset3 were extracted from large corpora with three domains two

from the stock markets and one from the corporate websites. The top ranked named entities with their

precision and recall values are given in Table. 2

Table 2 Top Ranked Entities with Precision And Recall

Named Entity Precision Recall

equity 98.34% 99.12%

preference share 98.12% 99.00%

dividend 99.00% 98.23%

bonus share 97.32% 98.67%

investment 70.23% 65.34%

The entity “dividend” is with a high precision of 99% and recall of 98.23%. But the entity

“investment” records with precision and recall rate due to its lack of specificity.

Table 2 Evaluating the proposed annotation frame work with different datasets

Domain Precision Recall F-score Fallout Accuracy Error

Dataset-1 97.54% 96.95% 96.97% 0.11% 95.5% 4.5%

Dataset-2 98% 81.25% 88.84% 0.11% 96% 6%

Dataset-3 95.55% 98.47% 96.89% 0.31% 94.66% 5.34%

Page 11: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

Fioboda - Semantic Annotation Framework For Web Extracted Data

http://www.iaeme.com/IJC

After pre-processing using the Modified Manning’s sentence boundary detection algorithm and Noun

phrase collocation detection algorithm is applied to the datasets, the resultant entities

of high quality. Dataset 1 contains 20000 instances to be annotated, Dataset 2 contains 25000 entities and

Dataset 3 contains 15000 entities for annotation and classification.

Dataset-1 and dataset 3 emerges with high Recall

Dataset 2 is very low and the error is also correspondingly lower, the FIOBODA Framework proves to be a

great success. Though dataset 3 has high precision and recall, the irrelevant data with fallout is 0.31

results in Table 2 are represented graphically in Fig.9. The accuracy levels are also above 94% and it is in

the range of acceptance the newly designed FIOBODA framework.

Figure 9 Graphical representation of Performance measure on datasets using

Table 3

DATASET

Dataset 1

Dataset 2

Dataset 3

The mean precision of the svm classifier on the datasets is 98.03% and the mean recall is

SVM classifier with its parameters performs optimization and the training set is linearly separable.

13. CONCLUSION

This semantic annotation framework annotates the document with Dublin core metadata elements and

higher-level concepts. Due to the frequent changing of web page content there is no tight coupling between

the annotation in the web page and the ontology. The correctly classified annotated documents which are

stored for future use, are the potential candidates for machine learning.

the concepts needs to be grilled down further and the association between the ontology and the document

has to be made still tighter.

Semantic Annotation Framework For Web Extracted Data

CET/index.asp 75

processing using the Modified Manning’s sentence boundary detection algorithm and Noun

phrase collocation detection algorithm is applied to the datasets, the resultant entities

of high quality. Dataset 1 contains 20000 instances to be annotated, Dataset 2 contains 25000 entities and

Dataset 3 contains 15000 entities for annotation and classification.

1 and dataset 3 emerges with high Recall rate as in Table.2. Since the fallout in dataset 1 and

Dataset 2 is very low and the error is also correspondingly lower, the FIOBODA Framework proves to be a

great success. Though dataset 3 has high precision and recall, the irrelevant data with fallout is 0.31

represented graphically in Fig.9. The accuracy levels are also above 94% and it is in

the range of acceptance the newly designed FIOBODA framework.

Graphical representation of Performance measure on datasets using FIOBODA framework

Table 3 Evaluation of SVM Classifier on Datasets

PRECISION

98.1%

97.67%

98.34%

The mean precision of the svm classifier on the datasets is 98.03% and the mean recall is

SVM classifier with its parameters performs optimization and the training set is linearly separable.

This semantic annotation framework annotates the document with Dublin core metadata elements and

frequent changing of web page content there is no tight coupling between

the annotation in the web page and the ontology. The correctly classified annotated documents which are

stored for future use, are the potential candidates for machine learning. The semantic relationship between

the concepts needs to be grilled down further and the association between the ontology and the document

Semantic Annotation Framework For Web Extracted Data

[email protected]

processing using the Modified Manning’s sentence boundary detection algorithm and Noun

phrase collocation detection algorithm is applied to the datasets, the resultant entities which are extracted is

of high quality. Dataset 1 contains 20000 instances to be annotated, Dataset 2 contains 25000 entities and

in Table.2. Since the fallout in dataset 1 and

Dataset 2 is very low and the error is also correspondingly lower, the FIOBODA Framework proves to be a

great success. Though dataset 3 has high precision and recall, the irrelevant data with fallout is 0.31%.The

represented graphically in Fig.9. The accuracy levels are also above 94% and it is in

FIOBODA framework

RECALL

98.76%

98.54%

99.23%

The mean precision of the svm classifier on the datasets is 98.03% and the mean recall is 98.84%.

SVM classifier with its parameters performs optimization and the training set is linearly separable.

This semantic annotation framework annotates the document with Dublin core metadata elements and

frequent changing of web page content there is no tight coupling between

the annotation in the web page and the ontology. The correctly classified annotated documents which are

emantic relationship between

the concepts needs to be grilled down further and the association between the ontology and the document

Page 12: FIOBODA - SEMANTIC ANNOTATION FRAMEWORK FOR WEB … · 2016-11-17 · semantic annotation method, “C” stands for the context of the annotation in which the annotat ... documents

C. Gnana Chithra and Dr. E. Ramaraj

http://www.iaeme.com/IJCET/index.asp 76 [email protected]

REFERENCE

[1] http://www.merriam-webster.com/dictionary/annotation

[2] Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM Semi-automatic CREAtion of Metadata.The 13th

Int. Conf. on Knowledge Engineering and Management (EKAW2002), ed.Gomez-Perez, A., Springer

Verlag (2002)

[3] Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufman

Publishers (2003)

[4] Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to Harvest Information for the Semantic

Web. ESWS 2004, LNCS 3053. Springer-Verlag Berlin Heidelberg (2004) 312–326

[5] Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In:Proc. of the

27th Int. Conference on Very Large Data Bases. (2001) 119–128

[6] Kiryakov, A., B. Popov, I. Terziev, D. Manov and D. Ognyanoff (2003). Semantic Annotation, Indexing

and Retrieval. In proccedings of the Second International Semantic Web Conference (ISWC'2003),

Florida, USA,pp. 484-499.

[7] Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K. S.,

Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J. Y.: A Case for Automated Large-Scale Semantic

Annotation. Journal of Web Semantics, 1(1) (2003) 115-132

[8] Sanderson, R., Van De Sompel, H. (2011a). Open Annotation. Beta Data Model

Guide.http:/www/openannotation.org/spec

[9] Oren, Renaud Delbru, Knud Möller, Max Völkel, Siegfried Handschuh "Annotation and Navigation in

Semantic Wikis", Proceedings of the Workshop on Semantic Wikis (SemWiki), in conjunction with 3rd

European Semantic Web Conference, 2006.

[10] Haase, K. (2004). Context for semantic metadata. Proceedings of the 12th ACM International

Conference on Multimedia, New York, USA, 204, ACM Press.

[11] Gnana Chithra.C, Ramaraj.E. Heursitic sentence boundary detection and classification. Paper selected

for presentation in the First International Conference on Recent Innovations in Engineering and

Technology 2016, and to be published in International Journal of Emerging Technoloies-IJET(online

ISSN: 2249-3255).

[12] Gnana Chithra.C, Ramaraj.E. A Novel automatic approach for Extraction and classification of Noun

Phrase collocates. In Editorial for International Journal of Computational Intelligence Research (IJCIR).

[13] CFI: Classification of Financial Instruments http://www.anna-web.org

[14] Mike Bennett [2007], Financial securities and ontologies: An exploration

www.hypercube.co.uk/docs/ontologyexploration.doc

[15] Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information

Retrieval, 1999, 1(1-2), 67–88

[16] Kohavi, R., and Provost, F. 1998. On Applied Research in Machine Learning. In Editorial for the

Special Issue on Applications of Machine Learning and Knowledge Discovery Process, Columbia

University, New York, volume30.

[17] Houda El Bouhissi, Mimoun Malki and Djamila Berramdane, Applying Semantic Web Services.

International Journal of Computer Engineering and Technology (IJCET), 4(2), 2013, pp. 108–113.

[18] Mangai P. Enhanced Web Image Re-Ranking Using Semantic Signatures , International Journal of

Computer Engineering and Technology (IJCET), 7(2), 2016, p p. 24 – 29 .