semantic annotation and retrieval in...

Semantic Annotation and Retrieval inE-Recruitment

By

Malik Nabeel Ahmed Awan

2011-NUST-DirPhD-IT-44

Supervisor

Dr. Sharifullah Khan (TI)

Department of Computing

A thesis submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Information Technology

In

School of Electrical Engineering and Computer Science,

National University of Sciences and Technology (NUST),

Islamabad, Pakistan.

(August 2019)

Abstract

E-recruitment processes prioritize matching between job descriptions and user

queries to identify relevant candidates. Existing e-recruitment systems face chal-

lenges in extracting job descriptions due to unstructured nature of content and

text nomenclature differences for defining the same content. The systems are par-

ticularly unable to extract effectively contextual entities, such as job requirements

and job responsibilities from job descriptions. They also lack in producing effec-

tively desired search results due to semantic differences in job descriptions and

users English natural language queries. This thesis proposes a framework to cater

for challenges in the existing e-recruitment systems.

The proposed Semantic Extraction, Enrichment and Transformation (SExEnT)

framework extracts entities from job descriptions using a domain specific dictio-

nary. The extraction process first performs linguistic analysis and then extracts

entities and compound words. After the extraction of entities and compound

words, it builds job context using a job description domain ontology. The ontol-

ogy provides an underlying schema for defining how concepts are related to each

other. Besides building a contextual relationship among entities, the entities are

also enriched using Linked Open Data (LOD) that improves search capability in

finding suitable jobs. In the proposed framework, Web Ontology Language (OWL)

is used to represent information for machine-understanding. The framework ap-

i

ii

propriately matches users queries and job descriptions.

The evaluation data set has been collected from various jobs portals, such as

Indeed, Personforce, DBWorld. A total of 860 jobs were collected that belong

to multiple categories, such as technology, medical, management and others. The

data set was vetted and verified by HR experts. The evaluation has been performed

using precision, recall, F-1 measure, accuracy and error rate. The proposed frame-

work achieved an overall F-1 measure of 87.83% and accuracy of 94% for entities

extraction. The application has a precision of 99.9% in representing and retriev-

ing job descriptions from its knowledge base. The job description ontology has

an overall concept coverage of 96%. The evaluation results show that the pro-

posed framework performs well in extracting, modelling, enriching, and retrieving

job description against queries. At current, the proposed framework is neither

able to automatically generate pattern/action rules, nor provide a complex ranked

retrieval of job descriptions against a user profile nor automatically extend dictio-

nary to increase extraction precision. In future, the framework can be extended

to resolve these limitations.

Table of Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Overview of Proposed Solution . . . . . . . . . . . . . . . . . . . . 8

1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Literature Review 11

2.1 Information Extraction in E-Recruitment . . . . . . . . . . . . . . . 12

2.1.1 Information Extraction . . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Data Extraction in E-Recruitment . . . . . . . . . . . . . . 15

2.2 Ontology Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Ontology Design Frameworks . . . . . . . . . . . . . . . . . 17

2.2.2 Domain Ontology . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3 E-Recruitment Ontology . . . . . . . . . . . . . . . . . . . . 19

2.3 Information Enrichment . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Natural Language Queries . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Critical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

iii

TABLE OF CONTENTS iv

3 SExEnT Framework 25

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Information Extraction and Enrichment . . . . . . . . . . . . . . . 27

3.3 Job Description Domain Ontology . . . . . . . . . . . . . . . . . . . 28

3.4 Job Query Transformation . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 SAJ Framework 30

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4 Linguistic Analysis and Extraction . . . . . . . . . . . . . . . . . . 33

4.4.1 Linguistic Analysis . . . . . . . . . . . . . . . . . . . . . . . 33

4.4.2 Entities Extraction . . . . . . . . . . . . . . . . . . . . . . . 34

4.5 Context Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6 Enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.6.1 Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.7.1 Data-set Acquisition . . . . . . . . . . . . . . . . . . . . . . 47

4.7.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 48

4.7.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . 49

4.7.4 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . 50

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Job Description Ontology 54

5.1 Ontology Design Methodology . . . . . . . . . . . . . . . . . . . . . 54

5.2 Ontology Expressiveness . . . . . . . . . . . . . . . . . . . . . . . . 56

TABLE OF CONTENTS v

5.3 Job Description Ontology . . . . . . . . . . . . . . . . . . . . . . . 57

5.3.1 Identify Purpose . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3.2 Build Ontology . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3.3 Ontology Development and Documentation . . . . . . . . . . 64

5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.1 Domain Coverage . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4.2 Application based Evaluation . . . . . . . . . . . . . . . . . 66

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Sem-QA Framework 71

6.1 The Sem-QA Framework . . . . . . . . . . . . . . . . . . . . . . . . 71

6.2 Semantic Linguistic Analysis . . . . . . . . . . . . . . . . . . . . . . 72

6.3 Query Template Matching . . . . . . . . . . . . . . . . . . . . . . . 74

6.4 Query Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.5 Working Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.6 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 77

6.6.2 Data Set Specification . . . . . . . . . . . . . . . . . . . . . 77

6.6.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . 79

6.6.4 System Performance for Semantic Association of Atomic FC 79

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7 Conclusion 81

7.1 Research Description . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.2 Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 83

7.4 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . 84

List of Figures

1.1 A sample job description with marked segments . . . . . . . . . . . 2

1.2 A semantic difference between a requirement and a user query . . . 4

3.1 Proposed SExEnT framework . . . . . . . . . . . . . . . . . . . . . 26

4.1 High level block diagram for the proposed framework SAJ . . . . . 31

4.2 Extracting context-aware requirement entity from a job description 35

4.3 Sample of educational requirement in a job description . . . . . . . 41

4.4 Graph structure showing entities and connections in SAJ . . . . . . 42

4.5 Entities enrichment process using LOD in SAJ . . . . . . . . . . . . 44

4.6 N3 notation of a job description in knowledge-base . . . . . . . . . 46

4.7 Evaluation of extraction comparison of accuracy vs error for SAJ . 50

4.8 SAJ, Alchemy API and OpenCalais extraction comparison for job

titles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.9 SAJ and OpenCalais extraction comparison for requirements . . . . 52

4.10 Comparison of precision, recall and f1-measure with ground truth . 52

5.1 Uschold and Kings enterprise methodology . . . . . . . . . . . . . . 55

5.2 Job description ontology . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Job description basic entities as N3 notation . . . . . . . . . . . . . 60

vi

LIST OF FIGURES vii

5.4 Job description requirements entity as N3 notation . . . . . . . . . 61

5.5 Job description responsibilities as N3 notation . . . . . . . . . . . . 62

5.6 Job description education as N3 notation . . . . . . . . . . . . . . . 63

5.7 Job description profile as N3 notation . . . . . . . . . . . . . . . . . 64

5.8 A sample N3 representation of the job description ontology . . . . . 65

5.9 A sample SPARQL query to retrieve job title labels after execution 69

6.1 A set of sample queries from Mooney data set . . . . . . . . . . . . 73

6.2 A sample query processing representation of Mooney data set . . . 76

6.3 Time comparison between various Filter Constraints queries . . . . 80

List of Tables

2.1 Gap analysis for information extraction . . . . . . . . . . . . . . . . 23

2.2 Gap analysis for ontology design . . . . . . . . . . . . . . . . . . . . 23

2.3 Gap analysis for question answering . . . . . . . . . . . . . . . . . . 24

4.1 Sample rules for segmentation and extraction with description . . . 32

4.2 Compound words identification and extraction rules . . . . . . . . . 34

4.3 Sample text showing nomenclature variation in job description . . . 35

4.4 Basic entities along with examples from a job description in SAJ . . 36

4.5 Sample rule for boundary detection for requirement using JAPE in

SAJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.6 Sample rule for job requirement using JAPE in SAJ . . . . . . . . . 38

4.7 Sample rule for detecting responsibilities from a job description in

SAJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.8 A sample rule for education extraction from a job description in SAJ 41

4.9 Statistics of jobs collected from various e-recruitment systems . . . 47

4.10 Statistics of job description in various job categories collected ran-

domly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.11 Results of entities extraction from job description in SAJ . . . . . . 49

5.1 DL basic expressive labels along with details . . . . . . . . . . . . . 56

viii

LIST OF TABLES ix

5.2 DL extension expressivity labels along with details . . . . . . . . . 57

5.3 Important concepts in job description ontology . . . . . . . . . . . . 60

5.4 Job requirements properties and description . . . . . . . . . . . . . 61

5.5 Job Responsibilities properties and description . . . . . . . . . . . . 62

5.6 Job position properties and descriptions . . . . . . . . . . . . . . . 63

5.7 Domain coverage of job description ontology . . . . . . . . . . . . . 68

5.8 Job description evaluation queries categorization . . . . . . . . . . . 69

5.9 Job Description user retrieval summary . . . . . . . . . . . . . . . . 69

6.1 Examples of entities detected from natural language job queries . . 73

6.2 Query count for Mooney and Perforce data set . . . . . . . . . . . . 77

6.3 Query Categorization based on number of Filter Constraints . . . . 78

6.4 Comparative analysis of Mooney and Personforce data-set . . . . . 79

Chapter 1

Introduction

The current chapter provides an overview of the problem domain. The chapter

also identifies critical research gaps and objectives. Along with that, a dissertation

outline is also discussed.

1.1 Motivation

With recent advancements in technology, human reliance on the internet has signif-

icantly increased. Information is now mostly available and shared via the internet

using sources, such as websites, social media and web portals. This advancement

in internet technology has also had an impact on recruiting potential employees

for an organization. Automatic ways of exploring, joining and sharing information

in Web 3.0 has improved the use-ability of web resources. This evolution of the

web (Bogh, 2012) had a direct impact on applications design and content sharing

over the web. Many years back, during the Web 1.0 era, recruitment was also just

plain vanilla lateral (Bhagia, 2015). Organizations post jobs via various channels,

such as newspaper or digital print media and then wait back for job seekers to

1

CHAPTER 1. INTRODUCTION 2

send them their resumes via email or postal mail. Organizations mainly created

in-house banks for resumes. With the advancement in Web 1.0 to Web 2.0, also

saw a drift in recruitment systems.

Figure 1.1: A sample job description with marked segments

Web 2.0 was an era of Social Media. The old-style static recruitment processes

enhanced to passive recruitment, e.g., platforms like LinkedIn 1 enabled passive

recruitment. The organizational recruiters now have a larger pool of information

to search their required candidates. Now, it is not only job seekers looking for

jobs, but also organizational recruitment agents are also looking for knowledgeable

1https://www.linkedin.com/


candidates using Web 2.0. However, was it enough? Web 3.0 gave a new dimension

to recruitment (Matthew Jeffery, 2011). It enabled systems to understand complex

job descriptions and process search queries more amicably. The understanding of

job descriptions and search queries was only possible by understanding the context

hidden in the contents (McConell, 2014).

The job description in Fig 1.1 outlines the location, job title, requirements and

responsibilities. Features in a job description, such as location, job title, skills and

expertise level, are described as entities. These entities are contextually associated

with each other to yield contextual entities, such as the job requirements. The

employer’s primary emphasis in candidate filtering is on job requirements because

the requirements define a baseline for the selection of a potential candidate.

1.2 Problem Statement

Existing e-recruitment systems, such as Indeed2, Monster 3, Personforce4, An-

gel.co5, LinkedIn6, Career Builder 7, Glassdoor 8, SimplyHired9 store information

as raw text and apply keyword matching or faceted search to provide better search

results to both organization and users. The recruitment process starts with adver-

tising a job description. The job description outlines the requirements for selecting

a potential user/candidate. The typical job description may comprise of location,

job title, requirements and responsibilities. Some features in a job description, such

as location, job title, skills and expertise level, are described as entities. These en-

2https://www.indeed.com3https://www.monster.com/4https://www.personforce.com/5https://angel.co/6https://www.linkedin.com7https://www.careerbuilder.com/8https://www.glassdoor.com/9https://www.simplyhired.com/


tities are contextually associated with each other to yield contextual entities, such

as the job requirements.

The job description/candidate profile contents, thus hold crucial importance.

However, the information provided in the job description/user profile provides

challenges for extraction, such as content is unstructured; there is no standard

format for defining content, and there are text nomenclature differences for defining

the same content. Recruitment processes prioritize matching/relevance between

job description and candidate queries to filter out irrelevant candidates or get the

most well-fitting job for the candidate. Manually performing this matching process

is time-consuming and challenging 10.

Figure 1.2: A semantic difference between a requirement and a user query

The process has been carried out automatically in e-recruitment systems. How-

ever, this process is not merely the matching of text because there can be semantic

heterogeneities in the texts. For example, in Fig 1.2, the text of a job description

and a user query have been illustrated. There is no match in them lexically; how-

ever, they are semantically matching because Android Development is a type of

mobile application. The matching process is complex and needs to understand the

context (i.e. domain-specific information) of the text to resolve semantic hetero-

10https://ckscience.co.uk/is-your-recruitment-process-costing-you-time-money-and-good-candidates


geneity in matchmaking.

Existing e-recruitment systems (Owoseni et al., 2017; Valle et al., 2007) do

not extract domain-specific information, such as ’mobile application’ in the above

example from the job requirement text to match with the candidate query of

Android Development. These domain-specific entities are contextually associated

with each other. Existing systems extract entities from text independently of each

other, without considering the context associated with the entities. For example,

the job requirement Strong fundamentals (OO, algorithm, data structure have the

entity strong as the expertise level and OO, algorithm, data structure as skills, but

actually OO, algorithm, data structure requires strong expertise. These entities

combine to generate a contextual entity, i.e., a job requirement. Another drawback

of existing systems is that of the limited availability of information (Silvello et al.,

2017; Candela et al., 2017) contained in the knowledge-base for enrichment either

through in-house data or with an external source that is static concerning data

growth. On the other hand, Linked Open Data (LOD) (lod, 2017) do not suffer

from data staleness, and data can expand over time. Multiple sources are actively

contributing to it, such as Wikipedia 11, Getty 12, and GeoNames 13. Existing

approaches (Shin et al., 2015; Gregory et al., 2011) do not properly implement

LOD principles. Besides extracting, enriching and building information context,

retrieval of information is another important aspect. The system users are not

domain experts to query machine-understandable data using its desired format,

i.e., SPARQL query language. A translation mechanism is required to translate the

users Natural Language Queries (NLQ) into the machine-understandable format,

i.e., SPARQL query language.

11https://www.wikipedia.org/12http://www.getty.edu/research/tools/vocabularies/lod/13https://lod-cloud.net/dataset/geonames-semantic-web


So, the main problems summarized from above discussion is:

1. Information loss in extraction of domain specific e-recruitment entities for

job description due to lack of their context and inter and intra document

linkages

2. Information loss due to the absence of a comprehensive domain ontology

in e-recruitment for building relationship among extracted entities for jobs

descriptions

3. Usage of external static sources for data enrichment (expansion) resulting in

data staleness

4. Deficiency in translating Natural Language queries for machine understand-

able data due to identifying entities and their context

1.3 Research Challenges

The extraction of contextually associated entities for filtering job candidates posses

major challenges, that are:

1. Unstructured Content : The text written in a job description is of unstruc-

tured nature, such as ’Previous working experience as a Java Developer for

minimum 2 years’. The aforementioned example include structured informa-

tion, such as skill: Java, job title: Java Developer and experience: 2 years

but mentioned in an unstructured way.

2. Nomenclature differences : The text represents same information in different

way, such as Requires 2+ years of java experience and Java experience of


2+ years required. The job requirement is of an experience of 2+ years but

representation is different.

3. Semantic heterogeneities : The text represents same information with its

synonym form, such as Android Developer and Mobile Developer. In the

aforementioned example Android Developer is type of Mobile Developer but

represented with a synonym form.

4. Contextual Entities : Extraction of contextually associated entities is it self

major challenge, such as Expertise in Spring boot is a contextual entity but

it extraction requires in-depth understanding of content.

1.4 Research Objectives

The main research objectives are:

1. To design an extraction and transformation methodology for identification

of entities and compound words from the job description for building infor-

mation context

2. To design a comprehensive web ontology for the representation of e-recruitment

job description in order to resolve semantic heterogeneities

3. To design an enrichment methodology for enriching entities and compound

words from Linked Open Data to cater for data staleness in e-recruitment

4. To design a methodology for translating natural language queries into ma-

chine understandable data, i.e., SPARQL queries


1.5 Overview of Proposed Solution

The proposed solution extracts, enriches, retrieves and transforms e-recruitment

natural language content of (Gupta, 2016) job descriptions and NLQ in a machine-

understandable format. The transformation improves searching in e-recruitment

systems by considering the context in matching relevance. It transform the raw

text into a machine-understandable format (Graupner et al., 2017) using the job

description domain ontology (Ahmed et al., 2016). The extracted entities are also

enriched using Linked Open Data, which is a non-stale data source. The enrich-

ment of job description entities increases accuracy and precision of the retrieval

process, such as java micro-service has connected concepts, such as Eureka, Rib-

bon and Feign. Connecting them via LOD will enhance the exploring capabilities

of knowledge-base. The system also facilitates a user to input complex queries in

plain English natural language. The input in plain English natural language pro-

vides users a way to express their requirements more efficiently and effectively. The

system transforms the plain English natural language query into a SPARQL for-

mat, which is a machine-understandable format and retrieves accurate and precise

matching results.

Each component of the framework has been evaluated separately, i.e., extrac-

tion and enrichment, the job description domain ontology and the NLQ trans-

formation process. The framework evaluation metrics are precision, recall, F1-

measure, and accuracy. The data-set consists of 860 jobs captured from vari-

ous job portals, such as Indeed, Personfroce and DB World. The data-set was

evaluated by domain experts to build an evaluation ground truth. Overall the

framework achieved comparably good results. The extraction and enrichment

component achieved highest F1-measure for job title as 95% whereas education

has the highest accuracy of 99.9% due to fewer variations in nomenclature. The


application-driven evaluation for job description ontology model has been able to

retrieve all relevant job. The transformation process evaluation has been carried

out on two different data-sets, i.e. Mooney data-set 14 containing 620 job queries

and Personforce data-set 15 containing 500 job queries. The transformation process

achieved F1-measure of 99.9%.

1.6 Thesis Organization

The thesis is organized as,

1. Chapter 2 discusses the existing work in the domain. The research gap anal-

ysis is also carried out in this chapter in order to identify research objectives.

2. Chapter 3 acts as a bridge for overall discussion for the SExEnT framework.

The chapter discusses all the components of the framework briefly, besides

building a logical bridge among the components. The forthcoming chapters

then discuss each component in details, along with evaluation.

3. Chapter 4 provides in-depth details and evaluation for e-recruitment job

description extraction, context building and enrichment framework. The

chapter discusses in-details algorithms used for extraction, context building

and enrichment of job description text.

4. Chapter-5 provides in-depth details and discussion on evaluation for the job

description ontology. The chapter discusses in details the design methodol-

ogy and rationale behind the design of the job description ontology. It also

provides a comparison against other similar schema.

14https://www.ifi.uzh.ch/en/ddis/research/talking/OWL-Test-Data.html15http://personforce.com/


5. Chapter 6 provides in-depth details on how transformation takes place and

evaluates the job description queries transformation process. It discusses

how critical and important the transformation is and then presents in-depth

details of the transformation.

6. Chapter 7 concludes the discussion. It summarizes the overall content of

the thesis providing critical information from each chapter. Alongside it also

provides future direction of the work.

Chapter 2

Literature Review

The primary objective of the current chapter is to review and discuss various

techniques, approaches, solutions, and methodologies that contribute to the de-

velopment of an e-recruitment solution. The discussion will start with a review

of existing extraction methodologies and techniques. An analysis on extraction

technique follows a discussion on enrichment, domain ontologies and retrieval of

information.

The organization of the current chapter is as follows: Section 2.1 will discuss

various extraction techniques and methodologies, in general, and specific to e-

recruitment. Section 2.2 will discuss ontology designs in various domains along

with e-recruitment specific systems. Section 2.3 will discuss enrichment in existing

systems in general and specific to e-recruitment. Section 2.4 will discuss the con-

tent retrieval in general and specific to e-recruitment. Section 2.5 concludes the

discussion in the chapter.

11

CHAPTER 2. LITERATURE REVIEW 12

2.1 Information Extraction in E-Recruitment

The purpose of this section is to first discuss existing work in the domain of

information extraction then followed by various extraction techniques proposed

specifically to e-recruitment. The work discussed will focus on trivial extraction

techniques as well as ontology-based extraction techniques for better understanding

of the domain.

2.1.1 Information Extraction

Information extraction is a technique for identification of entities, mentions, re-

lations in unstructured text (Karkaletsis et al., 2011; Jayram et al., 2006). The

extraction techniques includes, (a) Wrapper based techniques, (b) Rule and pattern

based technique, (3) Machine learning based technique, and (4) Ontology-based

technique.

2.1.1.1 Wrapper based techniques

The wrapper-based technique is a fundamental technique designed to extract data

from web pages using the Document Object Model (DOM). WebQL (Arocena and

Mendelzon, 1999) uses a wrapper-based technique to extract meaningful informa-

tion from web pages. The manually written wrappers proved inefficient due to

changes in the DOM structure of web pages that result in a complete rewrite.

Mingsheng et al. (Mingsheng et al., 2012) proposed a technique to overcome the

manual efforts. The technique first understands the DOM structure and then

extracts meaningful information. The focus of all these techniques was on web

pages. It was not able to extract content from other document formats such as

MS Word and PDF (Flesca et al., 2011). The rules and patterns (Jayram et al.,


2006) technique, machine learning (Bijalwan et al., 2014) technique and ontology

(Vicient et al., 2011) based technique were then used to cater to other formats.

2.1.1.2 Rule and pattern-based techniques

The rule and pattern-based technique identify hidden features in the text by uti-

lizing predefined rules or known patterns (Jayram et al., 2006), e.g., extraction

of a person’s phone number may require the occurrence of phrases, such as at ,

can be reached at and similar pattern phrases. Absence of these phrases will result

in the non-extraction of the person’s phone number. Rule-based techniques have

been applied in multiple domains, such as aspect extraction of product reviews by

exploiting practical knowledge and sentence dependency trees (Poria et al., 2014),

relation extraction using background knowledge (Rocktaschel et al., 2015), extrac-

tion of patient’s clinical data from medical texts (Mykowiecka et al., 2009) and

numerous others. These methods may process unstructured text multiple times

to obtain meaningful information. Besides this, the rule-based technique have

been applied in the extraction of compound entities from bio-medical domains

(Ramakrishnan et al., 2008) using BioInfer and GENIA corpora. The drawback

of compound word extraction using the technique mentioned by Cartic Ramakr-

ishnan et al. (Ramakrishnan et al., 2008) is that any concept that is missing

in the BioInfer and GENIA corpora will not be recognized as a compound word.

These methods cannot identify any new rule or pattern that is not already defined.

Machine learning based techniques can deal with such problems.

2.1.1.3 Machine learning based techniques

Machine learning based techniques help in extracting existing and new informa-

tion from unstructured texts. Machine learning based techniques require large


data-sets for training and evaluation purposes. These techniques mostly focus on

text categorization and classification problems. Existing techniques, such as Term

Model Graph (Wang et al., 2005), kNN (Bijalwan et al., 2014), Naive Bayes (Tang

et al., 2016) and Support Vector Machines (Guenther et al., 2016) can be applied

for classification and categorization. These techniques can also be applied to sce-

narios, such as sentiment analysis (Gautam and Yadav, 2014) on Twitter data or a

micro-blog data (Bontcheva et al., 2013) or even extracting data from bio-medical

domain texts or e-recruitment data. They fail to link information together with

a context and require large data for training. The lack of context and less train-

ing data may result in information loss (Gutierrez et al., 2016). Ontology-based

technique can fill this gap (Gutierrez et al., 2016).

2.1.1.4 Ontology based techniques

The ontology based technique cover this gap in information extraction, and mainly

use domain-specific knowledge for extracting meaningful information from unstruc-

tured text (Kiryakov et al., 2004; Maree et al., 2018). Some well known existing

systems that use domain ontology are KIM (Popov et al., 2003) and TextPresso

(Muller et al., 2004). These systems only use information present in domain ontol-

ogy to facilitate entity extraction. TextPresso mainly focuses on entity extraction

in the bio-medical domain. It uses Gene Ontology (GO) during extraction that

comprises approximately 80% of the lexicon. Any new information extracted will

result in information loss. This limitation was addressed by the technique pro-

posed by Vincent et al. (Vicient et al., 2011). According to their technique, the

newly extracted knowledge is merged with existing domain knowledge resulting in

enhanced domain knowledge. Further down the road information extraction has

also been supported with fuzzy ontology (Ali et al., 2015) and used in extracting


travelers reviews about hotels, building and designing business intelligence systems

for gathering company intelligence and country/region information (Saggion et al.,

2007), building and designing systems to extract information from clinical docu-

ments such as admission reports, radiology findings and discharge letters (Geibel

et al., 2015), and a framework for retrieval of images from web data (Vijayarajan

et al., 2016).

The technique proposed by Vincent et al.(Vicient et al., 2011) has the limita-

tion of only using the lexical English database WordNet 1 as an external source for

enhancing information. The enhancement was limited to WordNet which suffers

from data a staleness issue. This issue of data staleness has been addressed in

other studies by (Al-Yahya et al., 2014; Vicient et al., 2013; Nabeel Ahmed, 2008)

which have updated domain ontology independently of WordNet thus increasing

the extraction precision. They have used the pattern-based approach and ontol-

ogy for extracting new concepts that are not modeled in domain ontology, thus

enriching the ontology.

2.1.2 Data Extraction in E-Recruitment

E-Recruitment has gained much popularity over time. E-recruitment at current is

one of the most widely used ways to recruit talent for organizations. The rapid

growth of Internet has paved the way for many online Human Resource (HR) job

systems, such as Indeed 2, Monster 3, Personforce and hundreds of others.

PROSPECT (Singh et al., 2010) is a domain dependent research initiative to

extract data from e-recruitment content. The main aim of the PROSPECT system

is job candidates screening. It builds facets on resumes and then screen candidates

1https://wordnet.princeton.edu/2https://www.indeed.com3https://www.monster.com/


on bases of these facets. The job posting created by PROSPECT includes role,

job category, skills, skill experience, location, number of positions, and total expe-

rience. The PROSPECT system does not define and follow any defined ontology

model to represent this information; rather a job is posted using existing job post-

ing channels. Resumes are collected using the same channel through which a job

is advertised and then processed for screening.

SCREENER(Sen et al., 2012) is also a domain dependent research initiative

to facilitate the e-recruitment process by extracting information only from the re-

sumes. It identifies text segments that have a probability of having a specific set

of information that includes skills, education, experience, and other related infor-

mation. The extracted information is then indexed using Lucene 4 for searching

and ranking all applicants for a given job opening. The authors (Sen et al., 2012)

claims that this automated process makes the screening task more straightforward

and more efficient.

JobOlize(Buttinger et al., 2008) and WoLMIS (Boselli et al., 2018) extracts

structured information from unstructured job documents. JobOlize utilizes a hy-

brid approach combining existing Natural Language Processing (NLP) techniques

with the new form of context-driven extraction techniques for extracting layout,

structure and content information of a job description. WoLMIS system aim is to

collect and classify multilingual Web job descriptions with respect to a standard

taxonomy of occupations.

The existing information extraction approaches are unable to extract domain-

specific e-recruitment entities for job description due to unavailability of their

context and inter and intra document linkages.

4http://lucene.apache.org/


2.2 Ontology Design

The current section discusses the work done in the area of ontology design and

development. Ontology design and development is generally carried out by a team

of people such as domain experts, ontology engineers, and pedagogues. The main

factor behind ontology design and development, as mentioned by (T.R.Grubber,

1995) is to share a common understanding of knowledge and information structure

among people or applications. It also enables reuse of domain knowledge, thus

becoming a significant enabler in the current increase in ontology research, design

and development. Ontology design frameworks define steps and guide domain

experts, ontology engineers, and pedagogues for building better models.

The current section will review work in ontology, and development from various

aspects, that include (1) existing design frameworks, (2) ontologies from various

domains, and (3) ontology design for e-recruitment.

2.2.1 Ontology Design Frameworks

Various ontology design frameworks exist, that are: (1) Cyc, (2) TOVE, (3)

Uschold and Kings, (4) Methontology and others. Here a review of these famous

four is presented.

Cyc (Elkan and Greiner, 2006) is an ontology as well as ontology engineer-

ing methodology. The main intent behind this methodology is to make common

sense knowledge accessible and processable for computers. Key steps to develop

ontology-based on Cyc methodology are:

1. Manual identification of common sense knowledge

2. Computer-assisted extraction of common sense knowledge


3. Computer managed extraction of common sense knowledge

TOVE (TOronto Virtual Enterprise) ontology (Gruninger and Fox, 1995) pro-

posed another ontology engineering methodology. This methodology makes use of

story-driven cases which provide informal semantics for objects and relations.

Uschold and Kings Enterprise methodology (Uschold and King, 1995; Uschold

and Gruninger, 1996) provides guidelines for building ontologies based on enter-

prise modeling approach. This approach was developed as a part of the enterprise

project by AIAI at the University of Edinburgh in collaboration with IBM, Lloyds

Registers, Logica UK Limited and Unilever (Uschold and King, 1995; Uschold and

Gruninger, 1996).

All existing methodologies provide guidelines for ontology design whereas Methon-

tology (Gomez-Perez, 1996; Gomez-Perez, 1999) addresses the ontology mainte-

nance. This framework facilitates construction of ontology in a systematic way

and is compatible with software development process and knowledge engineering

methodologies like RUP (Rational Unified Process) (Shahid et al., 2009). The

life-cycle of this methodology is based on evolving prototypes.

Various domain ontologies have been developed based on these frameworks,

that are discussed next.

2.2.2 Domain Ontology

Ontology development has influenced various domains. Education is one of such

domains that has multiple ontologies ranging from course ontology (Ameen et al.,

2012), university ontology (Malik et al., 2010a), and others. These ontologies

comprehensively model the respective domain. Cultural heritage is another domain

that has been described using the ontology (Pattuelli, 2011). The authors of


paper describes the core concepts of cultural heritage and their interconnections.

The domain experts evaluated ontology coverage and use-age in cultural heritage

applications.

Another significant work related to ontology construction is in Botany, i.e.,

Plants. The Plant Ontology (Li et al., 2016) provides a comprehensive details

related to plants. It discusses its concepts and relationships for classifications,

construction, growth details, multiple names, flora details and possibly use-age of

plants mainly in the medical domain.

2.2.3 E-Recruitment Ontology

Multiple initiatives and research work have been carried out for defining a schema

for human resource management. One such project is SEEMP Project (Gomez-

Perez et al., 2007) by European Union (EU). Under this project, existing human

resource management standards are reused to build a common language called Ref-

erence Ontology. Reference Ontology includes compensation ontology, economic

activity ontology, occupation ontology, education ontology, skill ontology, job of-

fer ontology, and some other ontologies. These modular ontologies are combined

to form a comprehensive human resource ontology. One major problem with this

model is that HR-XML mainly influences it, so these models inherit any shortcom-

ing of HR-XML. The job offer model is not comprehensive to handle job posting

domain as a whole. Requirements are in raw plain text (Gomez-Perez et al., 2007);

entities are not identified, such as skills, experience, and expertise level.


2.3 Information Enrichment

This section discusses existing work carried out in the domain of information en-

richment, i.e., a process of enhancing, refining or improving existing data. The

enrichment process increases the data value (Weichselbraun et al., 2014). Work

has been carried out in various domains, such as cultural heritage, scientific pub-

lications, question answering, and others to enhance, refine and improve existing

data by adding more additional knowledge from external sources. The process is

gaining much popularity with time (Silvello et al., 2017). Significant work has been

carried out on scientific experimental evaluation data (Silvello et al., 2017). Global

large scale campaigns produce a large quantity of scientific and experimental data.

This data is a fundamental pillar of the scientific and technological advancement of

information retrieval (Silvello et al., 2017). The proposed system semantically an-

notates and interlink the data. The data is shared using Linked Open Data (LOD)

cloud. The interconnections of data in LOD provides a mean of data enrichment,

i.e., the depth and breadth of LOD graph increase. Another work carried out

by (Candela et al., 2017) to enrich data from the cultural heritage domain. The

cultural heritage institutions are now progressing towards sharing the knowledge

via LOD. Sharing data using LOD increases value. The current system connects

the Biblioteca Virtual Miguel de Cervantes records (200,00) to other data sources

on the web. At current there, the focus is to enrich location and date information

with additional knowledge. The current system uses the GeoNames API to link

the data.

The existing work in enrichment mostly uses static dictionaries to enrich in-

formation. Cultural heritage data currently use LOD to link artefacts information

with one available in LOD. No work has been carried out to enrich e-recruitment

entities/concepts from LOD that would increase its value many folds.


2.4 Natural Language Queries

The remarkable advancement of web applications is attracting more and more

users. Therefore along with trained and technical web users, novice users are also

increasing at a higher pace. Making web data explorable for everyone in a non-

technical way becomes inevitable. According to to (Copestake and Jones, 1990),

the most flexible and convenient method of communication with software is Natural

Language (NL). Although NL based systems are the most intuitive way of user

communication, they are much more challenging to implement than they were

expected in the past (Copestake and Jones, 1990). The central problem with

NL based Question Answering (QA) system is the identification of user intents

by disambiguation of the concepts and their mutual relationship in a particular

domain.

This section discusses the work done on Natural Language Queries or Interface

(NLI). NLI basic purpose is to bring ease to users in the data query. NLI generally

falls under two categories that are: Open domain NLI and restricted domain NLI.

Open domain NLI answers questions using general ontologies and information

available on web (Strzalkowski and Harabagiu, 2006). Since it targets general

web resources, therefore the answer reliability may be doubtful as information

could be outdated, conflicted or wrong. In terms of information reliability, closed

domain QA systems win over the open domain as their information resorts are

comparatively smaller and more specific (Frank et al., 2007). Some of the recent

work on NLIs include: QACID (Oscar et al., 2009), PANTO (Wang et al., 2007),

ORAKEL (Philipp et al., 2008) and AquaLog (Lopez et al., 2005).

QACID (Oscar et al., 2009) designed for cinema domain. It stores question

patterns in a query formulation database and statically binds a SPARQL query

with each pattern, while an entailment engine is designed to find a match for an


input query within the query formulation database.

AquaLog (Lopez et al., 2005) and PANTO (Wang et al., 2007) both initially

generate Query triples and then convert them into onto-triple. Both of the tech-

niques, as mentioned above, are using ontologies as their knowledge base. Aqua-

Log supports scalability, learning through user interaction and portability more

than PANTO, but only support 23 question categories. On the other hand,

PANTO supports more questions, but it lacks all other features of AquaLog. PRE-

CISE (Popescu et al., 2003) maps NL questions to SQL queries and has shown

100%precision for semantically tractable questions. Answering only semantically

tractable questions leave the not tractable questions unanswered.

ORAKEL (Philipp et al., 2008) is another ontological QA system. It supports

factoid questions, but it requires a lexicon engineer for mapping query lexicon to

ontological relations. Some NLI based QA systems including (Bernstein and Kauf-

mann, 2006), (Funk et al., 2007), and (Popescu et al., 2003) work with controlled

language, to overcome the ambiguity and vagueness of natural language questions.

Since controlled natural language is a subset of the representative language un-

derstandable for the system (Fuchs et al., 2006), therefore an end user is required

to learn it and be trained enough to express all types of questions using its sup-

ported constructs. Machine learning models based QA systems include: (Zelle and

Mooney, 1996) and (Thompson et al., 1999).

All approaches modeled so far exhibit one or many of the following weaknesses,

that are: (1) require a considerable amount of domain-specific training data to

achieve high accuracy, (2) require better interpretation of semantics in context

of question domain, (3) need to provide complete or about complete coverage of

questions, (4) the system should be capable of handling a large amount of data, (5)

involve end-user efforts, (6) not flexible enough to adapt the unrecognized question


patterns and new information while maintaining system accuracy.

2.5 Critical Analysis

Table 2.1, Table 2.2 and Table 2.3 shows gap analysis for three broad domains

that overall encapsulate our work.

Table 2.1: Gap analysis for information extraction

Feature / Sys-tems

Prospect(2010)

Screener(2012)

JobOlize(2008)

KIM(2003-to-date)

TextPresso(2004)

OpenCalais(2017-to-date)

AlchemyAPI(2005-to-date)

ContextualEntities

- - - - - Partial -

Basic Enti-ties

YES YES YES YES YES Partial Partial

ContextBuilding

- - - - - - -

Inter and In-tra documentlinkages

- - - YES YES - -

Enrichment - - - - - YES YESRDF Repre-sentation

- - - YES YES - -

Table 2.2: Gap analysis for ontology design

Feature / Systems HR-XML(1999-to-date)

Prospect(2010)

Schema.org(2011-to-date)

Indeed(2004-to-date)

Hierarchical Relation-ship

Partial - - -

Associative Relation-ship

- - - -

Expressiveness Low Low Medium Low


Table 2.3: Gap analysis for question answering

Feature / Systems QACID(2009)

PANTO(2007)

ORAKEL(2008)

AQUALOG(2005)

Domain SpecificTraining

YES YES YES YES

Complete Coverage ofQuestion

YES - - -

End-user Efforts - - YES -Adapt to new ques-tions

- YES - -

handle Large Data - YES - YES

Summarizing the researchers contributions, following research gaps exist based

on Table 2.1, Table 2.2 and Table 2.3;

1. Information loss in the extraction of domain-specific e-recruitment entities

for job description due to unavailability of their context and inter and intra

document linkages

2. Information loss due to the absence of a comprehensive domain ontology

in e-recruitment for building relationship among extracted entities for jobs

descriptions

3. Usage of static sources for data enrichment (expansion and hurdles in match-

ing) resulting in data staleness

4. Deficiency in translating natural language queries for machine understand-

able data due to inappropriately identifying entities and their context

Chapter 3

SExEnT Framework

The current chapter works as a bridge in explaining the proposed Semantic Ex-

traction, Enrichment and Transformation (SExEnT) framework for retrieval of

information in e-recruitment domain. This chapter integrates logical divisions in

the proposed framework for clear understanding.

3.1 Introduction

The proposed framework aims to provide a comprehensive solution for solving is-

sues related to information extraction, contextual representation of the extracted

information, enriching the extracted information, domain representation in a machine-

understandable format and then a solution to use NLQ for retrieving the informa-

tion.

The proposed framework has three logical divisions, that are, (1) extraction

and enrichment, (2) job description ontology, and (3) jobs query transformation.

Fig 3.1 shows a pictorial representation of the framework.

25

CHAPTER 3. SEXENT FRAMEWORK 26

Figure 3.1: Proposed SExEnT framework

The purpose of information extraction and enrichment is to extract, enrich

and build context from the plain text of job descriptions. The extraction and

context building make use of the job description ontology (Ahmed et al., 2016).

Job description domain ontology provides relationships among the entities. Once

the relation among the entities is defined, which result in building context. The

knowledge-base stores the contextual e-recruitment information. Users can query

the knowledge-base using Natural Language Query (NLQ). The user query is trans-

formed into the machine-understandable format, i.e., SPARQL 1 to fetch desired

results.

The section 3.2 presents the overview of the extraction and enrichment, section-

3.3 presents the overview of the job description ontology and section 3.4 presents

the overview of the job search.

1https://www.w3.org/TR/rdf-sparql-query/


3.2 Information Extraction and Enrichment

The extraction and enrichment process starts with the job description submitted

as raw plain English text. At the first step, the text classified into segments,

such as job title, job requirements, job responsibilities, and others. A rule-based

dictionary contains rules for segmentation as well as entities extraction. Once the

text has been segmented using a rule(s) from the dictionary, the next step is entities

extraction. The entities extraction process is preceded with sentence splitting and

POS tagging using Penn TreeBank tag set 2. After the text is segmented, sentences

and POS tags marked, the next important thing is to extract entities. Entity

extraction is an essential and critical process.

During entity extraction, information, such as places, organizations, money,

email address and others are extracted. Besides, these generic entities, domain-

specific entities are also extracted, such as job title, expertise level, career level,

skills, job requirements, job responsibilities and others. The dictionary helps in

the extraction of domain-specific entities. The dictionary contains patterns/action

rules developed using the feature, such as POS tags, words dictionary, and simple

pattern/action rules. The rules also incorporate priorities for determining the

execution order. The execution order effects the input/output for subsequent

pattern/action rules. After entities identification, the next step is to connect

entities for building context. The hierarchical and associative relationships among

entities of a job description define its context. Job description ontology plays a vital

role in building context among the extracted entities. The details of extraction,

enrichment and context building is present in Chapter 4.

2http://www.anc.org/oanc/penn.html


3.3 Job Description Domain Ontology

The job description ontology is represented using Web Ontology Language (OWL)

as domain knowledge. The job description ontology provides a semantics and struc-

ture for representing job description in a machine-understandable format. The core

schema classes are Job Description, Job Title, Requirements, Responsibilities, Ex-

pertise Level, Education, Career Level, and Job Type. Some of the core properties

are the job description, requirements, job type, education, job title, expertise level,

job position. Logically, the job description ontology has two parts, i.e., job descrip-

tion and job position. One job description can be a part of multiple job positions.

This logical segregation is incorporated to increase reuse-ability of a single job

description. The details of job description are discussed in Chapter 5.

3.4 Job Query Transformation

Users post natural language queries for the retrieval of relevant jobs. NLQ queries

are not machine-understandable as they are raw plain English text queries. The

NLQ queries are transformed into the SPARQL format for execution on machine-

understandable knowledge-base. The proposed job search solution is designed

to answer user questions posed in natural language against an RDF store of job

descriptions. It enables users to explore the RDF annotated data without knowing

SPARQL or the underlying job ontology. Sem-QAS (Semantic Question Answering

Solution) is designed using a hybrid of pattern storage approach and dynamic

SPARQL triple generation approach. It makes use of the ontologies and linguistic

analysis on input query. The most differentiating features are (1) use of atomic

filtering constraints to generate SPARQL query triple patterns without depending

on back end question database and (2) semantic association of generated triple


patterns according to users intent for dynamic generation of complex SPARQL

queries.

The transformation process for NLQ starts with the identification of named

entities, such as skills, location, experience, expertise level, and others. The iden-

tification of named entities uses the same dictionary as discussed in the previous

section 3.2. After the identification and extraction of named entities, the next step

is to match the entities with respective query template(s). The identified named

entity replaces the ontological concept with the actual value. The details for job

queries transformation and retrieval are discussed in detail in Chapter 6.

3.5 Summary

The current chapter briefly discussed the logical divisions of the proposed frame-

work. An overview of each component was presented with details to follow them

up in subsequent chapters.

Chapter 4

SAJ Framework

The purpose of this chapter is to discuss in detail the proposed extraction, enrich-

ment and context building framework for Job Description named as SAJ. At first,

each component of SAJ framework is discussed in detailed followed by evaluation.

4.1 Introduction

SAJ extracts, enriches and builds the context of information that exists in Job

descriptions in e-recruitment by exploiting Linked Open Data, job description do-

main ontology, and domain-specific dictionary. The SAJ enriches extracted infor-

mation to minimize the information loss in the extraction process. SAJ is an over-

all framework that encapsulates various processes together to achieve extraction,

enrichment and context building of data from the job description in e-recruitment.

Fig 4.1 shows pictorial representation of extraction, enrichment and context

building approach. At first, the raw plain English text is segmented into predefined

categories using a self-generated dictionary. Linguistic analysis and dictionary

help in identification of entities in the text. The extracted entities are processed

30

CHAPTER 4. SAJ FRAMEWORK 31

Figure 4.1: High level block diagram for the proposed framework SAJ

in parallel by context builder and enrichment. The knowledge-base stores the

context-aware and enriched entities. Extracting annotations from unstructured

text is non-trivial and challenging work (Malik et al., 2010b). The SAJ technique

not merely extracts entities from job description text but also makes them enriched

entities contrary to the existing e-recruitment systems (Buttinger et al., 2008)

(Roman et al., 2015). Following sections discuss the SAJ framework in detail.

4.2 Segmentation

Segmentation is a process to categorize text in a job description. The primary

objective of segmentation is to ensure that the extracted entities are correct and

belong to the correct text segment. The starting and ending index location for the

text segment is marked.

The segments are identified in the job description text, such as job title, require-

ments, responsibilities, career level, and others. At current, a dictionary-based ap-

proach (Sen et al., 2012) is adopted for identification of text categorization. The

dictionary contains an extensive list of possible rules and headings values that can

occur in a job description. The rules in the dictionary ensure that split is correct


along with its category. Fig 1.1 shows the categories, such as location, job title,

requirements, and responsibilities of a job description that are identified by the

segmentation process. The next section discusses dictionary details.

4.3 Dictionary

The purpose of the dictionary is to assist in text segmentation and entities extrac-

tion. The dictionary is a combination of rules designed for identifying segments

and entities in a job description. The rules are written using the grammatical

syntax of Java Annotation Pattern Engine (JAPE). Table 4.1 shows sample rules

for segmentation and entities extraction along with detail in natural language for

comprehension.

Table 4.1: Sample rules for segmentation and extraction with description

Rules DescriptionSegmentation

text.sentence.index == 1 Job title as first line of texttext.sentence.token ¡ 4 heading line has no other text

Extraction

Rule:expDurationForSkill

({Token.kind==number}

{Token.string=="+"}{SpaceToken}

({Token.string=="years"}

|{Token.string=="yrs"})):exp -->

:exp.ExpDuration

= {rule = "expDurationForSkill"}

rule detects the experience fora skill, e.g., 2+ years of expe-rience is required in Java

The rules in the dictionary are manually designed using JAPE grammar. Do-

main experts have validated these rules. The dictionary comprises two types of

files, (1) the JAPE rule files and (2) the gazetteer lists. Each rule in a JAPE file


has two parts, i.e., Left (L.H.S) and Right (R.H.S). L.H.S contains inputs that are

the identified annotation patterns. The annotation pattern comprise of regular

expression and operators (e.g. *, ?, +). R.H.S is the rule outcome that is one or

multiple annotations to be created based on L.H.S. All rules in the dictionary are

not applied once rather in order of priority as mentioned by priority parameter of

rule. This process reduces the chances of false positive and also provides a way to

verify any error that arises during requirements of boundary identification. The

file extension of rule files is .jape. The gazetteer is specialized lists of concepts that

are input to the rules for inferring or calculating output values. These values can

be input for another rule or a concept classification, such as job requirement, job

responsibilities, and expertise level. The rule priority defines the rules execution

order.

4.4 Linguistic Analysis and Extraction

The linguistic analysis and entities extraction component has two purposes, that

are, (1) analysis of text for identification of sentences, Part of Speech (POS) and

compound words, and (2) identification of entities.

4.4.1 Linguistic Analysis

Linguistic analysis is a process of understanding the text. It identifies sentences,

words/tokens, lemmatization and clearing, and part of speech tagging (POS). Sen-

tence’s extraction is carried out from the job description text after segmentation.

After splitting text into sentences, mark each word/token with particular POS

tags such as noun, adjective, verb, and others. POS uses the Penn TreeBank 1

1https://www.ling.upenn.edu/courses/Fall 2003/ling001/penn treebank pos.html


tag set for identification of POS tags, such as JJ for an adjective, NN for noun

singular and others. Identification of POS tags for word/token in sentences builds

a ground for identification of compound words (Nabeel Ahmed, 2008), such as

software development.

Table 4.2: Compound words identification and extraction rulesRules Description

∨a, a ∈ N every compound words have noun∨a, a ∈ N , ifsucceed( a, a)→ join( a, a)

A noun succeeded by a noun, termsare joined

∨a,b,wherea ∈ N ∧ b ∈ A,ifsucceed( a, b) → term( a),∧drop( a)

A noun succeeded by an adjective,the noun term is saved and adjectiveis compared with next token

Compound word extraction uses a set of rules. The rules are designed using

knowledge of words construction from English dictionaries and in-depth analysis

of scientific and technical English texts. English literature experts validated the

rules. Few rules are shown in Table 4.2 with explanations, and detailed rules are

available in (Awan, 2009).

Besides the identification of compound words, POS tags also help in identifying

cardinals, such as ’1’ or ’three’. The identification of cardinals has an impact on

text search results, e.g. java with two years of experience. Here, 2 is a cardinal

which represents the level of experience for the skill Java. Identification of a

cardinal helps to match the experience mentioned in the job description with

a user query/profile. After marking the text with POS tags, compound words,

cardinals, and entity extraction are next in the pipeline.

4.4.2 Entities Extraction

Entity extraction is a process of extracting important information from unstruc-

tured text, such as places, organizations, names, money. Fig 4.2 shows domain-


specific entities for e-recruitment from a job description, such as, expertise level,

skills.

Figure 4.2: Extracting context-aware requirement entity from a job description

The extraction of entities from a job description is non-trivial and challenging

work. The extraction of entities face challenges, such as similar information being

described with various nomenclature or existing of contextual entities. Table 4.3

shows a requirement for a skill java with 2 or more years of experience has been

represented with various nomenclature.

Table 4.3: Sample text showing nomenclature variation in job description1 . 2+ years o f expe r i ence r equ i r ed in Java .2 . Must have worked at l e a s t 2 years in java development .3 . Exper ience o f 2+ years i s r equ i r ed in java development .

This type of variation in the text makes it a challenge to extract information

with minimal information loss. The other problem, as discussed, is of contextual

entities. Entities extracted from a job description are of two types, that are,

(1) entities that are directly identified from the text, such as job title, skills,

organization, location etc, and (2) entities that are determined based on occurrence

of other entities such as job requirements, job responsibilities, and job and skill

experience. Fig 4.2 shows domain specific named entities for e-recruitment in job

description. In the figure Skill and Expertise Level are directly identified entities

where as requirement is contextual entity identified based on skill and expertise

level. The steps to cater for the problems as mentioned earlier are next.


4.4.2.1 Basic Information Extraction

The entities that are easily and readily available for extraction are basic entities,

such as, job title, location, career level, and organization are the basic entities for

a job description. Table 4.4 shows each of these entities with examples.

Table 4.4: Basic entities along with examples from a job description in SAJEntity Example.

Job Title Java Software EngineerLocation St. Louis. MO

Career Level Mid-LevelOrganization Google Inc.

A job must have a job title as a mandatory entity, but others are optional.

The extraction of basic information is carried out via a hybrid approach using

heuristics and rules from the dictionary. In case of a job title, when muktiple job

titles are detected, then the first line heuristic as mentioned before in Table 4.1 is

applied. The position of the job title plays a vital role. Another aspect that needs

much deliberation is the use of the special character in job title, thus creating an

issue with wrong boundary detection of a job title entity.

4.4.2.2 Requirements Extraction

Requirements are contextual entities that define essential skills or capabilities that

an employer seeks in a potential candidate. For example the Job Requirements

segment in Fig 1.1 shows requirements in the job description. The extraction of

requirements is vital due to its significance for both employers and candidates.

Requirements are not just basic rather basic entities in a specific context, e.g., a

skill java has various expertise levels, such as novice, proficiency. Here skill and

its expertise level make a single requirement as they occur in a specific context, as

shown in Fig 4.2.


In this process, there are two main steps: (1) identification of requirement

boundary, (2) identification of entities. The requirement boundary is the start

and end of a requirement. Table 4.5 shows a sample rule that marks the bound-

ary for a requirement. The rule uses POS tags in combination with words in a

sentence to mark a boundary for the requirement. After identification of require-

ment boundary, the next step is to identify the actual requirement. An essential

aspect of the rule is also setting its priority. Priorities are set in rules using JAPE

inherited property Priority.

Table 4.5: Sample rule for boundary detection for requirement using JAPE in SAJ

Rule Description

Rule:requirementboundarymarker

Priority: 100

{Lookup.majorType==Req_BeginKeywords}

({SpaceToken})[0,2]

({Token.category==IN}| {Token.category==TO}|

{Token.category==VBG}| {Token.category==VB}|

{Token.category==VBZ}|{Token.category==DT})?

((({SpaceToken})[0,3] ({Token.kind==word,

!Lookup.majorType==Req_NotAfterKeywords}

|{Token.kind==symbol}| {Token.kind==number}|

{Token.kind==punctuation,

!Token.string=="."}) )+)

: req --> :req.Requirement =

{rule = "requirementboundarymarker"}

This rule detectsthe boundary ofthe requirement.It detects thetoken categoriesas POS. The to-kens are eitherverbs (VBZ, VBG,VB), determinersor prepositions.Besides POSrequirement key-words placement insentence is verified.

The primary purpose of setting a priority is to define the execution order of

rules. The result obtained from one rule is input to the next rule. A higher value

of priority defines a higher order of execution of the rule. The rule in Table 4.5

has priority set to 100, meaning it is the first rule that will be executed and will

detect a start and end boundary for a requirement.


Table 4.6: Sample rule for job requirement using JAPE in SAJ

Rule Description

Phase: requirementSubParts

Input: RequirementsBeg Token

RequirementsNot RequirementsMid

RequirementsEnd Skill Split

ToolsAndTechnology OperatingSystem

Database Course TechnicalLanguage Protocol

ExpertiseLevel MandatoryConditionTrue

MandatoryConditionFalse ExpDuration

Options: control = appelt

Rule:requirementSubPartsStart

Priority: 50

{RequirementsBeg} (((({Token})* | {Skill}

| {ToolsAndTechnology} | {OperatingSystem}

| {Database} {Course} | {TechnicalLanguage}|

{Protocol} |{ExpertiseLevel}|

{MandatoryConditionTrue} |

{MandatoryConditionFalse}|

{ExpDuration} |{ExpertiseLevel}))+{Split} )

:req --> :req.Requirements =

{rule ="requirementSubPartsStart"}

This rule ap-plies variouslists in dictio-nary, such asToolAndTech-nology, Oper-atingSystem,Database,Course, Tech-nicalLanguageand others todetect the en-tities. Besidesextracting enti-ties the rule alsodetect the expe-rience durationand expertiselevel. The ruleis dynamic, i.e.,placement ofthese entities ina sentence willnot affect therule.

The contextual entity requirements are extracted from a job description using

pattern/action rules defined in the dictionary. The rules identify entities from

unstructured text that constitute the requirements for a job description. Con-

sider the requirement, Proficiency in Object Oriented Programming in Java and

Groovy+Grails for Web-based application. The rule in Table 4.5 will mark the

boundary of the requirement, Proficiency in Object Oriented Programming in Java,

as it is a high priority rule with value 100. Table 4.6 shows a pattern/action rule

with priority 50. The rule in Table 4.6 has a priority of 50. This rule extracts


skill, such as Java, and is executed after the rule with priority 100 mentioned in

Table 4.5.

The dictionary defines the domain knowledge for requirement identification.

The rule in Table 4.6 has a priority of 50 and use various lists, such as skill,

database, course, and technical knowledge for the extraction of requirements. A

sentence not satisfying the rule is plunged.

4.4.2.3 Responsibilities Extraction

Responsibilities are the duties that an employee performs during his stay in an

organization, as shown in Fig 1.1. It is a non-mandatory text segment of a job

description. Sometimes it is described along with job requirements. Sometimes

responsibilities are defined using similar entities, such as, must have a knowledge

of AWS cloud and he will manage AWS cloud. In the example, the first statement

is a job requirement, whereas the second statement is the job responsibility. It

is difficult to draw a clear line of distinction between job responsibility and job

requirement. After detailed analysis and experimentation on real-world data-set,

SAJ was successful in segregating the job requirements from job responsibilities.


Table 4.7: Sample rule for detecting responsibilities from a job description in SAJ

Rule Description

Phase: Responsibility

Input: Lookup Token SpaceToken

Options: control = appelt

debug=true

Rule:keywordResponsibility

Priority: 10

{Lookup.majorType==Responsibilty_BeginKeywords}

({SpaceToken})[0,2] ({Token.category==IN}|

{Token.category==TO}|{Token.category==VBG}|

{Token.category==VB}|{Token.category==DT} |

{Token.category==NN}|{Token.category==NNS})?

((({SpaceToken})[0,3]({Token.kind==word,

!Lookup.majorType==Res_NotAfterKeywords} |

{Token.kind==symbol}|{Token.kind==number}|

{Token.kind==punctuation,

!Token.string=="."}))*)

:req --> :req.Responsibility =

{rule = "keywordResponsibility"}

This ruledetects theboundaryof the re-sponsibility.It detectsthe tokencategoriesas POS.The tokensare eitherverbs (VBZ,VBG, VB),determines orprepositions.Besides POSrequirementkeywordsplacement ina sentence isverified.

Table 4.7 shows a rule for detection of responsibilities boundary from job

description. The sample rule extracts responsibilities using domain background

knowledge from a dictionary and morphological sentence structure. The rule has

a low priority of 10. It uses Responsibilty BeginKeywords lists in dictionary along

with POS tags and word kinds to identify the responsibility boundaries.


Figure 4.3: Sample of educational requirement in a job description

4.4.2.4 Education Extraction

Education defines mandatory or minimal qualification required for a job.

Table 4.8: A sample rule for education extraction from a job description in SAJ

Rule Description

Rule:degreeextractioninfull

Priority: 40

({Degree}

{Token.string=="in"}({Token.category==NNP,

!Lookup.majorType==date})+{Token.string=="and"}

({Token.category==NNP,!Degree})+ ):Degree

--> :Degree.FullDegree

={rule="degreeextractioninfull"}

This rule ex-tracts edu-cational re-quirementcategorized asDegree. It usePOS tag (NNP)and degree dic-tionaries. Therule also verifiesexistence oftoken ”in”.

Education has four categories that are degree, diploma, training, and certifica-

tion. These categories will be useful during a job matching with a profile. Fig 4.3

shows an educational requirement for a job.

Fig 4.3 shows an education requirement, i.e., BS/MS in Computer Science.

Table 4.8 shows a rule to extract the educational requirement as Degree. The

sample rule identifies educational entities that have a token in. In addition to token

identification, Degree list is used with POS tag to correctly detect the educational


requirement.

4.5 Context Builder

The extracted entities are forwarded to the context builder and enrichment module

in parallel.

Figure 4.4: Graph structure showing entities and connections in SAJ

The context builder creates relationships (both hierarchical and associative)

among extracted entities using a job description ontology, as shown in Fig 5.2.

The job description ontology is designed using job posting schema from schema.org

2 and job description domain studies from various existing job portals discussed

above. HR domain experts evaluated the ontology schema concepts and relation-

ships for validating the domain coverage of job description ontology. The details

of the Job Description Ontology are available in (Ahmed et al., 2016).

The job description ontology provides a schema for structuring and building

the context of extracted entities, as shown in Fig 4.4. The core schema classes

2https://schema.org/JobPosting


are Job Description, Job Title, Requirements, Education, Career Level, and Job

Type. Some of the core properties are job description, requirements, job type, edu-

cation, and job title. The ontology defines not only hierarchical relationships but

also define associative relationships, such as skos:altlabel, owl:sameAs and others.

Fig 4.4 represents the requirements of a job description in an ontological model

along with all its semantics. A relationship exists between a skill and an expertise

level in the requirement. The relationships are not automatically extracted from

the job description text; instead, the job description ontology already defines these

relationships. The context builder uses entity types, such as skill, job requirement,

expertise level, career level and others for identification of relationships.

For example S1 is an instance of Skills class which is an intermediary node to

connect Skill instance Object Oriented Programming and Expertise Level instance

Proficiency. The intermediary node S1 is then connected to R1 which is an

instance of Requirement class, connected to a Job Description instance JD1.

4.6 Enrichment

Enrichment is the process of adding additional knowledge to existing entities.

The enrichment of job description entities helps in increasing the search space

and better job-profile matching. The enrichment process receives its input from

the entity extraction process as a list of entities. The enrichment only processes

skill entities at current. The enrichment of skills has been performed to cater

the variation of nomenclatures for skills. The primary aim of processing skills

is to have all alternate forms, e.g., Object-Oriented Programming as OOP. The

enrichment will help SAJ in identifying Object Oriented Programming and OOP

as the same skill. The process achieves this by using Linked Open Data, as shown


in Fig 4.5. The main aim to use Linked Open Data for enrichment is to have up-

to-date information related to the terms being enriched. The enrichment process

via Linked Open Data will not suffer from the traditional data staleness problem.

The open source community is responsible for updating the LOD data.

Fig. 4.4 shows a pictorial representation of inter-document concepts enrich-

ment. The concept object oriented programming linked to two job descriptions

with different job titles. A requirement that needs to search all jobs that have

object oriented programming as a requirement will get precise results.

Figure 4.5: Entities enrichment process using LOD in SAJ

Fig 4.5 shows a pictorial representation of enrichment process using Linked

Open Data. Using Linked Open Data for enrichment provides an up-to-date in-

formation for entities. The process do not suffer from traditional data staleness

problem as LOD data is regularly updated by open source community.

The enrichment process receives new labels from LOD based on the properties

rdfs:label, rdfs:altLabel and a condition of lang=en. The rdfs:label, rdfs:altLablea


and lang=en filter are standard Web Ontology Language 3 properties. The simi-

larity is computed among the new labels fetched from LOD and entities extracted

from a job description. If number of returned entities from LOD is less then five,

then all returned entities are stored, but if the number exceeds five, then the simi-

larity is calculated using the Cosine Similarity (Thada and Jaglan, 2013). DISCO

API (Kolb, 2008) facilitates the calculation of cosine similarity. In addition to

cosine similarity, a distributional similarity is also calculated using DISCO API.

DISCO API allows for calculating the semantic similarity between arbitrary words

and phrases. DISCO API calculates the similarities using the Wikipedia 4 SIM

type data set. The data-set used for computing similarities has been published

in April 2013 5. The selecetd entities after computing similarity are stored in

knowledge-base using skos:altLabel.

4.6.1 Knowledge Base

The knowledge base is responsible for storing the data. It receives data from the

context builder and enrichment process. After integrating data the Knowledge-

base stores information as a graph structure using job description ontology. Fig 4.6

shows N3 notation of a job description in a knowledge-base.

3https://www.w3.org/OWL/4https://www.wikipedia.org/5https://www.linguatools.de/disco/discodownload en.html


Figure 4.6: N3 notation of a job description in knowledge-base

Fig 4.6 represents a single job description. The snapshot of Fig 4.4 visualize

two job descriptions JD1 and JD2. Both job descriptions have the same Object

Oriented Programming as skill requirement where their expertise levels are differ-

ent. The graph structure representation of the knowledge base will now connect

the same instance Object Oriented Programming to all Skill instances with differ-

ent expertise levels. This structure of knowledge base becomes more resourceful

when exploring a query in a graph, such as find all jobs which have a requirement

of object-oriented programming.

4.7 Evaluation

The evaluation rationale of the proposed system originates from its primary ob-

jectives, i.e., to design an extraction and transformation methodology for identifi-

cation of entities and compound words from e-recruitment content and build in-

formation context and to design an enrichment methodology to enrich entities and

compound words from Linked Open Data to cater data staleness in e-recruitment.

The information extracted by the SAJ should have minimal information loss, larger

search space and adhere to the Linked Open Data principles. The current evalua-

tion tries to achieve all the aspects mentioned above.


Table 4.9: Statistics of jobs collected from various e-recruitment systemsSource Descriptions.

Personforce.com 6 101DBWorld 7 139

Indeed.com 8 620Total 860

4.7.1 Data-set Acquisition

At current, no standard gold data-set exists for job descriptions. The date-set is

self-collected from various e-recruitment systems and community mailing list. A to-

tal of 860 job descriptions have been collected. Table 4.9 shows sources along with

the statistics of collected job descriptions from each source. The self-developed au-

tomatic crawler collected data from Indeed and DBWorld. Indeed provides REST

API to fetch data. Personforce provided data as an industrial partner.

The collected jobs descriptions belong to multiple categories, as shown in the

Table 4.10. These categories range from information technology to management

to health care. The job descriptions are collected at random and then placed in

these predefined categories. The random selection was carried out to ensure that

data-set is not biased instead contains jobs from multiple domains and disciplines.

The collected data-set was evaluated by Human Resource (HR) Experts who

had more than five years of experience working in the area of human recruitment

and staffing. The primary entities selected after discussion with HR experts for

evaluation were a job title, job responsibilities, job requirements, job category

and education, such as degree, diploma, training or certification. These selected

entities have a pivotal and vital role in the job description(s). The results of the

entity extraction from job descriptions are compared with the manually verified

data from HR experts.


Table 4.10: Statistics of job description in various job categories collected randomly

Job Category Count

Engineering and Technical Services 55Business Operations 20

Computer and Information Technology 125Internet 73

Project Management 85Health-care and Safety 9

Arts, Design and Entertainment 26Sales and Marketing 38

Office Support and Administrative 203Architecture and Engineering 10Construction and Production 9

Customer Care 21Management and Executive 22

Financial Services 9Government and Policy 6

Post-doctoral 45Research and Teaching 66

Others 38Total 860

4.7.2 Evaluation Metrics

The evaluation has been carried out using standard metrics of recall, precision

and F1-measure as shown in equations 4.1, 4.2 and 4.3 ,as well as an error analysis

(Powers, 2011) is also provided for more solid grounds of evaluation.

Recall =relevant − jobs ∩ retrieved − jobs

relevant − job(4.1)

Precision =relevant − jobs ∩ retrieved − jobs

retrieved − jobs(4.2)

F 1−Measure =2 · precision · recall

precision + recall(4.3)


Besides evaluating recall, precision and F1-measure, overall system accuracy

and error is also calculated. Error rate defines the in-accurate extractions, i.e.,

One minus the total accurate extractions.

Accuracy =tp + tn

tp + tn + fp + fn(4.4)

ErrorRate = 1− Accuracy (4.5)

4.7.3 Evaluation Results

The Table 4.11 shows results for entities extraction process. The table shows

recall, precision and F1-measure values of various entity types. These values are

computed by comparison against the gold standard, manually verified data-set by

HR experts. Education has the highest recall, i.e., 99.9% whereas job title has the

highest precision of 99.9%. Overall job title had the highest F1-measure value of

95.60%. This table shows only the proposed system evaluation results against the

gold standard. Next section discusses comparison with other systems.

Table 4.11: Results of entities extraction from job description in SAJ

S.No. Entity Type Precision Recall F-Measure

1 Requirements 90.5 87.90 88.762 Responsibilities 76.14 75.00 75.763 Education 38 99.9 55.054 Job Title 99.9 90.67 95.005 Job Category 79.24 97.67 87.50

Besides making a comparison on the bases of standard parameters of precision,

recall and F1-measure, an accuracy vs error comparison is also performed by SAJ

to have a clear idea of how good or bad the SAJ performs. From the graph in


Fig 4.7 it is quite evident that education has a low error rate of Zero. The 99.9%

accuracy is only due to low variation in education entity. The system overall has

an accuracy of 94% and an error rate of 6%.

Figure 4.7: Evaluation of extraction comparison of accuracy vs error for SAJ

4.7.4 Comparative Analysis

This sub-section presents the result comparison among SAJ, OpenCalais 9 and

Alchemy API 10. Both OpenCalais and Alchemy API are industry leaders for

information extraction.

All systems were able to extract job titles, as shown in Fig 4.8. The comparison

parameters are precision, recall, and f-measure.

9http://www.opencalais.com/about10http://www.alchemyapi.com/about-us


Figure 4.8: SAJ, Alchemy API and OpenCalais extraction comparison for job titles

From the graph, it is evident that SAJ performs well as compared OpenCalais

and Alchemy API. SAJ has achieved an overall precision of 98.1% as compared

to OpenCalais 39% and alchemy API 34.32%. The other entity that OpenCalais

was able to extract was requirements. Alchemy API was unable to extract re-

quirements. The graph in Fig 4.9 shows a comparative analysis of requirement

entity between SAJ and OpenCalais. From the graph in Fig 4.9, it is evident

that SAJ has a much higher precision that is 90.5% as compared to OpenCalais

42.78%. OpenCalais has a recall of 76.1% whereas SAJ has a recall of 87.09%.

OpenCalais and Alchemy API were not able to extract education, responsibilities

and job category.


Figure 4.9: SAJ and OpenCalais extraction comparison for requirements

Therefore, no comparison is present for these named entities as OpenCalais

and Alchemy API were no able to extract them, as those being domain specific

entities. Fig 4.10 shows comparison of evaluation metrics, that are precision, recall

and f1-measure with the build ground truth for extraction of various entity types.

Figure 4.10: Comparison of precision, recall and f1-measure with ground truth


4.8 Summary

In this research, the SAJ extracts context-aware information from job descriptions

by exploiting Linked Open Data, job description domain ontology, and domain-

specific dictionaries. SAJ enriches and builds context among extracted entities to

minimize the information loss in the extraction process. SAJ encapsulates various

processes together to achieve context-aware information extraction and enrichment

from the job description in e-recruitment. SAJ segments the text into predefined

categories using a self-generated dictionary. Natural Language Processing (NLP)

and dictionary help in identification of entities. The extracted entities are enriched

using Linked Open Data, and job context is built using a job description domain

ontology. The knowledge-base stores the enriched and context-aware information

built using Linked Open Data principles. The data-set comprises of 860 jobs.

HR experts have verified the data-set. The initial assessment is carried out by

comparing manually verified data and system extracted entities. SAJ framework

achieved an overall F1-measure of 87.83 % in entities extraction.

In comparison with other techniques, such as OpenCalais and alchemy API,

the SAJ performed better against the two systems. OpenCalais was able to extract

job titles and job requirements while alchemy API was only able to extract job

titles. Both OpenCalais and Alchemy API were not able to extract education,

responsibilities and job category, as those being domain specific entities. SAJ can

facilitate in searching and retrieval, scoring and ranking of human candidates.

Chapter 5

Job Description Ontology

The purpose of this chapter is to discuss in detail the proposed job description

ontology. The job description ontology describes the underlying semantics and

structure of a job. The job description ontology provides a comprehensive schema

for defining the relationships among concepts to logically connect them.

Section 5.1 defines the ontology design methodology, Section-5.2 discusses the

expressiveness of the ontology, Section-5.3 discusses in detail the job description

ontology and at end evaluation is discussed in Section-5.4.

5.1 Ontology Design Methodology

The job description ontology has been designed by following the Uschold and Kings

Enterprise Methodology (Uschold and King, 1995). Uschold and King Enterprise

Methodology presents in detail the way to build an ontology effectively and effi-

ciently. The designed ontology has its focus for ontology developers and engineers.

The Uschold and Kings Enterprise Methodology comprises following steps.

54

CHAPTER 5. JOB DESCRIPTION ONTOLOGY 55

1. Define purpose: The ontology purpose defines the scope and granularity of

the ontology. Various aspects to cover here are vocabulary definition, meta-

level specification, and discuss ontology re-use.

2. Build ontology: During this step, the ontology developer/engineer focuses on

identification of key concepts, their relationship, defines ontology in a formal

language, and if required, integrates existing ontologies.

Figure 5.1: Uschold and Kings enterprise methodology

3. Document ontology: This means formally documenting the ontology in some

language, i.e., RDF/RDF(S) or OWL. Formally defining the ontology will

facilitate ontology sharing among the community.

4. Evaluate ontology: It is a process to measure the enactment of an ontology.

During the evaluation, all requirement specifications are carefully examined

with respect to Ontology ability to answer questions for the purpose it is

built.

Fig 5.1 presents the pictorial representation of the Uschold and King Enterprise

Methodology. In the subsequent Section- 5.3, a detailed discussion on each of the

Uschold and King Enterprise Methodology step will be discussed, with respect to

job description ontology.


5.2 Ontology Expressiveness

Expressiveness 1 is a way to define a concept more effectively. Ontology’s speci-

fication languages mainly focus on abstraction away from data structure and im-

plementations. They mostly focus on the semantic level of information, preferably

a logical/physical level of information.

Table 5.1: DL basic expressive labels along with details

Label Description

AL Attributive language. This is the base language which allows:1. Atomic negation (negation of concept names that do not

appear on the left-hand side of axioms)2. Concept intersection3. Universal restrictions4. Limited existential quantification

FL Frame based description language[3] allows:• Concept intersection• Universal restrictions• Limited existential quantification• Role restriction

EL Existential language allows:• Concept intersection• Existential restrictions (of full existential quantification)

Ontology’s expressiveness is defined using the Description Logic (DL) 2 which

is more expressive then propositional logic 3 but lesser expressive than First-Order

Language 4 (FOL). DL uses a different formalism and naming convention then

FOL.

The Table 5.1 shows the basic allowed DL expressive labels and Table 5.2 shows

1https://www.dictionary.com/browse/expressiveness2https://en.wikipedia.org/wiki/Description logic3https://en.wikipedia.org/wiki/Propositional calculus4https://en.wikipedia.org/wiki/First-order logic


extension DL along with their details. The operator encodes its expressiveness.

Table 5.2: DL extension expressivity labels along with details

Label Description

F Functional properties.E Full existential qualification.U Concept union.C Complex concept negation.H Role hierarchy.R Complex role inclusion.O Nominals.I Inverse properties.N Cardinality restrictions.Q Qualified cardinality restrictions.

(D) Use of datatype properties, data values or data types.

Based on the DL expressiveness labels, the job description ontology has ex-

pressiveness of ALCHOF(D)

5.3 Job Description Ontology

5.3.1 Identify Purpose

Job description ontology aims to provide granular details and relationships of

the concepts. Besides providing granular details, it also provides concept sub-

classifications in order to have a better understanding at a granular level. Besides

providing concept granularity, the homogeneous and comprehensive schema will

resolve the semantic heterogeneity that exists in describing the job descriptions and

will provide a common ground for sharing the job descriptions. Job description

ontology serves multiple purposes, that is:

1. Defines the semantic structure for job description


2. Provides common grounds for knowledge sharing

3. Provides concept hierarchies using generalization/specialization

4. Defines the relationship among concepts for context building

Existing e-recruitment platforms, such as Indeed 5, Monster 6, Personforce 7,

Angel.co 8, LinkedIn 9, Career Builder 10, Glassdoor 11, SimplyHired and many

other use heterogeneous schema for describing a job description. No interoper-

ability of concepts exists among these schema. The job description ontology will

provide a homogeneous and comprehensive schema for representing the concepts

and relationships, and for sharing the job description among various e-recruitment

platforms.

5.3.2 Build Ontology

The motivation of the job description ontology has been adopted from schema.org

model of a job position. The job position schema defines an outline of elements

that must exist in defining a job position. The elements are not linked to-gather

instead they are presented in a flat structure in schema.org model. In addition to

that, essential elements, such as education, requirements, responsibilities are just

plain text instead of being defined by absolute concepts, such as requirements being

defined by skills, experience and expertise levels. The improved and comprehensive

model of job description ontology is a result of in-depth HR recruitment process

domain study and review of ontology design methodologies.

5https://www.indeed.com6https://www.monster.com/7https://www.personforce.com/8https://angel.co/9https://www.linkedin.com

10https://www.careerbuilder.com/11https://www.glassdoor.com/


Figure 5.2: Job description ontology

The proposed job description ontology, as shown in Fig 5.2 has two logical

divisions, that are (1) Job description, and (2) Job position. A job description

defines aspects related to a core job, such as job title, requirements, responsibil-

ities, education whereas a job position determines elements, such as post date,

last-apply date, organization name, and available positions. The primary advan-

tage of designing a job description ontology in such a way is that the same job

descriptions can be posted multiple times with significant variations in the value of

properties mentioned in the job positions. Fig 5.2 shows a pictorial representation

of the job description ontology. The Job Description and Job Position are the two

main concepts of the ontology. A job description comprises necessary information,

requirements, responsibilities, and education. Whereas, Job position is comprised

of organization and a job opening. One job description can be posted in multiple


job positions, thus increasing its reuse-ability.

5.3.2.1 Basic Information

Basic ontology concepts are derived from basic entities, as discussed in Section 4.4.2.1.

These concepts include job title, career level, type, experience, salary and others,

as shown in Table 5.3.

Table 5.3: Important concepts in job description ontology

Concept Values

title Software Engineer, Web Devel-oper.

occupationalCategory Internet, Health-care.careerLevel Manager, Entry Level.

type Full Time, Permanent.experienceRequirements 1 Year, 2 Years.

salary It can me monthly or hourly.

Figure 5.3: Job description basic entities as N3 notation


Fig 5.3 shows a N3 representation of basic entities in job description ontology.

5.3.2.2 Requirements

Requirements are contextual entities, as discussed in the previous chapter. Ta-

ble 5.4 show details of requirements present in the job description ontology. Re-

quirements define the baseline criteria that is to be judged by the employer for

acceptance in the organization.

Table 5.4: Job requirements properties and description

Concept Values

skill Java, HTML 5.expertiseLevel Expert, Novice

mandatory True or False.

Fig 5.4 shows a N3 representation of requirements contextual entity in a job

description ontology.

Figure 5.4: Job description requirements entity as N3 notation


5.3.2.3 Responsibilities

Responsibilities are the duties that employees perform on a job, such as:

1. responsible for managing daily sales

2. make daily reports and dispatch them to the head quarter

Fig 5.5 shows a N3 representation of contextual entity responsibility in a job

description ontology.

Figure 5.5: Job description responsibilities as N3 notation

5.3.2.4 Education

Education defines the minimum qualification required by a job. Table 5.5 show

properties present in the job description ontology associated with education con-

cept.

Table 5.5: Job Responsibilities properties and description

Concept Values

educationTitle BS, MS.educationType Degree, Certificate.

postEducationExperiance 1 year, 2 year.

Fig 5.6 shows a N3 representation of education entity in the job description

ontology.


Figure 5.6: Job description education as N3 notation

A job position defines actual job advertised for hiring candidates. The infor-

mation associated with a job position is shown in Table 5.6.

Table 5.6: Job position properties and descriptions

Concepts Values.

hiringOrganization Google Inc.datePosted Morning, Evening.jobLocation 9 - 5, 10 - 7.

positions 1,2.workingShift morning,night.workTiming 9 - 6, 17 - 4.


Fig 5.7 shows a N3 representation the job position in a job description ontology.

Figure 5.7: Job description profile as N3 notation

5.3.3 Ontology Development and Documentation

The job description ontology is developed using Protege tool. Fig 5.8 shows a sam-

ple N3 representation of the job description ontology. The ontology is documented

using Protege annotation properties that include rdfs:versionInfo, rdfs:comment,

rdfs:label, rdfs:seeAlso, rdfs:priorVersion and other properties.


Figure 5.8: A sample N3 representation of the job description ontology

5.4 Evaluation

The job description ontology is evaluated using two approaches, that are (1) do-

main coverage and (2) application driven evaluation. In the e-recruitment domain,

there exists no gold standard for evaluation of the job description ontology and

also there is no existing application for its reuse.


5.4.1 Domain Coverage

Domain coverage evaluation is a comparison of the job description ontology con-

cepts against concepts from different existing models in the same domain. The

domain concepts comparison was evaluated against the schema.org, indeed, HR-

XML, and PROSPECT. The comparative evaluation was based on how well on-

tology defines the domain. Table 5.7 shows a comparison of the job description

ontology with other existing domain models. The job description ontology has

coverage of 96% whereas schema.org had 75%, indeed had 38%, the prospect had

21%, and HR-XML had 46% coverage based on the concepts identified for eval-

uation in consultation with HR experts. From the comparison, it is evident that

the job description ontology is more comprehensive in concepts coverage than any

other model.

Besides, comparing ontology concepts of various existing model with job de-

scription ontology, the ontology was also presented to 6 HR experts from various

national and international organization for feedback. A positive feedback was

received from them in-terms of ontology comprehensiveness and domain represen-

tation.

5.4.2 Application based Evaluation

An application is built on top of the job description ontology for the applica-

tion based evaluation. The application analyzes how well it captures real-world

scenarios. The application was developed in Java, with graph store in GraphDB.

The evaluation is performed on a data-set of 101 job description collected

from 15 different domains, such as information technology, management, finance

and others as to ensure that ontology captures all domains. The job description


were stored in knowledge base using an application. Queries were executed on

knowledge-base and results were analyzed. Table 5.8 shows the queries used in the

evaluation.


Tab

le5.

7:D

omai

nco

vera

geof

job

des

crip

tion

onto

logy

Con

cepts

Sch

ema.

org

Job

Des

crip

tion

Indee

d.c

omP

RO

SP

EC

TH

R-X

ML

bas

eSal

ary

xx

xx

dat

ePos

ted

xx

xx

educa

tion

Req

uir

emen

tsx

xx

xdeg

reeR

equir

emen

tsx

cert

ifica

tion

Req

uir

emen

tsx

trai

nin

gReq

uir

emen

tsx

dip

lom

aReq

uir

emen

tsx

emplo

ym

entT

yp

ex

xx

exp

erie

nce

Req

uir

emen

tsx

xx

xx

hir

ingO

rgan

izat

ion

xx

xx

ince

nti

veC

omp

ensa

tion

xx

jobB

enefi

tsx

xin

dust

ryx

xjo

bL

oca

tion

xx

xx

occ

upat

ional

Cat

egor

yx

xx

xqual

ifica

tion

sx

xre

spon

sibilit

ies

xx

sala

ryC

urr

ency

xx

skills

xx

xx

spec

ialC

omm

itm

ents

xti

tle

xx

xx

xva

lidT

hro

ugh

xx

xw

orkH

ours

xx

xsk

ills

Exp

erti

seL

evel

xx


Table 5.8: Job description evaluation queries categorization

Job Titles Requirements Career Level

Product Manager Word with 3+ years ManagerFull Time Writer Microsoft with 3+ years Director

CTO MySQL Entry LevelAccount Manager AJAX ExecutiveProgram Manager Java Management

For each search query, jobs were manually categorized, and counts have been

calculated. After that, the same queries have been applied to the job descriptions

graph store for evaluation. The queries have been written in SPARQL language,

as shown in Fig 5.9. The primary purpose is to evaluate system precision. The

application compares the number of application retrieved job descriptions with

manually retrieved job descriptions. Table 5.9 shows aggregated number of results

for each query category.

Figure 5.9: A sample SPARQL query to retrieve job title labels after execution

The results are promising as shown in Table 5.9 for manually and system

retrieved job descriptions using the designed application.

Table 5.9: Job Description user retrieval summary

Category Manual System Retrieved

Job Titles 25 25Requirements 33 33Career Level 45 45


5.5 Summary

The current chapter presented the proposed job description ontology for e-recruitment

domain. The ontology design has its inspiration from the job position schema from

schema.org 12. The ontology segregates two key concepts that are job description

and job position. The job description ontology was evaluated for domain coverage.

Alongside domain coverage, an application-based evaluation was also carried out.

For the domain coverage method, the base criteria were a set of concepts identified

by HR expert team. The application based evaluation was carried out by designing

a small in-house application for storing and retrieving job descriptions using the

ontology model.

12https://schema.org/JobPosting

Chapter 6

Sem-QA Framework

The focus of the current chapter is to present a semantic query translation frame-

work SEM-QA. The focus of SEM-QA is to translate natural language queries for

searching machine-understandable data. The natural language queries will provide

ease to end user in defining their search requirements in a format that they can

best describe. The solution will handle all the under-laying transformation and

search complexities.

6.1 The Sem-QA Framework

Sem-QA (Semantic Query Translation Framework) is a comprehensive solution

for the transformation of natural language queries into a format that is machine-

understandable. The most differentiating features are (i) use of atomic filtering

constraint to generate SPARQL query triple pattern without depending on back

end question database and, (ii) Semantic association of generated triple patterns

according to user intents for dynamic generation of complex SPARQL queries.

The research contributions of the presented technique are as follows:

71

CHAPTER 6. SEM-QA FRAMEWORK 72

• flexibility to handle grammatically incorrect and incomplete queries

• use of atomic question patterns to dynamically generated complex question

patterns

• semantic processing of broadening and narrowing terms, for instance: at

least, at most and 3+.

• produce structured output per user demands

The Sem-QA is a comprehensive solution for the transformation of NLQ to

SPARQL queries to be executed over machine-understandable data. Sem-QA

query has three modules (i) Linguistic Analysis and, (ii) Query Template Match-

ing and, (iii) SPARQL Query Generation . The proposed technique is extensively

tested using two different data sets: (i) Mooney job data set (Mooney, 2016) and,

(ii) queries posted on the Personforce (Personforce, 2016) job portal by different

job seekers. The evaluation results show that the proposed methodology success-

fully translate user queries into valid SPARQL with high accuracy.

The rest of the chapter is structured as follows: Section 6.2 discusses linguis-

tic analysis on user queries, Section 6.3 discusses template matching, Section 6.4

discusses the SPARQL query generation, Section 6.5 discusses a working example,

and Section 6.6 shows the evaluation of the system.

6.2 Semantic Linguistic Analysis

Linguistic analysis is a process of understanding the text. It identifies sentences,

words/tokens, lemmatization and clearing, and part of speech tagging (POS). De-

tailed discussion on linguistic analysis has already been provided in Chapter 4.4.1.

The only significant difference between Chapter 4.4.1 analysis and current analysis


is of type of text analyzed. The text of NLQ queries posted by the users are not

very long, such as mobile development jobs, jobs in new york. Besides query being

too short, they also may have one or more of the following weakness: grammatical

and spelling mistakes, incomplete questions, use of jargon’s and some users type

keywords. Fig 6.1 shows sample queries.

Figure 6.1: A set of sample queries from Mooney data set

To serve a larger group of users, Sem-QA in Linguistic Analysis, handles most

of the errors as mentioned earlier. It uses a dictionary for mapping question

words (entities) to similar ontology concepts. Since NL questions are short and do

not have a significant pattern for entity detection nor the contextual information

could be used for it; therefore the designed dictionary is used to identify entities

that cannot be identified using pattern analysis. Dictionary for NLQ consists of

nearly 80 rule files used for the detection of different entities that include: Career

Level, Expertise level, Organization, Job Title, Skills, Experience, Job type, Job

Category, Person Name and many more. Table 6.1 shows examples of entities

extracted from job queries.

Table 6.1: Examples of entities detected from natural language job queries

Sample Query Detected Entities

Are their any jobs for odbcspecialist?

Skill: odbcExpertise Level: specialist

Java jobs in houston Skill: JavaPlace: Houston


6.3 Query Template Matching

This module is designed using pattern matching approach as discussed in (Oscar

et al., 2009). According to the technical speculation each question is composed

of Filter Constraint (FC) and Desired Information (DI). FC specifies user prior-

ities for a job search, while DI specifies the search intent. A question may have

null FC, for instance, All the jobs please, while another query may have mul-

tiple FC, such as Are there any jobs at dell that require no experience and pay

50000. Therefore correct answering demands: (i) identification of all user speci-

fied FC and, (ii) correct association of FC while generating a formal query and,

(iii) identification of negation of an FC and, (iv) special processing of broadening

and narrowing terms for instance, at least, at most and 3+. Mooney (Mooney,

2016) and Personforce (Personforce, 2016) data-sets have been analyzed to cater

for a maximum possible number of FC. All atomic FC from the data set queries

have been identified and are converted into generalized expressions, called FC

Expression (FCExp). A FCExp is a well defined generalized expression, that spec-

ifies an atomic filter constraint, in terms of ontology concepts. A user question

may consist of any number of FC. The template matcher maps an input text FC

to FCExp. Instead of generating and storing all possible combinations, of FCExp

complex FCExp are dynamically generated using the existing atomic FCExp.

6.4 Query Generation

This module is designed using pattern matching approach as discussed in Sec-

tion 6.3 and dynamic triple generation approach (Lopez et al., 2005) and (Wang

et al., 2007). It maps previously identified FCExp to FC Templates (FCTemp).

Each FCExp is bonded to an FCTemp. An FCTemp is a string template specifying


a SPARQL group graph pattern. The SPARQL group graph pattern in an ontol-

ogy concept representing the subject and object parts of a triple pattern. It takes

as an argument a hash map of key-value pairs. The key-value pair is made of the

ontological concepts as (keys), such as skill, job title, city, country values identi-

fied during the semantic linguistic analysis phase. In the invoked string template,

the ontological concepts (keys) are replaced with their actual values, to generate

question specific triple patterns. In a similar manner for all identified FCExp,

an FCTemp is invoked to generate SPARQL query triple patterns. These triple

patterns are then joined using a SPARQL operator, that is semantically equivalent

to the user-specified connecting words.

6.5 Working Example

The working of proposed technique is demonstrated with an example query from

the Mooney data set. The input query q1 is: List the companies that desire

’c++’ experience?. Initially q1 is processed for Semantic Linguistic Analysis.

Query q1 is annotated, checked for negations and special words and then FC

and DI are determined. In the example query, FC include: [Skill ] = [c + +]

and [ExpertiseLevel ] = [not null ], while DI is [Organization]. In Query Template

Matching phase a matching FCExp is searched for each of the two atomic FC

found in the input query, as shown in Figure 6.2. The matched FCExp are

used to invoke FCTemp. The query under discussion invokes three FCTemp

to form WHERE clause of SPARQL query. The SPARQL Query Generator

also looks for the appropriate association operators, along with the task of in-

voking FCTemp. In q1, a user is looking for [Organization], that satisfies two

atomic FC : [Skill ] = [c + +] and [ExpertiseLevel ] = [not null ]. Therefore the


Semantic Job Store

DI=[ Organization: ? ]FC=[ Skill:c++ ,

ExpertiseLevel:not mentioned ]

User Query q1=List the companies that desire 'c++' experience?

[job][organization]

[JobTitle][WorkLocation][ExpertiseLevel]

IBMApple

Microsoft

Organization: ?Skill:c++

ExpertiseLevel:not null

Matched Template

SPARQL Query

Annotated NLQ

Structured Results

SPARQL Query Generation

?jID jdo:jobTitle ?title .?jID jdo:publishedBy ?organization .

?jID jdo:hasSkill ?skill .FILTER regex(str(?skil), "<entity_list.Skill>", "i" )

?jID jdo:hasExpertiseLevel ?expertise .FILTER (bound(?expertise))

SELECT ?organization

Figure 6.2: A sample query processing representation of Mooney data set

generated triple patterns are associated using SPARQL Dot operator. After gen-

erating the WHERE clause, it also adds the SELECT clause to the final SPARQL

query. SELECT clause is generated using DI : [Organization]. Generated FCExp

and FCTemp are shown in Fig 6.2.


6.6 Evaluation and Results

6.6.1 Experimental Setup

The Sem-QA generated SPARQL queries are compared with the manually gener-

ated SPARQL queries, written by domain experts. A translated query is correct

if it is equivalent to the manually generated one. Tests have been conducted to

evaluate Sem-QA correctness and efficacy of translation. A discussion on data set

and evaluation results is provided in the Subsections 6.6.2 and 6.6.3.

6.6.2 Data Set Specification

Sem-QA evaluation is performed on two data-sets. One is standard benchmark

data set (Mooney, 2016) and the second is a real-world user queries posted on

the Personforce (Personforce, 2016) job portal. Fig 6.1 shows sample queries.

Table 6.2 shows the data set queries count. Both data-sets contain sample job

queries in plain English.

Table 6.2: Query count for Mooney and Perforce data set

Data-set Total Queries

Mooney 620Personforce 500

Most of the NLI based QA systems have not discussed the processing of ques-

tion words, such as at least, negation, at most, each, outside and inside , known

as scope specifiers. We performed a statistical analysis on the Mooney job data

set. It shows, questions with some scope modifier constitute 16.29% of total job-

related queries. Although they are not the major part of job-related questions,

the occurrence of scope specifier in NL questions is quite often; therefore, they can


not be ignored. According to Mooney geographical data set statistics discussed

in (Cimiano and Minock, 2010), one-quarter questions are missed by NLI based

QA systems, because of the missing or under processing of scope modifiers. There-

fore along with with 83.71% questions, Sem-QAS has paid particular attention to

the processing of 16.29% questions, that involve scope modifiers.

Another less focused part is the correct association of multiple FCs, while

the precision of the result is dependent on it. In the question Show me jobs

using lisp that require a bscs and desire a msee, the basic filtering criteria is bscs,

while msee needs to be added as an OPTIONAL condition. If the second filtering

constraint msee is associated with the first bscs using AND, it will miss all jobs

that require only bscs and have not mentioned msee as a requirement. Therefore

the incorrect association of filtering constraints may cause opportunity loss to the

user and is unbearable. To show correct association of FCs in Sem-QAS. We

have translated questions involving multiple FCs, that is associated using different

operators.

Table 6.3: Query Categorization based on number of Filter Constraints

Category Category Description

QC 1 Questions without any FCQC 2 Questions with single FCQC 3 Questions with 2 different FCs, both of them are manda-

toryQC 4 Questions with 3 different FCs, 2 are mandatory, and

the 3rd FC is OPTIONALQC 5 Questions with 3 different FCs, 1 is mandatory, and the

other 2 are associated with 1st one using OR

Table 6.3 shows categorization of evaluation queries based on number of Fil-

ter Constraints. Experimental results for the queries mentioned in Table 6.3 are

discussed in Section 6.6.4.


6.6.3 Evaluation Results

Correctness is measured in terms of a system recall and precision. The recall is

the ratio of correctly answered questions to the total number of questions in the

data-set, while precision is the measures of correctly answered question divided by

the total number of questions answered by the system (Damljanovic et al., 2010).

Table 6.4: Comparative analysis of Mooney and Personforce data-setTotalQueries

Sem-QASTranslatedQueries

PrecisionRecallF1-measure

Mooney Job Data Set Results620 619 100 99.84 1

Personforce Job Data Set Results500 500 100 100 1

The F1-measure results for two different data sets are shown in Table 6.4,

proves the correctness of Sem-QAS technique.

6.6.4 System Performance for Semantic Association of Atomic

FC

Another important NL based QA system feature is translation efficiency; it must

perform considerably well for complex NL questions. The increasing number of

FCs is also a measure of input query complexity. An NLI based QA system must

maintain efficacy while processing complex queries. To show this side of Sem-QAS

five query categories, as discussed earlier, are translated. Each category involves

queries of varying complexity with different association operators. The values

shown in Fig 6.3 represent the mean processing time of different queries belonging

to the same category. The results confirm that translation time remains almost

linear irrespective of the input query increasing complexity.


ht]

QC_1 QC_2 QC_3 QC_4 QC_585

86

87

88

89

90

91

92

93

94

95

Queries

Tim

e (M

illi

seco

nd

)

Figure 6.3: Time comparison between various Filter Constraints queries

6.7 Summary

The current chapter discussed the solution for the transformation of NLQ into

SPARQL queries. The solution uses query pattern templates and dynamic invo-

cation of string templates for answering natural language question from an un-

derlying RDF store. Its distinguishing features are special processing of scope

modifiers and use of atomic filtering constraints to generate complex queries. The

query matching and dynamic query generation techniques are evaluated using two

data sets. The results show a high recall and precision of the proposed technique.

Chapter 7

Conclusion

The focus of the current chapter is to summarize the extraction, enrichment and

transformation framework called SExEnT. SExEnT extracts knowledge from the

unstructured text of job descriptions, enriches entities and compound words, builds

context, transforms them in a machine-understandable format and stores them in a

knowledge-base. Beside this SExEnT also transforms the natural language English

query into a machine-understandable format so that jobs can be retrieved.

The chapter is organized as follows: Section 7.1 summarizes the contribution

of research, Section 7.3 discusses research contributions and Section 7.4 discuss

limitations and future work.

7.1 Research Description

In this research, an extraction and transformation methodology for identification

of entities and compound words from a job description is proposed for building

information context. The proposed framework extracts context-aware information

from a job description by exploiting Linked Open Data. The extracted informa-

81

CHAPTER 7. CONCLUSION 82

tion has been represented using a comprehensive proposed web ontology for job

descriptions in e-recruitment in order to resolve semantic heterogeneity. Along-

side resolving semantic heterogeneities an enrichment methodology is proposed for

enriching entities and compound words from Linked Open Data to cater for data

staleness in e-recruitment. The enrichment process enriches and builds context

between extracted entities to minimize the information loss in the extraction pro-

cess. Additionally, it transforms the job description Natural Language Queries

from plain English text to machine-understandable format. It combines various

processes to achieve context-aware information extraction and enrichment from

the job description in e-recruitment. The framework segments the text into pre-

defined categories using a self-generated dictionary. The entities are extracted

using Natural Language Processing (NLP) and dictionary. The extracted enti-

ties are enriched using Linked Open Data, and job context is built using the job

description domain ontology. The knowledge-based stores enriched and context-

aware information using Linked Open Data principles 1. The user searches the

context-aware information stored in the knowledge-base using Natural Language

Queries (NLQ). The transformation process encapsulates various processes to-

gether to achieve machine-understandable and context-aware information.

The evaluation has been performed on a data-set of 860 jobs, verified by HR

experts. Initially a comparison of manually verified data and system extracted en-

tities have been carried out. The SAJ framework achieved an overall F1-measure

of 87.83 %. In comparison with other techniques, OpenCalais and alchemy API,

SAJ performed the best. OpenCalais was able to extract job titles and job require-

ments while Alchemy API was only able to extract job titles, as it is a domain

dependent entity. SAJ can facilitate searching and retrieval, scoring and ranking

1https://www.ontotext.com/knowledgehub/fundamentals/linked-data-linked-open-data/


of job candidates. The evaluation of transformation from NLQ to SPARQL has

been performed on two data-sets, i.e., Mooney Data-set and Job Portal Data-set.

The results are promising and show a high recall and precision of the proposed

technique.

7.2 Application Areas

The proposed SExEnT framework can also be applied on below mentioned domains

other then e-recruitment:

1. Legal domain, such as court orders, legal proceedings.

2. Health-care data, such as textual discharge summaries.

3. Scientific documents, such as research articles, reports.

7.3 Research Contributions

The current research has the following contributions:

1. A framework, named as SAJ for extraction, transformation and context

building of information from job descriptions. The core focus is to extract

contextual entities, such as job requirements and job responsibilities using a

dictionary comprising of JAPE rules and seed tokens.

2. A framework for enriching skill entities using Linked Open Data, a continu-

ously growing data source managed by an open source community.


3. A job description ontology defines concepts and relationships among those

concepts. The job description ontology provides hierarchical and associative

relationships among the concepts.

4. A framework, named as Sem-QA for the transformation of natural language

plain English queries into a machine-understandable format, i.e., SPARQL.

The transformation helps a layman in searching for machine-understandable

data content.

7.4 Limitations and Future Work

The current work is unable to:

1. Automatically generates pattern/action extraction rules for unstructured

text. The rules would be learned from the text based of some predefined

features, such as word/token, POS tags, named entities and others.

2. Extend search queries matching to profile matching for job recommendation

based on profiles. User profile would be used as search query which is complex

in nature as compared to existing job search queries. The results produced

would be ranked with respect to the user profile.

3. Generate profile score based on matching with a job description. After

matching user profile with the job descriptions ranked results would be re-

trieved thus facilitating the decision process.

4. Automatic learning of extraction dictionary to enhance entities extraction.

Automatic dictionary learning will enhance extraction by addition of new

rules learned from the text.

Bibliography

(2017). Introduction to the principles of linked open data. Accessed: 2018-12-01.

Ahmed, N., Khan, S., and Latif, K. (2016). Job description ontology. In Fron-

tiers of Information Technology (FIT), 2016 International Conference on, pages

217–222. IEEE.

Al-Yahya, M., Aldhubayi, L., and Al-Malak, S. (2014). A pattern-based approach

to semantic relation extraction using a seed ontology. In Semantic Computing

(ICSC), 2014 IEEE International Conference on, pages 96–99. IEEE.

Ali, F., Kim, E. K., and Kim, Y. (2015). Type-2 fuzzy ontology-based opinion min-

ing and information extraction: A proposal to automate the hotel reservation

system. Appl. Intell., 42(3):481–500.

Ameen, A., Khan, K. U. R., and Rani, B. P. (2012). Creation of ontology in edu-

cation domain. In 2012 IEEE Fourth International Conference on Technology

for Education, T4E 2012, Hyderabad, India, July 18-20, 2012, pages 237–238.

Arocena, G. O. and Mendelzon, A. O. (1999). WebOQL: restructuring documents,

databases, and webse. Theory and Practice of Object Systems, 5(3):127–141.

Awan, M. N. A. (2009). Extraction and generation of semantic annotations from

digital documents. Master’s thesis, NUST School of Electrical Engineering &

Computer Science.

Bernstein, A. and Kaufmann, E. (2006). Gino–a guided input natural language

85

BIBLIOGRAPHY 86

ontology editor. In The Semantic Web-ISWC, pages 144–157. Springer.

Bhagia, L. (2015). The evolution of web technologies.

Bijalwan, V., Kumar, V., Kumari, P., and Pascual, J. (2014). Knn based machine

learning approach for text and document mining. International Journal of

Database Theory and Application, 7(1):61–70.

Bogh, C. (2012). The evolution of web technologies.

Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M. A., Maynard, D., and

Aswani, N. (2013). Twitie: An open-source information extraction pipeline for

microblog text. In RANLP, pages 83–90.

Boselli, R., Cesarini, M., Marrara, S., Mercorio, F., Mezzanzanica, M., Pasi, G.,

and Viviani, M. (2018). Wolmis: a labor market intelligence system for classi-

fying web job vacancies. Journal of Intelligent Information Systems, 51(3):477–

502.

Buttinger, C., Prll, B., Palkoska, J., Retschitzegger, W., Schauer, M., and Immler,

R. (2008). Jobolize-headhunting by information extraction in the era of web 2.0.

In Proceedings of the 7th International Workshop on Web-Oriented Software

Technologies, IWWOST.

Candela, G., Escobar, P., and Marco-Such, M. (2017). Semantic enrichment on

cultural heritage collections: A case study using geographic information. In

Proceedings of the 2nd International Conference on Digital Access to Textual

Cultural Heritage, pages 169–174. ACM.

Cimiano, P. and Minock, M. (2010). Natural language interfaces: What is the

problem? a data-driven quantitative analysis. In Natural Language Processing

and Information Systems, volume 5723, pages 192–206. Springer Berlin Heidel-

berg.

Copestake, A. and Jones, K. S. (1990). Natural language interfaces to databases.

BIBLIOGRAPHY 87

Number 187. Cambridge Univ Press.

Damljanovic, D., Agatonovic, M., and Cunningham, H. (2010). Natural lan-

guage interfaces to ontologies: Combining syntactic analysis and ontology-

based lookup through the user interaction. In The Semantic Web: Research

and Applications, pages 106–120. Springer.

Elkan, C. and Greiner, R. (2006). Building large knowledge-based systems: rep-

resentation and inference in the cyc project. Artificial Intelligence, (1):41–52.

Flesca, S., Masciari, E., and Tagarelli, A. (2011). A fuzzy logic approach to wrap-

ping pdf documents. Knowledge and Data Engineering, IEEE Transactions on,

23(12):1826–1841.

Frank, A., Krieger, H.-U., Xu, F., Uszkoreit, H., Crysmann, B., Jorg, B., and

Schafer, U. (2007). Question answering from structured knowledge sources.

Journal of Applied Logic, 5(1):20–48.

Fuchs, N. E., Kaljurand, K., and Schneider, G. (2006). Attempto controlled english

meets the challenges of knowledge representation, reasoning, interoperability

and user interfaces. In FLAIRS Conference, volume 12, pages 664–669.

Funk, A., Tablan, V., Bontcheva, K., Cunningham, H., Davis, B., and Handschuh,

S. (2007). Clone: Controlled language for ontology editing. In The Semantic

Web, pages 142–155. Springer.

Gautam, G. and Yadav, D. (2014). Sentiment analysis of twitter data using ma-

chine learning approaches and semantic analysis. In Contemporary computing

(IC3), 2014 seventh international conference on, pages 437–442. IEEE.

Geibel, P., Trautwein, M., Erdur, H., Zimmermann, L., Jegzentis, K., Bengner, M.,

Nolte, C. H., and Tolxdorff, T. (2015). Ontology-based information extraction:

Identifying eligible patients for clinical trials in neurology. J. Data Semantics,

4(2):133–147.

BIBLIOGRAPHY 88

Gomez-Perez, A. (1996). Towards a framework to verify knowledge sharing tech-

nology. Expert Systems with Applications, 11(4):519–529.

Gomez-Perez, A. (1999). Ontological engineering: A state of the art. Expert

Update: Knowledge Based Systems and Applied Artificial Intelligence, 2(3):33–

43.

Gomez-Perez, A., Ramırez, J., and Villazon-Terrazas, B. (2007). An ontology for

modelling human resources management based on standards. In International

Conference on Knowledge-Based and Intelligent Information and Engineering

Systems, pages 534–541. Springer.

Graupner, S., Nezhad, H. R. M., and Basu, S. (2017). Generating machine-

understandable representations of content. US Patent 9,633,332.

Gregory, M. L., McGrath, L., Bell, E. B., O’Hara, K., and Domico, K. (2011). Do-

main independent knowledge base population from structured and unstructured

data sources. In Proceedings of the Twenty-Fourth International Florida Arti-

ficial Intelligence Research Society Conference, May 18-20, 2011, Palm Beach,

Florida, USA.

Gruninger, M. and Fox, M. (1995). Methodology for the design and evaluation of

ontologies. International Joint Conference on Artificial Inteligence (IJCAI95),

Workshop on Basic Ontological Issues in Knowledge Sharing.

Guenther, N., Schonlau, M., et al. (2016). Support vector machines. Stata Journal,

16(4):917–937.

Gupta, Y. (2016). Literature review on e-recruitment: A step towards paperless

hr. International Journal, 4(1).

Gutierrez, F., Dou, D., Fickas, S., Wimalasuriya, D., and Zong, H. (2016). A

hybrid ontology-based information extraction system. Journal of Information

Science, 42(6):798–820.

BIBLIOGRAPHY 89

Jayram, T. S., Krishnamurthy, R., and Raghavan, S. (2006). Avatar Information

Extraction System. IEEE Data Engineering Bulletin, 29(1):40–48.

Karkaletsis, V., Fragkou, P., Petasis, G., and Iosif, E. (2011). Ontology Based

Information Extraction from Text. Knowledge-Driven Multimedia Information

Extraction and Ontology Evolution, 6050:89–109.

Kiryakov, A., Popov, B., Terziev, I., Manov, D., and Ognyanoff, D. (2004). Se-

mantic Annotation, Indexing, and Retrieval. Web Semantics: Science, Services

and Agents on the World Wide Web, 2(1):49 – 79.

Kolb, P. (2008). Disco: A multilingual database of distributionally similar words.

Proceedings of KONVENS-2008, Berlin, 156.

Li, X., Zhang, Y., Wang, J., and Pu, Q. (2016). A preliminary study of plant

domain ontology. In 2016 IEEE 14th Intl Conf on Dependable, Autonomic and

Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing,

2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and

Technology Congress, DASC/PiCom/DataCom/CyberSciTech 2016, Auckland,

New Zealand, August 8-12, 2016, pages 109–112.

Lopez, V., Pasin, M., and Motta, E. (2005). Aqualog: An ontology-portable ques-

tion answering system for the semantic web. In The Semantic Web: Research

and Applications, pages 546–562. Springer.

Malik, S. K., Prakash, N., and Rizvi, S. (2010a). Developing an university ontology

in education domain using protege for semantic web. International Journal of

Science and Technology, 2(9):4673–4681.

Malik, S. K., Prakash, N., and Rizvi, S. (2010b). Semantic annotation frame-

work for intelligent information retrieval using kim architecture. International

Journal of Web & Semantic Technology (IJWest), 1(4):12–26.

Maree, M., Kmail, A. B., and Belkhatir, M. (2018). Analysis and shortcom-

BIBLIOGRAPHY 90

ings of e-recruitment systems: Towards a semantics-based approach addressing

knowledge incompleteness and limited domain coverage. Journal of Information

Science, page 0165551518811449.

Matthew Jeffery, A. M. (2011). A vision for the future of recruitment: Recruitment

3.0.

McConell, I. (2014). Web 3.0 and what it means for the future of recruitment.

Mingsheng, H., Zhijuan, J., and Xiangyu, Z. (2012). An approach for text extrac-

tion from web news page. In Robotics and Applications (ISRA), 2012 IEEE

Symposium on, pages 562–565. IEEE.

Mooney, R. (2016). Owl test data. https://www.ifi.uzh.ch/en/ddis/research/talking/OWL-

Test-Data.html.

Muller, H.-M., Kenny, E. E., and Sternberg, P. W. (2004). Textpresso: an ontology-

based information retrieval and extraction system for biological literature. PLoS

biology, 2(11).

Mykowiecka, A., Marciniak, M., and Kupsc, A. (2009). Rule-based information

extraction from patients’ clinical data. Journal of Biomedical Informatics,

42(5):923–936.

Nabeel Ahmed, Sharifullah Khan, K. L. A. M. (2008). Extracting semantic an-

notation and their correlation with document. In 4th International Conference

on Emerging Technologies., pages 32–37.

Oscar, F., Ruben, I., Sergio, F., and Jose, Luis, V. (2009). Addressing ontology-

based question answering with collections of user queries. Information Process-

ing & Management, 45(2):175–188.

Owoseni, A. T., Olabode, O., and Ojokoh, B. (2017). Enhanced e-recruitment

using semantic retrieval of modeled serialized documents.

Pattuelli, M. C. (2011). Modeling a domain ontology for cultural heritage re-

BIBLIOGRAPHY 91

sources: A user-centered approach. J. Am. Soc. Inf. Sci. Technol., 62(2):314–

342.

Personforce (2016). User job queries. http://www.personforce.com/.

Philipp, C., Peter, H., Jorg, H., Matthias, M., and Rudi, S. (2008). Towards

portable natural language interfaces to knowledge bases the case of the orakel

system. Data and Knowledge Engineering, 65(2):325 – 354.

Popescu, A.-M., Etzioni, O., and Kautz, H. (2003). Towards a theory of natu-

ral language interfaces to databases. In Proceedings of the 8th international

conference on Intelligent user interfaces, pages 149–157. ACM.

Popov, B., Kiryakov, A., Kirilov, A., and Manov, D. (2003). KIM A Semantic

Annotation Platform. In International Semantic Web Conference, pages 834–

848.

Poria, S., Cambria, E., Ku, L., Gui, C., and Gelbukh, A. F. (2014). A rule-

based approach to aspect extraction from product reviews. In Proceedings of

the Second Workshop on Natural Language Processing for Social Media, So-

cialNLP@COLING 2014, Dublin, Ireland, August 24, 2014, pages 28–37.

Powers, D. M. (2011). Evaluation: from precision, recall and f-measure to roc,

informedness, markedness and correlation.

Ramakrishnan, C., Mendes, P. N., Wang, S., and Sheth, A. P. (2008). Unsuper-

vised discovery of compound entities for relationship extraction. In Gangemi,

A. and Euzenat, J., editors, Knowledge Engineering: Practice and Patterns,

pages 146–155, Berlin, Heidelberg. Springer Berlin Heidelberg.

Rocktaschel, T., Singh, S., and Riedel, S. (2015). Injecting logical background

knowledge into embeddings for relation extraction. In NAACL HLT 2015,

The 2015 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies, Denver, Colorado,

BIBLIOGRAPHY 92

USA, May 31 - June 5, 2015, pages 1119–1129.

Roman, D., Kopecky, J., Vitvar, T., Domingue, J., and Fensel, D. (2015). Wsmo-

lite and hrests: Lightweight semantic annotations for web services and restful

apis. Web Semantics: Science, Services and Agents on the World Wide Web,

31:39–58.

Saggion, H., Funk, A., Maynard, D., and Bontcheva, K. (2007). Ontology-based

information extraction for business intelligence. In The Semantic Web, 6th

International Semantic Web Conference, 2nd Asian Semantic Web Conference,

ISWC 2007 + ASWC 2007, Busan, Korea, November 11-15, 2007., pages 843–

856.

Sen, A., Das, A., Ghosh, K., and Ghosh, S. (2012). Screener: a system for extract-

ing education related information from resumes using text based information

extraction system. In International Conference on Computer and Software

Modeling, volume 54, pages 31–35.

Shahid, N., Khan, O. A., Anwar, S. K., and Pirzada, U. T. (2009). Rational

unified process. Online Notes on RUP. http://ovais. khan. tripod. com/paper-

s/Rational Unified Pro cess. pdf.

Shin, J., Wu, S., Wang, F., De Sa, C., Zhang, C., and Re, C. (2015). Incre-

mental knowledge base construction using deepdive. Proceedings of the VLDB

Endowment, 8(11):1310–1321.

Silvello, G., Bordea, G., Ferro, N., Buitelaar, P., and Bogers, T. (2017). Seman-

tic representation and enrichment of information retrieval experimental data.

International Journal on Digital Libraries, 18(2):145–172.

Singh, A., Rose, C., Visweswariah, K., Chenthamarakshan, V., and Kambhatla, N.

(2010). Prospect: a system for screening candidates for recruitment. In Proceed-

ings of the 19th ACM international conference on Information and knowledge

BIBLIOGRAPHY 93

management, pages 659–668. ACM.

Strzalkowski, T. and Harabagiu, S. M. (2006). Advances in open domain question

answering. Springer Heidelberg.

Tang, B., Kay, S., and He, H. (2016). Toward optimal feature selection in naive

bayes for text categorization. IEEE Transactions on Knowledge and Data En-

gineering, 28(9):2508–2521.

Thada, V. and Jaglan, V. (2013). Comparison of jaccard, dice, cosine similarity

coefficient to find best fitness value for web retrieved documents using genetic

algorithm. International Journal of Innovations in Engineering and Technology,

2(4):202–205.

Thompson, C. A., Califf, M. E., and Mooney, R. J. (1999). Active learning for

natural language parsing and information extraction. In Machine Learning

Conference, pages 406–414. Citeseer.

T.R.Grubber (1995). Toward principles for the design of ontologies used for knowl-

edge sharing. International Journal of Human-Computer Studies, 43(4-5):907–

928.

Uschold, M. and Gruninger, M. (1996). Ontologies: Principles, methods and

applications. The knowledge engineering review, 11(02):93–136.

Uschold, M. and King, M. (1995). Towards a methodology for building ontologies.

In Workshop on basic ontological issues in knowledge sharing, volume 74.

Valle, E. D., Cerizza, D., Celino, I., Estublier, J., Vega, G., Kerrigan, M., Ramırez,

J., Villazon-Terrazas, B., Guarrera, P., Zhao, G., and Monteleone, G. (2007).

SEEMP: an semantic interoperability infrastructure for e-government services

in the employment sector. In The Semantic Web: Research and Applications,

4th European Semantic Web Conference, ESWC 2007, Innsbruck, Austria,

June 3-7, 2007, Proceedings, pages 220–234.

BIBLIOGRAPHY 94

Vicient, C., Sanchez, D., and Moreno, A. (2011). Ontology-based feature extrac-

tion. In Proceedings of the 2011 IEEE/WIC/ACM International Conferences

on Web Intelligence and Intelligent Agent Technology - Volume 03, WI-IAT

’11, pages 189–192, Washington, DC, USA. IEEE Computer Society.

Vicient, C., Sanchez, D., and Moreno, A. (2013). An automatic approach for

ontology-based feature extraction from heterogeneous textualresources. Engi-

neering Applications of Artificial Intelligence, 26(3):1092–1106.

Vijayarajan, V., Dinakaran, M., Tejaswin, P., and Lohani, M. (2016). A generic

framework for ontology-based information retrieval and image retrieval in web

data. Human-centric Computing and Information Sciences, 6(1):18.

Wang, C., Xiong, M., Zhou, Q., and Yu, Y. (2007). Panto: A portable natural

language interface to ontologies. In The Semantic Web: Research and Appli-

cations, pages 473–487. Springer.

Wang, W., Do, D. B., and Lin, X. (2005). Term graph model for text classification.

In ADMA, pages 19–30. Springer.

Weichselbraun, A., Gindl, S., and Scharl, A. (2014). Enriching semantic knowledge

bases for opinion mining in big data applications. Knowledge-based systems,

69:78–85.

Zelle, J. M. and Mooney, R. J. (1996). Learning to parse database queries using

inductive logic programming. In Proceedings of the National Conference on

Artificial Intelligence, pages 1050–1055.

semantic annotation and retrieval in...

Documents