information processing and managementdownload.xuebalib.com/4rpbxmbczkun.pdfarticle in press jid: ipm...

ARTICLE IN PRESS

JID: IPM [m3Gsc; April 22, 2016;9:26 ]

Information Processing and Management 0 0 0 (2016) 1–13

Contents lists available at ScienceDirect

Information Processing and Management

journal homepage: www.elsevier.com/locate/ipm

Paper recommendation based on the knowledge gap between

a researcher’s background knowledge and research target

Weidong Zhao

∗, Ran Wu , Haitao Liu

Shanghai Key Laboratory of Data Science, School of Software, Fudan University, 200433 Shanghai, China

a r t i c l e i n f o

Article history:

Received 27 November 2015

Revised 16 March 2016

Accepted 13 April 2016

Available online xxx

Keywords:

Paper recommendation

Knowledge gap

Concept map

Shortest concept paths

a b s t r a c t

The massively growing documents make it a challenge for researchers to find high value

papers. To solve information explosion, some work on personalized paper recommenda-

tion have been proposed. However, the knowledge gap between a researcher’s background

knowledge and research target is seldom concerned. In this paper, we propose a new

method of recommending helpful papers to support researchers by bridging the knowl-

edge gap. First, domain knowledge is extracted as the concept map, which provides a basis

of comparing user background knowledge and target knowledge. Then, the knowledge gap

is defined with the concept map. To bridge the knowledge gap, the shortest concept paths

are searched to explore some suitable knowledge paths, which can help researchers to ac-

quire target knowledge in accordance with their cognition patterns. Finally, experiments

are performed to demonstrate the effectiveness of the recommendation method.

© 2016 Elsevier Ltd. All rights reserved.

1. Introduction

With the development of information technology, great achievements have been made in terms of electronic literatures.

Meanwhile, the increasing number of research papers makes it a challenge for researchers to discover helpful knowledge

resources. This is commonly called the information overload problem ( Drachsler, Hummel, & Koper, 2008; Salehi & Ka-

malabadi, 2013 ). To deal with this issue, some information techniques (e.g. information retrieval and information filtering)

are adopted by E-libraries and scientific databases to assist knowledge workers. Research paper recommendation needs the

technique that suggests helpful papers to researchers via exploring their interests and preferences ( Basu, Hirsh, Cohen, &

Nevill-Manning, 2001 ). By actively providing interesting materials, literature recommendation can save much time and ef-

forts for researchers ( Pan & Li, 2010 ).

Content-based filtering and collaborative filtering are the mostly used recommendation techniques in many contexts,

including E-commerce, travel, film and music websites, etc. To recommend papers, content-based filtering builds the user

profile based on reading history and suggests new papers that well match the profile. In previous work, user profiling is

generally built in consideration of the importance of key words; however, it is insufficient to model the user’s preference.

To refine the preference semantics, several methods have been proposed, such as label-enriched approach ( Guan et al.,

2010 ), and ontology-expansion approach ( Zhang, Ni, Zhao, Liu, & Yang, 2014 ). As for the collaborative filtering approach,

like-minded research groups are explored first and recommendations are generated based on their similar interests. A key

issue in collaborative filtering is how to measure user similarity. To analyze the relationship between researchers, data from

∗ Corresponding author.

E-mail addresses: [email protected] (W. Zhao), [email protected] (R. Wu), [email protected] (H. Liu).

http://dx.doi.org/10.1016/j.ipm.2016.04.004

0306-4573/© 2016 Elsevier Ltd. All rights reserved.

Please cite this article as: W. Zhao et al., Paper recommendation based on the knowledge gap between

a researcher’s background knowledge and research target, Information Processing and Management (2016),



http://www.ScienceDirect.com

http://www.elsevier.com/locate/ipm

mailto:[email protected]





2 W. Zhao et al. / Information Processing and Management 0 0 0 (2016) 1–13

ARTICLE IN PRESS


various resources is collected and analyzed, including e-mail logs, co-authors, references, project cooperation and social

media ( Davoodi, Afsharchi, & Kianmehr, 2012; Durand, Belacel, & Laplante, 2013 ).

Distinguished from common information resources (e.g. news and tweets), academic literature is an important knowl-

edge resource for researchers. Scientific workers gain their knowledge by reading published literatures. Before navigating an

academic database, a researcher usually has one or several focused knowledge goals, which are commonly documented as

research proposals, requirement specification or planning statement. The disparity between the knowledge goal and back-

ground knowledge is called “knowledge gap”. To bridge the gap, researchers go through literature databases and explore pa-

pers they need. By learning and digesting these papers, they transform the embedded knowledge into their own. However,

the knowledge gap issue is seldom concerned in previous researches. Since a user’s historical preference cannot necessarily

reveal his/her current knowledge requirements, previous methods can hardly narrow the new knowledge gap. As a result, a

large collection of written works still challenge researchers even though only a few well-documented knowledge goals are

available. Therefore, continuous efforts are required to support knowledge workers by bridging their knowledge gap.

In this paper, we present an approach called the knowledge gap based recommendation (KGR, for short) to bridge the

knowledge gap between a researcher’s background knowledge and research target. The method presents domain knowledge

in the form of the concept map. First, a set of central concepts are extracted from domain corpus. Then, the strategy builds

the links between these concepts according to their associations. A researcher’s reading records are analyzed to model

his/her background knowledge; and the target knowledge is extracted from the research proposal. In the domain of concept

map, the knowledge gap is defined as the shortest paths that connect these two kinds of knowledge. Finally, the concepts

in those paths are utilized to discover well-matched papers, which can help bridge the user’s knowledge gap.

The paper is structured as follows. Section 2 summarizes recent work related to paper recommendation. In Section 3 , a

methodology to build the domain concept map is introduced. Section 4 describes how to define the knowledge gap, and

proposes the recommendation algorithm based on the gap. Section 5 gives a brief case study and Section 6 evaluates the

proposed method. Section 7 draws conclusions.

2. Related work

Many researchers realize that user preference cannot be the only guidance when recommending resources. Therefore,

they consider other factors, such as domain knowledge, user background, learning targets and cognitive patterns. The context

of literature reading is not well-formed, making it difficult to determine the user’s learning targets and cognitive patterns. As

a result, previous researches only focus on standardized teaching situations such as e-learning. Zhang et al. (2014) organized

disciplines and curriculum information as a knowledge tree, from which association rules between resources and courses

were analyzed to recommend teaching resources. Tang and Mccalla, (2004 ) proposed a paper recommendation method con-

centrating on teaching characteristics, including the user’s knowledge level and knowledge goals. Based on these character-

istics, a set of ordered papers are recommended. This paper is devoted to bridging the knowledge gap between the user’s

background knowledge and research targets.

In some studies, domain knowledge was modeled in the form of domain taxonomy, ontology or concept networks. For

example, Liang et al. established a semantic network according to visited documents. The semantic network is composed

of several connected semantic trees, and connections in each semantic tree reflect the inheritance relationship between

concepts. Then spreading activation was used to semantically expand the user’s interest ( Liang, Yang, Chen, & Ku, 2008 ).

Spreading activation can achieve beneficial knowledge expansion, enriching original knowledge model with closely related

snots. Xu et al. (2012) combined concept networks with social networks, and recommended experts to users with compre-

hensive utilization of semantic relations between concepts, social cooperation between experts, and professional relationship

between concepts and experts. Cantador and Castells established a semantic network composed of domain concepts, and

user interests took the form of concepts and user interest levels. The algorithm expanded user interests by spreading acti-

vation, and then clustered concepts in semantic networks according to user preferences. Users with similar concept clusters

are considered to have similar interests ( Cantador & Castells, 2006 ).

Some approaches extracted and utilized the user’s background to build the personalized user profile, which reflected

unique background and requirements of each user. Chen et al. built an adaptive ontology for each user according to the

user’s reading behaviors. User-concept, user-user patterns were extracted from the ontology, and resources were recom-

mended to the user according to similar patterns in the pattern library ( Gemmis, Lops, Semeraro, & Musto, 2015 ). Hawalah

and Fasli set up a personalized interest ontology based on user interests and views. Spreading activation was applied to

expand user interests, aiming to find relevant concepts the user might be interested in ( Hawalah & Fasli, 2011 ). Some rec-

ommendation methods planned learning paths for the user on the basis of dependency between resources, so as to help

the user achieve a certain goal. Yu et al. created three ontologies representing the learner’s background knowledge, learning

resources and domain knowledge respectively. Similar learning resources were found after computing the similarity between

users’ background and learning resources, then these resources were organized as a learning path in accordance with their

prior links ( Hawalah & Fasli, 2011 ). Durand et al. asserted that the learner’s knowledge base needed to satisfy the prerequi-

site when he/she learns resources. Only in this way could the learner gain knowledge from them. On that basis, the strategy

( Durand et al., 2013 ) established a directed graph according to competencies required, and recommended a sequence of

learning objects in a well-defined order to the learner. Ordered learning paths could help users to reach the goal in terms

of their abilities, so that users can follow some learning paths from the initial set of competencies to the target one.





W. Zhao et al. / Information Processing and Management 0 0 0 (2016) 1–13 3

ARTICLE IN PRESS


This paper establishes a domain knowledge model–concept map, and the user’s knowledge gap is determined utilizing

domain background knowledge. To determine the user’s knowledge gap, background knowledge is extracted from his/her

reading records and target knowledge is distilled from research proposals. Generally, a researcher will prepare a proposal

that indicates the goal of further research.

3. Domain concept map

Several methods are available to build structured and formalized abstraction of domain knowledge. The general approach

is to extract high frequency of key words as feature factors and analyze the relationship between them. For example, hier-

archical relations, which can reflect the generalization/specification relation between concepts, are usually utilized to build

models like taxonomy and hierarchical structure ( Knijff, Frasincar, & Hogenboom, 2013; Sanderson & Croft, 2003; Tsui, Wang,

Cheung, & Lau, 2010 ). However, only word frequency cannot precisely indicate semantics of a document.

In this paper, we take semantics of a document into account. The method takes domain corpus as input, and outputs the

concept map, which represents the core knowledge structure of a domain. The concept map is a graphical representation of

both knowledge and thinking visualization ( Basu et al., 2001; Novak, 1998 ). From the concept map, we can get the logical

relationship between knowledge nodes ( Lehmann, 1992 ).

The concept map is represented as an undirected weighted graph G = { V,E,CW }. Each node v ∈ V is a knowledge factor.

Each edge e ∈ E ⊆{ V × V } represents the relation between two factors. CW ∈ [0, 1] is the weight of edges, which reflects

strength of the relation. The stronger the relation is, the greater the weight is.

The basic assumption of traditional feature extraction methods, such as term frequency-inverse document frequency (TF-

IDF), is that the most significant words are those of higher frequency. But the semantic information of a document is less

considered, the synonym/polysemy problem cannot be solved, either. LDA (latent dirichlet allocation), a popular probabilistic

topic model, assumes that all documents in a corpus share some implicit themes or topics ( Blei, Ng, & Jordan, 2003 ). Every

document is modeled as a mixture of implicit themes with certain probability. That indicates that some words may corre-

spond to one topic in a paper and a concept map consists of several topics, which imply the main content of the paper. So

we need extract these topics instead of key words using LDA. We choose LDA because it can elicit important topics instead

of key words. The number of topics is often smaller than that of key words by TF-IDF.

Since each topic is represented as some related words, the synonym/polysemy issue can be well settled. In this paper,

we employ LDA to extract implicit topics in a domain and construct the concept map with these topics. Each knowledge

node in the concept map represents a topic in the domain, and edges reflect the correlation between topics.

3.1. Model training

In LDA, each document is modeled as certain probability distribution over some topics, and each topic has special prob-

ability distribution over some words. By learning the document-topic distribution θ and topic-word distribution ϕ, we can

infer topic distribution of each paper. After tokenization, removing stop-words and stemming, the domain corpus D is rep-

resented as a set of words K = { k 1 , k 2, …, k KN }, where KN is the total number of key words. Assuming there are TN topics in

the domain, denoted as T = { t 1 , t 2 ,…, t TN }, θ and ϕ are computed as Eqs. (1) and ( 2 ).

θd,t =

n

(t) d

+ αt ∑ T N

t=1 n

(t) d

+ αt

(1)

ϕ t,k =

n

(k ) t + βk

∑ KN k =1 n

(k ) t + βk

(2)

where n (t) d

is the effective number of topics, which is included in the paper d , and n (k ) t is the effective number of words in

d , which touches on the topic t. αt is the priori probability of t , and βk is the priori probability of the word k .

A research paper may contain multiple topics with different probabilities. We say that a paper contains a topic only

when its corresponding probability is large enough. The membership relation is defined as follows.

Definition 1. The document d i contains the topic t j only when θd i , t j ≥ ϕ α , where θd i , t j

is the probability defined as Eq. (1) ,

and ϕ α is the threshold to control the right number of topics. Reasonable ϕ α can be determined by experiments. That is,

we only consider the topics conforming to Definition 1 in the following steps.

To improve the choice of topics, we refer to TF-IDF (term frequency–inverse document frequency). For a document d,

which is beyond D , we use ϕ in Eq. (2) to compute the importance of t in d as Eq. (3) .

w (d, t) =

∑

k ∈ d ϕ t,k ∑

t∈ T ∑

k ∈ d ϕ t,k

× log 2 | D |

n (d, t) (3)

where |D| denotes the number of documents in D and n ( d,t ) is times t appears in d .






ARTICLE IN PRESS


Algorithm 1

Constructing the concept map.

Input: D , research paper set

Output: G , domain concept map

1 Edge E, Vertex V, Weight CW ;

2 Corpus ← Preprocess ( D );

3 Topic T ← Semantic_process_with_LDA ( Corpus );

4 for each topic in T do

5 V.add ( topic );

6 for each v 1 ,v 2 in V do

7 weight ( v 1 ,v 2 ) ← CW ( v 1 ,v 2 );

8 if CW ( v 1 ,v 2 ) >ω;

9 e ← ( v 1 ,v 2 );

10 E.add ( e );

11 Return G ← { V,E,CW };

3.2. Construction of the concept map

Based on LDA, we construct the concept map to model domain knowledge. Herein, each node in the concept map repre-

sents a topic, and an edge in it is defined as correlation, which implies links in certain knowledge dimension. Bose argued

that concepts were associated in certain dimensions when they appeared in the same document ( Bose, Beemanapalli, Sri-

vastava, & Sahar, 2006 ). If two different topics appear in the same paper, it means that they may have certain relevance.

Therefore, if two topics frequently appear in the same paper simultaneously, they are considered to have stronger correla-

tion.

Definition 2. If a paper d contains the topics t i and t j simultaneously, then we call t i and t j co-occur one time. The correla-

tion weight CW is defined according to the frequency of their co-occurrence as Eq. (4) .

CW ( t i , t j ) =

| d t i ∩ d t j | | d t i ∪ d t j | (4)

where d t i , d t j are the paper set that includes t i , t j separately; and d t i ∩ d t j is the paper set that contains t i and t j simultane-

ously; d t i ∪ d t j is the paper set that contains t i or t j ; | d t i | , | d t i ∩ d t j | and | d t i ∪ d t j | is the corresponding set size, respectively.

The major steps to construct the concept map are as follows:

(1) For each topic t i in T = { t 1 , t 2 ,…, t T }, create a knowledge node in the concept map;

(2) For each pair of topics t i and t j , compute the correlation weight as Eq. (4) ;

(3) If the weight is greater than the threshold ω, create an edge between them.

Algorithm 1 shows the process of constructing the concept map. The algorithm takes the paper set D as input, and

outputs the domain concept map G . After tokenization, removing stop words, stemming and other preprocessing, the domain

corpus is generated (line 2); then LDA is performed on the corpus to extract the topic set T . To construct the concept map

G , we first create a node in G for each topic (line 4 and 5); then Eq. (4) is used to measure the correlation weight between

the topics, and an edges e is created and added to E if the correlation weight is greater than the threshold.

4. Knowledge gap analysis

In a formal learning context (e.g. E-learning), learning is a highly ordered process, and there are specific logical links

among learning materials (e.g. courses). Generally, the learning goal can be clearly defined, and a learning path to achieve

this goal can be explored by analyzing the association between courses. For example, Sowa & John introduced a scenario-

based method to define user requirements ( Sowa & John, 2006 ), and Santos & Boticario proposed an approach that de-

termined user requirements with group intelligence ( Santos & Boticario, 2015 ). Zhang et al. (2014) used pattern mining to

predict the user’s goals. However, different from formal learning, reading materials and attending conferences are informal

learning process for researchers. It is hard to give a clear definition of researcher’s learning goals and knowledge needs. In

this paper, we try to analyze the researcher’s background knowledge, research goals and knowledge gap by means of the

domain concept map.

According to Ausubel’s meaningful learning theory, knowledge is obtained by observing and cognizing new things

through existing concepts ( Ivie, 1998 ). Learning is a process to set up the concept network and constantly add new con-

cepts to it ( Chen, Chu, Chen, & Chao, 2013 ). A learner needs to associate new knowledge with his/her own background

knowledge while learning; only in this way could he/she achieve meaningful results. When we strive to explore solutions

for a problem, we need to activate some interrelated background knowledge ( Tsui et al., 2010 ). For example, Ferrari and

Gnesi built a concept map to model the user’s background knowledge and simulated the process of understanding natural

language by finding the least-cost paths on the map ( Ferrari & Gnesi, 2012 ). Knowledge relevance is an important factor






ARTICLE IN PRESS


to guarantee the effectiveness of study. To improve learning efficiency, it is necessary to take the association of learning

materials into account and recommend suitable learning paths ( Durand et al., 2013; Hawalah & Fasli, 2011 ).

In this paper, we take the knowledge relevance into account and try to narrow the user’s knowledge gap from the

learning perspective. From Ausubel’s theory, meaningful learning is achieved only when newly gained knowledge has con-

nections with the learner’s background knowledge. In order to achieve research goals, a user needs to add new concepts to

his/her knowledge base continuously, and build the connection between new knowledge and background knowledge. After

all the research-related knowledge becomes a part of background knowledge, the user would have necessary knowledge to

complete his/her research. Before finishing the required knowledge, there is knowledge gap between background knowledge

and research goals. Without the “gap knowledge”, the user could not have a comprehensive understanding of research goals,

making it difficult to conduct research successfully.

In this section, we discuss in detail how to bridge the “knowledge gap” by providing relevant knowledge sources. In the

first place, the user’s reading history and research proposal are analyzed to model background knowledge and knowledge

goal, respectively. To enrich knowledge goal, we use spreading activation to extend extracted knowledge requirements. Sec-

ondly, we analyze and define the knowledge gap in a graphic approach with the concept map. To bridge the knowledge gap,

we treat the shortest paths connecting background knowledge and target knowledge in the concept map as the best solu-

tion. By providing these knowledge paths, researchers can get required knowledge in a step-by-step way. Finally, research

papers that contain the knowledge paths are recommended to bridge the knowledge gap.

4.1. Knowledge model

The knowledge model of a user consists of his/her background knowledge and research goals. In this paper, we assume

that the user’s reading records are available to model his/her background knowledge, and a research proposal is provided to

extract the research target knowledge.

( 1 ) Background knowledge

The basic assumption is that a user can gain background knowledge by digesting literatures. By continuously referring to

related papers, the user will improve his/her knowledge about certain topics. To model the user’s knowledge of each topic,

his/her reading history is analyzed with LDA.

Definition 3. Let the user u ’s reading history be a set of documents UD = { ud 1 , ud 2 ,…, ud m

}, where m is the total number

of the documents u has read. Define u ’s expertise on the topic t as Eq. (5)

Expert ise (u, t ) =

∑ m

i =1 f i × w ( d i , t) (5)

where w ( d i , t ) is calculated according to Eq. (3) , f i is the frequency that u has read d i . That is, expertise is the accumulation

due to the user’s continuous reading. With the threshold ∂ t , we extract the topic set UT = { t | Expert ise (u, t ) > ∂ t , t ∈ T } as

background knowledge of u , denoted as UT = { ut 1 , ut 2 ,…, ut n }.

( 2 ) Research target knowledge

Research proposals are formal description of research targets for a researcher. They document the background, direction

and outline of his/her further research. We first discuss how to utilize LDA to extract topics in the proposal.

Definition 4. The importance of the topic t in the research proposal rp for the user u is defined as Eq. (6) , that is, the weight

of t in rp .

Importance (u, t) = w (rp, t) (6)

where w ( rp, t ) is calculated according to Eq. (3) .To choose more important target knowledge, we use the threshold � t to

filter those trivial topics.

Definition 5. Let the topic set GT = { t | importance ( u,t ) >� t , t ∈ T } be target knowledge, and the weights of all topics for target

knowledge are denoted as TW = { tw 1 , tw 2 ,…, tw q }, where tw i = importance ( u, t i ).

( 3 ) Spreading activation

Note that, the description from single proposal may be general. More analysis and investigation are required to refine the

knowledge goal. In this paper, we use spreading activation to further extend research goal knowledge to other closely related

content. Research goal knowledge is extended along links in the domain concept map. A link in the domain concept map

indicates the relevance between two knowledge nodes in some dimensions. This multi-dimensional knowledge expansion

can improve the comprehensiveness of research goal knowledge.

Spreading activation is a widely used technique by cognitive scientists seeking to understand the learning process that

takes place to form learning networks. Researchers in the field of computer science have emulated this process in a variety

of ways and apply spreading activation over semantic networks to solve several important problems in intelligent recom-

mendation ( Blanco-Fernández, López-Nores, Gil-Solla, Ramos-Cabrer, & Pazos-Arias, 2011; Gao, Yan, & Liu, 2008; Gemmis et

al., 2015 ). The spreading activation model could extend content-based filtering, and it could reach the nodes that are highly

associated with the initial nodes through network links ( Liang et al., 2008 ).






ARTICLE IN PRESS


Fig. 1. A concept map G.

As an iterative process, spreading activation starts from original nodes in the network and activates other nodes that

link to original nodes directly or indirectly. The activated nodes begin to activate other connected ones in the same manner.

This process will not stop until certain conditions are satisfied. The activation value, threshold and the maximum spreading

distance are the essential parameters in spreading activation. The activation value reflects the working condition of activated

nodes. The spreading distance is the number of links that are spread. The threshold and the maximum spreading distance

are the values for the activation process to stop. If the total activation value of a node is lower than the threshold, or the

spreading distance of this node has reached the maximum diffusion distance, the activation process will stop.

The spreading activation process is executed as follows.

(1) Initialization . The topic set in GT is chosen as initial activation nodes, with the weight TW as the initial activation value.

Activation values of other nodes are set as 0. The threshold λ and maximum spreading distance sd are initialized.

(2) Spreading activation . Nodes are activated according to the connections in the domain concept map starting from initial

activation nodes. When a node is activated via a connection by another node, the activation value is the product of

the activation value and the connection weight CW . For example, if a node whose activation value is 4 activates a

node through a connection with CW = 0.3, then the activation value of the newly activated node is 1.2. If a node is

activated by many nodes, then the activation value is the weighted sum of all activation values from activating nodes.

(3) Termination condition . Every time activation is completed, the termination condition will be checked. If the activation

value of a node is below the threshold or the spreading distance reaches the maximum spreading distance, spreading

activation will stop. We can get the extended topic set EGT .

(4) Node selection . For each topic t in EGT , delete the topics if importance ( u,t ) < � t . After topic selection, we can get the

document set CRT = { crt 1 ,crt 2 ,…, crt p },which represents target knowledge of the user.

4.2. Knowledge gap analysis

In this section, we discuss how to define the knowledge gap ( KG ) and how to bridge it. Generally, a researcher’s target

knowledge is not a part of his/her background knowledge. Since target knowledge does not coincide with the user’s back-

ground knowledge completely, there exists the disparity between these two kinds of knowledge. New knowledge is required

so that concept paths can be built to connect background knowledge and research goals. From the perspective of the concept

map, there is a “gap” between background knowledge nodes and target knowledge ones. In the concept map, any concept

path that links background knowledge and target knowledge is a solution to bridge the knowledge gap.

Let background knowledge be UT = { ut 1 ,ut 2 ,…, ut n } (ut ∈ T ) and research target knowledge be TT = { tt 1 ,tt 2 ,…, tt k }, tt ∈ T

in Fig. 1 . Note that there may exist various concept paths between a user’s background knowledge and research goals, and

target knowledge can be reached through any path. For example, a user has ut 2 as background knowledge and tt 3 as research

target knowledge, and there exist several concept paths 〈 ut 2 ,t 3 ,tt 3 〉 , 〈 ut 2 ,t 1 ,t 3 ,tt 3 〉 , 〈 ut 2 ,t 1 ,t 2 ,t 4 ,tt 3 〉 in G ( ut, t, tt ∈ T ). Different

concept paths contain different concepts, thus the number of concepts that the user needs to learn is also different. In order

to help improve learning efficiency, the shortest paths are selected to determine the knowledge gap in this paper.

In the shortest paths, there are fewer related concepts. This may result in recommending those papers, which are greater

challenge for the user to understand. Therefore, recommending the shortest paths might lead to a steep learning curve.

More learning materials are need to cover the motivation of the user.

The concept set UT and CRT are extracted from the literature that the user has read and research plan to represent

background knowledge and target knowledge, separately. There will be overlapping concept nodes between the user’s back-

ground knowledge and target knowledge as the research is conducted. And real research target knowledge now is TT = CRT -

UT ∩ CRT .

Starting from research target knowledge, the shortest path to UT is searched using the Dijkstra algorithm for each concept

in TT . The distance between concept nodes is defined as the reciprocal of the weight of correlation relationship CW ( t a ,t b ),

which is DL ( t a ,t b ) = 1/ CW ( t a ,t b ). This means that the closer two concepts are, the shorter the distance between them is. k

shortest paths are found to represent the knowledge gap of the user.






ARTICLE IN PRESS


Definition 6. The priority weight GW measures the importance of t in the path p

GW (t, path ) =

Dis (t, path ) ∑

t ′ ∈ path Dis (t ′ , path ) (7)

where Dis ( t, path ) is computed using the position of t in the path . The closer a topic node is to background knowledge

in a knowledge path, the more contribution it makes to learn new knowledge. On the contrary, the topics, which are

closer to target knowledge are more important for bridging the knowledge gap. For example, there is a knowledge path

p = 〈 ut 2 ,t 1 ,t 3 ,tt 3 〉 with tt 3 being target knowledge of the user. The position of each node on the path is (1,2,3,4) and the

priority weight is (1/10,2/10,3/10,4/10), respectively. As we can see, the nodes that are closer to the target knowledge have

higher weights.

4.3. Paper recommendation based on the knowledge gap

Since a concept may have different meanings in different situations, the combination of concepts can reflect knowledge

accurately. Thus we take a path of concepts as the user’s knowledge gap. According to the user’s knowledge gap, those

papers are recommended to narrow the gap by fulfilling knowledge the user lacks. After determining the knowledge gap,

concept paths in KG are utilized to find the papers that can bridge the knowledge gap. We define the matching score MS to

measure the matching degree between a paper d and a concept path cpath as Eq. (8) .

MS(d, cpath ) =

∑

t∈ path w (d, t) × GW (t, cpath ) (8)

where w ( d, t ) is the weight that d contains t , and GW ( t, cpath ) is the priority weight of t in cpath . The matching score will

increase if the topic distribution in a paper can fulfill the knowledge gap, especially target knowledge.

In order to gain target knowledge, all the knowledge paths in KG should be matched. The benefit of d for u is defined as

Eq. (9) .

Bene f it ( d, u, KG ) =

∑

p∈ KG

MS ( d, p ) ×∑

t p∈ p Importance ( u, t p ) (9)

where tp is target knowledge in the knowledge path p , and importance ( u, tp ) is the importance of the topic tp for u . During

the recommendation process, Eq. (9) is used to calculate each document’s utility value, and the papers with the highest

utility are recommended to the user.

Algorithm 2 shows the process of knowledge recommendation based on gap analysis. The algorithm takes the domain

knowledge graph G , research proposal P , referenced documents B that the user read and domain document set D as input.

The algorithm is composed of three steps: background knowledge analysis, target knowledge extraction and gap knowledge

recommendation. At the stage of background knowledge analysis, the papers the user has read are preprocessed first. Then,

the topics of high expertise are chosen as user background knowledge. During target knowledge extraction, target knowledge

and user’ demands for each topic are calculated, and then target knowledge nodes are spread in the concept map G . After

removing nodes with lower importance, the topic set TT represents target knowledge. The knowledge recommendation phase

finds the shortest paths from target knowledge to background knowledge using the Dijkstra algorithm. And then the papers

Algorithm 2

Knowledge gap based recommendation(KGR).

Input: G , concept map

P , research proposal

B , referenced document set

D , candidate domain document set

Output: literature list

1 Background Knowledge UT, Target Knowledge TT, Knowledge Gap KG, Topic set T ;

2 preprocess ( B );

3 for topic in T do

4 UT ( topic ) ← Expertise ( u,topic ) //remove less expertised knowledge topic

5 remove ( UT , ϕ t );

6 preprocess ( P );

7 for topic in T do

8 TT ( topic ) ← importance ( u,topic );

9 TT ← SpreadingActivation ( G,GT );

10 remove ( TT , ϕ i ); //remove less important knowledge topic

11 for ts in T do

12 path ( ts ) ← Min { Dijkstra ( G,tb,ts ) | tb ∈ UT }; //explore concept learning path

14 KG.add ( path ( ts ));

15 for document in D do

16 Match ( document ) ← Benefit ( d,u,KG );

17 Return the papers with the highest utility






ARTICLE IN PRESS


Fig. 2. The process of paper recommendation based on the knowledge gap.

that can well fulfill knowledge gap are chosen; and the papers with the highest matching score are recommended to the

user.

The process of paper recommendation based on the knowledge gap between two kinds of knowledge is illustrated as Fig.

2 . The numbers in it denote the order of the steps. There are two main sub-processes: find the knowledge gap by comparing

background knowledge and target knowledge, and recommend right papers for users through searching for the shortest

concept paths. The first sub-process includes the steps 1 ©, 2 © and 3 ©; The second one consists of the steps 4 © and 5 ©.

5. Case study

In this section, we take the domain of recommender systems as an example and give a brief case study. The data set is

composed of 500 academic papers chosen by domain experts, including 300 papers published in the proceedings of ACM

conference on recommender systems (2007–2015) and 200 papers retrieved through Google Scholar (using keywords like

“recommender system”, “personalized recommendation”, etc.). According to related work, we assume a symmetric dirichlet

distribution and set α and β as 50/ TN and 0.01, respectively ( Wei & Croft, 2006 ). Experimental analyses also show that

the priori probabilities α and β in Eqs. (1) and (2) conform to the symmetric dirichlet distribution and they are estimated

as 50/ TN and 0.01, respectively. Since the number of topics has an important effect on LDA, we use the perplexity metric

to evaluate the quality of LDA. In order to determine the right number of TN , we compute the perplexity value in terms

of varying number of TN . We find that the precision shows a little variation along with different TN . When TN = 75, LDA

achieves the highest precision. The related parameters α, β , ϕ α , ω are set as 50/75,0.01,0.36,0.54, separately.

The concept map consists of 75 vertexes and 2029 edges. Next, each user’s background knowledge and target knowl-

edge are analyzed to determine the knowledge gap. Because the user’s research program will change over time, read-

ing records during a period of time (one year) are selected to ensure research targets of the user will not have great

changes. Taking the user A as an example, A has read some papers during the last year, background knowledge is mod-

eled as UT = {recommendation, user, collaborative filtering, content-based, preference, item, tag, rating, feature, data mining,

similar, information overload}. The concept set after spreading activation is CRT = {social, user, network, social media, link,

relation, information, recommendation, online, item, rating, collaborative, context, location, feedback, preference, commu-

nity, clustering, spreading activation}. Target knowledge the user really needs to gain is TT = CRT - UT ∩ CRT = {social, network,

social media, link, relation, online, context, location, feedback, preference, community}. The paths linking TT and UT are

p 1 = {social, social media, network}, p 2 = {network, how many}, p 3 = {social media, network, how many}, p 4 = {link, how},

p = {base, closeness, how}, p = {online, social media, how}, p = {context, user}, p = {location, user}, p = {feedback, user},
5 6 7 8 9





ARTICLE IN PRESS


p 10 = {preference, how}, p 11 = {community, network, recommendation}. The knowledge gap of A is KG = { p 1 , p 2 , p 3, p 4, p 5 , p 6 ,

p 7 , p 8 , p 9 , p 10 , p 11 }. Then each concept path is taken to find matching papers in literature database.

We can see that A aims to do research in the field of social/trust network recommendation and feedback. Knowledge

paths obtained from the method have extra concepts like online, social media, link and location. Compared with background

knowledge, knowledge paths can help the user learn important research directions from the knowledge gap.

6. Experiment results

In this section, experiments are conducted to compare KGR with other recommendation methods. 200 postgraduate stu-

dents taking the “recommender systems” course from Software School of a University are recruited as the subjects. The

data set is firstly preprocessed. The title, abstract, key words and body of papers are chosen as experiment data. The Gibbs

sampling algorithm is utilized to sample topic-word probability on the basis of the topics given by domain experts.

The postgraduate students search and read some papers related to "recommend" and "personalized recommendation"

freely, and they need to record and submit the papers they have read. After each subject has read the same number of

papers, they are given the same research proposal about recommendation and asked to read it. Based on their reading

records and research proposals, three methods including keyword matching recommendation (KMR), semantic expansion

based recommendation (SER) and knowledge gap based recommendation (KGR) recommend 5, 10, 15, 20 and 25 papers

to each subject, respectively. The subjects are required to read all recommended papers. After that all subjects need to

mark each recommended paper according to the question “whether recommended papers can help reach research target

in the proposal” (if yes marked as true, else false). Fig. 3 shows the result of three methods with different numbers of

recommended papers, respectively.

It can be seen that the precision of three methods all go down when the number of recommended papers increases.

However, KGR holds the highest accuracy with more stable trend, which indicates it outperforms the other methods. The

precision of SER is a little better than KMR when the number of recommended papers is smaller. But as the number in-

creases, this advantage is gradually narrowed until surpassed. The reason is that SER could produce concepts closely related

to background knowledge when recommending fewer papers. Thus, it can gain higher precision. Yet, as the number in-

creases, it would include some less related concepts, which cause worse precision. As you can see, KMR is more stable

compared with SER.

We also find that there are a few different related papers recommended using the three methods although the intersec-

tion exists between the recommended results from them. In this way, the recommended papers consists of those from KGR

plus few papers from the other two methods, but the total number is limited to 5–10.

The ROC (receiver operating characteristic) curve, which can be used to evaluate recommendation methods, consists of

true positive rate (TPR) and false positive rate (FPR). The diagonal from (0, 0) to (1, 1) represents the random prediction, the

points above indicate better classification results than ones from random prediction, and the points below indicates worse

results.

Fig. 3. Comparison of recommendation precision of the different methods.






ARTICLE IN PRESS


Fig. 4. Comparison of ROC for three methods.

Fig. 4 is ROC of three methods when the number of recommended papers is 15, as the curve trend is roughly similar

when the number of recommended papers differs from 5 to 25. And it will lead to the situation that the effect of three

methods is difficult to distinguish when the number is too small. Similarly, when the number of recommended papers is

too large, three methods do not work effectively as expected. When the number of recommended papers is 15, as it can

be seen from Fig. 3 , KGR can produce larger precision compared to KMR and SER, which means KGR has the most accurate

prediction in these three methods.

SER performs better when the percentage of true related papers is small but the opposite phenomenon appears with the

increase of the percentage of related papers. The reason is that semantic expansion may bring about more noise. In Fig. 4 ,

we ensure the result again. KGR has the highest TPR at the same FPR, which indicates that papers recommended using KGR

have higher evaluations from the subjects compared to the other methods.

All the subjects are asked to search and read papers related to “recommend” and “personalization” freely. This work is

divided into five stages according to the number of papers that the subjects have read. After each stage ends, three rec-

ommendation methods suggest 10 papers to each subject respectively. Then, all subjects read all the recommended papers

and give satisfaction cores in range of 1 to 5 to each of them according to the evaluation questions like “Does the paper

provide the knowledge you do not have but helpful to your research?”, “Does the paper help deepen/expand understanding

of your further research?”, “Does the paper help yield new insights of existing knowledge, introduce new features, usage

and meanings?”. The answers reflect the subject’s interest in papers, and higher score indicates that the paper can provide

more valuable knowledge for research.

Average scores that reflect user satisfaction factor(SF) are calculated after all the subjects comment recommended papers.

Herein, SF stands for the average satisfactory extent of all the subjects. More satisfactory the subjects are, more interested

the recommended papers are for them. Fig. 5 displays how SF changes as the number of recommended papers they read

increases.

It is clearly observed that the satisfaction factor of three methods all declines as the number of papers increases. This

means that it is more difficult to find valuable papers to recommend more and more papers. SER and KMR result in sig-

nificant declines, whereas SF from KGR remains relatively stable. The reason may be that fewer related papers cover the

topics included in more papers, and semantic expansion may bring about some noisy information. On the other hand, LDA

can extract more suitable topics compared with simple keyword matching or semantic expansion methods. Furthermore, we

use the spreading activation approach in the concept map to find more comprehensive concepts to describe the research

goal behind the user’s research because activation may originate from alternate concepts, which plays an important role in

meaningful semantic processing. Compared with traditional semantic nets, the concept map provides a more solid basis for

iteratively propagating to other concepts.

We also collected some papers about other topics from the real-world datasets such as CiteULike( www.citeulike.org ), and

use the expanded graph-based method in the area of literature recommendation, the similar results are also found.

7. Conclusions

This paper proposes a graph-based method to explore knowledge paths, which can help recommend papers for re-

searchers. To do so, we construct the concept map based on researchers’ background knowledge and their research pro-

posals; then the shortest paths are extracted from the concept map to help identify helpful papers for researchers. Espe-




http://www.citeulike.org



ARTICLE IN PRESS


Fig. 5. Comparison of user satisfaction for different methods.

cially, we propose a method to recommend papers for researchers based on the knowledge gap. Distinguished from previous

methods, user preference is no longer the only factor for recommendation. In KGR, the user’s background knowledge and

his research targets are used to determine the knowledge gap. KGR aims to recommend papers that can help bridge the

knowledge gap. The method firstly builds the domain concept map based on domain corpus. Then the user’s knowledge gap

is thoroughly analyzed. The contributions of this paper are as follows: topics in a domain and their correlations are orga-

nized in the form of the concept map, which can reflect domain background knowledge; research target and background

knowledge of the user are used to determine the user’s knowledge gap; recommendation aims to fulfill the knowledge gap

and help achieve research goal.

The shortest path solutions help to bridge the knowledge gap faster but it is not necessarily the most suitable solution to

identify the crucial topics/papers to remove the "knowledge gap" between background knowledge and research proposals.

In this case, we can further identify the set of nodes, which are critical for the connection between knowledge background

and research proposals. For example, removing those nodes will completely disconnect knowledge background and research

proposal.

Since the method utilizes the research proposal and reading history to define the user’s knowledge gap, major concerns

can be put on how to use more data in more fields to find knowledge the user really lacks in the future research. Further-

more, we can use the knowledge graph to extend the concept map for reasonable knowledge representation and knowledge

gap bridging more effectively.

Regarding the constantly inflating size of research papers, we can divide the paper repository into many smaller ones

using open source distributed computing technologies like Hadoop, or memory computing like Spark to improve the per-

formance of the proposed method. In this way, we can deal with the divided small repositories in parallel and combine the

results from the operations by means of the MapReduce framework.

Acknowledgments

This paper was funded by Shanghai Pujiang Program ( 14PJC017 ) and the National Nature and Science Foundation of China

under Grant no. 71071038 .

References

Basu, C. , Hirsh, H. , Cohen, W. W. , & Nevill-Manning, C. (2001). Technical paper recommendation: A study in combining multiple information sources. Journal

of Artificial Intelligence Research, 14 , 231–252 .

Blanco-Fernández, Y. , López-Nores, M. , Gil-Solla, A. , Ramos-Cabrer, M. , & Pazos-Arias, J. J. (2011). Exploring synergies between content-based filtering andspreading activation techniques in knowledge-based recommender systems. Information Sciences, 181 , 4 823–4 846 .

Blei, D. M. , Ng, A. Y. , & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3 , 993–1022 . Bose, A. , Beemanapalli, K. , Srivastava, J. , & Sahar, S. (2006). Incorporating concept hierarchies into usage mining based recommendations. In Proceedings of

the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis (pp. 110–126). Springer . Cantador, I. , & Castells, P. (2006). Multilayered semantic social network modeling by ontology-based user profiles clustering: Application to collaborative

filtering. Lecture Notes in Computer Science, 4248 , 334–349 .

Chen, Y. J. , Chu, H. C. , Chen, Y. M. , & Chao, C. Y. (2013). Adapting domain ontology for personalized knowledge search and recommendation. Information &Management, 50 , 285–303 .

Davoodi, E. , Afsharchi, M. , & Kianmehr, K. (2012). A social network-based approach to expert recommendation system. Hybrid Artificial Intelligent Systems,7208 , 91–102 .

Drachsler, H. , Hummel, H. G. K. , & Koper, R. (2008). Identifying the goal, user model and conditions of recommender systems for formal and informallearning. Journal of Digital Information, 2 , 2009 .




http://refhub.elsevier.com/S0306-4573(16)30070-X/sbref0001














































ARTICLE IN PRESS


Durand, G. , Belacel, N. , & Laplante, F. (2013). Graph theory based model for learning path recommendation. Information Sciences, 251 , 10–21 . Ferrari, A. , & Gnesi, S. (2012). Using collective intelligence to detect pragmatic ambiguities. In Proceedings of 20th IEEE international conference on require-

ments engineering (pp. 191–200). IEEE . Gao, Q. , Yan, J. , & Liu, M. (2008). A semantic approach to recommendation system based on user ontology and spreading activation model. In Proceedings

of IFIP international conference on network and parallel computing (pp. 4 88–4 92). IEEE . Gemmis, M. D. , Lops, P. , Semeraro, G. , & Musto, C. (2015). An investigation on the serendipity problem in recommender systems. Information Processing &

Management, 51 , 695–717 .

Guan, Z. , Wang, C. , Bu, J. , Chen, C. , Yang, K. , Cai, D. , et al. (2010). Document recommendation in social tagging services. In Proceedings of the 19th internationalconference on World wide web (pp. 391–400). ACM .

Hawalah, A. , & Fasli, M. (2011). Using User Personalized ontological profile to infer semantic knowledge for personalized recommendation. In Lecture notesin business information processing (pp. 282–295). BErlin, Heidelberg: Springer .

Ivie, S. D. (1998). Ausubel’s learning theory: An approach to teaching higher order thinking skills. The High School Journal, 82 , 35–42 . Knijff, J. D. , Frasincar, F. , & Hogenboom, F. (2013). Domain taxonomy learning from text: The subsumption method versus hierarchical clustering. Data &

Knowledge Engineering, 83 , 54–69 . Lehmann, F. (1992). Semantic networks in artificial intelligence . Oxford: Pergamum Press .

Liang, T. P. , Yang, Y. F. , Chen, D. N. , & Ku, Y. C. (2008). A semantic-expansion approach to personalized knowledge recommendation. Decision Support Systems,

45 , 401–412 . Novak, J. D. (1998). Learning, creating, and using knowledge : Concept maps as facilitative tools in schools and corporations/J.D. Novak. Concept Mapping,

56 , 392 . Pan, C. , & Li, W. (2010). Research paper recommendation with topic analysis. In Proceedings of 2010 international conference on computer design and applica-

tions (ICCDA) (pp. 264–268). IEEE . Salehi, M. , & Kamalabadi, I. N. (2013). Hybrid recommendation approach for learning material based on sequential pattern of the accessed material and the

learner’s preference tree. Knowledge-Based Systems, 48 , 57–69 .

Sanderson, M. , & Croft, B. (2003). Deriving concept hierarchies from text. In Proceedings of international ACM SIGIR conference on research & development ininformation retrieval (pp. 206–213) .

Santos, O. C. , & Boticario, J. G. (2015). Practical guidelines for designing and evaluating educationally oriented recommendations. Computers & Education, 81 ,354–374 .

Sowa, F. John (2006). Semantic networks . New York: John Wiley & Sons . Tang, T. , & Mccalla, G. (2004). Beyond learners’ interest: Personalized paper recommendation based on their pedagogical features for an e-learning system.

Pricai Trends in Artificial Intelligence, 3157 , 301–310 .

Tsui, E. , Wang, W. M. , Cheung, C. F. , & Lau, A. S. M. (2010). A concept–relationship acquisition and inference approach for hierarchical taxonomy constructionfrom tags. Information Processing & Management, 46 , 44–57 .

Wei, X. , & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference onresearch and development in information retrieval (SIGIR 06) (pp. 178–185). ACM .

Xu, Y. , Guo, X. , Hao, J. , Ma, J. , Lau, R. Y. K. , & Xu, W. (2012). Combining social network and semantic concept analysis for personalized academic researcherrecommendation. Decision Support Systems, 54 , 564–573 .

Zhang, H. , Ni, W. , Zhao, M. , Liu, Y. , & Yang, Y. (2014). A hybrid recommendation approach for network teaching resources based on knowledge-tree. In

Proceedings of the 33rd Chinese control conference (pp. 3450–3455). IEEE .






































































































ARTICLE IN PRESS


Weidong Zhao is associate professor of Software School, Fudan University since 2003. He has received his Ph.D. from Southeast University, China, in 2001.He was a visiting scholar in Stern School of Business, New York University between 2011 and 2012. His current research interests include intelligent data

analysis and decision support systems. Dr. Zhao has published more than 60 articles in international conferences and journals such as PAKDD, Electronicsand Electrical Engineering, Knowledge and Information Systems etc.

Ran Wu is currently a master student in Software School, Fudan University. Her research interests include business intelligence, and recommender systems.

Haitao Liu is currently a master student in Software School, Fudan University. His research interests include data mining, and business intelligence. He has

published several papers in journals including Computer Integrated Manufacturing Systems, Knowledge and Information Systems etc.





本文献由“学霸图书馆-文献云下载”收集自网络，仅供学习交流使用。

学霸图书馆（www.xuebalib.com）是一个“整合众多图书馆数据库资源，

提供一站式文献检索和下载服务”的24 小时在线不限IP

图书馆。

图书馆致力于便利、促进学习与科研，提供最强文献下载服务。

图书馆导航：

图书馆首页文献云下载图书馆入口外文数据库大全疑难文献辅助工具

http://www.xuebalib.com/cloud/

http://www.xuebalib.com/

http://www.xuebalib.com/cloud/


http://www.xuebalib.com/vip.html

http://www.xuebalib.com/db.php

http://www.xuebalib.com/zixun/2014-08-15/44.html


information processing and managementdownload.xuebalib.com/4rpbxmbczkun.pdfarticle in press jid: ipm...

Documents