[ieee sixth international conference on advanced language processing and web information technology...

Role Centralized Modeling for Expert Search in Enterprise Corporation

Jing Yao, Jun Xu, Cheng Jin, Junyu Niu

Department of Computer and Science Engineering, Fudan University 200433 Handan Road

Shanghai, P.R.C [email protected]

Abstract

Automatic expert finding systems aiming at identifying experts from a large set of document repository have attracted considerable interest in recent years. To better describe the relationship on expert-document and document-topic, we introduce the conception “Role” in our model and expand its use in expert search. We illustrate how the conception “Role” in our model helps to present an effective approach to model the probability of the candidate being an expert on the given topic with documents relevant to it. Consequently, we evaluate the effectiveness of our model by a series of experiments on TREC 2006 Enterprise Track expert search task. The results show that our role centralized model performs much better than general approaches overall. It is a big potential to dig in more deeply in the future.

General Terms Algorithms, Performance, Experimentation Key Words Expert finding, language model, role, R-EDT 1. Introduction

It is crucial to find the right person with expertise and

appropriate skills in the specified field in large organizations such as enterprises. For instance, an employee is working for a project and his work involves knowledge of particular equipment he is unfamiliar with. Or he is a freshman to the work who is confused that to which person he could turn for help. A simple way to solve the problem is to establish databases housing each individual’s expertise for expert search [1]. However both the process and the management are costly and the data cannot be up to date because of the frequent staff flow. Consequently, people change their focus to the automatic expert finding system which is an Information Retrieval System (IR) identifying experts from large pools of textual information.

Also think about this, an IT company which has a couple of hundreds teams in software development is growing very fast. The CEO of that company wants to start a new program to help the new hires get information

on finding who they could turn to when they have technical questions or IT products. The company usually use discussion group to distinguish the technical areas, but it’s not convenient, yet not easy to find and hard to get quick response.

In 2005, Text REtrieval Conference (TREC) introduced expert search task into enterprise track and remained the task in 2006 and 2007, which proves the popularity of the expert finding system. It provided a platform to find experts from document collections which contains a crawl of the World Wide Web Consortium’s web site, a canonical list of people from the W3C collection as expert candidates, and a set of query topics. For a given topic, the task of the system is to return a ranked possible expert list with supported documents.

This is a real world example which we are going to analyze in details in the later chapters. Given a depot of corpus on the articles and discussions, the key question is how to find out the relationship among experts, documents, and query topics. There are two main categories of solutions that we normally use to process the corpus to figure out the relationship: either we look for the specific topic in every expert’s related document, or we look into every document which contains information on the topic to try our luck on if there is any experts’ name appears [2]. Some solutions are the combination of two approaches [3]. Apparently the critical area is how we model the relationship among experts, documents, topic to finalize the rank. This paper will focus on how to detect the relationship between on both expert-document and document-topic and proposed a role centralized model representing the probability of the candidate being an expert on the topic with relevant documents to the topic.

This paper is organized as follows: In section 2, we will analyze the problems of the classic approaches in details, and then conclude the key flaws of the classic solutions. We also make a review of what are the typical solutions to get rid of those defects. We will have a look at how they differ from others and what should we do to improve them.

In section 3, we give our own role centralized expert search models which base the relationship among experts, documents, and topics.

In section 4, experiment results are shown to demonstrate the effectiveness of our model. The

Sixth International Conference on Advanced Language Processing and Web Information Technology

0-7695-2930-5/07 $25.00 © 2007 IEEEDOI 10.1109/ALPIT.2007.26

498

Sixth International Conference on Advanced Language Processing and Web Information Technology

0-7695-2930-5/07 $25.00 © 2007 IEEEDOI 10.1109/ALPIT.2007.26

498

experiments used the tasks and the corpus of TREC Enterprise Track 2006.

Finally, we make conclusion and discuss our future work in section 5.

2. Related Work

The participants of TREC Enterprise Track adopted

several information retrieval (IR) technologies in the expert search task. Most participants use the two solutions mentioned in section 1. Balog et al. [3] proposed models based the two solutions and separately and extensively compare the two models. Their first model based on experts represents individual’s knowledge with associated documents and the second model based on topic ranks documents according to the topic, and then determines how likely a candidate is an expert by considering the set of documents associated. The two models both have advantages and disadvantages. A person can be an expert on several fields and Model 1 better models the expert’s knowledge, while Model 2 uses the most up-to-date information to identify expertise. In Balog’s experiments, Model 2 performs significantly better than Model 1 using W3C corpus.

The two approaches have similar process and different emphasis. From a pure engineering perspective, their pipelines are exactly the same. Both have 3 phases:

1. Getting primary relationship 2. Getting secondary relationship 3. Modeling, scoring and merging the two

relationship The difference is also obvious: expert based solution’s

primary relationship is between expert and document, while the other is between document and topic.

However, these two solutions have similar problems in three areas:

1. Relationship between expert and document For instance, Duplicated names in expert names / aliases and nicknames conflict introduces noise in establishing the relationship. 2. Relationship between topic and document Text indexing, noises in documents and query engine related problems will reduce the accuracy in this area. 3. How to model and merge the two relationships. The definition of expert is vague given only the topic and it brings trouble in establishing models.

The common solution of associating documents with candidates is to calculate the frequency that candidates appear in the documents. A candidate can appear in the document in 3 ways: exact full name match, last name match, email match [5]. Language Models are used to estimate the strengths of the associations. The baseline of the method is that the more a candidate appears in a document, the stronger the connection between the candidate and the document is. The weakness of the method is obvious: if a candidate writes the document, he

appears in the document only once. This method cannot appropriately describe the candidate-document relationship. Our model introduces the role into the relationship description to solve this problem. We define several roles the candidate may play in the documents.

In most expert finding systems, the relationship between documents and topics relies on the document retrieval engine such as Lucene and Lemur. However, this method does not take full advantage of the position where the terms of the topic appear in the document. Our model improved the description of the correlation between documents and topics.

There are a couple of ways people tried to resolve the problems above:

1. Social network. The main idea of social network is from some certain corpus we build up a topology that keeps the contacts of the experts. This idea comes from that people grouped together with each other most likely sharing the same knowledge and with same level in average [16]. However, it is random if we can build up a correct one due to the limit of corpus (mostly likely we need mails). It’s also unsure that people who communicate among each other frequently are really of the same level in technologies. The ROLE they played is not guaranteed to be the same.

2. Ontology used in expert identifying. This approach is used mainly in resolving duplicates and conflicts in expert names [5]. It is a great improvement but it is a big topic. The key point is how we can deal with initials / nickname / first name and avoid conflicts. Current ontology strategies are very often collecting the contact’s personal information, but that makes initials and first name very easily to be duplicated and it has very little considerations on nicknames.

3. Query expansion. This solution is used to improve the accuracy of relationship between topic and document. In the description there are some details of the topic, thus people tried to use those information in addition to mark the topic more precisely. However the result turns out that even using the self adaptive way for query expansion, it doesn’t go as expected because it introduces a lot of noise.

4. Merged / combination / smooth modeling. It is essential to have a modeling part in the system, mainly because the credit of corpus is very much different. So given different weights and categories the corpus, adapting different strategies makes sense to improve the system [4]. The only, but huge, problem is that identifying the threshold is very hard to decide in general. It’s so hard to test and to adjust a proper one – even if the corpus is in fixed size. And even

499499

when using the same model, different corpus can give very different result according to the same weight and same categorize strategy.

5. More importantly, we still cannot resolve the problem to get clear information on how to define an expert to be an expert in a topic. What we used most often, is counting frequency of times that the expert name appears in the document. This is very problematic.

3. Role Centralized Model

Given some items that we need to think about

improving on the list above, we employed a new metadata in the origin system: Role. In this section, we will discuss how Role affects Expert, Document, Topic, and their relationships between each other. And also we will show the ideas how the above mentioned problems can have a chance to be solved. In brief we call our model R-EDT which stands for Role, Expert, Document, and Topic.

3.1 Modeling:

Let’s review the task of expert search: identifying

experts among a candidate list from document repository on a given query topic. So the problem can be stated as how probably the candidate is the expert in this specified field. We consider the expert as a mixture of documents relevant to the query topic, thus we estimate the probability as follows:

∑∈

∝Dd

D qdpdcpqcp )|()|()|(

Where D is the document repository, )|( dcp denotes the relationship between the candidate and the document and )|( qdp denotes the relevance of the document to the given topic.

Most web documents can be divided into two main kinds: web pages and email discussion. In order to better describe the R-EDT relationship, we divide the corpus into two sub collections. As technical reports always contain more information users need, they contribute more to the expert finding. We extract technical reports from web pages as a sole collection. Therefore, the total probability will be the weighted sum of the probability over sub collections.

)|()|()|()|(

qcpqcpqcpqcp

emailemail

webwebtrtrCorpus

ωωω

+

+=

The weight ϖ depends on the contribution of the document category to the system and the value can be adaptively set by experience, where

1=++ emailwebtr ωωω From above we can see that we are using a merged

model for scoring, but we will not discuss how to overcome the problems with merged modeling in this paper.

3.2 Role Definition

In expert search, role mainly refers the actions and

activities of the experts in a document. From real world, this means we will analyze on what’s the actually behavior of certain experts in the document. We define role as a vector of (weight, actions …). How do we define actions? An action is a list of string text with weight which is a combination of the follows:

• prefix weight • prefix scope • prefix specified words • prefix terminators • suffix weight • suffix scope • suffix specified words • suffix terminators

When we find a given expert name in the text, we identify his action to two directions: the text before the expert name and after the name, to either one of the terminators or to certain amount of characters explored until scope value. In the context, we look for certain words such as author; ask which definitely identify expert’s action. And then decide if the expert is reacting as those roles.

A simple example of an Action: …, I asked Professor Wang some questions about CSS3 yesterday. Prefix terminator: [‘,’], Prefix word [“ask”] – this is the simplest “advisor” action. In this example, “advisor” can have a lot of actions defined. And after all actions defined, we can choose a weight for how much it is as a advisor to be an expert, used by our modeling process.

Please notice that this is flexible. When we are defining different roles it is critical to think about: should the expert name be the subject or object in the sentence? Which words should we use? Are there exceptions?

3.3 How is role be used to benefit expert identity

verifying

This role based model is used on the ontology of making identical experts either. To solve the problem that occurs in conflicts and duplicates of expert identities. We compare the roles of the expert in the specific document, and the documents of that expert overall. And then we can evaluate the roles of that duplicated / conflict identity to choose one of the duplicates. Also, for nicknames, we can also count information on their role, and decide which one they are most close to, then give them a correct owner.

Furthermore, most of the candidates are with English names formed as “first name / last name”. Therefore, they

500500

appear in the documents in different forms. In news pages their full name will be reported, while in most of the mail discussions they are called only by the first name and sometimes they are called by alias. First names can have duplicates and lead to inconsistence. To get this problem resolved we use an expert identify model that is ontology based which contains first name, full name, mail address and alias per unit. We give different weights mµ (.) for the four kinds of occurrence to reduce the noise: full name match, last name match and abbreviation of the first name match, last name match only and email match only.

3.4 How role is used to improve Expert-document

relationship We need take two things into consideration of the

expert-document relationship: one is the form in which the candidate appears and the other is the role that the candidate plays in the document, which is mainly decided by the context.

It is clear that the document type is the base to identify the candidate’s role. We define several roles over each document sub-collection and assign each role different value rv (.).

For the technical report, the author/editor is the key person and is assigned with the highest value. The candidates who appear in the acknowledgements are relatively less important but they also afford help to the accomplishment to the technical report. If the candidate’s name is just mentioned in the content, he may have some relevance to the field which the technical report is written about. However, we don’t consider the situation that the candidate’s name appears in the reference which may introduce more noise than accuracy. For instance, the technical report is about IPv6, but the references may be about IPv4, we cannot infer the experts on the IPv4 are the experts on the IPv6.

For the other web pages, the presence can be simply divided into content and non-content. The candidate mentioned in the content part is more likely to be an expert in the specific field than that mentioned in the non-content part.

For the email discussions, the candidates involved can be divided to three kinds: question raiser, responder and discusser. We put the original mail, reply, forward in to a single thread and judge the candidate’s role by analyzing the context of the presence in the whole thread. We calculate the frequencies that the candidate appear in the thread as each role, and judge the candidate to the role with the maximum frequency.

Thus, each candidate’s relevance to the document can be scored as:

∑=ja

irjimi dcvpcdcs ),(),(),( µ

Where ja stands for each presence of the candidate in the document.

Several candidates may appear in the same document, the relationship between the expert and the document can be described with language model as follows:

∑∈

=

Cci

ii

i

dcsdcs

dcp),(

),()|(

Finally we get the equation:

∑∑

∑

∈

=

Cc airjim

airjim

i

i j

j

dcvpc

dcvpcdcp

),(),(

),(),()|(

µ

µ

3.5 negative relationship

It is always hard to prove someone’s right from textual

information, but it can be judged someone is wrong by analyzing the context. The same thing happens in role based expert search. If someone is clued not to be an expert in one place, the weight of the role can be negative. Thus the result can be more accurate.

3.6 document-topic relationship

The document-topic relationship is principally decided

by document retrieval engine. But undoubtedly, it will be quite useful to consider the position where the terms of the topic appear in the document in the expert finding system. The terms appearing in the headers means higher relativity than in the content. If they just appear in the reference and links, the relativity is so low that can be ignored. So we get: ∏

∈

=qt

ii

tdsqdp )|()|( α

Where it stands for terms of the query topic. )|( itds is given by the search engine and α is the weight by experience.

Another interesting thing to investigate is to extend our definition of role, to make certain roles on the words of the descriptions a candidate query term, and invent certain roles on how query terms affects topic, which can reduce noise created by query expansion a lot.

Overall, the introducing of Role makes up some defects that traditional solutions which simply focus on frequency of name appearance have brought. It helps to make the definition of “being an expert” clear:

Expert is who plays essential roles in topic related documents as an expert.

4. Experiments

501501

To evaluate the effectiveness of our role centralized model, we use the expert search task and the corpus of the Enterprise Track in TREC 2006. The data set is a crawl of the publicly available web of the World Wide Web Consortium performed in June 2004, including 330,037 documents of 6 different types [6]. In our experiments, we just treat them as two types: email lists and web pages. And we extract technical reports from web pages as a subset by analyzing the URL. In addition, the track provided a list of 1092 candidate experts and 55 query topics.

We make 4 experiments using different strategies for expert search listed as follows:

1. Origin: It’s a bare hand experiment, which uses “topic centralized” approach with nothing else. What it does is using Lemur to index and query the topics, and count for expert name as reference. Finally the experts are scored by the how many times their names show up in the documents.

2. Classic1: In this approach we uses categorized document-expert relationship solution to try to improve the result. We categories the documents into three categories: technical-article, mail discussion and others. We gave weight 0.45:0.25:0.3 for technical / mail / other. However it turns out it fails and from Figure 1, the RP curve shows the recall goes down very fast.

3. Classic2: We changed weight this time for 0.5:0.35:0.15 and it improves a little bit.

4. Role Centralized: We use only the simplest roles without any other approaches.

Prefix suffix weight Ask 1 ask -0.5 expert 10 expert 10 consult 5 don't know -1 author 4 author 4 ? -0.5 editor 4 editor 4 help -2 help 3

Table 1. Role Set in the experiment Figure 1 and Table 2 show the experiment results of

the 4 strategies for expert search. We can see, although the results are still not good enough, it is still we can see, comparing to the classic approaches this brand new solution performs better overall. Obviously, the prefix and suffix dictionary for the role definition is incomplete, which reduces the effectiveness of the new approach. By analyzing the corpus more carefully, we can expand the action word dictionary and the performance will be definitely improved. It seems it is a big potential to dig in more deeply in the future.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.45 0.5

Recal l

Prec

ision

Role origin improve-unseccessful improve-seccessful

Figure 1. P-R Curve of the 4 experiment Name R-P MAP P@5 P@10 P@20 Simple Role

0.353 0.253 0.485 0.401 0.305

Origin 0.293 0.183 0.334 0.290 0.198 Classic1 0.251 0.235 0.328 0.289 0.245 Classic2 0.293 0.191 0.344 0.323 0.282 Table 2. Results of using different strategies for expert search 5. Conclusion and Future Work

We mainly illustrate three phases after an introduction

of expert search. First, the related work, including classic approaches with problems analyzed and imperfect solutions on those problems, then we get a list of common problems that we still have in the current systems. Finally we define a brand new concept of Role and integrate that into our own expert search model R-EDT (Role-Expert-Document-Topic). Also we discussed about how can that

improve the systems and have chances to solve the unfinished problems the systems currently have.

To give a further view, the model is preparing to be better tested and more widely use. This is a problem because of the large matrix of test perspectives and plans we have. We will follow up in TREC 2007. And also role can be inherited and extended in document topic area

which is a huge potential for us. Another thing we can do is to think about how to define a robust and effective role, we need guidelines and more tests on it.

6. References

502502

[1] T. H. Davenport and L. Prusak. Working Knowledge: How Organizations Manage What They Know, Harvard Business School Press, Boston, MA, 1998. [2] K. Balog, L. Azzopardi, and M. de Rijke. “Formal models for expert finding in enterprise corpora”, In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference, 2006. [3] Desislava Petkova, W. Bruce Croft, “Hierarchical Language Models for Expert Finding in Enterprise Corpora”, ictai, pp. 599-608, 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06), 2006. [4] Craig Macdonald and Iadh Ounis, “Voting for Candidates: Adapting Data Fusion Techniques for an Expert Search Task”, In: Proceedings of the 15th ACM international conference on Information and knowledge management, 2006. [5] Junyu Niu, Chen Lin, “WIM at TREC Enterprise Track”, In: Proceedings of 15th Text Retrieval Conference (TREC 2006), 2006. [6] Nick Craswell, Arjen P. de Vries and Ian Soboroff, “Overview of the TREC-2005 Enterprise Track”, TREC www.trec.nist.gov. [7] IRMA BECERRA-FERNANDEZ FLorida International University, “Searching for Experts on the Web: A Review of Contemporary Expertise Locator Systems”, ACM Transactions on Internet Technology, Vol. 6, No. 4, November 2006, Pages 333-355. [8] Krisztian Balog, Maarten de Rijke, “Finding Experts and their Details in E-mail Corpora”, In: Proceedings of the 15th international conference on World Wide Web [9] W3C. The W3C test collection, 2006. URL: http://research.microsoft.com/users/nickcr/w3c-summary.html. [10] Mark Maybury, Ray D’Amore, David House, “Expert Finding for Collaborative Virtual Environments”, Communications of the ACM Volume 44, Issue12 (December 2001), Pages: 55-56, 2001. [11] David Mattox, Mark T. Maybury, Daryl Morey, “Enterprise expert and knowledge discovery”, In: Proceedings of the HCI

International '99 (the 8th International Conference on Human-Computer Interaction) on Human-Computer Interaction: Communication, Cooperation, and Application Design-Volume 2, August 22-26, 1999. [12] Dawit Yimam-Seid, Alfred Kobsa, “Expert-Finding Systems for Organizations: Problem and Domain Analysis and the DEMOIR Approach”, Journal of Organizational Computing and Electronic Commerce 2003, Vol. 13, No. 1, Pages 1-24, 2003. [13] Xiaolin Niu, Gordon McCalla and Julita Vassileva, “Purpose-based Expert Finding in a Portfolio Management System”, Computational Intelligence Journal, Vol.20, No. 4, 548-561 [14] Bob Fields, Suzette Keith, Ann Blandford, “Designing for Expert Information Finding Strategies”, Technical Report: IDC-TR-2004-001 January, 2004. [15] Tim Reichling, Kai Schubert, Volker Wulf, “Matching Human Actors based on their Texts: Design and Evaluation of an Instance of the ExpertFinding Framework”, In: Proceedings of the 2005 international ACM SIGGROUP conference on Supporting group work, SESSION: Finding expertise and information, Pages: 61-70, 25005. [16] J.Zhang and M.S.Ackerman, “Searching For Expertise in Social Networks”, A Simulation of Potential Strategies Group '05, 2005. [17] D. Hawking. “Challenges in enterprise search”, In: Proceedings Fifteenth Australasian Database Conference, 2004. [18] Raymond D’Amore, “Expertise Community Detection”, SIGIR’04, July 25–29, Sheffield, South Yorkshire, UK, 2004. [19] J. M. Ponte and W. B. Croft. “A language modeling approach to information retrieval”, In SIGIR 98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275C281, New York, NY,USA, 1998. [20] P. Ogilvie and J. Callan. “Combining document representations for known-item search”, In: Proceedings of ACM SIGIR 2003, pages 143{150, Toronto,Canada, 2003.

503503

[ieee sixth international conference on advanced language processing and web information technology...

Documents