a exploring question selection bias to identify experts ... · a exploring question selection bias...

A

Exploring Question Selection Bias to Identify Experts and PotentialExperts in Community Question Answering

ADITYA PAL, F. MAXWELL HARPER, JOSEPH A. KONSTAN, University of Minnesota

Community Question Answering (CQA) services enable their users to exchange knowledge in the form ofquestions and answers. These communities thrive as a result of a small number of highly active users,typically called experts, who provide a large number of high quality useful answers. Expert identificationtechniques enable community managers to take measures to retain the experts in the community. There isfurther value in identifying the experts during the first few weeks of their participation as it would allowmeasures to nurture and retain them. In this paper, we address two problems: (a) How to identify currentexperts in CQA?, and (b) How to identify users who have potential of becoming experts in future (potentialexperts)? In particular, we propose a probabilistic model that captures the selection preferences of usersbased on the questions they choose for answering. The probabilistic model allows us to run machine learn-ing methods for identifying experts and potential experts. Our results over several popular CQA datasetsindicate that experts differ considerably from ordinary users in their selection preferences; enabling us topredict experts with higher accuracy over several baseline models. We show that selection preferences canbe combined with baseline measures to improve the predictive performance even further.

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Searchand Retrieval—retrieval models; selection process; H.1.2 [Information Systems]: User/Machine Systems—human factors; human information processing

General Terms: Experimentation, Human Factors, Algorithms

Additional Key Words and Phrases: Expert Identification, Question Selection Process, Community QuestionAnswering

ACM Reference Format:Pal, A., Harper, F. M., Konstan, J. A. Exploring Question Selection Bias to Identify Experts and PotentialExperts in CQA ACM Trans. Inf. Syst. V, N, Article A (January YYYY), 27 pages.DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

1. INTRODUCTIONCommunity Question Answering (CQA) services provide a platform for online users toexchange information in the form of questions and answers. Despite the existence ofseveral other information exchange venues, such as bulletin boards, blogs, microblogs,emails, newsgroups, wikis - CQA has seen immense popularity in recent years. Thispopularity can be estimated from the fact that within one year of its launch, Yahoo!Answers attracted about 60 million unique visitors and 160 million answers [pressrelease 2006]. Over the past few years several specialized CQA have emerged whichare tailored towards specific domains such as sites that field technology questions (e.g.Stackoverflow.com [Stackoverflow 2008]), sites geared towards taxes and tax prepara-

This work extends our prior published work [Pal and Konstan 2010]. This work is supported by the NationalScience Foundation grant IIS 08-08692 and 08-12148.Author’s addresses: Department of Computer Science and Engineering, University of Minnesota, Minneapo-lis, MN 55455, USA.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrightsfor components of this work owned by others than ACM must be honored. Abstracting with credit is per-mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]⃝ YYYY ACM 1046-8188/YYYY/01-ARTA $10.00

DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

A:2 Aditya Pal et al.

tion (e.g. TurboTax Live Community (TTLC) [TurboTax 2007]), etc. These sites havebecome amazingly useful and studies have found that the quality of answers presentedthere may exceed the kind of quality available by going to professional librarians andinformation specialists [Harper et al. 2008].

At the core of these communities lie “answer people” who are the main drivers ofanswer production in the community [Viegas 2004], [Fisher et al. 2006], [Welser et al.2007]. We call these answer people experts and define them more formally as userswho provide a large number of high quality answers. The expert users are found toprovide a lot more answers than ordinary users and they seldom ask questions [Zhanget al. 2007]. Due to such participation of experts, a considerable amount of researchwork has been focused on discovering the motivation that drives them. Researchersconsider altruism as the primary motivation for experts to volunteer and contributevalue in the community [Olson 1971], [Ostrom 1991]; though other reasons such asincreased social status [Gilbert 1990], [Weber 2004] and expected reciprocity, learningincentives [Kollock 1998], [Subramanyam and Xia 2008] have also been proposed.

Expert identification in online services is a widely studied problem and several re-searchers have proposed models to find experts in online communities. The most no-table effort in CQA are [Zhang et al. 2007], [Bouguessa et al. 2008]. [Zhang et al. 2007]proposed a graph based ranking model called ExpertiseRank, a slight variant of PageR-ank [Page et al. 1999] to estimate the expertise of users in CQA. They also proposeda metric Z-score based on the number of answers provided by a user and the numberof questions asked by that user. They considered a hand labeled dataset to comparethe predictive power of the selected metrics. Their results showed that a simple met-ric like Z-score outperforms complex graph based measures, such as, PageRank andExpertiseRank. [Bouguessa et al. 2008] proposed a model to identify experts in Yahoo!Answers. They considered number of best answers as an indicator of user expertiseand used clustering algorithms to find experts.

Another problem that is closely related to expert identification is the problem ofidentifying potential experts. We define potential experts as those users who have thepotential of becoming experts in the future. Potential expert identification has notseen much attention from the research community. We speculate two reasons for this:(i) The measures that typically reflect the expertise of a person, such as, the numberof answers, the number of best answers, might not be strong enough during the earlyparticipation of a person and might not work well in their early identification, and(ii) Lack of studies to emphasize the benefit of early identification of experts. In ourprior work [Pal et al. 2011], we focused on a domain specific CQA and proposed amodel to identify potential experts. Our results shows promise towards identificationof potential experts in CQA communities.

In this paper, we address the two problems identified above: (a) Identification of cur-rent experts in the community, and (b) Identification of potential experts, i.e., userswho are likely to become experts in the future. In particular, we propose a probabilis-tic model to encode the questions selection preferences (QSP) of CQA users. We encodeQSP based on the existing value (EV) of prior answers to a question. We then use ma-chine learning algorithms to distinguish experts and potential experts from ordinaryusers based on their QSP. Our results indicate that QSP can be extremely effective inthe identification of experts and potential experts, thus confirming the findings of priorresearch work [Yang et al. 2008], [Yang and Wei 2009], [Nam et al. 2009]. The priorresearch work shows that expertise can be correlated with doing uncommon tasks inthe community such as answering hard questions, or contributing more actively onquestions where rewards are high (typically by being the first answerer). We comple-ment the prior work by proposing a formal model for the identification of experts andpotential experts in CQA.


Exploring Question Selection Bias to Identify Experts and Potential Experts in CQA A:3

1.1. Our ContributionsThe main contributions of this paper are as follows:

(1) We propose an approach to model question selection preference of CQA users interms of the existing value (EV ) of prior answers to a question. We propose a gen-eral model to estimate EV of a question and a probabilistic model to compute theselection preferences based on EV .

(2) We show the effectiveness of our model for expert identification over several base-line models. We run our experiments over two popular and complete datasets show-ing that our model performs equivalent to and in some cases better than the bestbaseline model.

(3) We show the effectiveness of our model in identification of potential experts earlyin their participation with the community. We do this by considering the first fewweeks of participation data for users and show that our models perform equallywell or better than the best baseline models over the two CQA datasets.

(4) We show that the model performance can be improved by probabilistic formulationof baseline measure over the baseline models derived from the same measures. Inparticular, we show this improvement for number of votes on users’ answers.

(5) We perform a temporal analysis of users’ selection preferences and analyze howtheir preferences vary over time. We hypothesize that the community dynamicscan explain the changes that users’ selectivity undergoes over time.

The rest of the paper is organized as follows: Section 2 presents our literature survey.Section 3 describes our model and section 4 describes the datasets used. Section 5 de-scribes the models and their features used for the machine learning classifications andSection 6 presents our results. Section 7 presents our analysis of selection preferencesof users. Sections 8 and 9 present the discussion, conclusion and future work.

2. RELATED WORKThe approaches towards finding experts can be broadly divided into two categories:graph-based approaches and feature-based approaches. The graph-based approachesmodel the underlying domain in the form of an expertise graph where the nodes rep-resent the domain entities (experts and non-experts) and the edges between the nodesrepresent some notion of expertise (e.g. influence, prominence, authoritativeness, etc).Algorithms are then run to identify notable graph nodes based on the graph proper-ties such as connectedness, centrality, vitality, etc, or by computing measures such asPageRank, HITS, etc. Feature-based approaches extract features for each entity basedon their correspondence in the underlying domain, e.g., in CQA the potential featurescan be the number of answers, the number of questions, part of speech analysis, graphfeatures, etc. These features can be combined in non-linear ways to compute additionalfeatures, e.g., Zscore as we saw earlier. Then machine learning algorithms are used overthe extracted features to identify experts - typically this step involves clustering algo-rithms, ranking algorithms and/or using thresholds based on domain knowledge.

The most notable graph-based approaches are the PageRank algorithm [Page et al.1999] and HITS algorithm [Kleinberg 1998]. Both of these approaches model the In-ternet as a graph where a node represents a web page, and a directed link betweennodea and nodeb indicates that the web page corresponding to nodea has a hyperlinkthat points to web page corresponding to nodeb. This representation of the Internet isalso called as web graph. [Page et al. 1999] proposed the PageRank algorithm basedon Markov chains to compute the probability of a random surfer landing on a givenweb page. A web page with higher probability is then considered more important thana web page with lower probability. This forms the basis of ranking web pages and



is used successfully in the Google search engine [Engine 1999]. The HITS algorithm[Kleinberg 1998] computes two values for a web page: its authority value, which esti-mates the value of content of the page, and its hub value, which estimates the valueof its links to other pages. Authority and hub values are defined in terms of one an-other in a mutual recursion. Several variants of these two main algorithms have beenproposed, such as, SALSA [Lempel and Moran 2001], EntityRank [Zhang et al. 2007],TwitterRank [Weng et al. 2010], AuthorRank [Liu et al. 2005]. For a comprehensivesurvey of graph based ranking algorithm, see [Berkhin 2005], [Borodin et al. 2005].

Expert identification has been subject of extensive research in several online do-mains ranging from Usenet newsgroups to email networks to the blogosphere to themore recent microblog arena. [Fisher et al. 2006] applied social network analysis forthe task of characterizing authors in Usenet newsgroups by computing patterns ofreplies for each author, finding that second-degree ego-centric networks gives a cleardistinction between different types of authors and newsgroups. They revealed the pres-ence of “answer people”, i.e., users with high out-degree and low in-degree who replyto many but are rarely replied to, who provide most answers to the questions in thecommunity.

In email networks, user expertise has been recently studied by [Campbell et al. 2003]and [Dom et al. 2003]. [Campbell et al. 2003] used a graph-based algorithm (HITS) anda feature-based algorithm (thresholding using keyword frequencies) in an email net-work and found that the HITS algorithm performs much better than the feature-basedone. They concluded that the HITS algorithm takes communication patterns and emailcontent into account, while the feature-based algorithm took only email content intoaccount. [Dom et al. 2003] represented an email network as a digraph and comparedseveral graph algorithms for their effectiveness in identifying experts. For their net-work structure, PageRank algorithm outperformed the rest of the algorithms (includ-ing HITS). The inferior results of HITS were attributed to its notion of the existence ofhubs and authorities which might not represent the true scenario in email networks,primarily because most of the users might have an email correspondence with very fewexperts unlike the web graph, where a page can link to several authoritative pages.

Academic journals have been subject of expertise analysis to rank journals on thebasis of their impact, and evaluating the impact of individual authors publishing inthose journals. This is primarily done by co-authorship analysis. One early exampleof this is the Erdos number which indicates the smallest co-authorship link betweenany individual mathematician and the Hungarian mathematician Erdos [Castro andGrossman 1999]. One indication of prominence of a person is small value of Erdosnumber. Paul Erdos had Erdos number 0. [Liu et al. 2005] modeled the co-authorshipin academic papers in the form of a weighted graph where graph nodes represent theauthors, edges between two nodes exist if corresponding authors have co-authored apaper together. The edge weight is estimated based on the frequency of co-authoringbetween the two nodes, normalized for authors. They proposed an alternate versionof PageRank algorithm called AuthorRank which computes PageRank for a weightedgraph. Their results show that AuthorRank performs better than PageRank and sev-eral other graph centrality measures such as degree, betweenness, and closeness.

In blogs and microblogs, several efforts have attempted to surface top authorities.[Java et al. 2006] employed models proposed by [Kempe et al. 2003] to model the spreadof influence in the blogosphere in order to select an influential set of bloggers whichwill maximize the spread of information in the blogosphere. [Java et al. 2007] proposedmethods to find prominent blog feeds using the folder and feed names used to organizethe feeds and through subscriber count of the feeds. [Weng et al. 2010] modeled Twitterin the form of a weighted directed topical graph. They consider topical tweets postedby a user to estimate the topical distribution of the user and construct a separate



graph for each topic. The weights between two users estimate how much correlation isbetween the two users in the context of the given topic. A variant of PageRank calledTwitterRank is run over these graphs to estimate the topical importance of each user.[Pal and Counts 2011] have proposed a feature-based real-time algorithm for findingtopical authorities in microblogs. They proposed several features of microblog usersthat captures their topical interests (topical signal, signal strength, etc) and severalgraph based features which capture the topical prominence of the person (such asmention impact, network impact, etc). Based on these features they ran clusteringalgorithms to find clusters of experts; these users are then ranked using a rankingalgorithm.

2.1. Expert Finding Approaches in CQAFeature-based and graph-based approaches for finding experts in CQA have been ap-plied by several researchers. [Bouguessa et al. 2008] proposed a feature-based modelto identify experts in CQA. They considered the number of best answers given by auser as an indicator of authority of that person. In their approach they modeled theauthority scores of users as a mixture of gamma distribution and used Bayesian Infor-mation Criteria (BIC) to estimate the appropriate number of mixtures and computedthe parameters of each mixture component using the Expectation Maximization (EM)algorithm. The suggested benefit of their approach over graph algorithms is that theirapproach automatically surfaces the number of experts in the community whereas forgraph algorithms the number of experts is considered to be known beforehand (whichmight not be desirable).

[Zhang et al. 2007] proposed ExpertiseRank, a slight variant of PageRank, which notonly considers how many people one helped but also whom he/she helped. They alsoproposed a feature based measured called Zscore which is measured based on the num-ber of answers (a) and number of questions (q) as Zscore = a−q√

a+q. A user with a higher

Zscore is more likely to be an expert than a user with lower Zscore. This implies thatexperts answer a lot of questions and ask very few questions (often zero). Their resultsindicate that for the Java developer forum, simple measures like Zscore perform bet-ter than more complex graph based algorithms like PageRank, ExpertiseRank, HITS.The authors concluded that the network’s structural characteristics - i.e. how usersare interconnected - matters for the effectiveness of expertise ranking algorithms.

[Jurczyk and Agichtein 2007] performed link analysis on a dataset from Yahoo! An-swers by using a slight adaptation of HITS algorithm. Their results indicate that theHITS algorithm outperformed simple graph measures (such as indegree-based algo-rithms). [Guo et al. 2008] proposed a technique for recommending questions to usersbased on their interests. They proposed a generative model which could tap a user’sparticipation in CQA effectively by discovering their topic inclination and expertiseand then recommending appropriate questions.

In this work we explore the selection preferences of users towards the task of expertidentification. The selection preferences of experts have also been explored by previousresearchers in other contexts. [Yang et al. 2008] study a knowledge sharing online com-munity Taskcn.com1 which provides monetary incentives to answerers. They observethat a task’s prestige impacts the kind of participation it attracts and their resultsshow that popular tasks attract less expert participants whereas peripheral tasks at-tract more expert participation. [Yang and Wei 2009] studied another online questionanswering community Baidu Knows2. They observed that the group of users who do

1www.taskcn.com2www.baidu.com



both asking and answering a question choose less challenging questions to answeras compared to users who only answer questions. [Nam et al. 2009] study Naver’s3

question answering community show that users are motivated in part through a pointbased system. Summarizing the related work in exploring the selection biases of theexperts, prior work has explored the selection preferences mostly when the value on aquestion has been explicitly defined in terms of points or monetary value. In this paper,we propose a simple definition of existing value as the number of answers received ona question and use this definition of existing vale to show that experts can be identifiedin the community. This idea has been explored to some extent in our prior work [Paland Konstan 2010] and we extend our prior work in following ways.

(1) We lift several modeling assumptions that we used to capture the concept of “ex-isting value”. We present several different functional paradigms to estimate thisvalue and choose two of them for experimentation.

(2) We consider three different probabilistic models for estimating users’ selectionpreferences. The estimations made by each model differ from one another froma quantitative as well as a qualitative standpoint.

(3) We consider several different classification models for learning user preferences inorder to classify experts.

(4) We considered several baseline and aggregate models and explore their classifica-tion performance.

(5) We run extensive experimentation over the effectiveness of selection preferencesfor predicting experts and how it can be used to boost the model performances ofthe baseline models.

(6) We run several new experiments to show the effectiveness of our generalizationsfor identifying potential experts.

3. QUESTION SELECTION PROCESSWe begin with the question: In a community question answering site what questionsdo users choose for answering? The main motivation for asking this question is thehypothesis that since experts provides a large number of helpful answers so they mustbe more selective while picking questions in order to maximize the amount of helpprovided per answer. We encode this selectivity of users in the form of a probabilisticmodel which is then used for their classification.

Encoding users’ selection preferences is challenging due to the fact that in any pop-ular Q&A system there are at least hundreds of thousands of questions in variousquestion states: {unsolved, wrongly solved, partially solved, solved}. It would be naiveto assume that every user would be interested in answering unsolved questions, be-cause such an assumption would fail to explain the existence of hundreds to thou-sands of unanswered questions alongside a high average number of answers per ques-tion (69,675 (7.7%) unanswered questions yet 2.83 answers per question for Stack-overflow.com or 140,052 (22%) unanswered questions yet 1.4 answers per question forTurboTax Live community).

Furthermore, popular Q&A interfaces such as Yahoo! Answers 4, Stackoverflow.com,TurboTax Live Community, etc, typically present a list of questions, with a brief snip-pet per question, on their front page. This makes it difficult to estimate whether itis the accessibility of the question due to promotion by the interface or the question’spopularity or the user’s topical expertise or some other question characteristic or a

3www.naver.com4www.answers.yahoo.com



Table I. Attributes that reflects the value of existing answers.

Attributes Detailed descriptionNumber of answers Number of answers is a direct measure of the value of existing an-

swers on a question. The higher the number of answers, the higherthe value.

Votes received Votes and ratings given by users indicates an answer’s actual valueas perceived by community members.

Answer status Special status such as best answer or helpful answer are awarded tocertain answers, indicating their relative value in comparison to otheranswers on the same question.

Author reputation Reputation of an author can affect the value of his or her answers.Content quality Characteristics often used in information retrieval - e.g., length, num-

ber of citations, number of hyperlinks, and context similarity - can beused to effectively estimate the value of existing answers.

random click leading to the selection of the question for a detailed view and ultimatelyit being answered.

We propose existing value of a question as a measure to capture the question selec-tion process employed by the users. We use existing value to build a distribution whichmaps the pattern in the kind of questions a user prefers answering.

3.1. Existing Value of a QuestionWe formally define the existing value (EV ) of a question, as follows:

Definition 3.1 (EVq). Existing value of a question q is the overall value of answersprovided on that question. For an individual answer, the value is an indicator of howbeneficial the answer is to the question asker and other community members in com-parison with the prior answers on the same question.

The concept of EV can be used to explain the expertise of the person. Experts areusers who provide the most valuable answers and for them to be effective they shouldpick questions with low EV so as to maximize the likelihood of providing a valuableanswer. For a solved or nearly solved question the incremental value that a new answercould add is low. This leads to the following hypothesis:

HYPOTHESIS 1. Users who aim to provide valuable answers would choose questionswith low EV in order to improve the probability of their answer turning out to be valu-able and beneficial to the community.

Next, we present models to estimate EV in a CQA site.

3.2. Models to Estimate EV

Table I lists several factors that can be used to estimate EV of a question in a ques-tion answering site. We expect EV to have a positive correlation with the number ofanswers on a question. However the number of answers alone are not sufficient toestimate the EV because conversational questions receive a lot of answers withoutcontributing much value whereas informational questions receive few answer yet theycan be highly informational as shown by [Harper et al. 2009]. The status of an answerand the votes received on it can further refine this estimate of value.

The first three measures in table I, namely, number of answers, votes received, an-swer status are generally directly observable by CQA users. The other two factors aregenerally unobservable and can require complex computation. Hence we concentrateon the first three measures to estimate EV of a question.

Let EVq denote the existing value of a question q and Va denote the value of ananswer a, then EVq equals

∑a∈A Va, where A is the set of answers on question q.



Further, value of an answer is a function of votes received on it and its status. In ageneral setting, we define EVq as follows:

EVq = G

(∑a∈A

Va

)

= G

(∑a∈A

F (votesa, statusa)

)(1)

In this equation, votesa indicate the number of votes received by answer a andstatusa = 1 if a is selected as best or helpful answer otherwise statusa = 0. Thefunctions F and G can be arbitrary functions and their choice is critical in the esti-mation of EV and the performance of the classification models. In our previous work,we explored the following definition for F and G function:

F (x, y) = 1 + x+ y (2)

This F function indicates that the value of an answer is simply the sum value of itsvotes and status. The constant 1 is added to this value to ensure that EVq depends onthe number of answers to q. This definition of F function ensures that the overall valueof answers in A is an integer.

G(x) = Hb(x) =

0 if x < 1

b if x ≥ b

⌊x⌋ otherwise(3)

where ⌊x⌋ denotes the largest integer smaller than or equal to x. We could also writethe Hb function as Hb(x) = [1 + sign(x)] · [b+ x+ sign(x− b) · (b− x)]. The Hb functionrestricts the output to one of the b + 1 elements in the set {0, 1, ..., b}. The Hb func-tion reduces the range of values EV can take which helps while modeling the questionselection process (in restricting its dimensionality). In practice the range reduction isrequired because otherwise, the EV for some questions would exceed one thousandwhereas for majority of questions (99%) it would be less than 20-30. In our previouswork [Pal and Konstan 2010] we chose b = 5, however in this paper we lift this as-sumption and study the model performance over different values of b.

We also modified the Hb function slightly to create a distinction between no valueand negative value. This is important because we want to distinguish a high qualityanswer with no votes on it from a low quality answer with negative votes on it. Thenew Hb function is defined as follows:

Hb(x) =

−1 if x < 0

b if x ≥ b

⌊x⌋ otherwise(4)

The difference here is that Hb function is restricted to b + 2 values in the set{−1, 0, ..., b} and it creates a clear distinction between no value and negative value. Wealso consider alternate formulation for the F function in addition to the one defined inequation 2, as defined below:

F (x, y) = 1{x+ y > 0} (5)



where 1{condition} is an indicator random variable which yields 1 if the condition istrue otherwise 0. Thus the F function outputs 1 for an answer with positive votes orwith a special status, otherwise it outputs a 0. This formulation suggests that EV doesnot depend on the number of answers on a question but depends only on the numberof valuable answers on it.

Based on the above two definitions of function F , there are two possible estimates ofEV , as defined below:

Definition 3.2 (D1). EVq = Hb

(∑a∈A F (votesa, statusa)

). This formulation consid-

ers the value of an answer to be sum total of its votes, status and a constant 1.

Definition 3.3 (D2). EVq = Hb

(∑a∈A F (votesa, statusa)

). This formulation implies

that the existing value of a question equals the number of valuable answers providedon it.

To summarize, we model the existing value of a question based on the number of an-swers, the votes provided on those answers and any special status on those answers.This formulation is not only simple but also easy to compute as it uses directly observ-able factors in CQA. One assumption that is implicit in this discussion is that the EVof a question is the same for all users. Though this distinction from the perspectiveof different users might yield to different values for EV but this simplifying assump-tion makes the formulation much simpler and avoids circular dependencies like onexpertise of the user.

3.3. Probabilistic Model for Question Selection Process based on EV

We model the question selection process of a user in the form of a probability distribu-tion on EV . This probability distribution indicates the probability of a user choosing aquestion with specific EV . As mentioned in hypothesis 1 this probability distributiongives us a clue as whether the user intends to make a valuable answer and also herexpertise, without even looking at the actual answer posted by the user.

In order to build this probability distribution, we consider a set of answers providedby a user u. Let this set be indicated by A = {a1, a2, . . . , an}. Next, we examine theset of corresponding questions Q = {q1, q2, . . . , qn} over which the answers in A wereprovided.

Note that we assume that if u answers a question q more than once then the newanswer is posted on a duplicate instance of the question q. This is done for notationalsimplicity so that the elements in the sets A and Q have a one-to-one mapping. Wedenote the existing value on a question qi before the corresponding answer ai is postedas evai

qi . We define a discrete probability distribution using one answer instance, asfollows:

P (EV = evaiqi |u, ai, qi) =

{1 if EV = evai

qi0 otherwise (6)

This is a discrete probability distribution which is defined at b+ 2 points where EVis defined. The above distribution says that the probability mass is concentrated at theobserved EV value and is zero at all the other points. Next, we take all the answers inA and use Bayes rule and marginalization to compute the selection preference distri-bution independent of the question, as follows:



P (EV |u) =∑i

P (EV = evaiqi |u, ai, qi) · P (ai, qi)

=∑i

P (EV = evaiqi |u, ai, qi) · P (ai) · P (ai|qi)

=1

n

∑i

P (EV = evaiqi |u, ai, qi) (7)

where P (ai|qi) = 1 and P (ai) = 1/n. The prior, P (ai), is considered to be a uniformdistribution for all the n answers provided by the user u. The distribution defined inequation 7 provides insight into the user’s selection process: if P (EV |u) is high for lowEV value (say -1 or 0) then the user prefers answering questions with bad answers onthem. This is a generic formulation of the question selection process using the conceptof EV . We propose three practical applications of this formulation for running ourexperiments, as follows:

Definition 3.4 (M1). The simplest practical application of the above model is to con-sider all the answers provided by the user to compute her selection preference. Theselection preference computed this way is completely independent of the actual an-swers provided by the user. Additionally, we consider the D1 formulation for EV (seedefinition 3.2). This ensures that M1 is almost the same as the model explored in ourprevious work (a slight improvement though).

Definition 3.5 (M2). There can be instances where a user does not apply her selec-tion process as stringently as she does other times, e.g., while posting a response to afollow up reply to her earlier answer. Other scenarios can be while posting comments,clarifications, acknowledgements etc. In order to account for this, we consider onlythose instances where her answers turned out to be valuable to the CQA community.This is formulated as follows:

P (EV |u) =∑

i 1{vai > 0} · P (EV = evaiqi |u, ai, qi)∑

i 1{vai > 0}(8)

where 1{x} is an indicator random variable which is 1 if x is true otherwise 0. vai

represent the value of answer ai (sum of votes and status on ai excluding the constant1). To compute EV we use the formulation defined by D1.

Definition 3.6 (M3). This model considers all the answers provided by the user(same as M1) but for the computation of EV , it uses the D2 definition (see definition3.3). This model can be represented in short as M3 = M1 +D2.

We also considered M2 +D2 model but it performed similar to M2 and hence for thesake of conserving space we do not include it.

In summary, we considered three models M1, M2, M3 that capture the question se-lection process of the users in CQA. The selection process is captured in terms of aprobability distribution on EV of a question which determines the likelihood of a useranswering a question with a given EV . This probability distribution is a discrete dis-tribution which is defined at b+ 2 points (see equation 4).

4. DATASET DESCRIPTIONWe used data from two popular Q&A services: TurboTax Live Community and Stack-overflow.com. These CQA services allow users to post questions and answers belongingto several categories within a specialized domain. We gathered a complete dataset forthese two services from the day of their launch till the data collection date.



20 40 60 802

4

6

8

10

12

83 users

log(

1 +

num

ans

wer

s)

ExpertsOrdinary Users

Fig. 1. Plot of number of answers on log scale of top answerers amongst the two types of users for TurboTaxdataset.

Table II. Interaction characteristics of the U10 users for TurboTax dataset.

number of U10 users 1,367 (1% of all answerers)number of question asked by U10 users 4,790 (1% of all questions)

number of answers provided by U10 users 226,539 (33% of all answers)number of best answers provided by U10 users 23,662 (46% of all best answers)

number of superusers, S10 users 83 (6% of U10 users)number of questions asked by S10 users 1,963 (41% of those provided by U10 users)

number of answers provided by S10 users 177,426 (78% of those provided by U10 users)number of best answers provided by S10 users 20,731 (88% of those provided by U10 users)

4.1. TurboTax Live CommunityTurboTax live community [TurboTax 2007] is a Q&A service dedicated to tax-relatedquestions and answers. Intuit launched the online Q&A community in 2007 and it hasbeen the most popular site in USA for users’ to ask tax related questions. The datasetwe used spans over the years 2007, 2008 and 2009. It consists of 633,112 questionsprovided by 525,143 unique users and 688,390 answers provided by 130,770 uniqueusers. Intuit has employees that manually identify experts from the large pool of theirQ&A users. They evaluate a candidate on several factors, such as correctness andcompleteness of answers, politeness in response, language and choice of words used,etc. They also check a user’s professional background for experience in the tax domain.Based on these factors they elevate a user to a superuser (which is an expert in ourcase). Once a user gets marked as superuser, her status is visible to all other users inthe community. At the time of our data collection, they had labeled 83 superusers outof 130,770 answer providing users. Figure 1 shows the number of answers providedby the 83 superusers and 83 ordinary users with highest number of answers. The plotshows that a system that sorts users based on the number of answers would result ina poor precision and recall for the TurboTax system.

For our experiments, we selected users who have provided 10 or more answers in thecommunity. We call these users U10 users. There are 1,367 U10 users in the TurboTaxdataset (83 superusers and 1,284 ordinary users). These 1,367 users who form 1% ofall the 130,770 answerers have provided 33% of all the answers in the community.Amongst them, superusers provided 78% of the answers. Table II presents some of theinteraction characteristics of the U10 users.



Table III. Interaction characteristics of the U10 users for Stackoverflow.com dataset.

number of U10 users 29,869 (19% of all answerers)number of question asked by U10 users 412,200 (46% of all questions)

number of answers provided by U10 users 2,063,184 (87% of all answers)number of best answers provided by U10 users 521,506 (92% of all best answers)

number of experts, E10 users 2,986 (10% of U10 users)number of questions asked by E10 users 82,886 (20% of those provided by U10 users)

number of answers provided by E10 users 1,118,895 (54% of those provided by U10 users)number of best answers provided by E10 users 310,851 (60% of those provided by U10 users)

4.2. Stackoverflow.comStackoverflow.com is one of the most popular online sites for software developmentquestions. They allow questions from algorithms to software tools to specific program-ming problems. Stackoverflow.com discourages questions that are subjective or argu-mentative5. We downloaded the complete dataset since its launch in August 2008 toSeptember 20106. The dataset consists of 904,632 questions asked by 165,590 uniqueusers and 2,367,891 answers posted by 156,640 unique users. Unlike TurboTax, Stack-overflow.com does not promote a user to expert or superuser level. They only allowother community members to assign badges (e.g. “Useful Question”, “Helpful Answer”)to a user’s contributions. Though these badges can be useful motivating factors, theycannot be directly used to estimate the expertise of a person.

We devised an expert labeling scheme based on the number of answers posted by theusers. This labeling is motivated by the Z-score model proposed by [Zhang et al. 2007]which is reported to perform better than several graph based methods such as PageRank, HITS and their weighted variants. Recall that, the Z-score for a user equalsa−q√a+q

, where a is number of answers provided by the user and q is number of questionsasked by that user. Z-score of top experts is directly proportional to

√a as q ≈ 0 for

the top experts. As a result, the num answers criteria and Z-score criteria leads toexactly the same labeling for the Stackoverflow.com dataset. The labeling is describedas follows.

First we selected users who provided 10 or more answers as the U10 users. This led toa selection of 29,855 users. Then we sorted these users based on the number of answersprovided by them in decreasing order and selected the top 10% (2,986) as experts.As per this labeling a model based on the number of answers would result in 100%classification accuracy (for both precision and recall). There are several other featuresthat can be considered other than the number of answers for labeling users - like thenumber of best answers, the number of questions, etc. However, the number of answersis the simplest of them (selection of a best answer is a complex process and varies fromsite to site and user to user). Further, prior work shows that the number of answerscorresponds well with the expertise of users [Zhang et al. 2007]. Additionally, thislabeling criteria doesn’t influence our model because the selection preferences of usersas modeled by us are probability distributions over EV and do not contain any factorscorresponding to the number of answers provided by the users. Table III presents theinteraction characteristics of Stackoverflow users within their community.

4.3. Stackoverflow.com - Manual LabelingWe also created a small hand-coded dataset of 100 Stackoverflow.com users. The codingwas done by looking at the profile of the 100 users and their latest 10 answers andcomparing their answers with the other answers on the same questions. We asked

5http://stackoverflow.com/faq6http://blog.stackoverflow.com/2010/09/creative-commons-data-dump-sept-10/



two expert coders to label the 100 users as either expert or non-expert. We used athird coder to break the ties between the two expert coders. The inter-rater agreementbetween the expert coders is 0.72 (fleiss kappa with 95% CI, p ∼ 0), which indicatesthat the high agreement between the raters is not accidental. Out of the 100 users, 22were labeled as experts and rest as non-experts.

4.4. Basic Comparison Between TurboTax and Stackoverflow.com DatasetsBased on the interactions characteristics of U10 users in these two communities (seetable II and III), we can draw certain distinctions between the two communities. TheStackoverflow community has a much larger population of U10 users than the Turb-oTax community, even though both the communities have comparable number of an-swerers. This fact suggests that the users participate more vigorously in Stackover-flow.com.

Another striking thing about U10 users of the TurboTax community is that they haveasked only 1% of all the questions on the site but have provided a whopping 33% ofall the answers. This indicates that the top contributors of this community behaveas “answer people” in true sense. On the other hand, in Stackoverflow we see a verydifferent trend. Here U10 users who constitute only 19% of the community have asked asignificantly high 46% of questions and have provided 86% of the answers, suggestingthat these users are driving the community from both ends.

Additionally, we see that the experts in the TurboTax dataset have a very smallpopulation (6%) amongst U10 users, yet their answer contribution takes up a large per-centage (78%) amongst the contributions of U10 users. This is not the case for Stack-overflow.com dataset where experts are better than users but that distinction is notas clear as in the case of TurboTax. The contrasts between these two communities isprobably due to the fact that Stackoverflow is a community of peers while TurboTax ismore like a support system for mostly novice users.

4.5. Basic CharacteristicsFigure 2 shows the distribution plots of the characteristics of all the users in the twocommunities. The plots indicate that the user features: number of answers, number ofbest answer, number of questions follow a power-law distribution. These distributionsmatches the findings presented in [Zhang et al. 2007]. The power-law distribution in-dicates a highly uneven participation amongst users. For e.g., the “Num Answer” plotshows that a small percentage of users are responsible for answering a large percent-age of questions or conversely a large percentage of users are responsible for a smallpercentage of answers.

Users in both the communities exhibit a deviation from the power-law distribution.The deviation is prominent towards the tail of the distribution. So for the “Numanswer” plot, we can say that a small percentage of users are responsible for providinga large percentage of answers, indicating a highly uneven participation amongstusers. We see that the deviation is slightly larger for Stackoverflow than TurboTaxthough both exhibit strong power-law distributions. These distributions match thefindings presented in [Zhang et al. 2007].

In summary, the two communities are extremely popular and cater to a specific in-terest group online. The users in these communities show power-law characteristicssimilar to other online communities. The TurboTax datasets gives us a hand-coded la-beling of experts. For Stackoverflow dataset we user two labeling schemes: (i) syntheticlabeling based on number of answers, and (ii) a hand-coded labeling of a small set ofusers.



100

101

102

103

104

105

10−6

10−4

10−2

100

Cum

ulat

ive

Pro

babi

lity

Num Answers

turbotaxα =2.37

100

101

102

103

104

10−6

10−4

10−2

100

Cum

ulat

ive

Pro

babi

lity

Num Answers

stackoverflowα =2.54

100

101

102

103

104

10−6

10−4

10−2

100

Cum

ulat

ive

Pro

babi

lity

Num Best Answers

turbotaxα =1.65

100

101

102

103

104

10−6

10−4

10−2

100

Cum

ulat

ive

Pro

babi

lity

Num Best Answers


100

101

102

103

10−6

10−4

10−2

100

Cum

ulat

ive

Pro

babi

lity

Num Questions

turbotaxα =3.5

100

101

102

103

10−6

10−4

10−2

100

Cum

ulat

ive

Pro

babi

lity

Num Questions


Fig. 2. Log-Log distribution of basic user characteristics in the two communities.

5. MODELS FOR EXPERT IDENTIFICATIONWe created several models based on different user features, such as, the number ofanswers, the number of best answers, the number of questions, question selectionpreferences, etc. Some of the models were based on the factors that were used byprior research work on expert identification in the Q&A domain. We do not considergraph based features such as PageRank and HITS as they are extremely expensive tocompute and have been reported to perform inferior to feature based method [Zhanget al. 2007], [Pal and Counts 2011] for expert identification. Additionally, our workaims at analyzing the effectiveness of selection preference alongside the baselinemeasures towards the identification of experts and potential experts. The selectedmodels differed only in the underlying features used but the same classificationalgorithm was used for all the models (described later) for a fair comparison betweenthem. The model features are described as follows:

B0: This model uses number of answers (a), number of questions (q) and Z-score (=a−q√a+q

). Z-score has been used by [Zhang et al. 2007] and the authors showed that it



performs better than PageRank based measures for finding experts.

B1: This model is based on the number of best answers given by each user.[Bouguessa et al. 2008] used best answers to find experts and their number in Yahoo!Answers.

B2: This model is based on the number of questions asked by a user. It tries tocapture the intuition that experts like true “answer people” do not ask questions.

B3: This model considers number of votes received on all the answers provided by auser. Votes are a direct measure of how the community perceives the value of a user’scontribution.

Apart from the above four baseline models, we consider the following models whichare based on the user’s selection preferences:

M1: This model extracts a user’s question answering preferences in terms of theEV . In this model EV equals the sum of number of answers, votes on those answersand status indicator of answers for a given question (see definition 3.4).

M2: This model is based on a user’s question selection preferences as per thedefinition 3.5. In this model, we computed the selection preferences of a user basedon only the valuable answers provided by her. Thus we model the user’s selectionpreferences when she intended to help others.

M3: This model is based on user’s question selection preferences as per the defini-tion 3.6. This model considers that the existing value of a question equals number ofvaluable answers provided on it.

M∗: This model combines features of M1, M2, M3 models. This results in a muchhigher dimensional space and we considered factor analysis [Gorsuch 1983] to reducethe dimensionality of the feature space but it did not improve the model performance.Since we use 10-fold cross-validation, over-fitting is practically ruled out and the onlyconcern is that it can be more time-consuming to run the 10-folds of cross-validation,but this is not an issue for us because we are running in an off-line setting.

B∗ +M∗: This model combines the four baseline models with the features of M∗model.

B−∗ +M∗: This model combines features of the three baseline models B1, B2, B3

with the features of M∗ model. The B0 model is left out on purpose because forStackoverflow number of answers (B0 model) is used to generate the labeled datasetand hence we do not wish to use it for classification.

We considered several machine learning algorithms such as Support Vector Ma-chines [Cortes and Vapnik 1995], Decision Trees [Quinlan 1993], Gaussian Discrimi-nant Analysis [Friedman 1989]. The classification algorithm that consistently ran bet-ter than all the other algorithms is the Bagging meta classifier [Breiman 1996] overdecision trees with reduced error pruning. The bagging classifier reduces the varianceof underlying decision tree classifier, thereby improving its generalization accuracy.We used Matlab [MATLAB 2010] and Weka [Hall et al. 2009] for running the machinelearning algorithms. Due to the consistent performance of the Bagging classifier and



Table IV. Model performance over TurboTax dataset. (b=15, 10-fold crossval)

B0 B1 B2 B3 M1 M2 M3 M∗ B∗ +M∗precision 0.67 0.70 0.73 0.75 0.74 0.69 0.85 0.76 0.82recall 0.52 0.66 0.23 0.57 0.48 0.45 0.40 0.52 0.65f-measure 0.58 0.68 0.35 0.65 0.58 0.54 0.54 0.62 0.72

Table V. Model performance over Stackoverflow dataset. (b=15, 10-fold crossval).Note that here we did not show the B0 model as it is used for expert labeling. Asimilar precaution has been taken care by considering B-

∗ +M∗ here.

B1 B2 B3 M1 M2 M3 M∗ B-∗ +M∗

precision 0.88 0.43 0.78 0.94 0.89 0.99 0.94 0.92recall 0.74 0.01 0.73 0.82 0.80 0.81 0.84 0.90f-measure 0.80 0.02 0.75 0.88 0.84 0.89 0.89 0.91

Table VI. Model performance over Stackoverflow dataset (with manual labeling). (b=15,10-fold crossval)

B0 B1 B2 B3 M1 M2 M3 M∗ B∗ +M∗precision 0.84 0.87 0.66 0.75 0.84 0.84 0.87 0.87 0.87recall 0.73 0.77 0.45 0.63 0.73 0.73 0.77 0.77 0.82f-measure 0.78 0.82 0.53 0.68 0.78 0.78 0.82 0.82 0.84

for presentation considerations, we do not present the results from other classificationalgorithms.

6. RESULTSWe used 10-fold cross-validation for running the Bagging algorithm over the variousmodels presented in the previous section. Cross-validation [Stone 1974] avoids over-fitting of the classification models on the training data and it helps in measuring thetrue generalization error. The only downside of cross-validation is that it can be quitetime consuming.

We choose b = 15 to run most of our experiments as this choice leads to satisfac-tory performance on the selected datasets. A choice of b < 15 degraded the modelperformance and b > 15 did not improve the performance significantly. Moreover, largevalues of b can make the model computationally prohibitive and has learning issuesdue to sparsity in a high dimensional space. In section 6.5 presents a detailed analysisof how our model performs for different values of b.

We computed precision (p), recall (r) and f-measure ( 2∗p∗rp+r ) for the expert class. We donot report the aggregate accuracy values for experts and non-experts (ordinary users)as this would hide the true performance of the models in detecting experts.

6.1. Model Performance in Identification of ExpertsTables IV, V and VI show the predictive performance of the Bagging classification al-gorithm in identification of experts. The M models have higher precision for the twodatasets suggesting their practical usability in building models with higher predic-tive power. The aggregate M model (M∗) performs better than the baseline models forStackoverflow using all the three performance measures. For the TurboTax dataset, ithas slightly higher precision than the best baseline model (B1) but marginally lowerrecall. We suspect that this is due to a very small number of experts in the TurboTaxcommunity. Nevertheless, the model that combines all the measures together performsbest by optimizing precision and recall values to get higher f-measure in all cases. Theimprovement over the best individual model is 6% for TurboTax and 2% for Stackover-flow which is statistically significant using paired one-sided ttest (90% CI, p = 0.055).



week month 3 month all0.5

0.6

0.7

0.8

0.9

prec

isio

n

turbotax

B0 B1 B3 M* B*+M*


0.5

0.6

0.7

0.8

0.9

1

prec

isio

n

stackoverflow

B1 B3 M* B−*+M*

week month 3 month all0

0.2

0.4

0.6

0.8

reca

ll

turbotax

B0 B1 B3 M* B*+M*


0.2

0.4

0.6

0.8

1

reca

ll

stackoverflow

B1 B3 M* B−*+M*


0.4

0.6

0.8

1

f−m

easu

re

turbotax

B0 B1 B3 M* B*+M*


0.2

0.4

0.6

0.8

1

f−m

easu

re

stackoverflow

B1 B3 M* B−*+M*

Fig. 3. Model performance in identification of potential experts (b = 15, 10-fold cross-validation).

Though this improvement is small, yet it can be critical in situations where expertsare needed for volunteer mechanisms or for a task.

6.2. Model Performance in Identification of Potential ExpertsIdentification of potential experts requires us to identify experts based on their earlyparticipation. We consider the first n weeks of participation data per user to computetheir features. For a given user, the start of her participation is measured from thetimestamp of her first answer. Figure 3 shows the performance of the models over thetwo datasets (for the sake of clarity and presentation, we skipped the presentationof the individual M models). We observe that the precision of the M∗ model is sig-nificantly better than the baseline models for TurboTax and is comparable to B1 forStackoverflow. The high precision of our model indicates that some potential expertstune significantly differently from other new joiners during their early phase of life-



Table VII. Performance of models in identification of experts earlyon in TurboTax. (b = 15)

B0 B1 B2 B3 M∗ B∗ +M∗precision 1 1 0 0 1 1recall 0.125 0.25 0 0 0.25 0.25f-measure 0.22 0.4 0 0 0.4 0.4

Table VIII. Performance of models in identification of experts earlyon in Stackoverflow. (b = 15)

B0 B1 B2 B3 M∗ B∗ +M∗precision 1 0.96 0 0.71 1 1recall 0.08 0.07 0 0.11 0.1 0.11f-measure 0.15 0.13 0 0.19 0.18 0.20

cycle. Overall we see that the aggregate model provides considerable improvement inrecall and f-measure, leading to the conclusion that the selection preferences can beused along with baseline measures for more accurate identification of potential experts.

Considering the performance of models over 1 month of data, we observe that M∗provides a 20% improvement in precision over the best baseline model (B3) for Turb-oTax and is roughly equivalent to the best baseline (B1) for Stackoverflow. The highmodel precision in both the cases suggests that the selection preferences of a sizableproportion of potential experts (25-35%) are distinguishable from the ordinary usersand other potential experts. Furthermore, as the time window increases, the modelperformance improves. This is primarily due to the availability of more data for fea-ture computation and better estimation of users’ expertise. Three month data leadsto satisfactory performance for the aggregate model (0.7-0.8 precision, 0.45-0.6 recall,0.6-0.7 f-measure), suggesting that a substantial proportion of potential experts couldbe identified by running the aggregate model over first 12 weeks of data and thenpicking the predictions as potential experts.

6.3. Identifying Experts Early OnIn a practical setting, we would like to identify potential experts who are in their earlyphase of participation in the community and yet are similar to the current experts.

This helps in situations when the aim is to identify new users who match the levelsof experts; so that these users can be retained before their interest levels wane.

We perform this experiment by dividing the users into two sets: training set (80% ofthe users) and test set (20% of the users). For the users in the training set, all theirdata is used to compute their features. For the users in the test set, we used the firstmonth data to compute their features. Training set is used to build a classificationmodel and the test set is supplied to test its predictive power (no cross-validation isperformed in this case). Tables VII and VIII show the model performance in detectingexperts early on. We note that the M∗ model performs as good as or even better thanthe best baseline model indicating its overall reliability. The combined aggregate modelvery slightly improves the predictive performance.

We observe that all the models struggle with low recall, indicating that not all poten-tial experts reach their full potential in their first month and thus cannot compare withthe current experts using only the initial data. But some potential experts do measureup to the bar set by the current experts early on. The precision of 1 for M∗ is an in-dicator that the model does not go wrong in the early identification of these potentialexperts. It also performs better than all the baseline models. We surmise the reasonfor this to be that our model doesn’t depend on the quantity of contributions unlike thebaseline models and hence is slightly better when the quantity of the contributions islimited.



10 30 50 70 900.65

0.7

0.75

0.8

0.85

0.9

Answer Threshold

prec

isio

n

turbotax

M1 M2 M3 M*

10 30 50 70 900.85

0.9

0.95

1

Answer Threshold

prec

isio

n

stackoverflow

M1 M2 M3 M*

10 30 50 70 900.4

0.5

0.6

0.7

0.8

Answer Threshold

reca

ll

turbotax

M1 M2 M3 M*

10 30 50 70 900.8

0.85

0.9

0.95

1

Answer Threshold

reca

ll

stackoverflow

M1 M2 M3 M*

10 30 50 70 900.5

0.6

0.7

0.8

0.9

Answer Threshold

f−m

easu

re

turbotax

M1 M2 M3 M*

10 30 50 70 900.8

0.85

0.9

0.95

1

Answer Threshold

f−m

easu

re

stackoverflow

M1 M2 M3 M*

Fig. 4. Model performance over different choices of answer threshold. (b = 15, 10-fold crossval)

6.4. Model Performance over Answer ThresholdRecall, that we selected users who provided 10 or more answers to run our experi-ments. In this result, we see how model performance varies as the answer threshold of10 is varied. In order to do a fair comparison, we do not change the labeling of expertsand ordinary users for Stackoverflow, yet when we compute recall, we only considerthe users above the threshold (because the discarded users are not available to thelearning model). This experiment compares the effectiveness of M models over differ-ent answer thresholds (see figure 4). We observe that as answer threshold increases,the model performances improve. The reason for improvement is that at high answerthreshold, noisy users are filtered out. The noise stems from the fact that for lowerthresholds, the selection preferences are computed based on the small number of con-tributions (for some users) - which could lead to inaccurate estimation of their selectionpreferences.



3 5 7 9 11 15

0.65

0.7

0.75

0.8

0.85

b

prec

isio

n

turbotax

M1 M2 M3 M*

3 5 7 9 11 150.5

0.6

0.7

0.8

0.9

1

b

prec

isio

n

stackoverflow

M1 M2 M3 M*

3 5 7 9 11 150.2

0.3

0.4

0.5

0.6

0.7

b

reca

ll

turbotax

M1 M2 M3 M*

3 5 7 9 11 150

0.2

0.4

0.6

0.8

1

b

reca

ll

stackoverflow

M1 M2 M3 M*

3 5 7 9 11 15

0.35

0.4

0.45

0.5

0.55

0.6

0.65

b

f−m

easu

re

turbotax

M1 M2 M3 M*

3 5 7 9 11 150.2

0.4

0.6

0.8

1

b

f−m

easu

re

stackoverflow

M1 M2 M3 M*

Fig. 5. Model performance over different choices of b (using 10-fold cross-validation).

This result suggests that for practical purposes, we can increase answer thresholdto find more experts. On the downside, as we increase the answer threshold, somereal experts could also get discarded along with ordinary users. To counter this, wecompute the ratio of improvement in model performance with the percentage of expertsdiscarded and conclude that an answer threshold of 50-60 is practically reasonablethough it could vary as per site and domain.

6.5. Model Performance over bRecall that the selection preference are modeled as discrete probability distributiondefined at b+2 points (see Equation 4). So far we used b = 15 to run our experiments. Inthis result, we aim to see how the variations in b affects the model performance. Figure5 shows the performance of the M models for different choices of the b parameter.We see that as b increases, f-measure also increases, though this increase graduallybecomes smaller. The improvement in the model performances can be attributed to



Table IX. Performance comparison between point estimates and probabilitydistribution of user characteristics.

TurboTax StackoverflowB3 Bd

3 M Md B3 Bd3 M Md

precision 0.75 0.80 0.45 0.76 0.78 0.94 0.62 0.94recall 0.57 0.56 0.31 0.52 0.73 0.85 0.53 0.84f-measure 0.65 0.66 0.37 0.58 0.75 0.89 0.57 0.89

the fact that for larger values of b, our estimation of users’ selection preferences ismore accurate; implying that apart from being selective towards questions with lowEV experts can be identified by looking more accurately at the preferences of ordinaryusers towards questions with high EV .

But on the other hand, we fear that for very large value of b, the learning task isperformed in a very high dimensional space and the learning model would overfit onthe training data and would not be generalizable (leading to poor accuracy using cross-validation). Moreover, the model would be computationally expensive for large valuesof b.

6.6. Performance improvements using Probability distributionsIn this section, we show that our probabilistic formulation can be used along withbaseline measures to improve the performance of the baseline models derived from thesame measures. We consider all the votes received on a given user’s answers. ModelB3 takes a sum of these votes. We build a probability distribution over these votes justlike we proposed on EV . The resulting probability distribution indicates the probabil-ity of the user getting a specific vote on her answers. We use the bucketing function Hb

(see equation 4) with b = 15. The feature thus generated is a 17 dimensional probabil-ity distribution over votes received by each user’s answers. We call this model Bd

3 . Inaddition to this, we compute the average EV of questions answered by the user. Thisis a point estimate of a user’s selection preferences and we call it as model M .

Figure IX shows the performance of the four models, namely, B3, Bd3 , M , Md =

M∗. The distribution based models (subscript d) improves over the models based onpoint estimates of features. The point estimates such as the sum or average miss outa lot more details and patterns in the features which the probability distribution cansurface thereby leading to better predictive capability.

Note that not all features can be converted to probability distributions (pdf ). Wedemonstrated that votes can be converted to a pdf but the number of best answers orthe number of answers cannot be converted to pdf .

7. ANALYZING SELECTION PREFERENCESThe previous subsection shows that experts’ have selection preferences which are dis-tinguishable from those of ordinary users. In this section, we perform a deeper analysisof the selection preferences of the two kinds of users for the two datasets.

7.1. Experts vs Non-ExpertsWe compute the average selection preference for the two class of users, as shown byhistogram in figure 6. We observe that experts prefer question with EV = 0 more thanordinary users. Interestingly, ordinary users have a higher probability of answeringquestions with negative EV , suggesting that ordinary users restore the value on badlyanswered questions much more than experts. This is contrary to what we intuitivelyexpected.

We perform a 2 − way ttest to test if the mean probability for the two class of usersis statistically same or not. The two hypothesis are: H0 (null hypothesis): two groups



−1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

EV

prob

abili

ty

M1 − turbotax


−1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

EV

prob

abili

ty

M1 − stackoverflow


−1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

EV

prob

abili

ty

M2 − turbotax


−1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

EV

prob

abili

ty



−1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

EV

prob

abili

ty

M3 − turbotax


−1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

EV

prob

abili

ty



Fig. 6. Aggregate selection preferences of experts, and ordinary users for TurboTax dataset. For sake ofvisualization, we consider b = 5 to compute the preferences per model.

of users have same preference for a given EV question and Ha (alternate hypothesis):two groups of users have different preference for a given EV question. We reject H0 inall the cases with p < 0.0001 except for EV = −1. We conclude that on average expertsdiffer significantly from ordinary users in their selection preferences.

7.2. Individual Comparison in Selection PreferencesSo far, we saw that the experts differ significantly from ordinary users in the ag-gregate comparison of selection preferences. Here we perform a comparison at in-dividual level. We selected 10 experts and 10 ordinary users randomly from thepool of U10 users. For these 20 people, we computed their selection preferences us-ing the M1 model. Then we use symmetric Kullback-Leibler divergence, KLs(x, y) =∑

i[xi · log (xi/x′i) + x′

i · log (x′i/xi)]/2 [Kullback and Leibler 1951] to compute the simi-

larity in selection preferences between each user pair.



1 ordinary users (ou) 10 experts (e) 201

ou

10

e

20turbotax

1 ordinary users (ou) 10 experts (e) 201

ou

10

e

20stackoverflow

Fig. 7. Plot of 1 - KL-divergence of users and experts. White indicates similarity and high intensity towardsblack indicates dis-similarity. The ordinary users are from 1-10 and experts from 10-20, on both axes. Theaverage score for TurboTax: e−e = 0.94, ou−ou = 0.63, e−ou = 0.71. The average score for Stackoverflow:e− e = 0.91, ou− ou = 0.66, e− ou = 0.70.

Figure 7 plots 1 −KLs score between each user pair. Higher score (1 −KLs) meansthat the two users are similar in their preferences and low score means they are dis-similar. The experts (between 10-20 on both axis) have a very high score (as indicatedby the large white region). This result shows that the selection preferences of randomlyselected individual experts differs from randomly selected ordinary users. These dif-ferences exist at the individual level and as shown by the results - it can be used todistinguish and identify experts.

Interestingly, ordinary users differ from one another, suggesting that there can bedifferent types of roles assumed by them, e.g., some prefer answering −ve EV ques-tions, others prefer answering questions early on, some others prefer answering ques-tions when it reaches its resolution stage, and so on.

7.3. Selection Preference Variations Over TimeSo far, we saw that experts differ from ordinary users in their question selection pref-erences. Another important question arises is that: Are these experts as selective asit appears from the day they associate with the community or is there a transitionfrom being less selective to being more selective? In order to answer this, we divide theanswers provided by each person into 5 different buckets. For each user, the answersare sorted based on the time and then divided into five equal buckets. So all answersin a bucket i are given by the users after they have given all answers in bucket j(i > j; ∀i, j). This bucketing strategy ensures that all buckets have the same numberof answers.

Figure 8 and 9 shows the selection preference over different time for the five buck-ets for the two datasets. The probability values are averaged for experts and ordinaryusers. We observe that in the TurboTax community the selection preferences of ordi-nary users and experts do not vary much indicating that they assume their roles veryearly in their association with the community. This is also evident from the fact thatthe model performances improve marginally (0.5 - 0.6 f-measure) when the initial userdata is used (see Figure 3). On the other hand we see that in Stackoverflow commu-nity, the selection preferences show gradual changes. We notice the trends in both thetypes of users, they both shift their preference towards low EV questions. On an aver-age, the experts are still distinguishable from ordinary users even during their earlyparticipation.



0 2 4 60.5

0.6

0.7

0.8

0.9

time (buckets)

prob

abili

ty

P(EV=0|u)

experts users

0 2 4 60.05

0.1

0.15

0.2

0.25

time (buckets)

prob

abili

ty

P(EV=1|u)

experts users

0 2 4 60

0.01

0.02

0.03

0.04

0.05

time (buckets)

prob

abili

ty

P(EV=15|u)

experts users

Fig. 8. Variations over time in selection preferences of users of TurboTax community. The whiskers indicatethe standard deviation in the mean value.

0 2 4 60.2

0.25

0.3

0.35

0.4

0.45

0.5

time (buckets)

prob

abili

ty

P(EV=0|u)

experts users

0 2 4 60.06

0.07

0.08

0.09

0.1

0.11

time (buckets)

prob

abili

ty

P(EV=1|u)

experts users

0 2 4 60.05

0.1

0.15

0.2

0.25

time (buckets)

prob

abili

ty

P(EV=15|u)

experts users

Fig. 9. Variations over time in selection preferences of users of Stackoverflow community. The whiskersindicate the standard deviation in the mean value.

Users of the two communities differ in how their selection preferences develop overtime. For TurboTax user preferences remain roughly the same, whereas for Stack-overflow user preferences change. Our intuition for this change is that the communityobjective and the underlying knowledge that a person requires in the two sites is dif-ferent. The U10 communities for the two datasets differs from each other (see section4.4) and this amounts to the differences in how a user refines her selection.

8. DISCUSSIONIn this paper, we present a measure to capture users’ question selection preferencesin Community Question Answering. Users with different roles in the Q&A communitydiffer in their question selection heuristics. Experts tend to answer questions with lowEV disproportionately higher than ordinary users. The selection preferences of expertsmake them distinguishable from the ordinary users enabling us to use classificationmodels for identifying them. The accuracy of classification models is an indication thatthe selection preferences can serve as an effective measure beyond the intuitive andpopular baseline measures. Our experiments show that selection preferences can beused in conjunction with baseline measures to improve the model performances evenfurther.

We also argue that the user selection preferences can be effective in identifying po-tential experts. As a person joins the community, not much can be said about the ex-pertise of the person by looking at her initial few contributions. We show that evenfor new joiners a selection preference based model has higher precision in identify-ing users with potential expertise. Figure 3 demonstrates that for TurboTax dataset(containing an industry gold standard labeling) the selection preference based modelshave a precision of 80% by only considering a month of user data. This precision valueis more than 20% above the best baseline model.



In another practical setting, identifying new users who match the caliber of the com-munity experts can be desirable. We see that selection preference based models canfind such new users with a precision as high as 100% (table VII and VIII). They consis-tently perform better than most of the baseline methods for all the selected datasets.Though we see here that the recall of most of the models is quite high, however theresults suggest that around 10-40% potential experts match the levels of the existingcommunity experts.

The underlying communities differ in their dynamics and how users participate inthem. At one end is TurboTax which encourages true “answer people” behavior fromusers and at the other end is Stackoverflow that encourages a more healthy partic-ipation from users (asking questions as well as answering them). These communitydynamics along with their sizes allow users in Stackoverflow to be less preferentialinitially and gradually become more preferential towards low EV . Our results showthat in TurboTax experts are “born” experts and their preferences do not show anysignificant shift over time. We hypothesize that the interesting patterns presented bythe figures 8 and 9 can be explained by the community expectations of users, user’spersonal preference and the underlying community dynamics.

The user selection preferences present an interesting dimension to measure userbehavior in CQA. Interestingly, user preferences are computed without taking intoconsideration the actual contributions made by the users and yet they turn out to be aseffective (or even more) as baseline measures that explicitly measure the contributionsmade by the users. This suggests that expertise is a deep rooted phenomenon amongstusers and users’ expertise can also be established by looking at intrinsic measuressuch as their selection behavior.

9. CONCLUSION AND FUTURE WORKIn this paper, we present a formal model to capture the user selection process. Themodel proposed aims to understand how users’ select questions for answering. Our un-derstanding is based on the estimate of existing value of prior answers on the questionsselected by the users. We showed the effectiveness of the use of selection preferences ofusers in the identification of community experts and potential experts. Identificationof potential experts is important and can enable community managers to come up withmeasures to nurture and retain them in the community. Our results showed that themodeling of selection preferences based on EV is effective because of the intuition thatexperts aim to be more effective in providing answers and hence would select ques-tions with low EV on them. The selection preferences of experts sets them apart fromother users and thus enables classification models to detect them with high accuracy.We used several basic characteristics of users in CQA and showed that using the se-lection preferences we can improve the model performances of the baseline models inexpert identification in Community Question Answering. However, we do not considergraph based features of users in our models - as the aim of this study is to introducea new concept and show its effectiveness in building predictive models for identifyingexperts. We choose two popular and complete q&a datasets for testing our formalism.These two systems were quite dissimilar in how users participated in the underlyingcommunities, and yet we showed that our model was helpful in expert identification inthese diverse datasets.

Our model works well for informational/technical CQA due to experts’ tendency toanswer low EV questions in these domains. One of the future research direction wouldbe to use our model in CQA communities which are more conversational in nature andto study how expert behavior in question selection changes in these settings. Further-more, there can be scenarios worth exploring, such as, when the initial answers weregood and hence an expert refrained from answering, but the later answers were either



troll, or asked a new questions or transformed the original question - prompting anexpert answer. We wish to model these scenarios in our estimation of EV. We wouldalso like to explore several other interesting dimensions to capture the users’ behaviorin a CQA and use these dimensions along with our proposed methodology for the taskof discovering experts and other interesting users.

REFERENCESBERKHIN, P. 2005. A survey on PageRank computing. Internet Mathematics 2, 73–120.BORODIN, A., ROBERTS, G. O., ROSENTHAL, J. S., AND TSAPARAS, P. 2005. Link analysis ranking: algo-

rithms, theory, and experiments. ACM Transactions on Internet Technology 5, 231–297.BOUGUESSA, M., DUMOULIN, B., AND WANG, S. 2008. Identifying authoritative actors in question-

answering forums: the case of yahoo! answers. In Proceeding of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining. KDD ’08. ACM, New York, NY, USA, 866–874.

BREIMAN, L. 1996. Bagging predictors. Mach. Learn. 24, 123–140.CAMPBELL, C. S., MAGLIO, P. P., COZZI, A., AND DOM, B. 2003. Expertise identification using email com-

munications. In Proceedings of the twelfth international conference on Information and knowledge man-agement. CIKM ’03. ACM, New York, NY, USA, 528–531.

CASTRO, R. D. AND GROSSMAN, J. W. 1999. Famous trails to Paul Erdos. Mathematical Intelligencer 21,51–63.

CORTES, C. AND VAPNIK, V. 1995. Support-vector networks. Mach. Learn. 20, 273–297.DOM, B., EIRON, I., COZZI, A., AND ZHANG, Y. 2003. Graph-based ranking algorithms for e-mail expertise

analysis. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining andknowledge discovery. DMKD ’03. ACM, New York, NY, USA, 42–48.

ENGINE, G. S. 1999. http://www.google.com.FISHER, D., SMITH, M., AND WELSER, H. T. 2006. You are who you talk to: Detecting roles in usenet

newsgroups. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences -Volume 03. IEEE Computer Society, Washington, DC, USA, 59.2.

FRIEDMAN, J. H. 1989. Regularized discriminant analysis. Journal of the American Statistical Associa-tion 84, 405, pp. 165–175.

GILBERT, P. 1990. Changes: Rank, status and mood. In On the Move: The Psychology of Change and Transi-tion, S. Fischer and C. L. Cooper, Eds. Wiley, New York, NY, USA, 33–52.

GORSUCH, R. L. 1983. Factor Analysis. Second ed. Lawrence Erlbaum Associates, Hillsdale, NJ.GUO, J., XU, S., BAO, S., AND YU, Y. 2008. Tapping on the potential of Q&A community by recommending

answer providers. In Proceeding of the 17th ACM conference on Information and knowledge manage-ment. CIKM ’08. ACM, New York, NY, USA, 921–930.

HALL, M., FRANK, E., HOLMES, G., PFAHRINGER, B., REUTEMANN, P., AND WITTEN, I. H. 2009. Theweka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18.

HARPER, F. M., MOY, D., AND KONSTAN, J. A. 2009. Facts or friends?: distinguishing informational andconversational questions in social q&a sites. In Proceedings of the 27th international conference on Hu-man factors in computing systems. CHI ’09. ACM, New York, NY, USA, 759–768.

HARPER, F. M., RABAN, D., RAFAELI, S., AND KONSTAN, J. A. 2008. Predictors of answer quality in onlineQ&A sites. In Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computingsystems. CHI ’08. ACM, New York, NY, USA, 865–874.

JAVA, A., KOLARI, P., FININ, T., JOSHI, A., AND OATES, T. 2006. Modeling the spread of influence on theblogosphere. Technical report, University of Maryland, Baltimore County.

JAVA, A., KOLARI, P., FININ, T., JOSHI, A., AND OATES, T. 2007. Feeds that matter: A study of blog-lines subscriptions. In Proceedings of the First International Conference on Weblogs and Social Media,ICWSM. The AAAI Press.

JURCZYK, P. AND AGICHTEIN, E. 2007. Discovering authorities in question answer communities by usinglink analysis. In Proceedings of the sixteenth ACM conference on Conference on information and knowl-edge management. CIKM ’07. ACM, New York, NY, USA, 919–922.

KEMPE, D., KLEINBERG, J., AND TARDOS, E. 2003. Maximizing the spread of influence through a socialnetwork. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discoveryand data mining. KDD ’03. ACM, New York, NY, USA, 137–146.

KLEINBERG, J. M. 1998. Authoritative sources in a hyperlinked environment. In Proceedings of the ninthannual ACM-SIAM symposium on Discrete algorithms. SODA ’98. Society for Industrial and AppliedMathematics, Philadelphia, PA, USA, 668–677.



KOLLOCK, P. 1998. The economics of online cooperation: Gifts and public goods in cyberspace. In Commu-nities in Cyberspace, M. Smith and P. Kollock, Eds. Routledge, London.

KULLBACK, S. AND LEIBLER, R. A. 1951. On information and sufficiency. The Annals of MathematicalStatistics 22, 1, pp. 79–86.

LEMPEL, R. AND MORAN, S. 2001. Salsa: the stochastic approach for link-structure analysis. ACM Trans-actions on Information Systems (TOIS) 19, 131–160.

LIU, X., BOLLEN, J., NELSON, M. L., AND VAN DE SOMPEL, H. 2005. Co-authorship networks in the dig-ital library research community. Information Processing and Management: an International Journal -Special issue: Infometrics 41, 1462–1480.

MATLAB. 2010. version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachusetts.NAM, K. K., ACKERMAN, M. S., AND ADAMIC, L. A. 2009. Questions in, knowledge in?: a study of naver’s

question answering community. In Proceedings of the 27th International Conference on Human Factorsin Computing Systems, CHI. ACM, 779–788.

OLSON, M. 1971. The logic of collective action: Public goods and the theory of groups. Cambridge UniversityPress.

OSTROM, E. 1991. Governing the commons: The evolution of institutions for collective action. CambridgeUniversity Press.

PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. 1999. The pagerank citation ranking: Bringing orderto the web. Technical Report 1999-66, Stanford InfoLab. November.

PAL, A. AND COUNTS, S. 2011. Finding topical authorities in microblogs. In Proceedings of the fourth ACMinternational conference on Web search and data mining. WSDM ’11. ACM, New York, NY, USA.

PAL, A., FARZAN, R., KONSTAN, J. A., AND KRAUT, R. E. 2011. Early detection of potential experts inquestion answering communities. In User Modeling, Adaption and Personalization - 19th InternationalConference, UMAP. Lecture Notes in Computer Science Series, vol. 6787. Springer, 231–242.

PAL, A. AND KONSTAN, J. A. 2010. Expert identification in community question answering: exploring ques-tion selection bias. In Proceedings of the 19th ACM international conference on Information and knowl-edge management. CIKM ’10. ACM, New York, NY, USA, 1505–1508.

PRESS RELEASE, Y. A. 2006. http://yhoo.client.shareholder.com/press/releasedetail .cfm?releaseid=222275.QUINLAN, J. R. 1993. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Fran-

cisco, CA, USA.STACKOVERFLOW. 2008. http://stackoverflow.com/.STONE, M. 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal

Statistical Society. Series B (Methodological) 36, 2, pp. 111–147.SUBRAMANYAM, R. AND XIA, M. 2008. Free/libre open source software development in developing and

developed countries: A conceptual framework with an exploratory study. Decision Support Systems 46,173–186.

TURBOTAX. 2007. https://ttlc.intuit.com/app/full page.VIEGAS, F. B. 2004. Newsgroup crowds and authorlines: Visualizing the activity of individuals in conver-

sational cyberspaces. In Proceedings of the 37th Hawaii International Conference on System Sciences.IEEE Computer Society, Washington, DC, USA.

WEBER, S. 2004. The success of open source. Harvard University Press, Cambridge, MA, USA.WELSER, H. T., GLEAVE, E., FISHER, D., AND SMITH, M. 2007. Visualizing the signatures of social roles

in online discussion groups. 8.WENG, J., LIM, E.-P., JIANG, J., AND HE, Q. 2010. TwitterRank: finding topic-sensitive influential twitter-

ers. In Proceedings of the third ACM international conference on Web search and data mining. WSDM’10. ACM, New York, NY, USA, 261–270.

YANG, J., ADAMIC, L. A., AND ACKERMAN, M. S. 2008. Competing to share expertise: The taskcn knowledgesharing community. In Proceedings of the Second International Conference on Weblogs and Social Media,ICWSM. The AAAI Press.

YANG, J. AND WEI, X. 2009. Seeking and offering expertise across categories: A sustainable mechanismworks for baidu knows. In Proceedings of the Third International Conference on Weblogs and SocialMedia, ICWSM. The AAAI Press.

ZHANG, J., ACKERMAN, M. S., AND ADAMIC, L. 2007. Expertise networks in online communities: structureand algorithms. In Proceedings of the 16th international conference on World Wide Web. WWW ’07. ACM,New York, NY, USA, 221–230.


a exploring question selection bias to identify experts ... · a exploring question selection bias...

Documents