user latent preference model for better downside management in recommender systems ·...

User Latent Preference Model for Better DownsideManagement in Recommender Systems

Jian WangLinkedIn Corp

Mountain View, CA 94043 [email protected]

David HardtkeLinkedIn Corp

Mountain View, CA 94043 [email protected]

ABSTRACTDownside management is an important topic in the fieldof recommender systems. User satisfaction increases whengood items are recommended, but satisfaction drops signifi-cantly when bad recommendations are pushed to them. Forexample, a parent would be disappointed if violent moviesare recommended to their kids and may stop using the rec-ommendation system entirely. A vegetarian would feel steak-house recommendations useless. A CEO in a mid-sized com-pany would feel offended by receiving intern-level job recom-mendations. Under circumstances where there is penalty fora bad recommendation, a bad recommendation is worse thanno recommendation at all. While most existing work focuseson upside management (recommending the best items tousers), this paper emphasizes on achieving better downsidemanagement (reducing the recommendation of irrelevant oroffensive items to users). The approach we propose is gen-eral and can be applied to any scenario or domain wheredownside management is key to the system.

To tackle the problem, we design a user latent preferencemodel to predict the user preference in a specific dimension,say, the dietary restrictions of the user, the acceptable levelof adult content in a movie, or the geographical preferenceof a job seeker. We propose to use multinomial regression asthe core model and extend it with a hierarchical Bayesianframework to address the problem of data sparsity. Afterthe user latent preference is predicted, we leverage it to fil-ter out downside items. We validate the soundness of ourapproach by evaluating it with an anonymous job applica-tion dataset on LinkedIn. The effectiveness of the latentpreference model was demonstrated in both offline experi-ments and online A/B testings. The user latent preferencemodel helps to improve the VPI (views per impression) andAPI (applications per impression) significantly which in turnachieves a higher user satisfaction.

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to theauthor’s site if the Material is used in electronic media.WWW 2015, May 18–22, 2015, Florence, Italy.ACM 978-1-4503-3469-3/15/05. http://dx.doi.org/10.1145/2736277.2741126.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval

General TermsAlgorithms, Design, Experimentation

KeywordsRecommender System; Downside Management; User LatentPreference Model

1. INTRODUCTIONRecommender systems are popular research topics in the

information retrieval community. A host of academic andindustrial incarnations of recommender systems exist in do-mains such as movies (Netflix), music (Pandora), and e-commerce product recommendations (eBay, Amazon). Mostexisting work focuses on upside management by finding thebest items to recommend to a user. Better upside manage-ment can be achieved by leveraging state-of-art models suchas collaborative filtering, matrix factorization, deep learningneural networks and so on. While it is essential to discovergood items, it is equally important to avoid recommend-ing irrelevant or offensive items to a user. User satisfactionis affected by both upside items and downside items. Forexample, a CEO in a mid-sized company would be inter-ested in other executive positions while he would feel of-fended by receiving intern-level job recommendations. Anitem could be irrelevant or offensive for various reasons indifferent domains. It might be caused by wrong dining pref-erence, wrong job location, wrong movie category and so on.In some cases, a single bad recommendation could cause auser to abandon the recommendation system completely.

One simple strategy for downside management is not toshow an item when the predicted probability from recom-mender systems is too low: you just need to find an appro-priate threshold. Such strategy is based on the assumptionthat an “ideal” recommender system could assign an appro-priate probability for an item. In particular, the probabilityshould indicate the user’s willingness of accepting the itemin different dimensions. Yet it is not the case in the realworld. Most existing hybrid recommender systems (such aslogistic regression, gradient boosting tree, etc.) consider allfeatures together to predict a single probability for an item.The item might be recommended to a user according to itshigh probability if the content match is high even though

1209

the other contextual match is low. Users might not com-plain about a slight content mismatch. Yet they are lesstolerant to be pushed with recommendations in the wronglocation, wrong category or for other contextual reasons.Thus it would be hard to find the right threshold on a sin-gle probability from the existing recommender system if adownside item is about some specific dimension.

Downside management has been studied by some recentwork [28, 29, 1, 8, 19, 25]. Researchers studied context-awarerecommender systems [1, 8, 19, 25] to find items that matchwith the user’s contexts. Contextual information could beleveraged in the pre-filtering stage, post-filtering stage, orin the contextual modeling. The concept of “context” isdifferent from the “latent preference” in our model. The“context” is an explicit condition (for example, the currentlocation of the user) that could be observed for each userwhile the “latent preference” is an implicit yet explainablefactor that the model infers for each user (for example, thelocation of an item that the user would prefer). We studyhow the user’s implicit preferences or interests would affectrecommendations. This latent preference could be dietaryrestrictions of the user, the acceptable level of adult contentin a movie, the geographical preference of a job seeker andso on.

In this work, we propose a general framework that pre-dicts a user’s latent preference and leverage it to filter outdownside items. The core of the framework is multinomialregression that models a user’s latent preference in a specificdimension. The user’s preference could either be binary (forexample, work at a nearby job, or take a job that requiresrelocation) or multi-class (for example, work as the internlevel, senior level, staff level, or executive level). We assumethat users in different groups tend to have different priorsof preference choice. For example, job seekers around 22years of old are more likely to relocate compared to olderjob seekers. To accommodate such differences in user seg-ments, we propose to divide users in different segments andlearn a multinomial regression for each segment. The seg-ment could be based on the user’s age, gender, seniority,region and so on. We further extend it with a hierarchicalBayesian framework to address the problem of data sparsityfor those small segments. With this framework, differentsegments learn from each other by sharing the prior of the re-gression model. Detailed mathematical parameter inferencesteps are derived with the variational Bayesian algorithm.Items that would hurt the user satisfaction are filtered outin the recommendation stage.

While the framework we propose is general to any down-side dimension in any domain, we apply it to the location di-mension in the job domain in this paper. In other words, themodel aims to avoid recommending jobs in locations wherethe job seeker would not want to work. In the experiments,we first evaluate the performance of the user latent prefer-ence model itself with traditional classification IR metrics.We then demonstrate the effectiveness of leveraging the userlatent preference in recommender systems. The model wasevaluated with both an offline anonymous job applicationdataset and real-world job seekers on LinkedIn. Experimentsshow that the user latent preference model helps to improvethe VPI (views per impression) and API (applications perimpression) metrics significantly. At the same time, the ab-solute number of applications and views remain stable inspite of fewer impressions. Such performance follows our in-

tuition that the user latent preference model achieves betterdownside management which in turn results in higher usersatisfaction. The major contributions of this paper includethe following:

• Propose and study the problem of downside manage-ment in the field of recommender systems.

• Design the User Latent Preference Model to tacklethe problem and leverage it in recommender systems.Propose to use the multinomial regression as the coremodel and extend it with a hierarchical Bayesian frame-work. Derive detailed parameter inference steps basedon the variational Bayesian algorithm.

• Evaluate the model with both offline framework andonline A/B testing on LinkedIn. Demonstrate its ef-fect in achieving better downside management of rec-ommender systems.

2. RELATED WORKA major task of the recommender system is to present rec-

ommendations to the user. The task is usually conductedby first predicting a user’s ratings for each item and thenranking all items in the descending order. There are twomajor recommendation approaches: content-based filteringand collaborative filtering. Content-based filtering [17, 22]assumes that descriptive features of an item indicate a user’spreferences. Thus, a recommender system makes a decisionfor a user based on the descriptive features of other items theuser likes or dislikes. Usually, the system recommends itemsthat are similar to what the user liked before. Collaborativefiltering [12, 24, 13, 16, 26, 21, 10] on the other hand assumesthat users with similar tastes on some items may also havesimilar preferences on other items. Thus, the main idea isto use the behavior history from other like-minded users toprovide the current user with good recommendations. Re-search on collaborative filtering algorithms reached a peakdue to the 1 million dollar Netflix movie recommendationcompetition [4]. Factorization-based collaborative filteringapproaches [7, 14, 27, 23], such as regularized Singular ValueDecomposition, performed well on this competition, possiblybetter than Netflix’s own well-tuned Pearson correlation co-efficient algorithm. A common characteristic of these modelsis the introduction of user latent factors or/and item latentfactors to solve the data sparsity issue.

Context-aware recommender systems [1, 8, 19, 25, 2, 11,15] has become a hot topic in the research field recently.Most work focuses on using user contextual information orlatent tastes to improve the upside management. For exam-ple, Weston et.al [30] proposed an advanced matrix factor-ization model to incorporate user’s latent tastes. Hariri et.al [11] proposed a unified probabilistic model to integratethe user profile, item representations, and contextual infor-mation. They demonstrated the effectiveness of their modelin the domain of article and music recommendation wheretags were considered as contextual information. Zhang et.al [31] proposed the Explicit Factor Model (EFM) to gener-ate explainable recommendations, meanwhile keeping a highprediction accuracy. Nguyen et.al [18] studied Gaussian Pro-cess Factorization Machines (GPFM) for context-aware rec-ommendations using Gaussian processes. Different from thiswork, we leverage the user latent preference information toachieve better downside management.

1210

Downside management has been studied in recent workof recommender systems. In [29], Wang et. al studied theinfluence of recommending jobs to users at the right time.They proposed to use a hierarchical proportional hazardsmodel to push recommendations to a user only if the prob-ability of a user applying to a job at the current time ishigher than a threshold. In [27], Wang and Zhang proposedthe concept of utility in the field of recommender system andexplored to optimize the user’s utility in the e-commerce do-main. In [28], Wang and Zhang further studied the effect offinding the right time to push recommendations in the e-commerce domain. They proposed an opportunity model todiscover the best opportunity to push recommendations tooptimize for the user satisfaction. In this paper, we proposea more general framework for downside management andshow the effectiveness of the model in finding the right lo-cation to recommend jobs. The framework is general whichcould be applied to any existing state-of-art recommendersystems.

3. USER LATENT PREFERENCE MODEL

3.1 Problem DefinitionIn this paper, we aim to answer the following questions: 1)

How do we discover the user’s latent preference(s) 2) Howdo we leverage it for better downside management? Thefollowing notations are used in the paper.

• u = 1, 2, ..., U : the index of the user.

• j = 1, 2, ..., J : the index of the item.

• m = 1, 2, ...,M : the index of the segment for a user.A segment for a user could be determined by his age,gender, region, so on and so forth.

• D = {D1, ..., Dm, ..., DM}: The observed data of allsegments from all users.

• Dm = {ym,i,xm,i}: A set of observed data associatedwith segment m. Each segment m has Nm data obser-vations from all users in that segment. Each observa-tion i = 1, ..., Nm in segment m is associated with twoparts: the preference label ym,i and the feature vectorxm,i.

• ym,i: the preference label with the ith observation insegment m. It could be a binary choice (staying atthe current location or not) or a multi-class choice(work in IT industry, financing industry, educationindustry, etc.). The index of the candidate choice isk = 1, 2, ...,K.

• xm,i: the d-dimensional vector of features associatedwith the ith observation in segment m. Features in-clude both static demographic features and dynamicbehavior features.

The goal of the model is to predict the probability thatuser u has the latent preference ym,i = k, given that theuser belongs to segment m and his feature vector is xm,i.

3.2 Review of Multinomial RegressionBefore describing the hierarchical framework that we pro-

pose, we first briefly review the basics of multinomial regres-sion.

Multinomial regression is widely used to solve multi-classclassification problems with decent performance, e.g. pre-dicting the probability of a user having the latent preferenceym,i = k given the user feature vector xm,i. We consider auser’s latent preference as a process to choose exactly onecandidate at a time. The probability could be estimatedbased on all available features as follows:

p(ym,i = k|θ) =exp{θTk xm,i}∑k′ exp{θTk′xm,i}

where θ = θ1, ..., θK is the model parameter to be learnedfrom the training data. In a basic multinomial regression,all users share the same parameter θ.

Assuming that the prior distribution of each model param-eter is a Gaussian centered on zero, the optimal parametersθ can be learned from the training data using the maximuma posteriori probability (MAP) estimation. The multino-mial regression model reduces to logistic regression whenthe number of classes is 2 (positive and negative).

3.3 Model Extension with Bayesian FrameworkIn this paper we adopt multinomial regression as the core

model for user latent preference model and use it to esti-mate p(ym,i = k). Users in different segments may havedifferent priors for latent preferences. For example, youngpeople tend to relocate more than old people. People in bigcities tend to relocate less than those in small cities. Thuswe propose to learn one regression model per user segment.Without loss of generality, we do not specify how to seg-ment users in this paper. It could be determined by domainexperts or offline experiments.

In the real world, a small number of user segments of-ten have more users while most segments have few users.To address the problem of data sparsity, we follow a com-mon practice and extend the multinomial regression with ahierarchical Bayesian framework as illustrated in Figure 1.This framework helps segments with few observations byborrowing information from other segments through a com-mon prior for parameters of multinomial regression.

For each segment m, θm is sampled from a Gaussian dis-tribution: θm ∼ N(µθ,Σθ). We denote φ = (µθ,Σθ). Foreach ith observation in segment m with its observed featuresxm,i, its latent preference model ym,i is sampled from themultinomial regression model,

p(ym,i = k|θm) =exp{θTm,kxm,i}∑k′ exp{θTm,k′xm,i}

.

Consider that data D consists of a series of observationsfrom all segments. The latent preferences in the obser-vations are generated by using a set of hidden variablesθ = {θ1, θ2..., θM}. Note that the core model does not needto be multinomial regression. Other suitable models such asthe proportional hazards model [29] could be leveraged hereas well. θm contains all parameters in the core model. Thedata likelihood can be written as a function of φ = (µθ,Σθ).We use ym to represent {ym,1, ..., ym,i, ..., ym,Nm}, i.e., pref-erence observations in segment m. Assuming that the datais independent identically distributed, we can present the

1211

!"#$%$&'"&%()*%+#,"%-)$#.$#,,"/-)0/1#()!

2%&'),#.0#-3)'%,)"3,)/4-)$#.$#,,"/-)

data

segment

Figure 1: Illustration of dependencies of variables inthe hierarchical Bayesian regression model. It showsthe ith observation of segment m. ym,i is the labelwhich is dependent on the regression model θm ofsegment m, as well as the observed features xm,i ofthis user. Each segment m has its own parameters ofmultinomial regression θm. Models of each segmentshare information through the prior, φ = (µθ,Σθ).

data likelihood as

p(D|φ) =

M∏m=1

p(ym|φ) =

M∏m=1

∫p(θm, ym|φ)dθm. (1)

3.4 Parameter Inference with the VariationalBayesian Method

There is no closed-form solution for the estimation ofmodel parameters. We follow the variational Bayesian method[3]for constrained (approximate) optimization to derive an it-erative process to find an approximate solution. Maximizingthe likelihood in Equation 1 is equivalent to maximizing thelog likelihood L(φ),

L(φ) = ln p(D|φ) =

M∑m=1

ln p(ym|φ) =

M∑m=1

ln

∫p(θm, ym|φ)dθm.

We can simplify the problem by introducing an auxiliarydistribution q(θm) for each hidden variable θm [3]. In thevariational approach, we constrain q(θm) to be a particulartractable form for computational efficiency. In particular,we assume that q(θm) = N(µθm ,Σθm).

L(φ) =

M∑m=1

ln p(ym|φ)

=

M∑m=1

ln

∫q(θm)

p(θm, ym|φ)

q(θm)dθm

≥M∑m=1

∫q(θm) ln

p(θm, ym|φ)

q(θm)dθm

≡F (q(θ1), ..., q(θM ), φ). (2)

Note that F (q(θ1), ..., q(θM ), φ) is the lower bound of L(φ).The process to infer parameters is to iterate between theE-step and M-step until convergence. In particular, the Estep maximizes F (q(θ1), ..., q(θM ), φ) with respect to q(θm)given the current parameter setting φ. The M step max-imizes F (q(θ1), ..., q(θM ), φ) with respect to φ given q(θm)learnt from the E step. Here are the detailed mathematicalrepresentation:

E step:

q(θm)← arg maxq(θm)

F (q(θ1), ..., q(θM ), φ), ∀m ∈ {1, ...,M}

(3)

M step:

φ← arg maxφ

F (q(θ1), ..., q(θM ), φ) (4)

3.4.1 E-StepIn the E-step, maximizing F (q(θ1), ..., q(θM ), φ) is equiv-

alent to minimizing the following quantity to find each dis-tribution q(θm).

q(θm)← arg minq(θm)

∫q(θm) ln

q(θm)

p(θm, ym|φ)dθm

= arg minq(θm)

∫q(θm) ln

q(θm)

p(θm|φ)dθm −

∫q(θm) ln p(ym|θm)dθm

= arg minq(θm)

KL[q(θm)||p(θm|φ)]−∫q(θm) ln p(ym|θm)dθm.

(5)

The first part in Equation 5 is the KL-divergence betweenthe posterior distribution q(θm) and the prior distributionp(θm|φ). The KL-divergence between two Gaussian distri-butions is

KL[q(θm)||p(θm|φ)] =1

2[tr(Σ−1

θ Σθm) + (µθ − µθm)TΣ−1θ (µθ − µθm)

− ln(det Σθmdet Σθ

)− d].

The second part in Equation 5 is to maximize the like-lihood of ym = {ym,1, ..., ym,i, ..., ym,Nm} with the currentθm,

Nm∑i=1

∫q(θm) ln p(ym,i = k|θm)dθm (6)

=

Nm∑i=1

[E(θm,k)Txm,i − E(ln∑k′

exp(θTm,k′xm,i))].

The expectation E(θm,k) in the above equation is µθm,k .There is no closed form solution for the expectation δ =E(ln

∑k′ exp(θ

Tm,k′xm,i)). The moment generating function

for the multivariate normal distribution is

E[exp(θTm,k′xm,i)] = exp(µTθm,k′ xm,i +1

2xm,i

TΣθm,k′ xm,i).

(7)

1212

By using this function, we get the upper bound of δ withthe Taylor expansion as in the following equation,

E(ln∑k′

exp(θTm,k′xm,i)) (8)

6 γ∑k′

exp(µTθm,k′ xm,i +1

2xm,i

TΣθm,k′ xm,i)− log(γ)− 1,

for every γ ∈ RK . Such approximation is widely used inliterature [5] for other inference problems.

We can combine the above derivations with Equation 5,then find q(θm) with the conjugate gradient descent method.

3.4.2 M-StepIn the M-step, the goal is to maximize F (q(θ1), ..., q(θM ), φ)

in Equation 2 with respect to φ given all q(θm). It is sameas to maximize the following quantity:

φ← arg maxφ

M∑m=1

∫q(θm) ln p(θm, ym|φ)dθm. (9)

We derive the closed-form solution for µθ and Σθ:

(µθ,Σθ)← arg maxµθ,Σθ

M∑m=1

∫q(θm) ln p(θm|µθ,Σθ)dθm

= arg maxµθ,Σθ

M∑m=1

ln1√

(2π)k|Σθ|

− 1

2E[(θm − µθ)TΣ−1

θ (θm − µθ)].

(10)

By setting the first derivative to zero, we have the followingclosed form:

µθ =

∑Mm µθmM

Σθ =

∑Mm [Σθm + (µθm − µθ)(µθm − µθ)T ]

M.

4. USAGE OF THE USER LATENT PREF-ERENCE MODEL

Let us denote p(i, j) as the joint probability of user i ac-cepting item j. It considers both the match from an existingrecommender system and the prediction of the user’s latentpreference. p(i, j) could be estimated by the following equa-tion:

p(i, j) = prec(i, j)∑k

[p(ym,i = k)Ij∈S(i,k)]. (11)

prec is the matching probability between user i and item j,which could be predicted by any state-of-art recommendersystems. p(ym,i = k) is the probability of user i having aspecific latent preference k. It is predicted by the user latentpreference model, as shown in Equation 12. S(i, k) containsall items that match with the user’s preference. For example,if user i is predicted to seek jobs in his local area, S(i, k =local) contains all local jobs. Note that the candidates of auser’s preference k = 1, 2, ...,K are all exclusive and item jcan only belong to one set S(i, k). I∗ is a indicator functionwhich equals to 1 if ∗ is true.

Table 1: The utility set of the recommender sys-tem. There are four types of utilities, dependingon whether the system recommends the item to theuser and whether the user accepts the item.

show:Y show:Naccept:Y uTP uFNaccept:N uFP uTN

p(ym,i = k) = E[exp{θTk xm,i}∑k′ exp{θTk′xm,i}

] (12)

E[exp{θTk xm,i}] = exp{µTθkxm,i +1

2xm,i

TΣθkxm,i}.

The recommender systems aim to send recommendationsto a user to increase user satisfaction. While a good rec-ommendation (upside item) would increase the user satis-faction, a bad recommendation (downside item) would de-crease the user satisfaction. Let us consider the utility foreach type of recommendation as shown in Table 1. uTP isthe utility for a good recommendation that the user accepts.uFP is the utility for a bad recommendation that the userdoesn’t accept. uFN is the utility for missing a recommen-dation that the user would accept. uTN is the utility ofavoiding a bad recommendation. The utility for each typeof recommendations (uTP , uFP , uFN , and uTN ) could beset by domain experts in real-world applications. The useri’s satisfaction of a recommendation set utilityi is calculatedby Equation 13.

utilityi =∑

(uTP Ishow,accept + uFP Ishow, ¯accept

+ uFNI ¯show,accept + uTNI ¯show, ¯accept). (13)

To achieve a better user satisfaction/utility, the systemwith a filtering component should send recommendationsonly if the expected utility of an item is higher than zero.In other words, the system should send recommendationsif the joint probability of a user accepting the item p(i, j)is higher than the threshold α. The filtering threshold αis automatically determined by the utility set, which is acommon practice as in [28, 29, 9]:

α =uFP − uTN

uFP − uTN + uFN − uTP. (14)

5. EVALUATION OF THE USER LATENTPREFERENCE MODEL

As described before, the user latent preference model couldbe applied to a single dimension in any domain. Withoutloss of generality, we evaluate it in the location dimensionin the job domain in the experiment. We observe that someusers have a strong preference of being a local user andwould complain about receiving non-local job recommenda-tions, and vice versa. Local jobs are defined as jobs within50 miles of the user’s current location (determined by hispostal code). We treat local users as positive labels andnon-local users as negative labels.

We first list all research questions, followed by the ex-perimental setup, data analysis and performance analysis.Our major research questions for evaluating the user latentpreference model include:

1213

• How accurate is the prediction of the user latent pref-erence model? What’s the percentage of local usersthat the model could cover?

• How is the performance comparison for different usersegments? Does the hierarchical Bayesian regressionmodel have a better prediction power than the basicregression model?

5.1 Evaluation MetricsAs mentioned before, we label a user as the positive data

if the user intends to stay at the current location and other-wise negative. It naturally reduces to a binary classificationtask. Traditional classification metrics are used to evaluatethe prediction power of the user latent preference model:precision, recall and F1 measure. In the following equation,TP , FP and FN correspond to true positive, false positiveand false negative respectively.

precision =TP

TP + FP

recall =TP

TP + FN

F1 =2

1precision

+ 1recall

5.2 Models to CompareIn our experiments, we compare the basic regression model,

as well as the hierarchical regression model with differentsegments. As shown in Table 2, M-one is the basic multi-nomial regression model with no segments. Other modelssegment users based on different features.

All models use the same set of features for the core regres-sion model. Features include both the user static informa-tion (age, gender, region, current job, current industry, etc.)and the dynamic behavior information (# of local job ap-plications in last month, # of non-local job applications inlast month, last profile modification time, etc.). Extensivefeature engineering could be explored to improve the modelperformance, which is beyond the scope of this paper.

5.3 DatasetAs mentioned earlier in the paper, we apply our model

in the context of a job recommender system and use a real-world dataset from LinkedIn to evaluate our models. Inorder to build the user latent preference model, we randomsampled from the job application data of United States usersfor three months from Jan 2014 to March 2014. The sampledata contains around 530k users that applied to a reasonablenumber of jobs in that period 1. If all jobs that a userapplied to are local jobs, we label the user as a local user, andotherwise as a non-local user. In the dataset, 57.22% of usersare labeled as local users (positive data) while 42.78% usersare labeled as users that are willing to relocate (negativedata). 10-cross validation is performed in the experiments.A significance level of 0.05 with the paired two-tailed t-testis used to compare two models.

We further analyze the data by dividing users into differ-ent segments (age, gender, region). In Figure 2, the bar plotsthe percentage of users in each segments. It shows that the

1Exact number is omitted for business concern

data sparsity issue exists (in both age and region dimension).In addition, the line plots the percentage of local users, whoonly applied to local jobs. It indicates that users in differentsegments have different priors in relocating. This justifiesour motivation for building a hierarchical Bayesian regres-sion model that can 1) let different segments learn from eachother to address the problem of data sparsity. 2) let eachsegment learn its own regression model based on users in thesegment.

Some interesting observations include 1) Users around 22years old are more likely to relocate. It follows the intuitionthat college graduates would like to change location whenthey apply for their first jobs. 2) The probability of stayingin the same local area doesn’t increase for older applicants.3) Females are slightly more likely to stay in the same localarea. 4) Applicants from the big cities/regions tend to stayin the same local area. Top 5 regions that users tend tobe local users include San Francisco Bay Area, Greater NewYork City Area, Houston, Texas Area, Greater Denver Area,and Greater Seattle Area. 5) Applicants from small townstend to relocate. Top regions that users tend to relocateusually associate with a university or college, including Uni-versity of South Carolina, University of Wisconsin Madison,Syracuse university, Michigan State University and so on.

The findings from the LinkedIn job application data areconsistent with results from demographic surveys [6]. 37% ofAmerican adults have never left their hometown, and 57% ofAmerican adults have never lived outside their home state.More educated individuals are more likely to have relocatedduring their lifetimes (77% of college graduates have changedcommunities at least once). People are most likely to movewhen they are young, and people from the midwest are lesslikely to move than those from the Western United States.People who have moved before are much more likely to relo-cate than those who have never moved. People with higherincomes are more likely to move for job opportunities, whilelower income individuals cite different reasons for moving.The demographic data motivates our basic hypothesis thatsome people will not move under any circumstances, andthese users should not be shown non-local job recommenda-tions.

5.4 Performance AnalysisWe evaluate the classification performance of different user

latent preference models by treating the task as a classifica-tion task. We assume that the model predicts a user as alocal user if the probability is higher than the classificationthreshold. We show the performance for different classifica-tion thresholds from 0.1 to 0.9. The left plot in Figure 3shows the performance of the basic regression model M-one.It has decent performance in all metrics including precision,recall and F1. High precision ensures that we only filter outnon-local jobs for local users. High recall ensures that we candiscover most local users and achieve better downside man-agement for these users. F1 metric balances the trade-offbetween the precision and recall metrics. As demonstratedin the right plot in Figure 3, it is clear that all hierarchicalBayesian regression models help to improve the precisionperformance significantly. As shown in Table 2, there aredifferent ways of segmenting users. Among all models, M-age has the highest lift (5.82% for classification threshold0.3) with 7 user segments. It is essential to choose the rightuser segments in the model. Models with too few segments

1214

Table 2: Different segmentation methods of the user latent preference model.Model User segment # of segments Sample segmentsM-one none 1 N/AM-age age 7 <20 years old, 20-25 years old, 25-30 years old, ...

M-gender gender 2 female, maleM-region region 282 San Francisco bay area, Great New York area, ...

M-ageGender (age, gender) 14 (<20 years old, female), ...M-ageRegion (age, region) 1974 (<20 years old, San Francisco bay area), ...

M-regionGender (region, gender) 564 (San Francisco bay area, female), ...

Figure 2: Data analysis of different user segmentsin the job domain. The bar plots the percentage ofusers in each segment with the left axis. The lineplots the percentage of local users in each segmentwith the right axis. The three plots correspond toage segments, gender segments, and region segmentsrespectively. Major observations include that 1) thedata sparsity issue exists in some user segments 2)users in different segments have different priors inrelocating.

(M-gender) or too many segments (M-region, M-ageRegion,M-genderRegion, M-ageGender) do not work as well as theM-age model.

6. EVALUATION OF THE RECOMMENDERSYSTEM

In this section, we incorporate the user latent preferencemodel into the recommender system and evaluate its contri-bution in downside management. We first list all researchquestions that we intend to answer, followed by the exper-imental setup, and performance analysis. Major researchquestions include:

• Does it achieve higher user satisfaction by adding afiltering component to the basic recommender system?

• Could the user latent preference help to improve thedownside management of recommender systems? Doesit achieve even higher user satisfaction?

• How about using an extreme filter that filters out allnon-local jobs for all users? Does it work as well?What’s the contribution of using the prediction fromthe user latent preference model?

6.1 Models to CompareIn our offline and online experiments, we compare the fol-

lowing models. All models are trained with millions of jobapplications that were randomly sampled over months andtested with 17K users in the testing period. The thresholdof the filtering component is determined by the utility set,as shown in Equation 14.

Baseline is the baseline recommender systems that select aset of items for the user, without considering the userlatent preference. It could be any state-of-art recom-mendation model. We choose logistic regression hereas an example. It is a common practice with decentperformance and low time complexity [20, 29, 28]. Thebaseline recommender system does not have a filter-ing component. It always sends a set of top K itemsto each user (K is set to be 5). When we build thebaseline logistic regression model, we leverage basicfeatures including similarity-related features betweenthe user profile (including the user’s working experi-ence, education information, etc) and the job infor-mation (including the job’s title, description, etc), theuser behavior features, and so on. Features includelocation-related features as well, such as the transitionprobability from the user’s location to the job’s loca-tion and so on.

1215

Baseline.F applies a filter on the recommendation set fromBaseline based on the recommendation probability prec(i, j)without considering the user’s latent preference.

Baseline+Latent.F applies a filter on the recommenda-tion set based on the joint probability p(i, j) whichconsiders the user latent relocation preference. As aresult, the model recommends local jobs to users thatprefer to stay in the local area and non-local jobs toother users.

Baseline+AllLocal.F applies a filter on the recommen-dation set based on the joint probability p(i, j). Itassumes that all users are local users. In other words,it estimates p(ym,i = local) = 1 for all users. As aresult, the model recommends local jobs to all users.

6.2 Offline Testing SetupAs we discussed in Section 4, the user satisfaction is af-

fected by both good recommendations and bad recommen-dations. The corresponding evaluation metric is the averageutility which truly reflects the user satisfaction. Assumethat there are U users to send recommendations in the test-ing period. The average utility/user satisfaction utility can

be calculated as following: utility =∑Ui=1 utilityi

U.

Unlike traditional metrics such as the number of applica-tions/views, the utility metric considers both the positiveeffect of good recommendations and the negative effect ofbad recommendations. The higher the utility, the better themodel.

Table 3: Average utility of recommendations in theoffline scenario. There are 17K users in the testdataset who applied to 2.12 jobs on average. Reccoverage is the percentage of users that are recom-mended at least one item. Rec count is the averagenumber of unique recommendations that are pushedto each user.

utility set 1uTP = 2, uFN = -1uFP = -1, uTN = 0Threshold = 0.25

Model Baseline Baseline.F Baseline+Latent.F Baseline+AllLocal.FUtility 0.590 1.300 1.895 1.235

Rec coverage 100% 98.4% 95.0% 74.8%Rec count 5 4.282 3.488 2.852

utility set 2uTP = 2, uFN = -1uFP = -2, uTN = 0

Threshold = 0.4Model Baseline Baseline.F Baseline+Latent.F Baseline+AllLocal.FUtility -2.479 -0.206 1.039 0.500


utility set 3uTP = 2, uFN = -2uFP = -1, uTN = 0

Threshold = 0.2Model Baseline Baseline.F Baseline+Latent.F Baseline+AllLocal.FUtility 0.393 0.951 1.460 0.486


6.3 Online A/B Testing SetupIn the online A/B testing, we evaluate the performance

of recommender systems with real-world users on LinkedIn.We randomly select 5% United States users 2 that visit2By the end of 2014, there are more than 111 million mem-bers in United States in LinkedIn.

LinkedIn for each model and present the corresponding rec-ommendations to each user group. The difference of theperformance between two user buckets are reported in theperformance analysis. A significance level of 0.05 with thepaired two-tailed t-test is used to compare two models. Welet each model to run for one week to burn in the noveltyeffect (active job seekers tend to apply to any new jobs theysee due to a change in the recommendation model) and startthe comparison after that.

In the LinkedIn product, summaries of job vacancy post-ings are shown to the user. We call this an impression(each job vacancy summary shown to the user is consideredto be impressed). Job seekers can open a new page to viewthe entire job description, which we call as a view action. Inaddition, users will apply to the job if they wish to be con-sidered for the position, which we call as an apply action.We report metrics that measure the model’s performance inboth upside management and downside management: thenumber of impressions, the number of views and applica-tions, API ( # of applications

# of impressions), and VPI ( # of views

# of impressions).

6.4 Performance Analysis

6.4.1 Offline ExperimentsWe first evaluate all models in the offline scenario. The

testing data include 17K users who applied to 2.12 jobs onaverage. Three different sets of utilities are tested, as shownin Table 3. Each set reflects different scenarios (dependingon a user’s tolerance to bad recommendations) and it couldbe tuned by domain experts in real-world applications.

In the first utility set, the utility uTP for a good recom-mendation that the user applies to is set to 2. When the sys-tem shows a job and the user does not apply to it, the utilityuFP is −1. The penalty uFN for missing a good recommen-dation is set as −1. The resulting filtering threshold is 0.25.Model that applies a filter all perform better than the Base-line model without filtering. Among all three models thathave a filtering component, both Baseline + Latent.F andBaseline+AllLocal.F perform better than the Baseline.Fmodel by filtering out many downside items (i.e., jobs thatare not in the user’s preferred area). It follows our intu-ition that the probability p(i, j) with latent preference isa stronger signal than the pure recommendation probabil-ity prec(i, j). In addition, Baseline + Latent.F model stillkeeps potentially good jobs that a non-local user wants toapply to, which leads to a higher utility than Baseline +AllLocal.F model. Note that the average number of recom-mendations of Baseline + Latent.F is significantly higherthan the Baseline+AllLocal.F model.

In the second utility set, the utility uFP is set to −2, indi-cating that the user is less tolerant to downside items. Thusthe threshold of recommending an item is higher, being 0.4.Not surprisingly, the Rec coverage of Baseline + Latent.Fdecreases than that in the first set of utility. It still performsthe best among all four models.

In the third utility set, uFN is set to −2. It is the penaltyfor not recommending a good job that would be applied bythe user. Baseline+Latent.F model still achieves the high-est utility among all models. Baseline + AllLocal.F has alower utility since it misses some potentially good recom-mendations for non-local users.

1216

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Threshold

performance of M-one model

MetricprecisionrecallF1

0.0 0.2 0.4 0.6 0.8

-20

24

6

Threshold

% o

f pre

cisi

on li

ft

precision lift of hierarchical regression model

MetricgenderregionageGenderageRegionregionGenderage

Figure 3: Performance of the user latent preference model in the classification task. The left plot showsthe performance of the baseline multinomial regression in precision, recall and F1. The right plot shows theprecision lift of hierarchical Bayesian regression models.

Table 4: The performance comparison of online A/B testing. The number shows the lift of the model.Numbers in bold are significantly better than the value in the corresponding baseline model. The absolutenumbers are omitted for business concern.

Model # of impressions # of applications # of views API VPIBaseline+Latent.F vs. Baseline -11.1% +0.5% -0.6% +12.9% +11.7%

Baseline+AllLocal.F vs. Baseline -18.4% -10.4% -6.1% +9.7% +15%Baseline+Latent.F vs. Baseline+AllLocal.F +8.9% +12.1% +5.8% +2.9% -2.9%

6.4.2 Online ExperimentsIn the online A/B testing, the performance follows our

intuition as shown in Table 4. Baseline + Latent.F modelhas a significantly better downside management than theBaseline model. In details, Baseline + Latent.F modelshows less impressions to users. The corresponding API andVPI metrics increase significantly while the number of ap-plications and views remain stable. It shows that downsideitems that are not interesting to users are removed from therecommendation set. The performance is consistent withthat in offline experiments.

If we filter out non-local recommendations for all users,the performance of such model Baseline+AllLocal.F is shownin Table 4. As expected, it shows much fewer impressions.Although VPI and API increase significantly, the absolutenumber of applications and views drop. Such a model mightfilter out some good non-local recommendations for userswho are willing to relocate. For example, recent college grad-uates would prefer to relocate to other cities for their firstjob. In this case, Baseline+AllLocal.F still recommends lo-cal jobs to these users. Yet the user latent preference modelBaseline + Latent.F would treat these users as non-localusers and recommend jobs in non-local areas.

For a real-world recommender system, one goal is to im-prove the upside management, which is reflected by the #of applications and # of views metrics. A good upside man-agement helps users to discover items that they like, whichsaves users’ time and effort. While good upside managementlets users engage themselves with the system, it is equally

important for a recommender system to have a good down-side management. It aims to avoid showing those obviouslyterrible items to users. For example, if a user would like tostay in the San Francisco Bay Area, a good recommendersystem shouldn’t recommend a job in the New York city evenif the job is a perfect match. Good downside managementis reflected in the high API and VPI. It makes recommen-dations more explainable and transparent to users, which inturn helps to increase the user satisfaction.

7. CONCLUSION AND FUTURE WORKIn this paper, we propose to develop the user latent pref-

erence model to predict the probability of a user having aspecific preference. It is a general framework that could beapplied to any existing recommender system in any domain(including job search, e-commerce, movie recommendations,etc.). The model uses multinomial regression to predict theuser’s preference among all candidates. We build a regres-sion model for each user segment. In order to tackle the datasparsity issue, we further extend the model with the hierar-chical Bayesian framework. The variational Bayesian infer-ence algorithm helps to make the parameter estimation morecomputationally efficient. In the recommendation stage, weleverage the prediction of the user’s latent preference to esti-mate the joint probability for a user accepting an item. Weshow the effectiveness of the user latent preference modelin the context of recommending the right job at the rightlocation. Both offline experiments and online A/B testing

1217

demonstrate that our proposed model has a better downsidemanagement, which leads to higher user satisfaction.

This is just the first step to tackle the downside manage-ment in recommender systems. In this paper, we propose touse multinomial regression as the core model in the user la-tent preference model. Other distributions or models couldbe explored for other dimensions. In addition, the currentmodel predicts a user’s latent preference in one dimension,such as time, location and so on. It would be interesting toexplore how to integrate a user’s latent preferences in mul-tiple dimensions to provide better recommendations. Lastbut not least, the model could be evaluated in other domainsto show its effectiveness in downside management.

8. REFERENCES[1] G. Adomavicius and A. Tuzhilin. Context-aware

recommender systems. In Recommender systemshandbook, pages 217–253. Springer, 2011.

[2] L. Baltrunas and X. Amatriain. Towardstime-dependant recommendation based on implicitfeedback. In Context-aware Recommender SystemsWorkshop at RecSys09. ACM, 2009.

[3] M. Beal. Variational algorithms for approximateBayesian inference. PhD thesis, University of London,2003.

[4] J. Bennett and S. Lanning. The netflix prize. 2007.

[5] D. M. Blei and J. D. Lafferty. A correlated topicmodel of science. The Annals of Applied Statistics,pages 17–35, 2007.

[6] D. Cohn and R. Morin. Who Moves? Who Stays Put?Where’s Home? Technical report, Dec 2008.

[7] P. Cremonesi, Y. Koren, and R. Turrin. Performanceof recommender algorithms on top-n recommendationtasks. In Proceedings of the fourth ACM RecSys 2010,pages 39–46. ACM, 2010.

[8] L. Do and H. W. Lauw. Modeling contextualagreement in preferences. In Proceedings of the 23rdInternational Conference on World Wide Web, WWW’14, pages 315–326. International World Wide WebConferences Steering Committee, 2014.

[9] C. Elkan. The foundations of cost-sensitive learning.In Proceedings of the 17th IJCAI’01, IJCAI’01, pages973–978. Morgan Kaufmann Publishers Inc., 2001.

[10] N. Golbandi, Y. Koren, and R. Lempel. Adaptivebootstrapping of recommender systems using decisiontrees. In Proceedings of the fourth ACM WSDM,WSDM ’11, pages 595–604. ACM, 2011.

[11] N. Hariri, B. Mobasher, and R. Burke. Query-drivencontext aware recommendation. In Proceedings of the7th ACM Conference on RecSys, RecSys ’13, pages9–16, New York, NY, USA, 2013. ACM.

[12] J. L. Herlocker, J. A. Konstan, A. Borchers, andJ. Riedl. An algorithmic framework for performingcollaborative filtering. In Proceedings of the 22ndannual international ACM SIGIR, pages 230–237.ACM, 1999.

[13] R. Jin, L. Si, C. Zhai, and J. Callan. Collaborativefiltering with decoupled models for preferences andratings. In Proceedings of the twelfth CIKM, pages309–316. ACM, 2003.

[14] Y. Koren. Factorization meets the neighborhood: amultifaceted collaborative filtering model. In

Proceeding of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining,KDD ’08, pages 426–434. ACM, 2008.

[15] X. Liu and K. Aberer. Soco: A social network aidedcontext-aware recommender system. In Proceedings ofthe 22nd International Conference on World WideWeb, WWW ’13, pages 781–802. International WorldWide Web Conferences Steering Committee, 2013.

[16] B. Marlin and R. S. Zemel. The multiple multiplicativefactor model for collaborative filtering. In Proceedingsof the 21st ICML ’04, page 73. ACM, 2004.

[17] R. J. Mooney and L. Roy. Content-based bookrecommending using learning for text categorization.In DL ’00, pages 195–204. ACM, 2000.

[18] T. V. Nguyen, A. Karatzoglou, and L. Baltrunas.Gaussian process factorization machines forcontext-aware recommendations. In Proceedings of the37th International ACM SIGIR Conference, SIGIR’14, pages 63–72. ACM, 2014.

[19] U. Panniello, A. Tuzhilin, M. Gorgoglione,C. Palmisano, and A. Pedone. Experimentalcomparison of pre- vs. post-filtering approaches incontext-aware recommender systems. In Proceedings ofthe Third ACM Conference on RecSys, RecSys ’09,pages 265–268. ACM, 2009.

[20] D. Parra, A. Karatzoglou, X. Amatriain, and I. Yavuz.Implicit feedback recommendation viaimplicit-to-explicit ordinal logistic regression mapping.In Proceedings of the CARS-2011, 2011.

[21] D. Parra-Santander and P. Brusilovsky. Improvingcollaborative filtering in social tagging systems for therecommendation of scientific articles. Web Intelligenceand Intelligent Agent Technology, 1:136–142, 2010.

[22] M. Pazzani, D. Billsus, S. Michalski, and J. Wnek.Learning and revising user profiles: The identificationof interesting web sites. In Machine Learning, pages313–331, 1997.

[23] E. Shmueli, A. Kagian, Y. Koren, and R. Lempel.Care to comment?: recommendations for commentingon news stories. In Proceedings of the 21st WWW,WWW ’12, pages 429–438. ACM, 2012.

[24] L. Si and R. Jin. Flexible mixture model forcollaborative filtering. In Proceedings of ICML, pages704–711. AAAI Press, 2003.

[25] K. Verbert, N. Manouselis, X. Ochoa, M. Wolpers,H. Drachsler, I. Bosnic, and E. Duval. Context-awarerecommender systems for learning: a survey andfuture challenges. Learning Technologies, IEEETransactions on, 5(4):318–335, 2012.

[26] J. Wang, B. Sarwar, and N. Sundaresan. Utilizingrelated products for post-purchase recommendation ine-commerce. In Proceedings of the fifth ACMconference on Recommender systems, RecSys ’11,pages 329–332. ACM, 2011.

[27] J. Wang and Y. Zhang. Utilizing marginal net utilityfor recommendation in e-commerce. In Proceedings ofthe 34th international ACM SIGIR’11, pages1003–1012. ACM, 2011.

[28] J. Wang and Y. Zhang. Opportunity model fore-commerce recommendation: right product, righttime. In Proceedings of the 36th international ACMSIGIR Conference, SIGIR ’13. ACM, 2013.

1218

[29] J. Wang, Y. Zhang, C. Posse, and A. Bhasin. Is ittime for a career switch. In Proceeding of the 22ndInternational Conference on World Wide Web, WWW’13. ACM, 2013.

[30] J. Weston, R. J. Weiss, and H. Yee. Nonlinear latentfactorization by embedding multiple user interests. InProceedings of the 7th ACM Conference on RecSys,RecSys ’13, pages 65–68. ACM, 2013.

[31] Y. Zhang, G. Lai, M. Zhang, Y. Zhang, Y. Liu, andS. Ma. Explicit factor models for explainablerecommendation based on phrase-level sentimentanalysis. In Proceedings of the 37th InternationalACM SIGIR Conference, SIGIR ’14, pages 83–92.ACM, 2014.

1219

user latent preference model for better downside management in recommender systems ·...

Documents