the pareto frontier of utility models as a framework for evaluating push notification...

The Pareto Frontier of Utility Models as a Framework forEvaluating Push Notification Systems

Gaurav Baruah and Jimmy LinDavid R. Cheriton School of Computer Science

University of Waterloo, Ontario, Canada{gbaruah,jimmylin}@uwaterloo.ca

ABSTRACTWe propose a utility-based framework for the evaluation of pushnoti�cation systems that monitor document streams for users’ top-ics of interest. Our starting point is that users derive either positiveutility (i.e., “gain”) or negative utility (i.e., “pain”) from consumingsystem updates. By separately keeping track of these quantities, wecan measure system e�ectiveness in a gain vs. pain tradeo� space.�e Pareto Frontier of evaluated systems represents the state of theart: for each system on the frontier, no other system can o�er moregain without more pain. Our framework has several advantages: ituni�es three previous TREC evaluations, subsumes existing metrics,and provides more insightful analyses. Furthermore, our approachcan easily accommodate more re�ned user models and is extensibleto di�erent information-seeking modalities.

1 INTRODUCTION�ere is growing interest in push noti�cation systems that prospec-tively monitor continuous document streams to identify relevant,novel, and timely content, delivered to users as push noti�cations.�e Temporal Summarization [2, 8], Microblog [11], and Real-TimeSummarization [12] Tracks at recent Text Retrieval Conferences(TRECs) all explore this basic idea.

We propose a novel framework for evaluating push noti�cationsystems that focuses on the Pareto Frontier of a utility model inwhich users derive positive utility (i.e., gain) from consuming rel-evant content and negative utility (which we refer to as “pain”)from consuming non-relevant content. A system occupies a pointin this pain vs. gain tradeo� space, and instead of trying to collapseboth quantities into a single-point metric (as previous evaluationsa�empt to do), we examine the Pareto Frontier of systems in thisutility space. Systems that lie on the frontier are all “optimal” inthe sense that for a particular level of gain, no other system o�ersa user experience with less pain.

Our framework makes the following contributions to the evalu-ation of push noti�cation systems:• It provides a uni�ed evaluation methodology that encompasses

multiple recent TREC evaluations.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permi�ed. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior speci�c permissionand/or a fee. Request permissions from [email protected]’17, October 1–4, 2017, Amsterdam, �e Netherlands.© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.978-1-4503-4490-6/17/10. . .$15.00DOI: h�p://dx.doi.org/10.1145/3121050.3121089

• It subsumes previous metrics used in those evaluations.• It provides more insightful comparisons of system e�ectiveness

that capture di�erent operating points.• It is extensible to di�erent information-seeking modalities and

more re�ned user models.

�is paper details our evaluation framework, applies it to previousTREC evaluations, and elaborates on each of the points above.

2 BACKGROUND AND RELATEDWORKWork on prospective information needs against document streamsdates back at least a few decades. Early examples include the TRECFiltering Tracks from 1995 [10] to 2002 [15] and research underthe umbrella of topic detection and tracking (TDT) [1]. Over thepast few years, we have seen renewed interest in this area, due tothe growing popularity of social media and the tighter responsecycles demanded by various end users (e.g., journalists, politicians,traders, etc.) to act on high-velocity information.

�e TREC Temporal Summarization (TS) Tracks [3, 8], whichtook place from 2013 to 2015, imagined a stream of newswire articlesthat a system processed in real time to incrementally select relevantnon-redundant sentences pertinent to events of interest such asthe 2012 Pakistan garment factory �res or the 2012 Buenos Airestrain crash. �e putative user of such a system might be a disasterrelief coordinator or a journalist, and although somewhat under-speci�ed in the evaluation itself, system output (i.e., “updates”)might be proactively delivered to the user via push noti�cations.Alternatively, based on the model of Baruah et al. [6], users mayperiodically poll the system for updates.

�e Temporal Summarization Tracks evaluated system outputin terms of two metrics, expected gain, (1/|S |)∑s ∈S G (s ), and com-prehensiveness, (1/|Z |)∑s ∈S G (s ), where S is the set of updatesreturned by the system, Z is the number of relevant updates, andGcomputes the gain for an update. Typically,G factors in redundancy(i.e., systems are not rewarded for returning two updates that “saythe same thing”), verbosity (i.e., shorter updates are preferred overlonger updates), and also timeliness (i.e., incorporating some typeof latency penalty). �e exact details varied across track iterations,but the overall form remained constant. Expected gain is analogousto precision and comprehensiveness is analogous to recall, so thesetwo metrics are quite familiar to any IR researcher.

At around the same time as the TS evaluations, the TREC Mi-croblog (MB) Tracks were exploring information needs againstsocial media streams (speci�cally Twi�er). For the evaluation in2015, systems were given “interest pro�les” and the task was tomonitor the live Twi�er stream to identify relevant and novel tweetsin a timely fashion, which were putatively delivered to users as

Short Paper ICTIR’17, October 1–4, 2017, Amsterdam, The Netherlands

253

push noti�cations. For example, a user might be interested in anec-dotes of people injured while playing Pokemon Go and wished tobe noti�ed whenever there are such reports. �e convergence ofthe TS and MB Tracks led to their merger in TREC 2016 to form theReal-Time Summarization (RTS) Track, following the basic setupof content �ltering on tweets. �e novel aspect of the evaluationin 2016 was live user participation, in which system updates wereactually delivered as push noti�cations to users’ mobile devices,who provided judgments in situ [12, 16].

�e MB and RTS evaluations used metrics that were heavilyborrowed from the TS evaluations, although the exact formulationdi�ered. Nevertheless, both evaluations measured expected gainas well as cumulative gain (what TS called comprehensiveness).Once again, these are basically variants of precision and recall. �eMB and RTS evaluations additionally re-introduced a linear utilitymetric that dates back to the TREC Filtering Track in 1990s [10, 15]:gain minus pain (GMP), which is a linear interpolation betweenpositive and negative utility based on an arbitrary weight. Ourutility-based framework is derived from this formulation, but with acritical di�erence: instead of interpolating gain and pain to producea single-point metric, we separately present the two quantities inan e�ectiveness tradeo� space.

3 EVALUATION MODELWe assume a task model in which users specify prospective infor-mation needs against a stream of discrete text units (sentences,tweets, etc.). �e system monitors this stream, and whenever itidenti�es a relevant piece of text, the user is proactively noti�ed—inoperational terms, this might be a push noti�cation to the user’smobile device, a noti�cation pop-up on the user’s desktop, or someother mechanism of grabbing the user’s a�ention. Generically, wecall these updates. �is task setup broadly encompasses the TRECevaluations described in Section 2.

To evaluate such systems, we assume a simple user model: thatthe user will read each update with some probability p, which wecall the persistence parameter. A high persistence value models an“eager” user, and a low persistence value models a user who is likelyto ignore the updates. In addition, we assume that unread updatesare accumulated in a “holding area” somewhere on the user’s device,e.g., a mobile app, a desktop noti�cation tray, etc. Whenever theuser examines an update, we assume that the user will also readeach subsequent item in the holding area with probability p. �edecision is made independently with respect to each item and doesnot take into account the item’s relevance (similar to RBP [14]).

Given this user model, we can construct a utility model forevaluating systems. Whenever the user reads an update, she derivessome utility: we refer to positive utility as “gain” and negative utilityas “pain”. For a particular topic, over a speci�ed evaluation period, aparticular system or algorithm delivers to the user a certain amountof pain and gain. We average across multiple topics to arrive ata system’s overall e�ectiveness. To enable meaningful averaging,gain is normalized to the maximum gain based on the evaluationground truth (e.g., all nuggets in the evaluation pools).1 �us,1Note that this maximum gain may or may not be achievable based on a particularuser model. For example, a low persistence user may never have the opportunity toaccumulate much gain, even if every update contains relevant information. �is designrepresents one of many possible choices, each with advantages and disadvantages.

each system can be represented as a point in the gain vs. paintradeo� space. Below, we explain precisely how gain and pain areoperationalized in the three evaluations we examined.Temporal Summarization. We applied our evaluation frame-work to data from the TS Track at TREC 2014 (TS14). All the TSevaluations employed a nugget-based evaluation methodology [3],and thus expected gain (i.e., the precision-like metric) and com-prehensiveness (i.e., the recall-like metric) were both measured interms of nuggets. For a given run, TS14 evaluated updates usingnormalized Expected Gain (nEG) and Comprehensiveness (C).

In our analysis, positive utility (i.e., gain) is accumulated whenthe user encounters a nugget in an update and negative utility (i.e.,pain) is accumulated when the user encounters a non-relevant up-date. Reading behavior is dictated by the user persistence model de-scribed above. We do not penalize gain for latency or verbosity [3, 6]in order to keep our implementation straightforward and intuitive,and pain in this case is simply the number of non-relevant updatesconsumed. As will be described later, our basic framework can beenriched with more re�ned user models.

�ere is one additional detail to note: A few systems in TS14returned an enormous number of updates (on the order of 104 up-dates per topic). It is inconceivable that so many updates wouldactually be delivered to a user, and therefore we preprocessed theseruns prior to analysis. Fortunately, systems returned a con�dencescore with each update, and therefore in our experiments only up-dates above a normalized con�dence score of 0.7 are pushed tothe simulated user. We set this threshold to balance the volume ofsystem output with the number of unjudged updates; because ofthe pooled evaluation procedure, most of the updates from the highvolume systems were unjudged. �is preprocessing was necessaryto bring the TS14 runs closer in line to a “reasonable” push noti�ca-tion scenario, although changing the threshold does not impact our�ndings (since it has the e�ect of throwing away mostly unjudgedupdates and a�ects only the high-volume runs). Updates with anormalized score below 0.7 are assumed to still be available in the“holding area”, but in practice since the user consumes these updatesin score order, they are rarely read based on our user model.Microblog and Real-Time Summarization. We analyzed datafrom the TREC 2015 Microblog Track (MB15) and the TREC 2016Real-Time Summarization Track (RTS16). For both, positive utilityaccumulates when the user reads a relevant tweet (de�ned in thesame way as in the o�cial evaluation), and negative utility accu-mulates when the user reads a non-relevant tweet (as in the o�cialevaluation, just the count of non-relevant tweets).

Since in both MB15 and RTS16, each system was limited topushing ten tweets per topic per day, we assumed that all systemupdates were delivered to users. Unread noti�cations are assumedto be stored in a time-ordered queue on the user’s mobile app, tobe consumed or ignored based on our persistence model.

4 RESULTS AND ANALYSISEvaluation results for TS14, MB15, and RTS16 are shown in Figure 1based on the pain and gain de�nitions described in the previoussection, modeled with a persistence of p = 0.5, i.e., half the push no-ti�cations are ignored. Each point represents an individual systemin the evaluation.


254

Figure 1: Application of our evaluation framework to the TS14, MB15, and RTS16 evaluations, with persistence p = 0.5. �ePareto Frontier represents the best achievable gain for a given level of pain.

�e Pareto Frontier (outlined in each �gure) contains the set ofsystems such that, for each system, it is not possible to achievegreater gain without incurring greater pain. Each system alongthe frontier can claim to be “optimal”, in the sense that for theparticular level of gain achieved, the system has the minimum pain.Put di�erently, all systems in the interior are dominated by systemsalong the frontier, in that for a particular level of gain, there isan alternate system (on the frontier) which provides at least thatlevel of gain with smaller pain—and therefore a rational user wouldalways prefer the system on the frontier. Systems on the frontier,however, are not comparable since they represent di�erent gainvs. pain tradeo�s—deciding between them would require knowingactual user preferences (more below). Naturally, the frontier is notcontinuous, and hence we interpolate between systems.

In what follows, we elaborate on the four key advantages of ourevaluation framework that were outlined in the introduction:Uni�ed evaluation framework. We provide a uni�ed view acrossthree di�erent TREC evaluations that share underlying similari-ties, but which up until now have employed di�erent evaluationmethodologies. Across these evaluations, we see the broad contoursof system design and the tradeo�s in building push noti�cations:in all cases, the Pareto Frontier rises steeply in the beginning andthen falls o�. �is suggests that there are a number of “obviouslygood” updates that are fairly easy to identify, but the marginal costof discovering additional useful updates increases, such that thehighest levels of recall can’t be achieved (with present systems)without an inordinate amount of pain. �is is most apparent in theTS14 evaluation, where the frontier really levels o� at higher gain.Subsumption of previous metrics. In our framework, each systemis represented by a point in the gain vs. pain tradeo� space. �isposition encodes representations that are comparable to all existingmetrics for TS, MB, and RTS. Expected gain relates to the slope ofthe line connecting each point to the origin. Recall metrics (e.g.,comprehensiveness) can be visualized as the position on the y axisof each system. Linear utility corresponds to a weighted linearcombination of the x and y positions. Furthermore, it is easy toconstruct iso-metric lines (i.e., lines along which systems have thesame metric value): radial lines of di�erent slopes emanating fromthe origin, parallel horizontal lines, and parallel diagonal lines forthe three metrics above, respectively. �us, our evaluation frame-work not only uni�es but also subsumes previous evaluations—noinformation about previous metrics is lost.

More insightful analyses. Beyond unifying previous evaluationsand subsuming existing metrics, our framework addresses speci�cissues with previous metrics and supports more insightful analyses.In previous evaluations, precision and linear utility are hard tointerpret because they don’t factor in update volume: for example,a system that returns 100 “good” and 100 “bad” updates receivesthe same score as a system that returns 1000 “good” and 1000 “bad”updates. In both cases, the precision is 0.5 and the linear utilityis zero (assume equal weight on gain and pain). However, it isobvious that these two systems provide a completely di�erent userexperience. If we sort systems in terms of e�ectiveness by eithermetric, we are hiding important di�erences in how systems actuallyappear to real users.

In fact, any single-point metric necessarily encodes a particu-lar user model representing a speci�c operating point, which isnot ideal because we currently have li�le evidence of what usersactually want in a push noti�cation system (e.g., guidance fromuser studies). Speci�cally, the Pareto Frontier recognizes that thereare di�erent operating points for push noti�cation systems in thespace of pain vs. gain. Some operating points are more plausiblethan others (for example, in TS14 it is unlikely that users wouldbe willing to endure over 5000 units of pain), but ultimately weneed guidance from real-world users. Until then, we argue that itis dangerous to optimize on any single-point metric.

With guidance from our user model, we can explore di�erentusage scenarios in our evaluation framework to conduct more in-sightful analyses. For example, the current user model contains apersistence parameter, which characterizes the extent to which auser will pay a�ention to or ignore a noti�cation: we can vary thepersistence parameter to model di�erent types of users and notehow the Pareto Frontier changes in each case. Figure 2 shows someof these results: we plot the Pareto Frontiers for three di�erentvalues of persistence p = {0.1, 0.5, 0.9}. In all three evaluations,we only focus on low-pain systems (note the range of the x axiscompared to Figure 1). From the plots, we see that although thePareto Frontier changes, its overall shape and the systems thatlie on the frontier remain similar. Put di�erently, “good” systemsseem to be invariant with respect to di�erent user models, whichsuggests that our evaluation framework is robust with respect todi�erent parameter se�ings.

Extensibility. Beyond merely exploring parameters in the existingmodel, our framework can be extended with richer user and utility


255

Figure 2: Pareto frontiers for the TS14, MB15, and RTS16 evaluations, modeling users with di�erent persistence, p = 0.1, 0.5, 0.9

models based on advances in our understanding of user behavior, op-erationalized in user simulations [4, 5]. Concrete examples includeaccurate models of user reading speed [7] in the calculation of painand gain, time-based calibration of e�ectiveness measures for moreaccurately capturing utility [17], incorporating negative higher-order relevance [9], and modeling users’ propensity to browse deepdown a ranked list of items [13]. �ese improvements could all beapplied to replace our simple persistence model.

In fact, our framework can even accommodate completely di�er-ent modalities of information seeking with respect to prospectiveinformation needs. In this work we assume that system updates aredelivered as push noti�cations, potentially interrupting the user.Alternatively, Baruah et al. [6] considered the scenario where up-dates accumulate without explicit noti�cation, and a simulated userperiodically returns to examine system output (unprompted by thesystem itself). Pull-based and push-based consumption of updatesare merely the ends of a spectrum, and real-world users are likelyto employ a combination of both strategies. It is certainly possibleto capture these di�erent modalities in our framework using a morere�ned user model, which would translate into di�erent gain andpain measurements. However, our notion of a gain vs. pain tradeo�space remains relevant, and the Pareto Frontier can still serve asan important construct for understanding the state of the art.

5 FUTUREWORK AND CONCLUSIONOur evaluation framework immediately suggests a gap in our under-standing of push noti�cations systems: What is the pain toleranceof users in operational systems, and how does that tolerance varywith the volume of updates? Based on the latest evaluation resultsfrom the TREC 2016 Real-Time Summarization Track, the best sys-tems can achieve a gain vs. pain ratio of roughly one for low volumepushes [12], which corresponds to a precision of 0.5. In other words,roughly half of the content pushed is not relevant. Furthermore,beyond a certain volume, current systems are not able to maintaineven a one-to-one gain vs. pain ratio. Is this “good enough”? Itseems unlikely to us, but empirically, what is the threshold for asystem to be usable in practice?

An important related question concerns update volume [16]:What is the right number of noti�cations to push within some timeinterval? One imagines that even for a fast moving topic, users willget annoyed if they are constantly bombarded by noti�cations, evenif all the updates are relevant. On the other hand, a user might alsoexpress annoyance at a system for failing to deliver an important

update. All of the questions above point to the need for user studiesto enrich our understanding of the push noti�cation problem. Ourframework can help to contextualize these studies.

Extensions to the user and utility models mentioned in the pre-vious section represent another promising area of future work, aswell as modeling hybrid push/pull interactions. We believe thatour framework can provide a uni�ed view to evaluate systems thataddress prospective information needs. Speci�cally, there is greatpotential in mixed-initiative interfaces that are both reactive andproactive as appropriate.

REFERENCES[1] James Allan. 2002. Topic Detection and Tracking: Event-Based Information Orga-

nization. Kluwer Academic Publishers, Dordrecht, �e Netherlands.[2] Javed Aslam, Ma�hew Ekstrand-Abueg, Virgil Pavlu, Richard McCreadie, Fer-

nando Diaz, and Tetsuya Sakai. 2014. TREC 2014 Temporal Summarization TrackOverview. In TREC.

[3] Javed Aslam, Ma�hew Ekstrand-Abueg, Virgil Pavlu, Fernando Diaz, and TetsuyaSakai. 2013. TREC 2013 Temporal Summarization. In TREC.

[4] Leif Azzopardi. 2016. Simulation of Interaction: A Tutorial on Modelling andSimulating User Interaction and Search Behaviour. In SIGIR. 1227–1230.

[5] Leif Azzopardi, Kalervo Jarvelin, Jaap Kamps, and Mark D. Smucker. 2011. Reporton the SIGIR 2010 Workshop on the Simulation of Interaction. SIGIR Forum 44, 2(2011), 35–47.

[6] Gaurav Baruah, Mark D. Smucker, and Charles L. A. Clarke. 2015. EvaluatingStreams of Evolving News Events. In SIGIR. 675–684.

[7] Charles L. A. Clarke and Mark D. Smucker. 2014. Time Well Spent. In IIiX ’14.205–214.

[8] Qi Guo, Fernando Diaz, and Elad Yom-Tov. 2013. Updating Users about TimeCritical Events. In ECIR. 483–494.

[9] Heikki Keskustalo, Kalervo Jarvelin, Ari Pirkola, and Jaana Kekalainen. 2008.Intuition-Supporting Visualization of User�s Performance Based on ExplicitNegative Higher-Order Relevance. In SIGIR. 675–682.

[10] David D. Lewis. 1995. �e TREC-4 Filtering Track. In TREC. 165–180.[11] Jimmy Lin, Miles Efron, Yulu Wang, and Garrick Sherman. 2015. Overview of

the TREC-2015 Microblog Track. In TREC.[12] Jimmy Lin, Adam Roegiest, Luchen Tan, Richard McCreadie, Ellen Voorhees,

and Fernando Diaz. 2016. Overview of the TREC 2016 Real-Time SummarizationTrack. In TREC.

[13] David Maxwell, Leif Azzopardi, Kalervo Jarvelin, and Heikki Keskustalo. 2015.Searching and Stopping: An Analysis of Stopping Rules and Strategies. In CIKM.313–322.

[14] Alistair Mo�at and Justin Zobel. 2008. Rank-Biased Precision for Measurementof Retrieval E�ectiveness. ACM TOIS 27, 1, Article 2 (Dec. 2008), 27 pages.

[15] Stephen Robertson and Ian Soboro�. 2002. �e TREC 2002 Filtering Track Report.In TREC.

[16] Adam Roegiest, Luchen Tan, and Jimmy Lin. 2017. Online In-Situ InterleavedEvaluation of Real-Time Push Noti�cation Systems. In SIGIR.

[17] Mark D. Smucker and Charles L. A. Clarke. 2012. Time-based Calibration ofE�ectiveness Measures. In SIGIR. 95–104.

Acknowledgments. �is work was supported by the Natural Sciences andEngineering Research Council (NSERC) of Canada, with additional contri-butions from the U.S. National Science Foundation under CNS-1405688.


256

the pareto frontier of utility models as a framework for evaluating push notification...

Documents