arnoldo frigessi [email protected] posterior uncertainty for rank data aggregation and a...

78
Arnoldo Frigessi [email protected] Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Upload: francis-bates

Post on 15-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Arnoldo [email protected]

Posterior uncertainty for rank data aggregation

and a priori plans for BigInsight

BigInsight

Page 2: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

3 5 1 2 4

RANKS

Page 3: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Ranked data is everywhere.

Rankings arise when …

• users express preferences about products and services, • voters cast ballots in elections, • research projects are evaluated based on their merits, • genes are ordered based on their expression levels under various

experimental conditions.

A ranking represents a statement about the relative quality or relevance of the items being ranked.

Assessors rank items. Designed or observed Panel, volunteers, users….

1 3 4 2

Page 4: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Tasks

Aggregate, merge, summarise multiple rankings to discover shared patterns and structure.

? ? ? ?

1 3 4 2

2 1 3 4

1 2 4 3

4 3 1 2

3 1 4 2

3 1 4 2

1 3 4 2

Assessors

Consensus ranking

Page 5: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Tasks

Predict individual ratings, when only partial ratings are made. (not all items rated)

1 3 4 2

2 1 3 4

1 2 4 3

4 3 1 2

? 1 ? 2

? 1 ? 2

1 ? ? 2

Some rankings aremissing in some samples

Page 6: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Tasks

Predict individual ratings, when only partial ratings are made. (not all items rated)

1 3 4 2

2 1 3 4

1 2 4 3

4 3 1 2

1 2

1 2

1 2

UNCERTAINTY

Page 7: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Tasks

Partition assessors in classes and predict class membership of new assessors.

1 3 4 2

2 1 3 4

1 2 4 3

4 3 1 2

3 1 4 2

3 1 4 2

1 3 4 2

1 3 4 2

4 3 1 2

3 1 4 2

3 1 4 2

2 1 3 4

1 2 4 3

1 3 4 2

1 2

Population subtypes

Classification of new samples

Page 8: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

MovieLens is a movie recommendation website. You tell us what movies you love and hate. We use that information to generate personalized recommendations for other movies.

Prob ( member Nils likes movie A better than movie B? ) Prob ( for member Nils movie A will be among his top 5 preferences ?)

MovieLens uses collaborative filtering to generate recommendations. It matches users with similar opinions about movies. Each user has a 'neighbourhood' of other like-minded users. Ratings from these neighbours are used to create personalized recommendations for the target user.

Hundreds of thousands of users. Started in 1997. University of Minnesota.

Page 9: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

META ANALYSIS OF GENE EXPRESSIONS ACROSS LABS

Gene expression is a measure of the activity of a gene in a sample

~ 20000 genes measured in a few hundred patients (prostate cancer)

Repeated in various cohorts and labs with different technologies. Absolute measures are hard to compare. Ranks easier.

Each lab produces a ranked list of genes, hard to analyse together

Genes = items to be ranked Labs = assessors

Merge the studies to produce a consensus list Prob (P57 is among top 10?)

Page 10: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Mallows model Bayesian inference MCMC algorithm Applications What we shall do next

Page 11: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 12: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 13: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 14: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

NP-hard; more complicated for α

Page 15: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

The Kendall distance measures the minimum number of pairwise adjacent switches which convert R into ρ.

The computation of the normalizing constant in the Mallows model when using other distance measures than Kendall's is NP-complete.

Page 16: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Bayes!

Page 17: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 18: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Sampling from the posterior by Markov Chain Monte Carlo

Page 19: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 20: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 21: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

• Choose an item u at random in {1,2,…,n}. Its current rank is

• Choose a new rank r for item u, uniformly in - L, …, + L Now two items have rank r and one item (u) has no rank.

• Shift by one all the items of ranks between r and

for ρ

Page 22: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 23: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

ρ=(1,2,3,…, n)

Page 24: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

= Eq{ }

Page 25: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 26: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

(Pseudolikelihood)

Page 27: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 28: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 29: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Need a theorem….

Page 30: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

n=26 students

Page 31: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

true rank

heaviest potato lightest potato

posterior marginaldistribution for the rankof each potato

Represents uncertainty: The trace is the posterior expectation of the number of correctly ranked potatoes

Central potatoes are the one ranked with highest uncertainty

Page 32: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

by looking by touching

Less uncertainty

Page 33: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

More Imp Sampl samples

Longer MCMC

Convergence of MCMC with imprecise

Page 34: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Only a subset of the items have been ranked.

Ranks can be missing at random, or the assessors may only have ranked, say, the top-5 items.

Can be handled easily in the Bayesian framework, by applying data augmentation techniques: estimating the lacking ranks consistently with the partial observations.

Page 35: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 36: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Cases vs. Controls

Page 37: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

(89 genes in total)

N=5 assessors

n=89

ite

ms

Page 38: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

1. Find the gene with highest posterior probability of having rank 1.

2. Among the remaining genes, find the gene with highest posterior probability of having rank 1 or 2.

3. Etc. cumulative probability

• The probability of being among the top-10 for each gene.

VERY UNCERTAIN!WEAK CONSENSUS!N=5 too small

Page 39: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

• “stationary distribution”, level of consensus• No precise interpretation.

Page 40: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 41: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Assessors not one homogeneous group, but C groups

We use a mixture of Mallows models to cluster a sample of N assessors according to how they rank the n items.

We estimate a latent ranking of the items for each cluster of assessors.

The variables assign each assessor to one of the C clusters.Prior: Dirichlet distribution on the probabilities that an assessor is in each class

Page 42: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 43: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Label switching is not explicitly handled inside the MCMC to ensure fullconvergence of the chain (Jasra et al., 2005; Celeux et al., 2000). MCMC iterations are reordered after convergence is achieved, using the re-ordering approaches in (Papastamoulis, 2015).

Page 44: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

N = 5000 people (assessors) were interviewed, each giving his/her

complete ranking of n = 10 sushi variants (items):

ebi (shrimp), anago (sea eel), maguro (tuna), ika (squid), uni (sea urchin), sake (salmon roe), tamago (egg), toro (fatty tuna), tekka-maki (tuna roll), kappa-maki (cucumber roll).

Page 45: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

within cluster distanceof each rank to the cluster centroid

Elbow rule

Page 46: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

DIC

Page 47: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 48: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

assessors

clus

ter

sposterior probabilities for being assigned to each cluster

• most assessors have posterior probabilities concentrated on some preferred value of c, indicating a reasonably stable behaviour in the cluster assignments.

Page 49: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

MCMC: we need to propose augmented ranks which obey the partial ordering constraints given by the assessor.

Assume coherent pair comparisons

Page 50: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 51: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 52: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

• perfect stochastic orderings between most of the teams• … but not all

Page 53: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

P (Team A < Team B | all data )

Page 54: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

N=5891 assessors (users), n=200 movies Mean number of movies rated per user = 30.2

Ratings transformed to pair comparisons (as in Lu & Boutilier 2011)

14 classes of users (age and gender) – for simplicity fixed, in real application would be estimated

Normalising constant approximated as in Mukherjee 2013, as importance sampling inefficient with n=200

is the posterior predicted probability of the full ranking for assessor j, consistent with given preferences and relative to the class j belongs to.

Personalised recommendation.

Page 55: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

To test method, we discarded one rated movie per user

Use all other data to estimate

Read off posterior probability of the given (but hidden) preference

Median such probability over all assessors = 0.812

If we decide to predict the preference between two movies by taking the preference with posterior predictive probability >0.5, then we make an error in 12.7% of cases.

Page 56: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 57: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

BIG INSIGHT

2015-2023

Page 58: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Sometimes, it is not enough to crunch data!

Page 59: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 60: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

VentureBeat

Mike Loukides & VB

Page 61: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 62: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

MODEL-BASED STATISTICS

Model-based methods exploit knowledge and structure in the new data, To understand, discover, predict, control.

Page 63: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

BIG INSIGHT

Page 64: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 65: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Personalised solutions

Forecasting the transient

Page 66: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Personalised solutions

Forecasting the transient

Telenor: mobile phones Gjensidige: policy-holdersDNB: customers Folkehelsa: infected and susceptible individuals OUS: patients NAV: people on sick leaveSkatteetaten: tax-payers ABB & DNV GL: sensors on a ship DNV GL & OUS: sensors in healthcare.

Aims:

•personalised marketing, •personalised products, •personalised prices, •personalised risk assessments, •personalised fraud assessment, •personalised screening, •personalised therapy, •individualised sensor monitoring, •individualised maintenance schemes,

Page 67: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Personalised solutions

Forecasting the transient

High frequency data allow to measure processes in time while they are not in a stable situation, not in equilibrium.

OUS: patient receiving treatment ABB & DNV GL: sensor on a ship at sea Telenor, Gjensidige, DNB : customer NAV: worker who lost the job HYDRO: electricity prices

Aims:

•Predict the dynamics •Optimal intervention•Causal understanding of the factors which affect the process.

Page 68: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

6 Innovation Objectives

Page 69: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

1. Changepoint detection 2. Changepoint prediction

• Multivariate time series with dependence1. some structure of the dependence is known2. some is not, and must be estimated

Page 70: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Structure = network•flow•feed-back•vicinity in many ways (spatial, function, type,…) •conditional independence

Page 71: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

1. Changepoint detection 2. Changepoint prediction

The changepoint process must be understood and modelled.

A.from a (longish) history of changepoints, estimate the rhythm of their arrivals, and use it for prediction

B.understand/assume/estimate the causes of changepoints, and use these as early warnings.

Page 72: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 73: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight
Page 74: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

•variability or extremes?•slow or fast changes?•surprise prediction! (absence of alternative hypothesis)•few sensors fail or many? (sparsity)

•real time prediction: hours, days, minutes ? (what intervention?)•adaptive resolution of data! •master (real or in-silico) sensors?

 

Page 75: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

Annals of statistics 2013

Page 76: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

N sensors, each giving observations yn,t n=1,2,…, N

t=1,2,…

At certain time K, there are changes in the distributions of observations from a subset M of the sensors. This changepoint K, the subset M and its size #M are unknown.

Goal: to detect K as soon as possible after it occurs (minimizing Expected Detection Delay EDD) while keeping the frequency of false alarms as low as possible.

N is large, #M is relatively small

Page 77: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

JRSS-B 2015

Page 78: Arnoldo Frigessi frigessi@medisin.uio.no Posterior uncertainty for rank data aggregation and a priori plans for BigInsight BigInsight

www.BigInsight.no