plug-in martingales for testing exchangeability on-line · alex gammerman [email protected] ilia...

Plug-in martingales for testing exchangeability on-line

Valentina Fedorova [email protected] Gammerman [email protected] Nouretdinov [email protected] Vovk [email protected]

Computer Learning Research Centre, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK

Abstract

A standard assumption in machine learningis the exchangeability of data, which is equiv-alent to assuming that the examples are gen-erated from the same probability distributionindependently. This paper is devoted to test-ing the assumption of exchangeability on-line:the examples arrive one by one, and after re-ceiving each example we would like to have avalid measure of the degree to which the as-sumption of exchangeability has been falsified.Such measures are provided by exchangeabil-ity martingales. We extend known techniquesfor constructing exchangeability martingalesand show that our new method is competi-tive with the martingales introduced before.Finally we investigate the performance of ourtesting method on two benchmark datasets,USPS and Statlog Satellite data; for the for-mer, the known techniques give satisfactoryresults, but for the latter our new more flexi-ble method becomes necessary.

1. Introduction

Many machine learning algorithms have been devel-oped to deal with real-life high dimensional data. Inorder to state and prove properties of such algorithmsit is standard to assume that the data satisfy the ex-changeability assumption (although some algorithmsmake different assumptions or, in the case of predictionwith expert advice, do not make any statistical assump-tions at all). These properties can be violated if theassumption is not satisfied, which makes it importantto test the data for satisfying it.

Appearing in Proceedings of the 29 th International Confer-ence on Machine Learning, Edinburgh, Scotland, UK, 2012.Copyright 2012 by the author(s)/owner(s).

Note that the popular assumption that the data isi.i.d. (independent and identically distributed) has thesame meaning for testing as the exchangeability as-sumption. A joint distribution of an infinite sequenceof examples is exchangeable if it is invariant w.r. to anypermutation of examples. Hence if the data is i.i.d., itsdistribution is exchangeable. On the other hand, by deFinetti’s theorem (see, e.g., Schervish, 1995, p. 28) anyexchangeable distribution on the data (a potentiallyinfinite sequence of examples) is a mixture of distribu-tions under which the data is i.i.d. Therefore, testingfor exchangeability is equivalent to testing for beingi.i.d.

Traditional statistical approaches to testing are inap-propriate for high dimensional data (see, e.g., Vapnik,1998, pp. 6–7). To address this challenge a previousstudy (Vovk et al., 2003) suggested a way of on-linetesting by employing the theory of conformal predictionand calculating exchangeability martingales. Basicallytesting proceeds in two steps. The first step is im-plemented by a conformal predictor that outputs asequence of p-values. The sequence is generated in theon-line mode: examples are presented one by one andfor each new example a p-value is calculated from thisand all the previous examples. For the second step theauthors introduced exchangeability martingales thatare functions of the p-values and track the deviationfrom the assumption. Once the martingale grows upto a large value (20 and 100 are convenient rules ofthumb) the exchangeability assumption can be rejectedfor the data.

This paper proposes a new way of constructing mar-tingales in the second step of testing. To construct anexchangeability martingale based on the sequence ofp-values we need a betting function, which determinesthe contribution of a p-value to the value of the martin-gale. In contrast to the previous studies that use a fixedbetting function the new martingale tunes its bettingfunction to the sequence to detect any deviation from

arX

iv:1

204.

3251

v2 [

cs.L

G]

28

Jun

2012

Plug-in martingales

the assumption. We show that this martingale, whichwe call a plug-in martingale, is competitive with all themartingales covered by the previous studies; namely,asymptotically the former grows faster than the latter.

1.1. Related work

The first procedure of testing exchangeability on-lineis described in Vovk et al. (2003). The core testingmechanism is an exchangeability martingale. Exchange-ability martingales are constructed using a sequence ofp-values. The algorithm for generating p-values assignssmall p-values to unusual examples. It implies the ideaof designing martingales that would have a large valueif too many small p-values were generated, and suggestscorresponding power martingales. Other martingales(simple mixture and sleepy jumper) implement morecomplicated strategies, but follow the same idea ofscoring on small p-values.

Ho (2005) applies power martingales to the problemof change detection in time-varying data streams. Theauthor shows that small p-values inflate the martingalevalues and suggests to use the martingale difference asanother test for the problem.

1.2. This paper

To the best of our knowledge, no study has aimedto find any other ways of translating p-values into amartingale value. In this paper we propose a newmore flexible method of constructing exchangeabilitymartingales for a given sequence of p-values.

The rest of the paper is organised as follows. Section2 gives the definition of exchangeability martingales.Section 3 presents the construction of plug-in exchange-ability martingales, explains the rationale behind them,and compares them to the power martingales, whichhave been used previously. Section 4 shows experimen-tal results of testing two real-life datasets for exchange-ability; for one of these datasets power martingaleswork satisfactorily and for the other one the greaterflexibility of plug-in martingales becomes essential. Sec-tion 5 summarises the paper.

2. Exchangeability martingales

This section outlines necessary definitions and resultsof the previous studies.

2.1. Exchangeability

Consider a sequence of random variables(Z1, Z2, . . .)

that all take values in the same example space. Thenthe joint probability distribution P(Z1, . . . , ZN ) of a

finite number of the random variables is exchangeableif it is invariant under any permutation of the randomvariables. The joint distribution of infinite number ofrandom variables

(Z1, Z2, . . .) is exchangeable if the

marginal distribution P(Z1, . . . , ZN ) is exchangeablefor every N .

2.2. Martingales for testing

As in Vovk et al. (2003), the main tool for testing ex-changeability on-line is a martingale. The value of themartingale reflects the strength of evidence against theexchangeability assumption. An exchangeability mar-tingale is a sequence of non-negative random variablesS0, S1, . . . that keep the conditional expectation:

Sn ≥ 0

Sn = E(Sn+1 | S1, . . . , Sn),

where E refers to the expected value with respect toany exchangeable distribution on examples. We alsoassume S0 = 1. Note that we will obtain an equivalentdefinition if we replace “any exchangeable distributionon examples” by “any distribution under which theexamples are i.i.d.” (remember the discussion of deFinetti’s theorem in Section 1).

To understand the idea behind martingale testing wecan imagine a game where a player starts from thecapital of 1, places bets on the outcomes of a sequenceof events, and never risks bankruptcy. Then a martin-gale corresponds to a strategy of the player, and itsvalue reflects the acquired capital. According to Ville’sinequality (see Ville, 1939, p. 100),

P{∃n : Sn ≥ C

}≤ 1/C, ∀C ≥ 1,

it is unlikely for any Sn to have a large value. For theproblem of testing exchangeability, if the final value of amartingale is large then the exchangeability assumptionfor the data can be rejected with the correspondingprobability.

2.3. On-line calculation of p-values

Let (z1, z2, . . .) denote a sequence of examples. Eachexample zi is the vector representing a set of attributesxi and a label yi: zi = (xi, yi). In this paper we useconformal predictors to generate a sequence of p-valuesthat corresponds to the given examples. The generalidea of conformal prediction is to test how well a newexample fits to the previously observed examples. Forthis purpose a “nonconformity measure” is defined.This is a function that estimates the strangeness of oneexample with respect to others:

αi = A(zi, {z1, . . . , zn}

),

Plug-in martingales

Algorithm 1 Generating p-values on-line

Input: (z1, z2, . . .) data for testingOutput: (p1, p2, . . .) sequence of p-valuesfor i = 1, 2, . . . do

observe a new example zifor j = 1 to i do

αj = A(zj , {z1, . . . , zi}

)end forpi =

#{j:αj>αi}+θi#{j:αj=αi}i

end for

where in general {. . .} stands for a multiset (the sameelement may be repeated more than once) rather thana set. Typically, each example is assigned a “noncon-formity score” αi based on some prediction method. Inthis paper we deal with the classification problem andthe 1-Nearest Neighbor (1-NN) algorithm is used as theunderling method to compute the nonconformity scores.The algorithm is simple but it works well enough inmany cases (see, e.g., Hastie et al., 2001, pp. 422–427).A natural way to define the nonconformity score of anexample is by comparing its distance to the exampleswith the same label to its distance to the exampleswith a different label:

αi =minj 6=i:yi=yj d(xi, xj)

minj 6=i:yi 6=yj d(xi, xj), (1)

where d(xi, xj) is the Euclidean distance. Accordingto the chosen nonconformity measure, αi is high if theexample is close to another example with a differentlabel and far from any examples with the same label.

Using the calculated nonconformity scores of all ob-served examples, the p-value pn that corresponds toan example zn is calculated as

pn =#{i : αi > αn}+ θn#{i : αi = αn}

n,

where θn is a random number from [0, 1] and the symbol# means the cardinality of a set. Algorithm 1 sum-marises the process of on-line calculation of p-values(it is clear that it can also be applied to a finite dataset(z1, . . . , zn) producing a finite sequence (p1, . . . , pn) ofp-values).

The following is a standard result in the theory ofconformal prediction (see, e.g., Vovk et al. 2003, Theo-rem 1).

Theorem 1. If examples (z1, z2, . . .) (resp. (z1, z2, . . . ,zn)) satisfy the exchangeability assumption, Algorithm1 produces p-values (p1, p2, . . .) (resp. (p1, p2, . . . , pn))that are independent and uniformly distributed in [0, 1].

The property that the examples generated by an ex-changeable distribution provide uniformly and indepen-dently distributed p-values allows us to test exchange-ability by calculating martingales as functions of thep-values.

3. Martingales based on p-values

This section focuses on the second part of testing: giventhe sequence of p-values a martingale is calculated asa function of the p-values.

For each i ∈ {1, 2, . . .}, let fi : [0, 1]i → [0,∞). Let(p1, p2, . . .) be the sequence of p-values generated byAlgorithm 1. We consider martingales Sn of the form

Sn =

n∏i=1

fi(pi), n = 1, 2, . . . , (2)

where we denote fi(p) = fi(p1, . . . , pi−1, p) and call thefunction fi(p) a betting function.

To be sure that (2) is indeed a martingale we need thefollowing constraint on the betting functions fi:∫ 1

0

fi(p)dp = 1, i = 1, 2, . . .

Then we can check:

E(Sn+1 | S0, . . . , Sn) =

∫ 1

0

n∏i=1

(fi(pi)

)fn+1(p)dp

=

n∏i=1

(fi(pi)

)∫ 1

0

fn+1(p)dp =

n∏i=1

fi(pi) = Sn.

Using representation (2) we can update the martin-gale on-line: having calculated a p-value pi for a newexample in Algorithm 1 the current martingale valuebecomes Si = Si−1 · fi(pi). To define the martingalescompletely we need to describe the betting functionsfi.

3.1. Previous results: power and simplemixture martingales

Previous studies (Vovk et al., 2003) have proposed touse a fixed betting function

∀i : fi(p) = εpε−1,

where ε ∈ [0, 1]. Several martingales were constructedusing the function. The power martingale for some ε,denoted as Mε

n, is defined as

Mεn =

n∏i=1

εpε−1i .

Plug-in martingales

p−values, p

pow

er m

artin

gale

bet

ting

func

tion

εpε−

1

ε0.110.220.330.440.560.670.780.89

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

01

23

45

67

Figure 1. The betting functions that are used to constructthe power and simple mixture martingales.

The simple mixture martingale, denoted as Mn, is themixture of power martingales over different ε ∈ [0, 1]:

Mn =

∫ 1

0

Mεndε.

Such a martingale will grow only if there are manysmall p-values in the sequence. This follows from theshape of the betting functions: see Figure 1. If thegenerated p-values concentrate in any other part ofthe unit interval, we cannot expect the martingale togrow. So it might be difficult to reject the assumptionof exchangeability for such sequences.

3.2. New plug-in approach

3.2.1. Plug-in martingale

Let us use an estimated probability density function asthe betting function fi(p). At each step the probabilitydensity function is estimated using the accumulatedp-values:

ρi(p) = ρ(p1, . . . , pi−1, p), (3)

where ρ(p1, . . . , pi−1, p) is the estimate of the proba-bility density function using the p-values p1, . . . , pi−1output by Algorithm 1.

Substituting these betting functions into (2) we get anew martingale that we call a plug-in martingale. Themartingale avoids betting if the p-values are distributeduniformly, but if there is any peak it will be used forbetting.

Estimating a probability density function. Inour experiments we have used the statistical environ-ment and language R. The density function in its

Stats package implements kernel density estimationwith different parameters. But since p-values always liein the unit interval, the standard methods of kernel den-sity estimation lead to poor results for the points thatare near the boundary. To get better results for theboundary points the sequence of p-values is reflected tothe left from zero and to the right from one. Then thekernel density estimate is calculated using the extendedsample ∪ni=1

{−pi, pi, 2 − pi

}. The estimated density

function is set to zero outside the unit interval andthen normalised to integrate to one. For the resultspresented in this paper the parameters used are theGaussian kernel and Silverman’s “rule of thumb” forbandwidth selection. Other settings have been triedas well, but the results are comparable and lead to thesame conclusions.

The values Sn of the plug-in martingale can be updatedrecursively. Suppose computing the nonconformityscores (α1, . . . , αn) from (z1, . . . , zn) takes time g(n)and evaluating (3) takes time h(n). Then updatingSn−1 to Sn takes time O(g(n) + n+ h(n)): indeed, itis easy to see that calculating the rank of αn in themultiset {α1, . . . , αn} takes time Θ(n).

The performance of the plug-in martingale on real-lifedatasets will be presented in Section 4. The rest ofthe current section proves that the plug-in martingaleprovides asymptotically a better growth rate than anymartingale with a fixed betting function. To provethis asymptotical property of the plug-in martingalewe need the following assumptions.

3.2.2. Assumptions

Consider an infinite sequence of p-values (p1, p2, . . .).(This is simply a deterministic sequence.) For its finiteprefix (p1, . . . , pn) define the corresponding empiricalprobability measure Pn: for a Borel set A in R,

Pn(A) =#{i = 1, . . . , n : pi ∈ A}

n.

We say that the sequence (p1, p2, . . .) is stable if thereexists a probability measure P on R such that:

1. Pnweak−−−−→n→∞

P;

2. there exists a positive continuous density functionρ(p) for P: for any Borel set A in R, P(A) =∫Aρ(p)dp.

Intuitively, the stability means that asymptoticallythe sequence of p-values can be described well by aprobability distribution.

Plug-in martingales

Consider a sequence (f1(p), f2(p), . . .) of betting func-tions. (This is simply a deterministic sequence of func-tions fi : [0, 1]→ [0,∞), although we are particularlyinterested in the functions fi(p) = ρi(p), as definedin (3).) We say that this sequence is consistent for(p1, p2, . . .) if

log(fn(p)

) uniformly in p−−−−−−−−−−−→n→∞

log(ρ(p)).

Intuitively, consistency is an assumption about thealgorithm that we use to estimate the function ρ(p); inthe limit we want a good approximation.

3.2.3. Growth rate of plug-in martingale

The following result says that, under our assumptions,the logarithmic growth rate of the plug-in martingale isbetter than that of any martingale with a fixed bettingfunction (remember that by a betting function we meanany function mapping [0, 1] to [0,∞)).

Theorem 2. If a sequence (p1, p2, . . .) ∈ [0, 1]∞ is sta-ble and a sequence of betting functions

(f1(p), f2(p), . . .

)is consistent for it then, for any positive continuousbetting function f ,

lim infn→∞

(1

n

n∑i=1

log(fi(pi)

)− 1

n

n∑i=1

log(f(pi)

))≥ 0

First we explain the meaning of Theorem 2 and thenprove it. According to representation (2) after n stepsthe martingale grows to

n∏i=1

fi(pi). (4)

Note that if for any p-value p ∈ [0, 1] we have fi(p) = 0then the martingale can become zero and will neverchange after that. Therefore, it is reasonable to considerpositive fi(p). Then we can rewrite product (4) as sumof logarithms, which gives us the logarithmic growthof the martingale:

n∑i=1

log(fi(pi)

).

We assume that the sequence of p-values is stable andthe sequence of estimated probability density functionsthat is used to construct the plug-in martingale isconsistent. Then the limit inequality from Theorem 2states that the logarithmic growth rate of the plug-inmartingale is asymptotically at least as high as thatof any martingale with a fixed betting function (whichhave been suggested in previous studies).

To prove Theorem 2 we will use the following lemma.

Lemma 1. For any probability density functions ρ and

f (so that∫ 1

0ρ(p)dp = 1 and

∫ 1

0f(p)dp = 1),∫ 1

0

log(ρ(p)

)ρ(p)dp ≥

∫ 1

0

log(f(p)

)ρ(p)dp.

Proof of Lemma 1. It is well known (Kullback, 1959,p. 14) that the Kullback–Leibler divergence is alwaysnon-negative: ∫ 1

0

log( ρ(p)

f(p)

)ρ(p)dp ≥ 0.

This is equivalent to the inequality asserted byLemma 1.

Proof of Theorem 2. Suppose that, contrary to thestatement of Theorem 2, there exists δ > 0 such that

lim infn→∞

(1

n

n∑i=1

log(fi(pi)

)− 1

n

n∑i=1

log(f(pi)

))< −δ.

(5)Then choose an ε satisfying 0 < ε < δ/4.

Substituting the definition of ρ(p) into Lemma 1 weobtain ∫ 1

0

log(ρ(p)

)dP ≥

∫ 1

0

log(f(p)

)dP. (6)

From the stability of (p1, p2, . . .) it follows that thereexists a number N1 = N1(ε) such that, for all n > N1,∣∣∣∣∫ 1

0

log(f(p)

)dPn −

∫ 1

0

log(f(p)

)dP

∣∣∣∣ < ε

and ∣∣∣∣∫ 1

0

log(ρ(p)

)dPn −

∫ 1

0

log(ρ(p)

)dP

∣∣∣∣ < ε.

Then inequality (6) implies that, for all n ≥ N1,∫ 1

0

log(ρ(p)

)dPn ≥

∫ 1

0

log(f(p)

)dPn − 2ε.

By the definition of the probability measure Pn, thelast inequality is the same thing as

1

n

n∑i=1

log(ρ(pi)

)≥ 1

n

n∑i=1

log(f(pi)

)− 2ε. (7)

By the consistency of(f1(p), f2(p), . . .

)there exists a

number N2 = N2(ε) such that, for all i > N2 and allp ∈ [0, 1], ∣∣∣log

(fi(p)

)− log

(ρ(p)

)∣∣∣ < ε. (8)

Plug-in martingales

Let us define the number

M = maxi,p

∣∣log(fi(p)

)− log

(ρ(p)

)∣∣. (9)

From (8) and (9) we have

∣∣log(fi(p)

)− log

(ρ(p)

)∣∣ ≤ { M, i ≤ N2

ε, i > N2.(10)

Denote N3 = max(N1, N2). Then, using (10) and (7),we obtain, for all n > N3,

1

n

n∑i=1

log(fi(pi)

)≥ 1

n

n∑i=1

log(f(pi)

)− 3ε− MN3

n.

Denoting N4 = max(N3,MN3

ε ), we can rewrite the lastinequality as

1

n

n∑i=1

log(fi(pi)

)≥ 1

n

n∑i=1

log(f(pi)

)− 4ε,

for all n > N4. Finally, recalling that ε < δ4 , we have,

for all n > N4,

1

n

n∑i=1

log(fi(pi)

)− 1

n

n∑i=1

log(f(pi)

)≥ −δ.

This contradicts (5) and therefore completes the proofof Theorem 2.

4. Empirical results

In this section we investigate the performance of ourplug-in martingale and compare it with that of thesimple mixture martingale. Two real-life datasets havebeen tested for exchangeability: the USPS dataset andthe Statlog Satellite dataset.

4.1. USPS dataset

Data The US Postal Service (USPS) dataset consistsof 7291 training examples and 2007 test examples ofhandwritten digits, from 0 to 9. The data were collectedfrom real-life zip codes. Each example is described bythe 256 attributes representing the pixels for displayinga digit on the 16× 16 gray-scaled image and its label.It is well known that the examples in this dataset arenot perfectly exchangeable (Vovk et al., 2003), andany reasonable test should reject exchangeability there.In our experiments we merge the training and testsets and perform testing for the full dataset of 9298examples.

Figure 2 shows the typical performance of the martin-gales when the exchangeability assumption is satisfied

index of example

log 1

0 M

simple mixtureplug−in

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

−2.

2−

1.8

−1.

4−

1−

0.6

−0.

20.

20.

61

Figure 2. The growth of the martingales for the USPSdataset randomly shuffled before on-line testing. The ex-changeability assumption is satisfied: the final martingalevalues are about 0.01.

for sure: all examples have been randomly shuffledbefore the testing.

Figure 4 shows the performance of the martingaleswhen the examples arrive in the original order: first7291 of the training set and then 2007 of the test set.The p-values are generated on-line by Algorithm 1and the two martingales are calculated from the samesequence of p-values. The final value for the simplemixture martingale is 2.0 × 1010, and the final valuefor the plug-in martingale is 3.9× 108.

Figure 6 shows the betting functions that correspondto the plug-in martingale and the “best” power mar-tingale. For the plug-in martingale, the function isthe estimated probability density function calculatedusing the whole sequence of p-values. The betting func-tion for the family of power martingale corresponds tothe parameter ε∗ that provides the largest final valueamong all power martingales. It gives a clue why wecould not see advantages of the new approach for thisdataset: both martingales grew up to approximatelythe same level. There is not much difference betweenthe best betting functions for the old and new meth-ods, and the new method suffers because of its greaterflexibility.

4.2. Statlog Satellite dataset

Data The Statlog Satellite dataset (Frank & Asun-cion, 2010) consists of 6435 satellite images (dividedinto 4435 training examples and 2000 test examples).The examples are 3× 3 pixel sub-areas of the satellitepicture, where each pixel is described by four spectral

Plug-in martingales

index of example

log 1

0 M


0 500 1500 2500 3500 4500 5500

−2

−1.

8−

1.4

−1

−0.

8−

0.4

0

Figure 3. The growth of the martingales for the StatlogSatellite dataset randomly shuffled before on-line testing.The exchangeability assumption is satisfied: the final mar-tingale values are about 0.01.

values in different spectral bands. Each example isrepresented by 36 attributes and a label indicatingthe classification of the central pixel. Labels are num-bers from 1 to 7, excluding 6. The testing results aredescribed below.

Figure 3 shows the performance of the martingalesfor randomly shuffled examples of the dataset. Asexpected, the martingales do not reject the exchange-ability assumption there.

Figure 5 presents the performance of the martingaleswhen the examples arrive in the original order. Thefinal value for the simple mixture martingale is 5.6×102

and the final value for the plug-in martingale is 1.8×1017. Again, the corresponding betting functions for theplug-in martingale and the “best” power martingale arepresented in Figure 7. For this dataset the generatedp-values have a tricky distribution. The family ofpower betting functions εpε−1 cannot provide a goodapproximation. The power martingales lose on p-valuesclose to the second peak of the p-values distribution.But the plug-in martingale is more flexible and endsup with a much higher final value.

It can be argued that both methods, old and new, workfor the Statlog Satellite dataset in the sense of rejectingthe exchangeability assumption at any of the commonlyused thresholds (such as 20 or 100). However, thesituation would have been different had the datasetconsisted of only the first 1000 examples: the final valueof the simple mixture martingale would have been 0.013whereas the final value of the plug-in martingale wouldhave been 3.74× 1015.

index of example

log 1

0 M


0 1000 2000 3000 4000 5000 6000 7000 8000 9000

−2

−1

01

23

45

67

89

10

Figure 4. The growth of the martingales for the USPSdataset. For the examples in the original order the ex-changeability assumption is rejected: the final martingalevalues are greater than 3.8 × 108.

5. Discussion and conclusions

In this paper we have introduced a new way of con-structing martingales for testing exchangeability on-line. We have shown that for stable sequences ofp-values the new more adaptive martingale providesasymptotically the best result compared with any othermartingale with a fixed betting function. The experi-ments of testing two real-life datasets have been pre-sented. Using the same sequence of p-values the plug-inmartingale extracts approximately the same amount ormore information about the data-generating distribu-tion as compared to the previously introduced powermartingales.

Remark. The previous studies were based on the nat-ural idea that lack of exchangeability leads to newexamples looking strange as compared to the old onesand therefore to small p-values (for example, if the data-generating mechanism changes its regime and startsproducing a different kind of examples). This is, how-ever, a situation where lack of exchangeability makesthe p-values cluster around 1: we observe examplesthat are ideal shapes of several kinds distorted by ran-dom noise, and the amount of noise decreases withtime. Predicting the kind of a new example using thenonconformity measure (1) will then tend to producelarge p-values.

Our goal has been to find an exchangeability martin-gale that does not need any assumptions about thep-values generated by the method of conformal predic-tion. Our proposed martingale adapts to the unknowndistribution of the p-values by estimating a good bet-

Plug-in martingales

index of example

log 1

0 M


0 500 1500 2500 3500 4500 5500

−2

02

46

810

1214

1618

20

Figure 5. The growth of the martingales for the StatlogSatellite dataset. For the examples in the original order theexchangeability assumption is rejected: the final value ofthe simple mixture martingale is 5.6 × 102, and the finalvalue of the plug-in martingale is 1.8 × 1017.

ting function from the past data. This is an example ofthe plug-in approach. It is generally believed that theBayesian approach is more efficient than the plug-inapproach (see, e.g., Bernardo & Smith, 2000, p. 483).In our present context, the Bayesian approach wouldinvolve choosing a prior distribution on the bettingfunctions and integrating the exchangeability martin-gales corresponding to these betting functions over theprior distribution. It is not clear yet whether this canbe done efficiently and, if yes, whether this can improvethe performance of exchangeability martingales.

Acknowledgments

We are indebted to Royal Holloway, University of London,for continued support and funding. This work has alsobeen supported by the EraSysBio+ grant SHIPREC fromthe European Union, BBSRC and BMBF and by the VLAgrant on machine learning algorithms.

We thank all reviewers for their valuable suggestions forimproving the paper.

References

Bernardo, Jose M. and Smith, Adrian F. M. BayesianTheory. Wiley, Chichester, 2000.

Frank, A. and Asuncion, A. UCI repository, 2010. URLhttp://archive.ics.uci.edu/ml.

Hastie, T., Tibshirani, R., and Friedman, J. The Elementsof Statistical Learning: Data Mining, Inference, andPrediction. Springer, New York, 2001.

Ho, S.-S. A martingale framework for concept change

p−values

betti

ng fu

nctio

n

betting function from the Power familyestimated pdf

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

00.

10.

30.

50.

70.

91

1.1

1.3

Figure 6. The betting functions for testing the USPSdataset for examples in the original order.

p−values

betti

ng fu

nctio

n

betting function from the Power familyestimated pdf

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

00.

10.

30.

50.

70.

91

1.1

Figure 7. The betting functions for testing the Statlog Satel-lite dataset for examples in the original order.

detection in time-varying data streams. In Proceedings ofthe 22nd International Conference on Machine Learning(ICML 2005), pp. 321–327, 2005.

Kullback, S. Information Theory and Statistics. Wiley, NewYork, 1959.

Schervish, M. J. Theory of Statistics. Springer, New York,1995.

Vapnik, V. N. Statistical Learning Theory. Wiley, NewYork, 1998.

Ville, J. Etude critique de la notion de collectif. Gauthier-Villars, Paris, 1939.

Vovk, V., Nouretdinov, I., and Gammerman, A. Testingexchangeability on-line. In Proceedings of the 20th Inter-national Conference on Machine Learning (ICML 2003),pp. 768–775, 2003.

http://archive.ics.uci.edu/ml

plug-in martingales for testing exchangeability on-line · alex gammerman [email protected] ilia...

Documents