on medians of (randomized) pairwise means · of the risk of low variance take the form of...

22
On Medians of (Randomized) Pairwise Means Pierre Laforgue 1 Stephan Cl´ emenc ¸on 1 Patrice Bertail 2 Abstract Tournament procedures, recently introduced in Lugosi & Mendelson (2016), offer an appealing alternative, from a theoretical perspective at least, to the principle of Empirical Risk Minimization in machine learning. Statistical learning by Median- of-Means (MoM) basically consists in segmenting the training data into blocks of equal size and com- paring the statistical performance of every pair of candidate decision rules on each data block: that with highest performance on the majority of the blocks is declared as the winner. In the context of nonparametric regression, functions having won all their duels have been shown to outperform em- pirical risk minimizers w.r.t. the mean squared error under minimal assumptions, while exhibit- ing robustness properties. It is the purpose of this paper to extend this approach, in order to address other learning problems in particular, for which the performance criterion takes the form of an expectation over pairs of observations rather than over one single observation, as may be the case in pairwise ranking, clustering or metric learn- ing. Precisely, it is proved here that the bounds achieved by MoM are essentially conserved when the blocks are built by means of independent sam- pling without replacement schemes instead of a simple segmentation. These results are next ex- tended to situations where the risk is related to a pairwise loss function and its empirical counter- part is of the form of a U -statistic. Beyond theo- retical results guaranteeing the performance of the learning/estimation methods proposed, some nu- merical experiments provide empirical evidence of their relevance in practice. 1 LTCI, T´ el ´ ecom Paris, Institut Polytechnique de Paris 2 Modal’X, UPL, Universit ´ e Paris-Nanterre. Correspondence to: Pierre Laforgue <[email protected]>. Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 1. Introduction In Lugosi & Mendelson (2016), the concept of tournament procedure for statistical learning has been introduced and analyzed in the context of nonparametric regression, one of the flagship problems of machine learning. The task is to predict a real valued random variable (r.v.) Y based on the observation of a random vector X with marginal dis- tribution μ(dx), taking its values in R d with d 1, say by means of a regression function f : R d R with min- imum expected quadratic risk R(f )= E[(Y - f (X)) 2 ]. Statistical learning usually relies on a training dataset S n = {(X 1 ,Y 1 ), ..., (X n ,Y n )} formed of independent copies of the generic pair (X, Y ). Following the Empiri- cal Risk Minimization (ERM) paradigm, one is encouraged to build predictive rules by minimizing an empirical ver- sion of the risk ˆ R n (f ) = (1/n) n i=1 (y i - f (x i )) 2 over a class F of regression function candidates of controlled complexity (e.g. of finite VC dimension), while being rich enough to contain a reasonable approximant of the opti- mal regression function f * (x)= E[Y | X = x]: for any f ∈F , the risk excess R(f ) -R(f * ) is then equal to ||f - f * || 2 L2(μ) = E[(f (X) - f * (X)) 2 ]. A completely different learning strategy, recently proposed in Lugosi & Mendelson (2016), consists in implementing a tournament procedure based on the Median-of-Means (MoM) method (see Nemirovsky & Yudin (1983)). Precisely, the full dataset is first divided into 3 subsamples of equal size. For every pair of candidate functions (f 1 ,f 2 ) F 2 , the first step consists in computing the MoM estima- tor of the quantity kf 1 - f 2 k L1(μ) := E[|f 1 (X) - f 2 (X)|] based on the first subsample: the latter being segmented into K 1 subsets of equal size (approximately), kf 1 -f 2 k L1(μ) is estimated by the median of the collection of estimators formed by its empirical versions computed from each of the K sub-datasets. When the MoM estimate is large enough, the match between f 1 and f 2 is allowed. The rationale be- hind this approach is as follows: if one of the candidate, say f 2 , is equal to f * , and the quantity kf 1 - f * k L1(μ) (which is less than kf 1 - f * k L2(μ) = p R(f 1 ) -R(f * )) is large, so is its (robust) MoM estimate (much less sensitive to atyp- ical values than sampling averages) with high probability. Therefore, f * is compared to distant candidates only, against which it should hopefully win its matches. The second step consists in computing the MoM estimator of R(f 1 ) -R(f 2 )

Upload: others

Post on 25-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of (Randomized) Pairwise Means

Pierre Laforgue 1 Stephan Clemencon 1 Patrice Bertail 2

Abstract

Tournament procedures, recently introduced inLugosi & Mendelson (2016), offer an appealingalternative, from a theoretical perspective at least,to the principle of Empirical Risk Minimization inmachine learning. Statistical learning by Median-of-Means (MoM) basically consists in segmentingthe training data into blocks of equal size and com-paring the statistical performance of every pair ofcandidate decision rules on each data block: thatwith highest performance on the majority of theblocks is declared as the winner. In the context ofnonparametric regression, functions having wonall their duels have been shown to outperform em-pirical risk minimizers w.r.t. the mean squarederror under minimal assumptions, while exhibit-ing robustness properties. It is the purpose of thispaper to extend this approach, in order to addressother learning problems in particular, for whichthe performance criterion takes the form of anexpectation over pairs of observations rather thanover one single observation, as may be the casein pairwise ranking, clustering or metric learn-ing. Precisely, it is proved here that the boundsachieved by MoM are essentially conserved whenthe blocks are built by means of independent sam-pling without replacement schemes instead of asimple segmentation. These results are next ex-tended to situations where the risk is related to apairwise loss function and its empirical counter-part is of the form of a U -statistic. Beyond theo-retical results guaranteeing the performance of thelearning/estimation methods proposed, some nu-merical experiments provide empirical evidenceof their relevance in practice.

1LTCI, Telecom Paris, Institut Polytechnique de Paris2Modal’X, UPL, Universite Paris-Nanterre. Correspondence to:Pierre Laforgue <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

1. IntroductionIn Lugosi & Mendelson (2016), the concept of tournamentprocedure for statistical learning has been introduced andanalyzed in the context of nonparametric regression, oneof the flagship problems of machine learning. The task isto predict a real valued random variable (r.v.) Y based onthe observation of a random vector X with marginal dis-tribution µ(dx), taking its values in Rd with d ≥ 1, sayby means of a regression function f : Rd → R with min-imum expected quadratic risk R(f) = E[(Y − f(X))2].Statistical learning usually relies on a training datasetSn = {(X1, Y1), . . . , (Xn, Yn)} formed of independentcopies of the generic pair (X,Y ). Following the Empiri-cal Risk Minimization (ERM) paradigm, one is encouragedto build predictive rules by minimizing an empirical ver-sion of the risk Rn(f) = (1/n)

∑ni=1(yi − f(xi))

2 overa class F of regression function candidates of controlledcomplexity (e.g. of finite VC dimension), while being richenough to contain a reasonable approximant of the opti-mal regression function f∗(x) = E[Y | X = x]: for anyf ∈ F , the risk excess R(f) − R(f∗) is then equal to||f − f∗||2L2(µ) = E[(f(X) − f∗(X))2]. A completelydifferent learning strategy, recently proposed in Lugosi &Mendelson (2016), consists in implementing a tournamentprocedure based on the Median-of-Means (MoM) method(see Nemirovsky & Yudin (1983)).

Precisely, the full dataset is first divided into 3 subsamples ofequal size. For every pair of candidate functions (f1, f2) ∈F2, the first step consists in computing the MoM estima-tor of the quantity ‖f1 − f2‖L1(µ) := E[|f1(X)− f2(X)|]based on the first subsample: the latter being segmented intoK ≥ 1 subsets of equal size (approximately), ‖f1−f2‖L1(µ)

is estimated by the median of the collection of estimatorsformed by its empirical versions computed from each of theK sub-datasets. When the MoM estimate is large enough,the match between f1 and f2 is allowed. The rationale be-hind this approach is as follows: if one of the candidate, sayf2, is equal to f∗, and the quantity ‖f1 − f∗‖L1(µ) (whichis less than ‖f1 − f∗‖L2(µ) =

√R(f1)−R(f∗)) is large,

so is its (robust) MoM estimate (much less sensitive to atyp-ical values than sampling averages) with high probability.Therefore, f∗ is compared to distant candidates only, againstwhich it should hopefully win its matches. The second stepconsists in computing the MoM estimator ofR(f1)−R(f2)

Page 2: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

based on the second subsample for every distant enoughcandidates f1 and f2. If a candidate wins all its matches, itis kept for the third round. As said before, f∗ should be partof this final pool, denoted by H . Finally, matches involvingall pairs of candidates in H are computed, using a thirdMoM estimate on the third part of the data. A championwinning again all its matches is either f∗ or has a smallenough excess risk anyway.

It is the purpose of the present article to extend the MoM-based statistical learning methodology. Firstly, we inves-tigate the impact of randomization in the MoM technique:by randomization, it is meant that data subsets are builtthrough sampling schemes, say simple random samplingwithout replacement (SRSWoR in abbreviated form) for sim-plicity, rather than partitioning. Though introducing morevariability in the procedure, we provide theoretical and em-pirical evidence that attractive properties of the originalMoM method are essentially preserved by this more flexiblevariant (in particular, the number of blocks involved in thisalternative procedure is arbitrary). Secondly, we considerthe application of the tournament approach to other statisti-cal learning problems, namely those involving pairwise lossfunctions, like popular formulations of ranking, clusteringor metric-learning. In this setup, natural statistical versionsof the risk of low variance take the form of U -statistics (ofdegree two), i.e. averages over all pairs of observations,see e.g. Clemencon et al. (2008). In this situation, wepropose to estimate the risk by the median of U -statisticscomputed from blocks obtained through data partitioningor sampling. Results showing the accuracy of this strategy,referred to as Median of (Randomized) Pairwise Means here,are established and application of this estimation techniqueto pairwise learning is next investigated from a theoreticalperspective and generalization bounds are obtained. Therelevance of this approach is also supported by convincingillustrative numerical experiments.

The rest of the paper is organized as follows. Section 2briefly recalls the main ideas underlying the MoM pro-cedure, its applications to robust machine learning aswell as basic concepts pertaining to the theory of U -statistics/processes. In section 3, the variants of the MoMapproach we propose are described at length and theoreticalresults establishing their statistical performance are stated.Illustrative numerical experiments are displayed in section4, while proofs are deferred to the Appendix section. Sometechnical details and additional experimental results arepostponed to the Supplementary Material (SM).

2. Background - PreliminariesAs a first go, we briefly describe the main ideas underlyingthe tournament procedure for robust machine learning, andnext recall basic notions of the theory of U -statistics, as well

as crucial results related to their efficient approximation.Here and throughout, the indicator function of any event Eis denoted by I{E}, the variance of any square integrabler.v. Z by Var (Z), the cardinality of any finite set A by #A.If (a1, . . . , an) ∈ Rn, the median (sometimes abbreviatedmed) of a1, . . . , an is defined as aσ((n+1)/2) when n isodd and aσ(n/2) otherwise, σ denoting a permutation of{1, . . . , n} such that aσ(1) ≤ . . . ≤ aσ(n). The floor andceiling functions are denoted by u 7→ buc and u 7→ due.

2.1. Medians of Means based Statistical Learning

First introduced independently by Nemirovsky & Yudin(1983), Jerrum et al. (1986), and Alon et al. (1999), theMedian-of-Means (MoM) is a mean estimator dedicated toreal random variables. It is now receiving a great deal ofattention in the statistical learning literature, following inthe footsteps of the results established in Audibert & Catoni(2011), Catoni (2012), where mean estimators are studiedthrough the angle of their deviation probabilities, ratherthan on their traditional mean square errors, for robustnesspurpose. Indeed, Devroye et al. (2016) showed that theMoM provides an optimal δ-dependent subgaussian meanestimator, under the sole assumption that a second ordermoment exists. The MoM estimator has later been extendedto random vectors, through different generalizations of themedian (Minsker et al., 2015; Hsu & Sabato, 2016; Lugosi& Mendelson, 2017). In Bubeck et al. (2013), it is used todesign robust bandits strategies, while Lerasle & Oliveira(2011) and Brownlees et al. (2015) advocate minimizinga MoM, respectively Catoni, estimate of the risk, ratherthan performing ERM, to tackle different learning tasks.More recently, Lugosi & Mendelson (2016) introduced atournament strategy based on the MoM approach.

The MoM estimator. Let Sn = {Z1, . . . , Zn} be a samplecomposed of n ≥ 1 independent realizations of a squareintegrable real valued r.v. Z, with expectation θ and finitevariance Var (Z) = σ2. Dividing Sn into K disjoint blocks,each with same cardinality B = bn/Kc, θk denotes theempirical mean based on the data lying in block k for k ≤ K.The MoM estimator θMoM of θ is then given by

θMoM = median(θ1, . . . , θK).

It offers an appealing alternative to the sample mean θn =(1/n)

∑ni=1 Zi, much more robust, i.e. less sensitive to

the presence of atypical values in the sample. Exponentialconcentration inequalities for the MoM estimator can be es-tablished in heavy tail situations, under the sole assumptionthat the Zi’s are square integrable. For any δ ∈ [e1−n/2, 1[,choosing K = dlog(1/δ)e and B = bn/Kc, we have (seee.g. Devroye et al. (2016), Lugosi & Mendelson (2016)):

P

{∣∣∣θMoM − θ∣∣∣ > 2

√2eσ

√1 + log(1/δ)

n

}≤ δ, (1)

Page 3: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

The tournament procedure. Placing ourselves in thedistribution-free regression framework, recalled in the In-troduction section, it has been shown that, under appro-priate complexity conditions and in a possibly non sub-Gaussian setup, the tournament procedure outputs a candi-date f with optimal accuracy/confidence tradeoff, outper-forming thus ERM in heavy-tail situations. Namely, thereexist c, c0, and r > 0 such that, with probability at least1−exp(c0nmin{1, r2}), it holds both at the same time (seeTheorem 2.10 in Lugosi & Mendelson (2016)):

‖f − f∗‖L2≤ cr, and R(f)−R(f∗) ≤ (cr)2,

2.2. Pairwise Means and U -Statistics

Rather than the mean of an integrable r.v., suppose now thatthe quantity of interest is of the form θ(h) = E[h(X1, X2)],where X1 and X2 are i.i.d. random vectors, taking theirvalues in some measurable space X with distribution F (dx)and h : X × X → R is a measurable mapping, squareintegrable w.r.t. F ⊗ F . For simplicity, we assume thath(x1, x2) is symmetric (i.e. h(x1, x2) = h(x2, x1) for all(x1, x2) ∈ X 2). A natural estimator of the parameter θ(h)based on an i.i.d. sample Sn = {X1, . . . , Xn} drawnfrom F is the average over all pairs

Un(h) =2

n(n− 1)

∑1≤i<j≤n

h(Xi, Xj). (2)

The quantity (2) is known as the U -statistic of degreetwo1, with kernel h, based on the sample Sn. One mayrefer to Lee (1990) for an account of the theory of U -statistics. As may be shown by a Lehmann-Scheffe argu-ment, it is the unbiased estimator of θ(h) with minimumvariance. Setting h1(X1) = E[h(X1, X2) | X1] − θ(h),h2(X1, X2) = h(X1, X2) − θ(h) − h1(X1) − h1(X2),σ2

1(h) = Var (h1(X1)) and σ22(h) = Var (h2(X1, X2)) and

using the orthogonal decomposition (usually referred to assecond Hoeffding decomposition, see Hoeffding (1948))

Un(h)− θ(h) =2

n

n∑i=1

h1(Xi)

+2

n(n− 1)

∑1≤i<j≤n

h2(Xi, Xj),

one may easily see that

Var (Un(h)) =4σ2

1(h)

n+

2σ22(h)

n(n− 1). (3)

Of course, an estimator of the parameter θ(h) taking theform of an i.i.d. average can be obtained by splitting thedataset into two halves and computing

1Let d ≥ n and H : X d → R be measurable,square integrable with respect to F⊗k. The statistic (n!/(n −d)!)

∑(i1, ..., id)

H(Xi1 , . . . , Xid), where the sum is taken overall d-tuples of (1, . . . , n), is a U -statistic of degree d.

Mn(h) =1

bn/2c

bn/2c∑i=1

h(Xi, Xi+bn/2c).

One can check that its variance, Var(Mn(h)) =σ2(h)/bn/2c, with σ2(h) = Var(h(X1, X2) = 2σ2

1(h) +σ2

2(h), is however significantly larger than (3). Regard-ing the difficulty of the analysis of the fluctuations of (2)(uniformly over a class of kernels possibly), the reducedvariance property has a price: the variables summed upbeing far from independent, linearization tricks (i.e. Ha-jek/Hoeffding projection) are required to establish statisticalguarantees for the minimization of U -statistics. Refer toClemencon et al. (2008) for further details.

Examples. In machine learning, various empirical perfor-mance criteria are of the form of a U -statistic.

• In clustering, the goal is to find a partition P of the featurespace X so that pairs of observations independently drawnfrom a certain distribution F on X within a same cell ofP are more similar w.r.t. a certain metric D : X 2 → R+

than pairs lying in different cells. Based on an i.i.d. trainingsample X1, . . . , Xn, this leads to minimize the U -statistic,referred to as empirical clustering risk:

Wn(P) =2

n(n− 1)

∑1≤i<j≤n

D(Xi, Xj) · ΦP(Xi, Xj),

where ΦP(x, x′) =∑C∈P I{(x, x′) ∈ C2}, over a class of

partition candidates (see Clemencon (2014)).

• In pairwise ranking, the objective is to learn from in-dependent labeled data (X1, Y1), . . . , (Xn, Yn) drawnas a generic random pair (X,Y ), where the real valuedrandom label Y is assigned to an object described by ar.v. X taking its values in a measurable space X , a rank-ing rule r : X 2 → {−1, 0,+1} that permits to predict,among two objects (X,Y ) and (X ′, Y ′) chosen at random,which one is preferred: (X,Y ) is preferred to (X ′, Y ′)when Y > Y ′ and, in this case, one would ideally haver(X,X ′) = +1, the rule r being supposed anti-symmetric(i.e. r(x, x′) = −r(x′, x) for all (x, x′) ∈ X 2). This canbe formulated as the problem of minimizing the U -statisticknown as the empirical ranking risk (see Clemencon et al.(2005)) for a given loss function ` : R→ R+:

Ln(r) =2

n(n− 1)

∑1≤i<j≤n

` (−r(Xi, Xj) · (Yi − Yj)) .

Other examples of U -statistics are naturally involved in theformulation of metric/similarity-learning tasks, see Belletet al. (2013) or Vogel et al. (2018). We also point out thatthe notion of U -statistic is much more general than thatconsidered above: U -statistics of degree higher than two(i.e. associated to kernels with more than two arguments)

Page 4: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

and based on more than one sample can be defined, see e.g.Chapter 14 in Van der Vaart (2000) for further details. Themethods proposed and the results proved in this paper can bestraightforwardly extended to this more general framework.

3. Theoretical ResultsMainly motivated by pairwise learning problems such asthose mentioned in subsection 2.2, it is the goal of thissection to introduce and study several extensions of theMoM approach for robust statistical learning.

3.1. Medians of Randomized Means

As a first go, we place ourselves in the setup of Section 2.1,and use the notation introduced therein. But instead ofdividing the dataset into disjoint blocks, an arbitrary numberK of blocks, of arbitrary size B ≤ n, are now formedby sampling without replacement (SWoR), independentlyfrom Sn. Each randomized data block Bk, k ≤ K, is fullycharacterized by a random vector εk = (εk,1, . . . , εk,n),such that εk,i is equal to 1 if the i-th observation has beenselected in the k-th block, and to 0 otherwise. The εk’sare i.i.d. random vectors, uniformly distributed on the setΛn,B = {ε ∈ {0, 1}n :

∑ni=1 εi = B} of cardinality

(nB

).

Equipped with this notation, the empirical mean computedfrom the k-th randomized block, for k ≤ K, can be writtenas θk = (1/B)

∑ni=1 εk,iZi. The Median-of-Randomized

Means (MoRM) estimator θMoRM is then given by

θMoRM = median(θ1, . . . , θK). (4)

We point out that the number K and size B ≤ n of therandomized blocks are arbitrary in the MoRM procedure, incontrast with the usual MoM approach, where B = bn/Kc.However, choices for B and K very similar to those leadingto (1) lead to an analogous exponential bound, as revealedby Proposition 1’s proof. Because the randomized blocksare not independent, the argument used to establish (1) can-not be applied in a straightforward manner to investigate theaccuracy of (4). Nevertheless, as can be seen by examiningthe proof of the result stated below, a concentration inequal-ity can still be derived, using the conditional independenceof the draws given Sn, and a closed analytical form for theconditional probabilities P{|θk − θ| > ε | Sn}, seen asU -statistics of degree B. Refer to the Appendix for details.

Proposition 1. Suppose that Z1, . . . , Zn are indepen-dent copies of a square integrable r.v. Z with mean θand variance σ2. Then, for any τ ∈]0, 1/2[, for any δ ∈[2e−8τ2n/9, 1[, choosing K = dlog(2/δ)/(2(1/2 − τ)2)eand B = b8τ2n/(9 log(2/δ))c, we have:

P

{∣∣θMoRM − θ∣∣ > 3

√3 σ

2 τ3/2

√log(2/δ)

n

}≤ δ. (5)

The bound stated above presents three main differenceswith (1). Recall first that the number K of randomizedblocks is completely arbitrary in the MoRM procedure andmay even exceed n. Consequently, it is always possibleto build dlog(2/δ)/(2 (1/2− τ)

2)e blocks, and there is no

restriction on the range of admissible confidence levels δ dueto K. Second, the size B of the blocks can also be chosencompletely arbitrarily in {1, . . . , n}, and independentlyfrom K. Proposition 1 exhibits their respective dependencewith respect to δ and n. Still, B needs to be greater than 1,which results in a restriction on the admissible δ’s, such asspecified. Observe finally that B never exceeds n. Indeedfor all τ ∈]0, 1/2[, 8τ2/(9 log(2/δ)) does not exceeds 1 aslong as δ is lower than 2 exp(−2/9) ≈ 1.6, which is alwaystrue. Third, the proposed bound involves an additionalparameter τ , that can be arbitrarily chosen in ]0, 1/2[. Asmay be revealed by examination of the proof, the choice ofthis extra parameter reflects a trade-off between the order ofmagnitude of K and that of B: the larger τ , the larger K,the larger the confidence range, the lower B and the lowerthe constant in (5) as well. Since one can pick K arbitrarilylarge, τ can be chosen as large as possible in ]0, 1/2[. Thisway, one asymptotically achieves a 3

√6 constant factor,

which is the same than that obtained in Hsu & Sabato (2016)for a comparable confidence range. However, the price ofsuch an improvement is the construction of a higher numberof blocks in practice (for a comparable number of blocks,the constant in (5) becomes 27

√2).

Remark 1. (ALTERNATIVE SAMPLING SCHEMES) Wepoint out that other procedures than the SWoR scheme above(e.g. Poisson/Bernoulli/Monte-Carlo sampling) can be con-sidered to build blocks and estimates of the parameter θ.However, as discussed in the SM, the theoretical analysisof such variants is much more challenging, due to possiblereplications of the same original observation in a block.

Remark 2. (EXTENSION TO RANDOM VECTORS) Amongapproaches extending MoMs to random vectors, that ofMinsker et al. (2015) could be readily adapted to MoRM.Indeed, once Lemma 2.1 therein has been applied, the sumof indicators can be bounded exactly as in Proposition 1’sproof. Computationally, MoRM only differs from MoMin the sampling, adding no difficulty, while multivariatemedians can be computed efficiently (Hopkins, 2018).

Remark 3. (RANDOMIZATION MOTIVATION) Theoreti-cally, randomization being a natural alternative to datasegmentation, it appeared interesting to study its impact onMoMs. On the practical side, when performing a MoM Gra-dient Descent (GD), it is often needed shuffling the blocksat each step (see e.g. Remark 5 in Lecue et al. (2018)).While this shuffling may seem artificial and “ad hoc” in aMoM GD, it is already included and controlled with MoRM.Finally, extending MoU to incomplete U -statistics like insubsection 3.4 first requires a MoM randomization’s study.

Page 5: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

3.2. Medians of (Randomized) U -statistics

We now consider the situation described in subsection 2.2,where the parameter of interest is θ(h) = E[h(X1, X2)] andinvestigate the performance of two possible approaches forextending the MoM methodology.

Medians of U -statistics. The most straightforward wayof extending the MoM approach is undoubtedly to formcompleteU -statistics based onK subsamples correspondingto sets of indexes I1, . . . , IK of size B = bn/Kc builtby segmenting the full sample, as originally proposed: fork ∈ {1, . . . , K},

Uk(h) =2

B(B − 1)

∑(i,j)∈I2k, i<j

h(Xi, Xj).

The median of U -statistics estimator (MoU in abbreviatedform) of the parameter θ(h) is then defined as

θMoU(h) = median(U1(h), . . . , UK(h)).

The following result provides a bound analogous to (1),revealing its accuracy.

Proposition 2. Let δ ∈ [e1−2n/9, 1[. Choosing K =d9/2 log(1/δ)e, we have with probability at least 1− δ:

∣∣∣θMoU(h)− θ(h)∣∣∣ ≤

√C1 log 1

δ

n+

C2 log2( 1δ )

n(2n− 9 log 1

δ

) ,with C1 = 108σ2

1(h) and C2 = 486σ22(h).

We point out that another robust estimator θMoM(h) ofθ could also have been obtained by applying the classicMo(R)M methodology recalled in subsection 2.1 to the setof bn/2c i.i.d. observations {h(Xi, Xi+bn/2c) : 1 ≤ i ≤bn/2c}, see the discussion in subsection 2.2. In this con-text, we deduce from Eq. (1) with K = dlog(1/δ)e andB = bbn/2c/Kc that, for any δ ∈ [e1−bn/2c/2, 1[

∣∣∣θMoM(h)− θ(h)∣∣∣ ≤ 2

√2eσ(h)

√1 + log(1/δ)

bn/2c

with probability at least 1− δ. The Mo(R)M strategies onindependent pairs lead to constants respectively equal to4eσ(h) and 6

√3σ(h). On the other hand, MoU reaches a

6√

3σ1(h) constant factor on its dominant term. Recallingthat σ2(h) = 2σ2

1(h) + σ22(h), beyond the

√2 constant

factor, MoU provides an improvement all the more sig-nificant that σ2

2(h) is large. Another difference betweenthe bounds is the restriction on δ, which is looser in theMoU case. This is due to the fact that the MoU estimatorcan possibly involve any pair of observations among then(n− 1)/2 possible ones, in contrast to θMoM(h) that relieson the bn/2c pairs set once and for all at the beginning only.

MoU however exhibits a more complex two rates formula,but the second term being negligible the performance arenot affected, as shall be confirmed empirically.

As suggested in subsection 3.1, the data blocks used tocompute the collection ofK U -statistics could be formed bymeans of a SRSWoR scheme. Confidence bounds for sucha median of randomized U -statistics estimator, comparableto those achieved by the MoU estimator, are stated below.

Medians of Randomized U -statistics. The alternative wepropose consists in building an arbitrary number K of datablocks B1, . . . , BK of size B ≤ n by means of a SRSWoRscheme, and, for each data block Bk, forming all possiblepairs of observations in order to compute

Uk(h) =1

B(B − 1)

∑i<j

εk,iεk,j · h(Xi, Xj),

where εk denotes the random vector characterizing the k −th randomized block, just like in subsection 3.1. Observethat, for all k ∈ {1, . . . , K}, we have: E[Uk(h) | Sn] =Un(h). The Median of Randomized U -statistics estimatorof θ(h) is then defined as

θMoRU(h) = median(U1(h), . . . , UK(h)). (6)

The following proposition establishes the accuracy of the es-timator (6), while emphasizing the advantages of the greaterflexibility it offers when choosing B and K.

Proposition 3. For any τ ∈]0, 1/2[, for any δ ∈[2e−8τ2n/9, 1[, choosing K = dlog(2/δ)/(2(1/2 − τ)2)eand B = b8τ2n/(9 log(2/δ))c, it holds w.p.a.l. 1− δ:

∣∣θMoRU(h)− θ(h)∣∣ ≤√C1(τ) log 2

δ

n+

C2(τ) log2( 2δ )

n(8n− 9 log 2

δ

) ,with C1(τ) = 27σ2

1(h)/(2τ3) and C2 = 243σ22(h)/(4τ3).

Remark 4. Observe that Proposition 2’s constants (andbound) can be recovered asymptotically by letting τ → 1/2.

Remark 5. Propositions 2 and 3 remain valid for (multi-samples) U -statistics of arbitrary degree. Refer to the SMfor the general statements, and discussions about relatedapproaches (Joly & Lugosi, 2016; Minsker & Wei, 2018).

3.3. MoU-based Pairwise Learning

We now describe a version of the tournament method tai-lored to pairwise learning. Let X be a measurable space,F ⊂ RX×X a class of decision rules, and ` : F×X 2 → R+

a given loss function. The goal pursued here is to learnfrom 2n i.i.d. variables X1, . . . , X2n distributed asa generic r.v. X valued in X a minimizer of the riskR(f) = E[`(f, (X,X ′))], where X ′ denotes an indepen-dent copy ofX . In order to benefit from the standard tourna-ment setting, we introduce the following notation: for every

Page 6: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

f ∈ F , let Hf (X,X ′) =√`(f, (X,X ′)) the kernel that

maps every pair (X,X ′) to its (square root) loss throughf . Let HF = {Hf : f ∈ F}. It is easy to see that for allf ∈ F , R(f) = ‖Hf‖2L2(µ), and that if f∗ and H∗f denoterespectively theR and L2 minimizers over F andHF , thenH∗f = Hf∗ . First, the dataset is split into 2 subsamples Sand S ′, each of size n. Then, a distance oracle is used toallow matches to take place. Namely, for any f, g ∈ F2, letΦS(f, g) be a MoU estimate of ‖Hf −Hg‖L1 built on S.If B1, . . . ,BK is a partition of S, it reads:

ΦS(f, g) = med(U1|Hf −Hg|, . . . , UK |Hf −Hg|

).

If ΦS(f, g) is greater than βr, for β and r to be specifiedlater, a match between f and g is allowed. As shall be seenin the SM proofs, a MoRU estimate of ‖Hf −Hg‖L1

couldalso have been used instead of a MoU. The idea underlyingthe pairwise tournament is the same than that of the standardone: with high probability ΦS(f, g) is a good estimate of‖Hf −Hg‖L2

, so that only distant candidates are allowedto confront. And if Hf∗ is one of these two candidates, itshould hopefully win its match against a distant challenger.The nature of these matches is to be specified now. Forany f, g ∈ F2, let ΨS′(f, g) denote a MoU estimate ofE[H2

f −H2g ] built on S ′. With B′1, . . . ,B′K′ a partition of

S ′, ΨS′(f, g) reads

ΨS′(f, g) = med(U1(H2

f −H2g ), . . . , UK′(H

2f −H2

g )).

f is declared winner of the match if ΨS′(f, g) ≤ 0, i.e. if∑i<j `(f, (Xi, Xj)) is lower than

∑i<j `(g, (Xi, Xj)) on

more than half of the blocks. A candidate that has not lost asingle match it has been allowed to participate in presentsgood generalization properties under mild assumptions, asrevealed by the following theorem.

Theorem 1. Let F be a class of prediction functions,and ` a loss function such that HF is locally compact.Assume that there exist q > 2 and L > 1 such that∀Hf ∈ span(HF ), ‖Hf‖Lq

≤ L‖Hf‖L2. Let r∗ (prop-

erly defined in the SM due to space limitation) that only de-pends on f∗, L, q, and the geometry ofHF aroundHf∗ . Setr ≥ 2r∗. Then there exist c0, c > 0, and a procedure basedon X1, . . . , X2n, L, and r that selects a function f ∈ Fsuch that with probability at least 1− exp(c0nmin{1, r2}),

R(f)−R(f∗) ≤ cr.

Proof. The proof is analogous to that of Theorem 2.11 inLugosi & Mendelson (2016), and sketched in the SM.

Remark 6. In pairwise learning, one seeks to minimize`(f,X,X ′) = (

√`(f,X,X ′)− 0)2 = (Hf (X,X ′)− 0)2.

We almost recover the setting of Lugosi & Mendelson (2016):quadratic loss, with Y = 0, for the decision function Hf .

This is why any loss function ` can be considered, once tech-nicalities induced by U -statistics are tackled. The controlobtained on ‖Hf −H∗f ‖L2 then translates into a control onthe excess risk of f (see SM for further details).

Remark 7. As discussed at length in Lugosi & Mendelson(2016), computing the tournament winner is a nontrivialproblem. However, one could alternatively consider per-forming a tournament on an ε-coverage of F , while control-ling the approximation error of this coverage.

3.4. Discussion - Further Extensions

The computation of the U -statistic (2) is expensive inthe sense that it involves the summation of O(n2) terms.The concept of incomplete U -statistic, see Blom (1976),precisely permits to address this computational issue andachieve a trade-off between scalability and variance reduc-tion. In one of its simplest forms, it consists in selecting asubsample of size M ≥ 1 by sampling with replacement inthe set of all pairs of observations that can be formed fromthe original sample. Setting Λ = {(i, j) : 1 ≤ i < j ≤ n},and denoting by {(i1, j1), . . . , (iM , jM )} ⊂ Λ the sub-sample drawn by Monte-Carlo, the incomplete version of theU -statistic (2) is: UM (h) = (1/M)

∑m≤M h(Xim , Xjm).

UM (h) is directly an unbiased estimator of θ with variance

Var(UM (h)

)=

(1− 1

M

)Var(Un(h)) +

σ2(h)

M.

The difference between its variance and that of (2) vanishesas M increases. In contrast, when M ≤ #Λ = n(n− 1)/2,the variance of a complete U -statistic based on a subsampleof size b

√Mc, and thus on O(M) pairs just like UM (h),

is of order O(1/√M). Minimization of incomplete U -

statistics has been investigated in Clemencon et al. (2016)from the perspective of scalable statistical learning. Hence,rather than sampling first observations and forming nextpairs from data blocks in order to compute a collection ofcomplete U -statistics, which the median is subsequentlytaken of, one could sample directly pairs of observations,compute alternatively estimates of drastically reduced vari-ance and output a Median of Incomplete U -statistics. How-ever, one faces significant difficulties when trying to analyzetheoretically such a variant, as explained in the SM.

4. Numerical ExperimentsHere we display numerical results supporting the relevanceof the MoM variants analyzed in the paper. Additionalexperiments are presented in the SM for completeness.

MoRM experiments. Considering inference of the expecta-tion of four specified distributions (Gaussian, Student, Log-normal and Pareto), based on a sample of size n = 1000,seven estimators are compared below: standard MoM, and

Page 7: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

Table 1. Quadratic Risks for the Mean Estimation, δ = 0.001

NORMAL (0, 1) STUDENT (3) LOG-NORMAL (0, 1) PARETO (3)

MOM 0.00149 ± 0.00218 0.00410 ± 0.00584 0.00697 ± 0.00948 1.02036 ± 0.06115MORM1/6, SWOR 0.01366 ± 0.01888 0.02947 ± 0.04452 0.06210 ± 0.07876 1.12256 ± 0.14970MORM3/10, SWOR 0.00255 ± 0.00361 0.00602 ± 0.00868 0.01241 ± 0.01610 1.05458 ± 0.07041MORM9/20, SWOR 0.00105 ± 0.00148 0.00264 ± 0.00372 0.00497 ± 0.00668 1.02802 ± 0.04903

Table 2. Quadratic Risks for the Variance Estimation, δ = 0.001

NORMAL (0, 1) STUDENT (3) LOG-NORMAL (0, 1) PARETO (3)

MOU1/2; 1/2 0.00409 ± 0.00579 1.72618 ± 28.3563 2.61283 ± 23.5001 1.35748 ± 36.7998MOUPARTITION 0.00324 ± 0.00448 0.38242 ± 0.31934 1.62258 ± 1.41839 0.09300 ± 0.05650MORUSWOR 0.00504 ± 0.00705 0.51202 ± 3.88291 2.01399 ± 4.85311 0.09703 ± 0.07116

six MoRM estimators, related to different sampling schemes(SRSWoR, Monte-Carlo) or different values of the tuningparameter τ . Results are obtained through 5000 replica-tions of the estimation procedures. Beyond the quadraticrisk, accuracy of the estimators are assessed by means ofdeviation probabilities (see SM), i.e. empirical quantiles fora geometrical grid of confidence levels δ. As highlightedabove, τ = 1/6 leads to (approximately) the same numberof blocks as in the MoM procedure. However, MoRM usu-ally select blocks of cardinality lower than n/K, so thatthe MoRM estimator with τ = 1/6 uses less examples thanMoM. Proposition 1 exhibits a higher constant for MoRMin that case, and it is confirmed empirically here. The choiceτ = 3/10 guarantees that the number of MoRM blocks mul-tiplied by their cardinality is equal to n. This way, MoRMuses as much samples as MoM. Nevertheless, the increasedvariability leads to a slightly lower performance in this case.Finally, τ = 9/20 is chosen to be closer to 1/2, as suggestedby (5). In this setting, the two constant factors are (almost)equal, and MoRM even empirically shows a systematic im-provement compared to MoM. Note that the quantile curvesshould be decreasing. However, the estimators being δ-dependent, different experiments are run for each value ofδ, and the rare little increases are due to this random effect.

Mo(R)U experiments. In these experiments assessing em-pirically the performance of the Mo(R)U methods, the pa-rameter of interest is the variance (i.e. h(x, y) = (x−y)2/2)of the four laws used above. Again, estimators are assessedthrough their quadratic risk and empirical quantiles. A met-ric learning application is also proposed in the SM.

5. ConclusionIn this paper, various extensions of the Medians-of-Meansmethodology, which tournament-based statistical learningtechniques recently proposed in the literature to handle

heavy-tailed situations rely on, have been investigated atlength. First, confidence bounds showing that accuracycan be fully preserved when data blocks are built throughSRSWoR schemes rather than simple segmentation, givingmore flexibility to the approach regarding the number ofblocks and their size, are established. Second, its applicationto estimation of pairwise expectations (i.e. Medians-of-U -statistics) is studied in a valid theoretical framework, pavingthe way for the design of robust pairwise statistical learn-ing techniques in clustering, ranking or similarity-learningtasks.

Technical ProofsProof of Proposition 1

Let ε > 0. Just like in the classic argument used to prove(1), observe that

{∣∣θMoRM − θ∣∣ > ε

}⊂

{K∑k=1

Iε(Bεk) ≥ K/2

},

where Iε(Bεk) = I{∣∣θk − θ∣∣ > ε} for k = 1, . . . , K. In

order to benefit from the conditional independence of theblocks given the original sample Sn, we first condition uponSn and consider the variability induced by the εk’s only:

P{∣∣θMoRM − θ

∣∣ > ε | Sn}≤ P

{K∑k=1

Iε(Bεk)

K≥ 1

2

∣∣∣∣Sn}.

Now, the average (1/K)∑Kk=1 Iε(Bεk) can be viewed as

an approximation of the U -statistic of degree B (refer toLee (1990)), its conditional expectation given Sn being

Uεn =1(nB

) ∑ε∈Λ(n,B)

Iε(Bε).

Page 8: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

Denoting by pε = E[Uεn] = P{|θ1−θ| > ε} the expectationof the Iε(Bεk)’s, we have ∀τ ∈]0, 1/2[:

P{ ∣∣θMoRM − θ

∣∣ >ε} ≤ PSn {Uεn − pε ≥ τ − pε} (7)

+ ESn

[Pε

{1

K

K∑k=1

Iε(Bεk)− Uεn ≥1

2− τ∣∣∣∣Sn}]

.

By virtue of Hoeffding inequality for i.i.d. averages (seeHoeffding (1963)) conditioned upon Sn, we have: ∀t > 0,

{1

K

K∑k=1

Iε(Bεk)− Uεn ≥ t∣∣∣∣Sn}≤ exp

(−2Kt2

).

(8)In addition, the version of Hoeffding inequality for U -statistics (cf Hoeffding (1963), see also Theorem A in Chap-ter 5 of Serfling (1980)) yields: ∀t > 0,

PSn {Uεn − pε ≥ t} ≤ exp(−2nt2/B

). (9)

One may also show (see SM A.2.1) that pε ≤ σ2

Bε2 . Combin-ing this remark with equations (7), (8) and (9), the deviationprobability of θMoRM can be bounded by

exp

(−2

n

B

(τ − σ2

Bε2

)2)

+ exp

(−2K

(1

2− τ)2).

Choosing K = dlog(2/δ)/(2(1/2 − τ)2e and B =b8τ2n/(9 log(2/δ))c leads to the desired result.

Proof of Proposition 2

The data blocks are built here by partitioning the originaldataset into K ≤ n subsamples of size B = bn/Kc. SetIε,k = I{|Uk(h)−θ(h)| > ε} for k ∈ {1, . . . , K}. Again,observe that P

{∣∣θMoU(h)(h)− θ(h)∣∣ > ε

}is lower than

P

{1

K

K∑k=1

Iε,k − qε ≥1

2− qε

},

where qε = E[Iε,1] = P{|U1(h) − θ(h)| > ε}. By virtueof Chebyshev’s inequality and equation (3):

qε ≤Var(U1(h)

)ε2

=1

ε2

(4σ2

1(h)

B+

2σ22(h)

B(B − 1)

).

Using Hoeffding inequality, the deviation probability canthus be bounded by

exp

(−2K

(1

2−(

4σ21(h)

Bε2+

2σ22(h)

B(B − 1)ε2

))2).

Choosing K = log(1/δ)/(2(1/2− λ)2), λ ∈]0, 1/2[ gives:

ε =

√C1(λ)

log 1δ

n+ C2(λ)

log2( 1δ )

n[2( 1

2 − λ)2n− log 1δ

] ,

with C1(λ) = 2σ21(h)/(λ( 1

2 − λ)2) and C2(λ) =σ2

2(h)/(λ(1/2−λ)2). The optimal constant for the first andleading term is attained for λ = 1/6, which corresponds toK = (9/2) log(1/δ) and gives:

ε =

√C1

log 1δ

n+ C2

log2( 1δ )

n(2n− 9 log 1

δ

) ,with C1 = 108σ2

1(h) and C2 = 486σ22(h). Finally, taking

dKe instead of K does not change the result.

Proof of Proposition 3

Here we consider the situation where the estimator is themedian of K randomized U -statistics, computed from datablocks built by means of independent SRSWoR schemes.We set Iε(εk) = I

{|Uk(h)− θ(h)| > ε

}. For all τ ∈

]0, 1/2[, we have:

P{∣∣∣θMoRU(h)−θ(h)

∣∣∣ > ε}≤ P

{1

K

K∑k=1

Iε(εk) ≥ 1

2

}

≤ E

[P

{1

K

K∑k=1

Iε(εk)−W εn ≥

1

2− τ | Sn

}]+ P {W ε

n − qε ≥ τ − qε} , (10)

where we set

W εn = E [Iε(ε1) | Sn] =

1(nB

) ∑ε∈Λn,B

Iε(ε),

qε = E[Iε(ε1)] = P{|U1(h)− θ(h)| > ε}.

The conditional expectation W εn, with mean qε, is a U -

statistic of degree B, so that Theorem A in Chapter 5 ofSerfling (1980) yields:

P {W εn − qε ≥ τ − qε} ≤ exp

(−2

n

B

(1

2− qε

)2).

(11)

One may also show (see SM A.2.2) that

qε ≤ 1

ε2

(4σ2

1(h)

B+

2σ22(h)

B(B − 1)

). (12)

Combining (10), standard Hoeffding’s inequality condi-tioned on the data, as well as (11) together with (12) givesthat the deviation of θMoRU(h) is upper bounded by

exp

(−2K

(1

2− τ)2)

(13)

+ exp

(−2

n

B

(τ − 1

ε2

(4σ2

1(h)

B+

2σ22(h)

B(B − 1)

))2).

Choosing K = dlog(2/δ)/(2(1/2 − τ)2)e and B =b8τ2n/(9 log(2/δ))c leads to the desired bound.

Page 9: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

ReferencesAlon, N., Matias, Y., and Szegedy, M. The space complexity

of approximating the frequency moments. Journal ofComputer and system sciences, 58(1):137–147, 1999.

Audibert, J.-Y. and Catoni, O. Robust linear least squaresregression. The Annals of Statistics, 39(5):2766–2794,2011.

Bellet, A., Habrard, A., and Sebban, M. A Survey on MetricLearning for Feature Vectors and Structured Data. ArXive-prints, June 2013.

Blom, G. Some properties of incomplete U-statistics.Biometrika, 63(3):573–580, 1976.

Brownlees, C., Joly, E., Lugosi, G., et al. Empirical risk min-imization for heavy-tailed losses. The Annals of Statistics,43(6):2507–2536, 2015.

Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. Bandits withheavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.

Callaert, H. and Janssen, P. The Berry-Esseen theorem forU-statistics. The Annals of Statistics, 6(2):417–421, 1978.

Catoni, O. Challenging the empirical mean and empiricalvariance: a deviation study. In Annales de l’Institut HenriPoincare, Probabilites et Statistiques, volume 48, pp.1148–1185. Institut Henri Poincare, 2012.

Clemencon, S. A statistical view of clustering performancethrough the theory of U-processes. Journal of Multivari-ate Analysis, 124:42–56, 2014.

Clemencon, S., Lugosi, G., and Vayatis, N. Ranking andscoring using empirical risk minimization. In Proceed-ings of COLT, 2005.

Clemencon, S., Lugosi, G., and Vayatis, N. Ranking andempirical risk minimization of U-statistics. The Annalsof Statistics, 36(2):844–874, 2008.

Clemencon, S., Colin, I., and Bellet, A. Scaling-up Em-pirical Risk Minimization: Optimization of IncompleteU-statistics. Journal of Machine Learning Research, 17:1–36, 2016.

Devroye, L., Lerasle, M., Lugosi, G., Oliveira, R. I., et al.Sub-gaussian mean estimators. The Annals of Statistics,44(6):2695–2725, 2016.

Hoeffding, W. A class of statistics with asymptoticallynormal distribution. Ann. Math. Stat., 19:293–325, 1948.

Hoeffding, W. Probability inequalities for sums of boundedrandom variables. Journal of the American StatisticalAssociation, 58(301):13–30, 1963.

Hopkins, S. B. Sub-gaussian mean estimation in polynomialtime. arXiv preprint arXiv:1809.07425, 2018.

Hsu, D. and Sabato, S. Loss minimization and parameterestimation with heavy tails. The Journal of MachineLearning Research, 17(1):543–582, 2016.

Jerrum, M., Valiant, L., and Vazirani, V. Random generationof combinatorial structures from a uniform distribution.Theoretical Computer Science, 43:169–188, 1986.

Joly, E. and Lugosi, G. Robust estimation of u-statistics.Stochastic Processes and their Applications, 126(12):3760–3773, 2016.

Lecue, G. and Lerasle, M. Robust machine learning bymedian-of-means: theory and practice. arXiv preprintarXiv:1711.10306, 2017.

Lecue, G., Lerasle, M., and Mathieu, T. Robust clas-sification via mom minimization. arXiv preprintarXiv:1808.03106, 2018.

Lee, A. J. U -statistics: Theory and practice. Marcel Dekker,Inc., New York, 1990.

Lerasle, M. and Oliveira, R. I. Robust empirical meanestimators. arXiv preprint arXiv:1112.3914, 2011.

Lugosi, G. and Mendelson, S. Risk minimization by median-of-means tournaments. arXiv preprint arXiv:1608.00757,2016.

Lugosi, G. and Mendelson, S. Sub-gaussian estima-tors of the mean of a random vector. arXiv preprintarXiv:1702.00482, 2017.

McDiarmid, C. On the method of bounded differences,pp. 148–188. London Mathematical Society LectureNote Series. Cambridge University Press, 1989. doi:10.1017/CBO9781107359949.008.

Mendelson, S. On aggregation for heavy-tailed classes.Probability Theory and Related Fields, 168(3-4):641–674, 2017.

Minsker, S. and Wei, X. Robust modifications of u-statisticsand applications to covariance estimation problems. arXivpreprint arXiv:1801.05565, 2018.

Minsker, S. et al. Geometric Median and Robust Estimationin Banach Spaces. Bernoulli, 21(4):2308–2335, 2015.

Nemirovsky, A. S. and Yudin, D. B. Problem Complex-ity and Method Efficiency in Optimization. Wiley Inter-science, New-York, 1983.

Pena, V. H. and Gine, E. Decoupling: from dependence toindependence. 1999.

Page 10: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

Serfling, R. Approximation Theorems of MathematicalStatistics. Wiley Series in Probability and Statistics. JohnWiley & Sons, 1980.

Van der Vaart, A. Asymptotic Statistics. Cambridge univer-sity press, 2000.

Vogel, R., Clemencon, S., and Bellet, A. A ProbabilisticTheory of Supervised Similarity Learning: Pairwise Bi-partite Ranking and Pointwise ROC Curve Optimization.In International Conference in Machine Learning, 2018.

Page 11: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

A. Technical DetailsA.1. Remark on Proposition 1’s Bound

We point out that K and B are chosen so that (8) and (9) are both bounded by δ/2. Given τ , setting B = b2(τ −λ)2n/ log(2/δ)c for λ ∈]0, τ [ yields a minimal constant for λ = τ/3. Interestingly, B involves a floor function, even if it isnot constrained by K. An interpretation is that building large blocks increases the risk of selecting extreme values, and thusof deteriorating the performance.

A.2. Variance Computations

A.2.1. MORM

By virtue of Chebyshev’s inequality, one gets:

pε ≤E[(θ1 − θ)2

]ε2

=ESn

[E[(θ1 − θ)2 | Sn

]]ε2

.

Observing that E[θ1|Sn] = θn and that

E[(θ1 − θ)2|Sn

]= Var

(θ1 | Sn

)+ (θn − θ)2 = (θn − θ)2 +

1

B

n−Bn

σ2n,

where σ2n = (1/(n− 1))

∑ni=1(Zi − θn)2, we deduce that

pε ≤(

1

n+n−BnB

)σ2

ε2=

σ2

Bε2.

A.2.2. MORU

Observe first thatVar(U1(h)

)= E

[Var(U1(h) | Sn)

]+ Var

(E[U1(h) | Sn

]). (14)

Recall that E[U1(h) | Sn] = Un(h), so that

Var(E[U1(h) | Sn

])=

4σ21(h)

n+

2σ22(h)

n(n− 1). (15)

In addition, we have, for B ≥ 4,

Var(U1(h) | Sn) =4

B2(B − 1)2

∑i<j

h2(Xi, Xj)Var(ε1,iε1,j) +∑

i<j, k<l(i,j)6=(k,l)

Cov(ε1,iε1,j , ε1,lε1,k)h(Xi, Xj)h(Xk, Xl).

Let i 6= j, one may check that

Var(ε1,iε1,j) =B(B − 1)(n−B)(n+B − 1)

n2(n− 1)2.

And, for any k 6= l, we have

Cov(ε1,iε1,j , ε1,lε1,k) = −B(B − 1)

n(n− 1)

(n−B)(4nB − 6n− 6B + 6)

n(n− 1)(n− 2)(n− 3)

when {i, j} ∩ {k, l} = ∅, as well as

Cov(ε1,iε1,j , ε1,iε1,k) =B(B − 1)

n(n− 1)

(n−B)(nB − 2n− 2B + 2)

n(n− 1)(n− 2)

Page 12: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

when k 6= j and k 6= i. Hence, observing that E[h(X1, X2)h(X1, X3)] = σ21(h) + θ2(h), we obtain:

E[Var(U1(h) | Sn)

]=

2(n−B)(n+B − 1)

n(n− 1)B(B − 1)

(σ2(h) + θ2(h)

)− (n−B)(4nB − 6n− 6B + 6)

n(n− 1)B(B − 1)θ2(h)

+4(n−B)(nB − 2n− 2B + 2)

n(n− 1)B(B − 1)(σ2

1(h) + θ2(h)). (16)

Combining (14), (15) and (16), we get:

Var(U1(h)

)=

4σ21(h)

n+

2σ22(h)

n(n− 1)+

2(n−B)(n+B − 1)

n(n− 1)B(B − 1)

(2σ2

1(h) + σ22 + θ2(h)

)− (n−B)(4nB − 6n− 6B + 6)

n(n− 1)B(B − 1)θ2(h) +

4(n−B)(nB − 2n− 2B + 2)

n(n− 1)B(B − 1)(σ2

1(h) + θ2(h)),

Var(U1(h)

)=

4σ21(h)

B+

2σ22(h)

B(B − 1).

Chebyshev inequality permits to conclude.

A.3. Remark on the Term log(2/δ) in the Rate Bounds

In all results related to randomized versions (namely Proposition 1 and Proposition 3), the term log(2/δ) appears, insteadof log(1/δ). We point out that this limitation can be easily overcome by means of a more careful analysis in (5) and (13).Indeed, K and B have been chosen so that both exponential terms are equal to δ/2, but one could of course consider splittingthe two terms into (1− κ)δ and κδ for any κ ∈]0, 1[. This, way, choosing

K =

⌈log

(1

(1− κ)δ

)/(2(1/2− τ)2)

⌉and B =

⌊8τ2n/(9 log

(1

κδ

))

⌋leads to log(1/κδ) instead.

A.4. Extension to Generalized U -statistics

As noticed in Remark 5, Propositions 2 and 3 have been established for U -statistics of degree 2, but remain valid forgeneralized ones. For clarity, we recall the definition of generalized U -statistics. An excellent account of properties andasymptotic theory of U-statistics can be found in Lee (1990).

Definition 1. Let T ≥ 1 and (d1, . . . , dT ) ∈ N∗T . Let X{1,...,nt} = (X(t)1 , . . . , X

(t)nt ), 1 ≤ t ≤ T , be T independent

samples of sizes nt ≥ dt and composed of i.i.d. random variables taking their values in some measurable spaces Xt withdistribution Ft(dx) respectively. Let H : X d11 × . . .×X

dTT → R be a measurable function, square integrable with respect to

the probability distribution µ = F⊗d11 ⊗ . . .⊗F⊗dTT . Assume in addition (without loss of generality) that H(x(1), . . . ,x(T ))

is symmetric within each block of argument x(t) valued in X dtt , 1 ≤ t ≤ T . The generalized (or T -sample) U -statistic ofdegrees (d1, . . . , dT ) with kernel H is then defined as

Un(H) =1∏T

t=1

(nt

dt

) ∑I1

. . .∑IT

H(X(1)I1, . . . ,X

(T )IT

),

where the symbol∑It

refers to the summation over all(nt

dt

)subsets X(t)

It= (X

(t)i1, . . . , X

(t)idt

) related to a set It of dtindexes 1 ≤ i1 < . . . < idt ≤ nt and n = (n1, . . . , nT ).

Within this framework, we aim at estimating θ(h) = E[H(X

(1)1 , . . . , X

(1)d1, . . . , X

(T )1 , . . . , X

(T )dT

)], and an analog of the

standard MoM estimator for generalized U -statistics can be defined as follows.

Definition 2. With the notation introduced in Definition 1, let 1 ≤ K ≤ mint nt/(dt + 1). Partition each sampleX(t) intoK blocks B(t)

1 , . . . , B(t)K of sizes bnt/Kc. Compute θk the complete U -statistics based on B(1)

k , . . . , B(T )k for 1 ≤ k ≤ K.

The Median-of-Generalized-U -statistics if then given by θMoGU = median(θ1, . . . , θK).

Page 13: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

Please note that this estimator is very different from that considered in Minsker & Wei (2018). Robust versions of U -statisticshave already been considered in the literature (for the purpose of covariance estimation, rather than the design of statisticallearning methods), from a completely different angle however. The U -statistic is viewed as a M -estimator minimizing acriterion involving the quadratic loss, and the proposed estimator is the M -estimator solving the same criterion except that adifferent loss function is used. This loss function is designed to induce robustness, while being close enough to the squareloss to derive guarantees.

The definition above is closer to that given in Joly & Lugosi (2016). Indeed, in the particular setting of a 1-sample U -statisticof degree d1, Definition 2 coincides with the diagonal blocks estimate mentioned on page 5 of Joly & Lugosi (2016).However, as noticed therein, this estimator only considers a small fraction of possible dt-tuples, namely those whose itemsare all in the block B(t)

k . In order to overcome this limitations, an alternative strategy involving decoupled U -statistics isadopted in Joly & Lugosi (2016). The approach pursued here is rather to consider randomized U -statistics, which is anotherway to introduce variability into the tuples considered to build the estimator.

Proposition 4. Using the notation of Definitions 1 and 2, let nmin = mint nt, and n, d such that n/d = mint nt/dt. Letδ ∈ [e1−2n/9d, 1[. Choosing K = d9/2 log(1/δ)e, we have with probability at least 1− δ:

∣∣∣θMoGU − θ(h)∣∣∣ ≤

√C1 log 1

δ

nmin+

C2 log2( 1δ )

nmin(2nmin − 9 log 1

δ

) ,with C1 = 108σ2

1(h) and C2 = 486σ22(h).

Proposition 5. Now let θMoRGU (Median-of-Randomized-Generalized-U -statistics) be the estimator of θ(h) such that theblocks B(t)

k are no longer partitions of the samples X(t), but rather drawn from SWoR. For any τ ∈]0, 1/2[, for anyδ ∈ [2e−8τ2n/9d, 1[, choosing K = dlog(2/δ)/(2(1/2− τ)2)e and Bt = b8τ2nt/(9 log(2/δ))c, it holds with probabilitylarger than 1− δ: ∣∣∣θMoRGU − θ(h)

∣∣∣ ≤√C1(τ) log 2

δ

nmin+

C2(τ) log2( 2δ )

nmin(8nmin − 9 log 2

δ

) ,with C1(τ) = 27σ2

1(h)/(2τ3) and C2 = 243σ22(h)/(4τ3).

Proofs. The proofs are analogous to that of Propositions 2 and 3, except that concentration results for generalized U -statisticsare used (see e.g. Hoeffding (1963)).

A.5. Proof of Theorem 1 (sketch of)

The proof follows the path of Theorem 2.11’s proof in Lugosi & Mendelson (2016), with adjustments every time U -statisticsare involved instead of standard means. The first one deals with the constants involved in the propositions.

Definition 3. Let λQ(κ, η, h) and λM(κ, η, h) defined as in Lugosi & Mendelson (2016) (see Definitions 2.2 and 2.3 therein).

Definition 4. A difference however occurs on rE(κ, h) and rM(κ, h). Indeed, let

rE(κ, h) = inf

r : E supu∈Fh,r

∣∣∣∣∣∣√

2

B(B − 1)

∑i<j

σi,j u(Xi, Xj) ≤ κ√B(B − 1)

2r

∣∣∣∣∣∣ ,

and

rM(κ, h) = inf

r : E supu∈Fh,r

∣∣∣∣∣∣√

2

B(B − 1)

∑i<j

σi,j u(Xi, Xj) · h(Xi, Xj) ≤ κ√B(B − 1)

2r2

∣∣∣∣∣∣ .

While the difference on rE(κ, h) is only due to the double summation, the change in rM(κ, h) also comprises the removalof Y , as the general framework for pairwise learning does not involve any label. As a consequence, any Y in the olderdefinitions is replaced by 0. Moreover, one can assess a 0 noise W such that ‖W‖L2 = 0 ≤ 1 = σ. This way, all σ’sencountered in older definitions and propositions can be replaced by 1.

Page 14: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

Lemma 1. For every q > 2 and L ≥ 1, there are constants B and κ0 that depend only on q and L for which the followingholds. If ‖h‖Lq ≤ L‖h‖L2 and X1, . . . , XB are independent copies of X , then

P

2

B(B − 1)

∑1≤i<j≤B

|h(Xi, Xj)| ≥ κ0‖h‖L2

≥ 0.9.

Proof. The proof is analogous to that of Lemma 3.4 in Mendelson (2017), except that a version of Berry-Esseen theoremfor U -statistics (Callaert & Janssen, 1978) is used instead of the standard one.

Lemma 2. For every q > 2 and L ≥ 1, there is a constant κ1 that depends only on q and L for which the following holds.If X1, . . . , XB are independent copies of X , then

P

2

B(B − 1)

∑1≤i<j≤B

|h(Xi, Xj)| ≤ κ1‖h‖L2

≥ 0.9.

Proof. As{

2B(B−1)

∑i<j |h(Xi, Xj)| ≥ κ1‖h‖L2

}⊂ {∃i < j, |h(Xi, Xj)| ≥ κ1‖h‖L2

}, Chebyshev inequality gives

P

2

B(B − 1)

∑1≤i<j≤B

|h(Xi, Xj)| ≥ κ1‖h‖L2

≤ B(B − 1)

2P {|h(Xi, Xj)| ≥ κ1‖h‖L2

} ≤ B(B − 1)

2κ21

.

Since B only depends on q and L (see proof of Lemma 1), so does κ1.

Proposition 6. There are constants κ, η,B, c > 0 and 0 < α < 1 < β depending only on q and L for which the followingholds. For a fixed f∗ ∈ F , let r∗ = max{λQ(κ, η,Hf∗), rE(κ,Hf∗)}. For any r ≥ 2r∗, with probability at least1− 2 exp(−cn), ∀ Hf ∈ HF ,

• If ΦS(f, f∗) ≥ βr, then β−1ΦS(f, f∗) ≤ ‖Hf −Hf∗‖L2 ≤ α−1ΦS(f, f∗).

• If ΦS(f, f∗) ≤ βr, then ‖Hf −Hf∗‖L2≤ (β/α)r.

Proof. Using Lemma 1 and Lemma 2 with h = Hf −Hf∗ , together with the union bound, it holds that for every block Bkone has with probability at least 0.8

κ0‖Hf −Hf∗‖L2 ≤ Uk(|Hf −Hf∗ |) ≤ κ1‖Hf −Hf∗‖L2 . (17)

Denoting by Ik the indicator of this event, and by Ik its complementary, we have E[Ik] ≤ 0.2. Moreover,

P

{K∑k=1

Ik ≥ 0.7K

}= 1− P

{1

K

K∑k=1

Ik ≥ 0.3

}.

When a MoU estimate is used, the Ik are independent, since built on disjoint blocks, and the concentration of Binomialrandom variables allows to finish. But interestingly, when a MoCU is used, it is straightforward to see that the last term isexactly the same quantity as the one involved in (10). The same method can thus be used since an upper bound of E[Ik] isalready available. Precisely, choosing τ = 0.25 < 0.3 and recalling B = bn/Kc, it holds

P

{1

K

K∑k=1

Ik ≥ 0.3

}≤ 2 exp

(−2(0.05)2K

).

So the number of blocks which satisfy (17) is larger than 0.7K with probability at least 1− 2 exp(−c1K) for some positiveconstant c1. The rest of the proof is similar to that of Proposition 3.2 in Lugosi & Mendelson (2016).

Page 15: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

Proposition 7. Under the assumptions of Theorem 1, and using its notation, with probability at least

1− 2 exp(−c0nmin{1, r2}),

∀ f ∈ F if ΦS(f, f∗) ≥ βr then f∗ defeats f . In particular f∗ ∈ H , and ∀ f ∈ H , ΦS(f, f∗) = ‖Hf −Hf∗‖L2 ≤ βr.

Proof. This proof carefully follows that of Proposition 3.5 in Lugosi & Mendelson (2016) (see Section 5.1 therein), so thatonly changes induced by pairwise objectives are detailed here. As discussed in Definition 4, every Y can be replaced by 0,and every σ by 1. Attention must also be paid to the fact that in the context of means, m, the cardinal of each block, is alsoequal to

(m1

), the number of possible 1-combinations. In our notation, it thus may sometimes be identified to B, the cardinal

of the blocks, and sometimes to B(B−1)2 , the number of pairs per block.

Proof of pairwise Lemma 5.1 First, one may rewrite

Qf,g =2

B(B − 1)

∑i<j

(Hf (Xi, Xj)−Hg(Xi, Xj))2,

Mf,g =4

B(B − 1)

∑i<j

(Hf (Xi, Xj)−Hg(Xi, Xj)) ·Hg(Xi, Xj),

andRk(u, t) =

∣∣{(i, j) ∈ B2k : i < j, |u(Xi, Xj)| ≥ t}

∣∣ =∑

i<j∈B2k

1{|u(Xi, Xj) ≥ t|}.

Since all pairs are not independent, even if the Xi’s are, one cannot use directly the proposed method. Instead, the Hoeffdinginequality for U -statistics gives that the probability of each Rk(Hf −Hf∗ , κ0r) to be greater than B(B−1)ρ0

4 is greater than

1− exp(−Bρ20

4 ). For τ small enough, we still have that this probability is greater than 1− τ/12. Aggregating the Bernoullimay be done in two ways. If we deal with a MoU estimate, the independence between blocks leads to the same conclusion.If a MoRU estimate is used instead, the remark made for Proposition 6’s proof is again valid, and one can conclude.

The next difficulty arises with the bounded differences inequality for Ψ. If a MoU estimate is used, changing one sample X ′ionly affects one block, and generates a 1/K difference at most, exactly like with MoM, so that the bound holds the sameway. On the contrary, if a MoRU is used, there is no guarantee that the replaced sample contaminates all K blocks. Theanalysis of the MoRU behavior in that case is a bit trickier, and we restrict ourselves to MoU estimates for the matches.

The end of the proof uses a standard symmetrization argument. This kind of arguments still apply to U -statistics (see e.g.p.150 of Pena & Gine (1999)), and the proof is completed in the pairwise setting.

Proof of pairwise Lemma 5.2

P

∣∣∣∣∣∣ 2

B(B − 1)

∑i<j

Ui,j − EU

∣∣∣∣∣∣ ≥ t ≤ 2

B(B − 1)tE

∣∣∣∣∣∣∑i<j

Ui,j − EU

∣∣∣∣∣∣≤ 2

B(B − 1)t

√√√√√E

∣∣∣∣∣∣∑i<j

Ui,j − EU

∣∣∣∣∣∣2

≤ 2

B(B − 1)t

√ ∑i<j, k<l

E[Ui,jUj,k]− (EU)2

≤ 2

B(B − 1)t

√B(B − 1)

2

(E[U2]− (EU)

2)

+B(B − 1)(B − 2)σ21)

≤√

2(B − 2)√B(B − 1)t

‖U‖L2

P

∣∣∣∣∣∣ 2

B(B − 1)

∑i<j

Ui,j − EU

∣∣∣∣∣∣ ≥ t ≤

√2√Bt‖U‖L2

Page 16: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

After that, every case needing a pairwise investigation has already been treated earlier in the section: Binomial concentration,bounded differences inequality, symmetrization arguments. So is Proposition 7 proved.

Theorem 1’s proof. Proposition 6 and Proposition 7 gives that if any f ∈ F wins all its matches, then with probability atleast 1− 2 exp(−c0nmin{1, r2}) ‖Hf −Hf∗‖L2

≤ cr. Hence it also holds with the same probability:

R(f)−R(f∗) = ‖Hf‖L2− ‖Hf∗‖L2

≤ ‖Hf −Hf∗‖L2≤ cr.

Although this extension to the pairwise learning framework deals with any loss `, it is important to notice that the extensionof Theorem 2.11 in Lugosi & Mendelson (2016) is applied to Hf , and not f directly. Hf is penalized via the quadratic loss(as in Lugosi & Mendelson (2016)), so that the Theorem apply, up to technicalities induced by the U -statistics. Doing so,one achieved a control on ‖Hf −H∗f ‖L2

(as in Lugosi & Mendelson (2016)), which is equal to ‖Hf −Hf∗‖L2thanks to

the remark stated in the first paragraph of Subsection 3.3. This quantity happens to be greater than the excess risk of f ,hence the conclusion. Formally, the tournament procedure outputs a Hf , and one has to recover the f such that Hf = Hf .

Knowing the dependence between f and Hf , and with the ability to evaluate Hf , which is known, on any pair, this last stepshould not be too difficult.

About the extension to pairwise learning, one should keep in mind that the general framework does not involve any target Y .Instead, one seeks directly to minimize `(f,X,X ′) =

√`(f,X,X ′)

2= (√`(f,X,X ′)− 0)2 = (Hf (X,X ′)− 0)2. We

recover the setting of Theorem 2.11 in Lugosi & Mendelson (2016): quadratic loss, with Y = 0, for the decision functionHf . The only novelty to address here is the fact that Hf depends on two random variables X and X ′. And this is preciselywhat has been done in this subsection.

Page 17: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

B. More Numerical ResultsB.1. MoRM Estimation Results

Table 3. Quadratic Risks for the Mean Estimation, δ = 0.001

NORMAL (0, 1) STUDENT (3) LOG-NORMAL (0, 1) PARETO (3)

MOM 0.00149 ± 0.00218 0.00410 ± 0.00584 0.00697 ± 0.00948 1.02036 ± 0.06115MORM1/6, SWOR 0.01366 ± 0.01888 0.02947 ± 0.04452 0.06210 ± 0.07876 1.12256 ± 0.14970MORM1/6, MC 0.01370 ± 0.01906 0.02917 ± 0.04355 0.06167 ± 0.07143 1.13058 ± 0.14880MORM3/10, SWOR 0.00255 ± 0.00361 0.00602 ± 0.00868 0.01241 ± 0.01610 1.05458 ± 0.07041MORM3/10, MC 0.00264 ± 0.00372 0.00622 ± 0.00895 0.01283 ± 0.01650 1.05625 ± 0.07298MORM9/20, SWOR 0.00105 ± 0.00148 0.00264 ± 0.00372 0.00497 ± 0.00668 1.02802 ± 0.04903MORM9/20, MC 0.00105 ± 0.00146 0.00265 ± 0.00374 0.00499 ± 0.00673 1.02985 ± 0.04880

10−6 10−5 10−4 10−3 10−2 10−1

δ

0.0

0.1

0.2

0.3

0.4

0.5

0.6

q 1−δ(|

θ−θ|)

Normal(0, 1)MoMMoRM1/6, SWoRMoRM1/6,MCMoRM3/10, SWoRMoRM3/10,MCMoRM9/20, SWoRMoRM9/20,MC

10−6 10−5 10−4 10−3 10−2 10−1

δ

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

q 1−δ(|

θ−θ|)

Student(3)MoMMoRM1/6, SWoRMoRM1/6,MCMoRM3/10, SWoRMoRM3/10,MCMoRM9/20, SWoRMoRM9/20,MC

10−6 10−5 10−4 10−3 10−2 10−1

δ

0.0

0.2

0.4

0.6

0.8

1.0

1.2

q 1−δ(|

θ−θ|)

Log-normal (0, 1)MoMMoRM1/6, SWoRMoRM1/6,MCMoRM3/10, SWoRMoRM3/10,MCMoRM9/20, SWoRMoRM9/20,MC

10−6 10−5 10−4 10−3 10−2 10−1

δ

1.0

1.1

1.2

1.3

1.4

1.5

q 1−δ(|

θ−θ|)

Pareto(3)MoMMoRM1/6, SWoRMoRM1/6,MCMoRM3/10, SWoRMoRM3/10,MCMoRM9/20, SWoRMoRM9/20,MC

Figure 1. Empirical Quantiles for the Different Mean Estimators on 4 Laws

The empirical quantiles confirm the quadratic risks results: the τ parameter is crucial, making MoRM the worst or the bestestimate depending on its value. The sampling scheme does not affect to much the performance, even if the MC scenario ismuch more complex to analyze theoretically.

Page 18: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

B.2. MoRU Estimation Results

Table 4. Quadratic Risks for the Variance Estimation, δ = 0.001

NORMAL (0, 1) STUDENT (3) LOG-NORMAL (0, 1) PARETO (3)

MOU1/2; 1/2 0.00409 ± 0.00579 1.72618 ± 28.3563 2.61283 ± 23.5001 1.35748 ± 36.7998MOUPARTITION 0.00324 ± 0.00448 0.38242 ± 0.31934 1.62258 ± 1.41839 0.09300 ± 0.05650MORUSWOR 0.00504 ± 0.00705 0.51202 ± 3.88291 2.01399 ± 4.85311 0.09703 ± 0.07116MOIU1/6, SWOR 0.00206 ± 0.00285 1.78161 ± 34.7216 2.50529 ± 21.8989 1.37800 ± 40.1308MOIU1/6, MC 0.00205 ± 0.00281 1.65481 ± 26.2157 2.61701 ± 24.7918 1.50578 ± 42.9135MOIU3/10, SWOR 0.00216 ± 0.00301 1.13887 ± 16.9511 2.07136 ± 14.8312 0.85041 ± 21.9916MOIU3/10, MC 0.00211 ± 0.00288 1.22402 ± 17.4715 2.16590 ± 15.2378 0.89035 ± 22.2866

10−6 10−5 10−4 10−3 10−2 10−1

δ

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

q 1−δ(|

θ−θ|)

Normal(0, 1)MoCUPart

MoCUSWoR

MoIU3/10, SWoR

MoIU3/10,MC

10−6 10−5 10−4 10−3 10−2 10−1

δ

0.8

1.0

1.2

1.4

1.6

1.8

2.0

q 1−δ(|

θ−θ|)

Student(3)MoCUPart

MoCUSWoR

MoIU3/10, SWoR

MoIU3/10,MC

10−6 10−5 10−4 10−3 10−2 10−1

δ

2.0

2.5

3.0

3.5

q 1−δ(|

θ−θ|)

Log-normal (0, 1)MoCUPart

MoCUSWoR

MoIU3/10, SWoR

MoIU3/10,MC

10−6 10−5 10−4 10−3 10−2 10−1

δ

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

q 1−δ(|

θ−θ|)

Pareto(3)MoCUPart

MoCUSWoR

MoIU3/10, SWoR

MoIU3/10,MC

Figure 2. Empirical Quantiles for the Different Variance Estimators on 4 Laws

The partitioning MoU seems to outperform every other estimate. One explanation can be that an extreme value may corruptonly one block within this method, whereas randomized versions can suffer from it in several blocks.

Page 19: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

B.3. MoRU Learning Results: a Metric Learning Application

0 200 400 600 800 1000Gradient epoch

0

2

4

6

8

10

Cost

Normal DatasetMini batchMoCU batch

0 200 400 600 800 1000Gradient epoch

0

10

20

30

40

50

60

70

80

Cost

Contaminated DatasetMini batchMoCU batch

Figure 3. Gradient Descent Convergences for Normal and Contaminated Datasets

In this experiment, we try to learn from points that are known to be close or not, a distance that fits them, over the setof all possible Mahalanobis distances, i.e. d(x, y) =

√(x− y)>M(x− y) for some positive definite matrix M . It is

done through mini-batch gradient descent over the parameter M . On the normal iris dataset (top), we see that standardmini-batches perform well, while MoRU mini-batches induce a much slower convergence. But in the spirit of Lecue &Lerasle (2017), the experiment is also run on a (artificially) corrupted dataset (bottom). This highlights how well can theMoM-like estimators behave in the presence of outliers. Indeed, although the convergence with the MoRU mini-batchesremains slower than with the standard ones on the normal regime, they avoid peaks, presumably caused by the presence ofone (or more) outlier in the mini-batch. This makes it a very interesting alternative in the context of highly corrupted data.Experiments run on the same dataset for clustering purposes show the same behavior.

Page 20: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

C. Alternative Sampling Schemes - Possible ExtensionsAs pointed out in Remark 1 and in the discussion in the end of Section 3, the approaches investigated in the present papercould be implemented with sampling procedures different from the SRSWoR scheme. It is the purpose of this section toreview possible alternatives and discuss the technical difficulties inherent to the study of their performance.

C.1. MoRM using a Sampling with Replacement Scheme

Rather than forming K ≥ 1 data blocks of size B ≤ n from the original sample Sn by means of a SRSWoR scheme,one could use a Monte-Carlo procedure and draw independently K times an arbitrary number B of observations withreplacement. In this case, each data block Bk is characterized by a random vector ek = (ek,1, . . . , ek,B) independentfrom Sn, where, for each draw b ∈ {1, . . . , B}, ek,b = (ek,b(1), . . . , ek,b(n)) is a multinomial random vector in {0, 1}nindicating the index of the observation randomly selected: for any i ∈ {1, . . . , n}, ek,b(i) = 1 if i has been chosen,ek,b(i) = 0 otherwise and P{ek,b(i) = 1} = 1/n (notice also that

∑ni=1 ek,b(i) = 1 with probability one). In this case, the

empirical mean based on block Bk can be written as

θk =1

B

B∑b=1

〈ek,b, Zn〉,

where Zn = (Z1, . . . , Zn) and 〈., .〉 is the usual Euclidean scalar product on Rn. Conditioned upon Sn, the θk’s are i.i.d.and we have E[θ1 | Sn] = θn, as well as

Var(θ1|Sn) =1

B

1

n

n∑j=1

Z2j − θ2

n

.

The corresponding variant of the MoM estimator is

θMoRM = median(θ1, . . . , θK

).

Observe that this estimation procedure offers a greater flexibility, insofar as both K and B can be arbitrarily chosen. Inaddition, the variance of the block estimators θk:

Var(θ1) = Var(E[θ1 | Sn]

)+ E

[Var(θ1 | Sn

)]=

(1

n+

1

B

(1− 1

n

))σ2.

It is comparable to that of the θk’s for the same block size B ≤ n, although always larger: Var(θ1) − Var(θ1) =(1 − 1/B)σ2/n ≥ 0. However, investigating the accuracy of θMoRM is challenging, due to the fact that it is far fromstraightforward to study the concentration properties of the random quantity

Uεn = P{|θ1 − θ| > ε | Sn}.

Even if Chebyshev’s inequality yields

E[Uεn] = pε ≤(

1

n+

1

B

(1− 1

n

))σ2

ε2,

the r.v. Uεn is a complex functional of the original data Sn, due to the possible multiple occurrence of a given observation ina single sample obtained through sampling with replacement. In particular, in contrast to Uεn, it is not a U -statistic of degreeB. Viewing it as a function of the n i.i.d. random variables Z1, . . . , Zn and observing that changing the value of any ofthem can change its value by at most 1− (1− 1/n)B ≤ B/n, the bounded difference inequality (see McDiarmid (1989))gives only: ∀t > 0,

P{Uεn − pε ≥ t

}≤ exp

(−2

n

B2t2), (18)

while we have P{Uεn − pε ≥ t} ≤ exp(−2(n/B)t2). Hence, the bound (18) is not sharp enough to establish guarantees forthe estimator θMoRM similar to those stated in Proposition 1.

Page 21: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

C.2. Medians-of-Incomplete U -statistics

The computation of the U -statistic (2) is expensive in the sense that it involves the summation of O(n2) terms. The conceptof incomplete U -statistic, see Blom (1976) permits to address the computational issue raised by the expensive calculationof the U -statistic (2), which involves the summation of O(n2) terms, so as to achieve a trade-off between scalabilityand variance reduction. In one of its simplest forms, it consists in selecting a subsample of size M ≥ by sampling withreplacement (i.e. Monte-Carlo scheme) in the set Λ = {(i, j) : 1 ≤ i < j ≤ n}, of all pairs of observations that can beformed from the original sample. Denoting by {(i1, j1), . . . , (iM , jM )} ⊂ Λ the subsample thus drawn, the incompleteversion of the U -statistic (2) is:

UM (h) =1

M

M∑m=1

h(Xim , Xjm).

As can be easily shown, (C.2) is an unbiased estimator of θ. More precisely, its conditional expectation given Sn is equal toUn(h) and its variance is

Var(UM (h)

)=

(1− 1

M

)Var (Un(h)) +

σ2(h)

M.

Repeating independently the sampling procedure K times in order to compute incomplete U -statisticsUM,1(h), . . . , UM,K(h) (conditionally independent given the original dataset Sn), one may consider the Median ofIncomplete U -statistic:

θMoIU(h) = median(UM,1(h), . . . , UM,K(h)

).

Although the variance of the UM,k(h)’s are smaller than that of the Uk(h)’s (and that of the Uk(h)’s) for the same numberof pairs involved in the computation of each estimator, the argument underlying Proposition 3’s proof cannot be adapted tothe present situation because the concentration properties of the random quantity

W εn = P

{∣∣∣UM (h)− θ(h)∣∣∣ > ε | Sn

},

which has mean

qε = P{∣∣∣UM (h)− θ(h)

∣∣∣ > ε}≤

Var(UM (h)

)ε2

= O(

1

Mε2

),

are very difficult to study, due to the complexity of the data functional (C.2). Like in SM C.1, a straightforward applicationof the bounded difference inequality gives: ∀t > 0,

P{W εn − qε ≥ t

}≤ exp

(−2

n

M2t2).

Obviously, this bound is not sharp enough to yield a bound of the same order as those in Proposition 2 and Proposition 3.

Alternative sampling schemes. Other procedures than the Monte-Carlo scheme above can of course be considered tobuild a subset of all pairs of observations and compute an incomplete U -statistic. Refer to Lee (1990) or to subsection 3.4in Clemencon et al. (2016) for further details. When selecting a subsample of size M ≤ n(n − 1)/2 by simple randomsampling without replacement (SRSWoR) in the set of all pairs of observations that can be formed from the original sample,the version of (C.2)obtained has conditional expectation and variance:

E[UM (h) | X1, . . . , Xn

]= Un(h),

Var(UM (h) | X1, . . . , Xn

)=

(n2

)−M(n2

) × V 2n (h)

M,

whereV 2n (h) =

1(n2

)− 1

∑i<j

(h(Xi, Xj)− Un(h))2.

Hence, its variance can be expressed as

Var(UM (h)

)= Var (Un(h)) +

(n2

)−M(

n2

)− 1

× 1

M

(σ2(h) + Var (Un(h))

)= O(1/M).

Page 22: On Medians of (Randomized) Pairwise Means · of the risk of low variance take the form of U-statistics (of degree two), i.e. averages over all pairs of observations, see e.g. Cl´emen

On Medians of Randomized (Pairwise) Means

In contrast with the situation investigated in subsection 3.1, the conditional expectation W εn cannot be viewed as a U -statistic

of degree M , insofar as the n(n − 1)/2 variables {h(Xi, Xj) : i < j} are not i.i.d. r.v.’s. Here as well, the boundeddifference inequality is not sufficient to get bounds of the same order as that obtained in those in Proposition 2 andProposition 3, jumps

1−(

(n− 1)(n− 2)/2

M

)/

(n(n− 1)/2

M

)being at least of order M/n, as Stirling’s formula shows.