theory and practice of active learning · encouragement. for all these reasons, nir is a genuine...

133
THEORY AND PRACTICE OF ACTIVE LEARNING RON BEGLEITER Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Upload: others

Post on 07-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

THEORY AND PRACTICE OF

ACTIVE LEARNING

RON BEGLEITER

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 2: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 3: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

THEORY AND PRACTICE OF ACTIVE

LEARNING

RESEARCH THESIS

SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS

FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

RON BEGLEITER

SUBMITTED TO THE SENATE OF THE TECHNION — ISRAEL INSTITUTE OF TECHNOLOGY

ADAR, 5773 HAIFA MARCH, 2013

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 4: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 5: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

THIS RESEARCH THESIS WAS DONE UNDER THE SUPERVISION OFASSIST. PROF. NIR AILON AND ASSOC. PROF. RAN EL-YANIV IN THE

DEPARTMENT OF COMPUTER SCIENCE

ACKNOWLEDGMENTS

First and foremost, I am grateful to my doctoral advisors Prof.Nir Ailon and Prof. Ran El-Yaniv.

They both taught me how to do research, how to ask the rightquestions and how to answer them, and how to do both mathemat-ics and empirical assessment. Their wisdom, creativity, integrity,humility, and generosity will continue to inspire me. I was indeedfortunate to have them as my advisors.

Nir was always two steps ahead of me, and it was he who pavedthe way to the successful completion of this work. I am deeplygrateful to him for sharing with me his vast knowledge, intuition,and ingenious observations, as well as for his endless support, andencouragement. For all these reasons, Nir is a genuine role modeland the ideal advisor — I couldn’t ask for better!

Ran trained and guided me from the very first stages of my re-search, and to him I owe the very DNA of my work. In fact, withoutRan, my work could never have come into being. I thank him deeplyfor committing precious time and energy to my project and to me,for always keeping my sights trained on the big picture, and for set-ting the highest possible standards and demonstrating, again andagain, that we could attain them. I was privileged to learn fromsuch a true master.

I am in debt to my co-authors, Dr. Esther Ezra, and Dr. DmitryPechyony. Esther, who also shared her deep knowledge and exper-tise with me, provided me with the ideas and the tools that turnedout to be critical for the development of the main theme of thisthesis. I thank her greatly for her contribution.

I started collaborating with Dmitry in the early stages of mydoctoral studies, when he and I were both students. I thank him

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 6: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

for inspiring me with his wisdom, for being a friend, and for teachingme how fun research collaboration can be.

I am thankful to Prof. Masashi Sugiyama for hosting me atthe Tokyo Institute of Technology in the summer of 2008. I thankfriends and colleagues who assisted along the way: Amit, Meir,Dany Dorr, Marko Jankovic, Gilad Kutiel, Simcha Lechtman, MasashiSugiyama, Ryota Tomioka, Reut Tsarfaty, Liwei Wang, Yair Wiener,and David Yanay. I thank Shahar Mendelson for encouraging mewhen I needed it the most. I am thankful to Itai Brickner, and StasKrichevsky from Kontera Technologies for sharing the enjoymentand fun of doing applied machine learning — I have learned a lotfrom both of them.

I am thankful to PASCAL – European Network of Excellence,Masashi Sugiyama, organizers of MLSS’06, and organizers of COLT’12for their financial support.

I deeply thank my grandfather, Ernst; my parents, David andHanna; my brother, Shy; and my sister, Tal, for their support, love,and care. This thesis is dedicated to them.

List of publications. (I have been a major contributor in all of the followingpublications.)

1. N. Ailon, R. Begleiter, and E. Ezra. Active Learning Using Smooth RelativeRegret Approximations with Applications. Submitted for publication to Journalof Machine Learning Research, 2012. (Impact factor: 4.97; rank in ArtificialIntelligence: 4/114.)

2. N. Ailon, R. Begleiter, and E. Ezra. Active Learning Using Smooth RelativeRegret Approximations with Applications. In Proceedings of the 25th AnnualConference on Learning Theory. JMLR Workshop and Conference Proceedings,volume 23, 2012. (Received COLT’s best student paper award.)

3. R. Begleiter, R. El-Yaniv, and D. Pechyony. Repairing Self-Confident Active-Transductive Learners Using Systematic Exploration. Pattern Recognition Let-ters, 29(9):1234–1251 2008. (Impact factor: 2.32; rank in Artificial Intelligence:40/114.)

Remarks:

∗ Additional publication: N. Ailon, R. Begleiter, and E. Ezra. Selective Sampling withAlmost Optimal Guarantees for Learning to Rank from Pairwise Preferences. NIPSWorkshop on Choice Models and Preference Learning (CMPL2011), 2011.

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 7: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

∗ For journal scores, I used: SCImago. (2007). SJR – SCImago Journal & Country Rank.Retrieved December, 2012, from http: // www. scimagojr. com

THE GENEROUS FINANCIAL HELP OF THE TECHNION ISGRATEFULLY ACKNOWLEDGED

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 8: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 9: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Contents

Abstract 1

1 Introduction 51.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Brief Introduction to Active Learning 112.1 Settings and Notations . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Theory of Active Learning . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Complexity Terms . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Alternative Settings and Guarantees . . . . . . . . . . . . 16

2.3 Algorithmic Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Sampling by Uncertainty . . . . . . . . . . . . . . . . . . 172.3.2 Utility Maximization (Error Minimization) . . . . . . . . 192.3.3 Online Weighting . . . . . . . . . . . . . . . . . . . . . . . 202.3.4 Other Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 The Method of Smooth Relative Regret Approximations 233.1 Smooth Relative Regret Approximations (SRRA) . . . . . . . . . 233.2 Constant Uniform Disagreement Coefficient Implies Efficient SR-

RAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.1 The Construction . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Convex Relaxations . . . . . . . . . . . . . . . . . . . . . . . . . 323.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Active Preference-based Ranking Using SRRAs 354.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 The Necessity of Active Learning . . . . . . . . . . . . . . . . . . 384.3 Disagreement Coefficient Arguments Are Not Sufficient for Effec-

tive Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . 394.4 Better SRRA for LRPP . . . . . . . . . . . . . . . . . . . . . . . 40

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 10: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

4.5 LRPP over Linearly Induced Permutations in Constant-DimensionalFeature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6 Heuristics for Optimizing the SRRAs . . . . . . . . . . . . . . . . 494.7 Empirical Proof of Concept . . . . . . . . . . . . . . . . . . . . . 50

4.7.1 Synthetic Experiments . . . . . . . . . . . . . . . . . . . . 514.7.2 Real Data Experiments . . . . . . . . . . . . . . . . . . . . 544.7.3 The Combinatorial Experiment . . . . . . . . . . . . . . . 554.7.4 A Note on Related Algorithms and Datasets . . . . . . . 584.7.5 A Note on Misconceptions and Beliefs . . . . . . . . . . . 59

4.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 SRRA for Clustering with Side Information 635.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Passive Learning Is Not Useful . . . . . . . . . . . . . . . . . . . 665.3 Disagreement Coefficient Arguments Are Not Sufficient for Effec-

tive Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . 665.4 Better SRRA for Semi-Supervised k-Clustering . . . . . . . . . . . 67

5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.4.2 Simple Case . . . . . . . . . . . . . . . . . . . . . . . . . . 695.4.3 More Involved Case . . . . . . . . . . . . . . . . . . . . . . 705.4.4 Conclusion: f is an ε-SRRA . . . . . . . . . . . . . . . . . 75

5.5 Hierarchical Correlation Clustering . . . . . . . . . . . . . . . . . 765.5.1 Definitions and Notations . . . . . . . . . . . . . . . . . . 775.5.2 SRRA for Learning Shallow Ultrametrics . . . . . . . . . 80

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6 Active Exploration for Graph-based Learning: Clustering withSide Information is a Big Plus 83

6.0.1 Motivating Example, and a Preview . . . . . . . . . . . . 856.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 856.2 A Review of the Graph-Based Transductive Algorithms We Use . 876.3 An Exploration-Exploitation Routine and Its Implementation . . 88

6.3.1 Implementation of QA and SW . . . . . . . . . . . . . . . . 886.3.2 On Some Known and Some New Querying Components . 90

6.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 916.4.1 Datasets and Experimental Setting . . . . . . . . . . . . . 916.4.2 The Efficiency of +EXPLORE . . . . . . . . . . . . . . . 936.4.3 The Advantage of Adaptive Exploration . . . . . . . . . . 94

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A Standard Concentration to the Mean Bounds 97

Hebrew Abstract i

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 11: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

List of Figures

2.1 Pool-based active learning protocol. . . . . . . . . . . . . . . . . . 12

2.2 Stream-based active learning protocol. . . . . . . . . . . . . . . . 13

3.1 Illustration of Disagreement-SRRA Elements . . . . . . . . . . . . 29

4.1 Maximal distance by a single pair inversion . . . . . . . . . . . . . 39

4.2 Illustration: Sampling Ru,i . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Example: Arrangement and Duality View of X and H . . . . . . 48

4.4 Synthetic model-based preferences. We permuted the rows andcolumns on the basis of the w-induced order, resulting in a Yperfect

noiseless ranking, which contrary to the upper-right and lower-lefttriangles depicted here, would have appeared as perfect solid-darkand solid-light triangles, respectively. . . . . . . . . . . . . . . . . 52

4.5 Results for synthetic datasets averaged over 10 runs, accompaniedby standard error of the mean. . . . . . . . . . . . . . . . . . . . . 53

4.6 Example for an Amazon Turk HIT that provides a label for theCOST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.7 Three crowdsourcing preferences. The dark colored Wu,v depictsthat row alternative, u, is preferred over the column alternative v.A fully sortable, noiseless preference would have been representedhere as solid dark upper triangles, representing anti-symmetric ma-trices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8 Comparing SRRA and random samplers. The first row corre-sponds to COMBI-SVM + CLIMB solver; the second row to CLIMB;and the last to SVM. The first column corresponds to COST, thesecond to WINE, and the last to DATE. Each result is an average of10 runs along with the standard error of the mean. . . . . . . . . 57

4.9 Example of a recipe text. . . . . . . . . . . . . . . . . . . . . . . . 58

4.10 Ranking by preference-majority versus local-improvement. Thedark color illustrates that the row alternative is preferred over thatof the corresponding column. The REF ranking corresponds torow indices, where the top and bottom rows are most and leastpreferred, respectively. . . . . . . . . . . . . . . . . . . . . . . . . 59

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 12: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

5.1 Semi-Supervised Clustering as a Graph Problem . . . . . . . . . . 655.2 An example of an ultrametric tree with ` = k = 3. Let vi denote

the vertex indexed with i ∈ [27]. Observe that τ(v25, v27) = 1,τ(v22, v1) = 3 is the maximal distance, and V(Rτ,2) = C1, C2, C3defines the tree’s first level. . . . . . . . . . . . . . . . . . . . . . . 78

6.1 Motivating example. . . . . . . . . . . . . . . . . . . . . . . . . . 856.2 +EXPLORE routine. . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3 Comparing the three methods: RAND, SELF-CONF, and +EXPLORE.

Each point in each axis comprises two mean error results of twomethods (in the x-axis and the y-axis) over one of the datasets fora training size m = 50. . . . . . . . . . . . . . . . . . . . . . . . . 93

6.4 The effect of dynamic exploration: Comparing the learning (er-ror) curves of SELF-CONF with SELF-CONF+EXPLORE. Queries byexploration (using QA) are indicated by dark dots. . . . . . . . . . 94

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 13: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

List of Tables

4.1 The number of cycles in the different crowdsource datasets. Wepresent the number and size of strongly connected components(scc’s). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.1 The error (%) of the “best” representatives of the +EXPLORE, Q,and P methods. The lowest error in each row (over all columns)appears in bold font. . . . . . . . . . . . . . . . . . . . . . . . . . 92

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 14: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 15: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Abstract

The active-learning model is concerned with ways to expedite the learning phasethrough interaction with the labeling process. In a pool-based active-learningsetting, the learning algorithm receives a set of unlabeled examples, as well asaccess to an oracle that contains the full labeling information on that particularset. The learner’s goal is to produce a nearly optimal hypothesis, while requiringthe minimum possible interactions with the oracle.

In this thesis, we present a novel smoothness condition over empirical riskerror estimators and show its usefulness for active pool-based learning. Insteadof estimating the risk directly, we target regrets relative to pivot hypotheses.We show that such smooth relative regret estimators yield an active-learningalgorithm that converges to a competitive solution at a fast rate.

We show three specific constructions of such smooth estimators. The first isobtained when the only assumptions made are bounds on the disagreement co-efficient and the VC dimension. This leads to an active-learning algorithm thatalmost matches the best-known algorithms that use the same assumptions. Onthe other hand, we present two problems of vast interest, for which a direct anal-ysis of the relative regret function produces state-of-the-art learning strategies.The two problems we study are concerned with learning relations over a groundset, where one problem deals with order relations and the other with equivalencerelations (with a bounded number of equivalence classes).

Our smoothness condition, we argue, influences sampling methods that shouldbe carefully biased in a way that incorporates exploration of all hypothesis space,along with exploitation of a current candidate solution. Following this idea, wepresent a heuristic that enhances any active-learning algorithm with systematicexplorations. We show that this heuristic significantly improves leading active-learning heuristics within a graph-based transductive setting.

1

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 16: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

2

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 17: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Abbreviations and Notations

ε Error approximation parameterπ, σ Permutations over a finite set Vθ The universal disagreement coefficientALG Algorithm ALG, e.g., SVMH Hypothesis class, the set of feasible hypothesesd Either VC dimension, or geometric dimension, i.e., Rd; (context dependent)D Distribution over instance-label pairs (X, Y ) ∼ DDX Marginal distribution over the instance space (X )dist Distance function over hypotheses, dist : H×H → RdSF Spearman’s Footrule distanceE [·] ExpectationerD Error rate, erD(h) = EX∼D [h(X) 6= Y (X)]ν The noise rate, ν = infh∈H erD(h)ERM Empirical Risk Minimizationh Hypothesis in Hh∗ The optimal hypothesis, s.t., erD(h∗) = νLRPP Learning to Rank from Pairwise PreferencesP [·] ProbabilityPTAS Polynomial Time Approximation SchemeR The set of real numbersregh Relative regret function with respect to hypothesis hSRRA Smooth Relative Regret ApproximationV A finite set of n elementsv, u Elements in Vu An Embedding of u ∈ V in Rd (for some fixed d)VC Vapnik-Chervonenkis (as in VC dimension)X Instance spaceY Label space, here it is mostly binary: Y = 0, 1Y (X) Labeling function (deterministic), Y : X −→ Y

3

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 18: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

4

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 19: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Chapter 1

Introduction

Efficient utilization of unlabeled examples during a learning process can be veryadvantageous when using limited labeling to construct accurate classifiers. Theabundance of information led by “big data” phenomena like cloud computing,personalization, and the social web has motivated the research community tomake considerable efforts in this direction. The three most prominent approachesfor achieving this goal are semi-supervised, transductive, and active learning.Semi-supervised and transductive learning are close in nature to the standardPAC learning, and mainly deal with ways to benefit from the side information ofunlabeled examples. Here we are concerned with active learning, a model thatstudies ways to accelerate learning by acquiring labels in interactive ways.

Unlike standard supervised learning, in active learning the learner chooseswhich instances to learn from. In the streaming setting, the learner may rejectlabels of instances that arrive in a stream; in the pool setting, the learner maycollect a pool of instances and then choose a subset from which to request labels.Although it is a relatively young field compared to traditional (passive) learning,there is by now a significant body of literature on the subject (see, e.g., Cohnet al., 1994; Freund et al., 1997; Dasgupta, 2005; Castro et al., 2005; Kaariainen,2006; Balcan et al., 2006; Sugiyama, 2006; Hanneke, 2007a,b; Balcan et al., 2007;Dasgupta et al., 2007; Bach, 2007; Castro and Nowak, 2008; Balcan et al., 2008;Dasgupta and Hsu, 2008; Cavallanti et al., 2008; Hanneke, 2009; Beygelzimeret al., 2009, 2010; Koltchinskii, 2010; Cesa-Bianchi et al., 2010; Yang et al., 2010;Hanneke and Yang, 2010; El-Yaniv and Wiener, 2010; Hanneke, 2011; Orabonaand Cesa-Bianchi, 2011; Cavallanti et al., 2011; Yang et al., 2011; Wang, 2011;Minsker, 2012). A more comprehensive overview of active learning, along with aproper contextual background will be provided in Chapter 2.

Active learning can be viewed as a search game in which the learner searchesfor a “good” hypothesis by means of queries that restrict the search space. Thisapproach is especially appealing when the search space contains the underlyinglabeling function. That is, the search space contains a perfect hypothesis thatnever errs, a setting known as the realizable. When this is the case, the search

5

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 20: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

space is essentially the set of hypotheses that are congruent with all revealedlabels. This subset of hypotheses is called the version space. Each new exposedlabel breaks the version space into hypotheses that either agree or disagree withthe label. The merit of a query is thus tied to the potential proportion of thecorresponding split. The revealed labels may potentially shrink the search spacein constant factors (e.g., as in a generalized binary search). This means that theoptimal hypothesis may potentially be revealed by a mere logarithmic number ofqueries, which is an exponential improvement over passive learning rates.

However, it turns out that exploiting the obtained labeled examples is gener-ally not enough. Dasgupta (2005) shows in his breakthrough theoretical work asimple example wherein the probability that a query will split the search space isminuscule, thus the only way to make progress is by trying to locate such “split”examples through exploration.

When the setting is non-realizable, this exploration–exploitation dilemma be-comes acute. The search cannot be led by the set of hypotheses that are unani-mous on the sampled labels because every candidate solution errs. The labeledexamples may be exploited for the sake of shrinking the search space only whenthere is some “guarantee” that we will not lose potential solutions. In cases wherethe “confidence” in the labeled examples is low, the learner should enhance it byusing new examples to make it more robust to the problem’s inherent error noise.

From the theoretical point of view of active learning, the (few) existing active-learning complexity terms quantify how difficult it is to apply significant searchspace reductions. We argue that in difficult cases such as these, the search “di-rection” is static and thus choices are only concerned with better search stateexploitations. For example, Dasgupta’s (2005) splitting index captures how com-mon it is for labels to break the search space into significant portions. Thus, thesplitting index measures how easy it is to sample a good set of labeled examplesthat shrink the version space sufficiently. This expresses how well we can exploitthe current set of labels. Similarly, Hanneke’s (2007b) disagreement coefficientmeasures the “volume” of the set of instances on which the r-ball of hypothesesaround the optimal hypothesis is not unanimous on. Again, this set consists ofall query candidates for reducing the search space.

Among the above two complexity terms, the disagreement coefficient of Han-neke (2007b) has become a central data-independent invariant in proving activelearning rates. As a result, the analysis of the vast majority of the theoreticallyjustified active learning algorithms is mainly concerned with efficient exploita-tions. When this is not enough (i.e., when exploration is needed), the analysisis often accompanied by certain structural or Bayesian assumptions about thenoise (especially the model of Mammen and Tsybakov, 1999; Tsybakov, 2004).The problem is analogous under “nicely” behaved noise to the realizable case, inwhich, under sufficient conditions, exploitation can be the main concern. Indeed,this type of assumption provides excellent labeling rate guarantees that usually

6

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 21: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

outperform passive learning (e.g., Balcan et al., 2007; Castro and Nowak, 2008;Hanneke, 2009; Koltchinskii, 2010; Yang et al., 2010; Wang, 2011; Yang et al.,2011; Minsker, 2012).

On the other hand, a few empirical works consider the role of exploration ex-plicitly (e.g., Lindenbaum et al., 1999; Baram et al., 2004; Osugi et al., 2005). Allof them indicate that sensible active learning methods should perform systematicexplorations. This serves as real motivation for examining this phenomenon froma theoretical point of view.

1.1 Problem Definition

Our work is concerned with pool-based active learning only. We deal mostly withdistribution-free settings, thus we do not consider any structural or Bayesiannoise assumptions. Our analysis was carried out under the worst-case analysisframework. We focused on the non-realizable setting, known as the agnosticactive learning setting. The hypothesis space in this setting does not necessarilycontain the ground truth labeling function; as a result, the analogy of activelearning as a search through the version space is no longer valid. The activelearner has to trade off between exploitation of the labeled sample, which is aproxy to the labeling process, and exploration of new sample biases.

1.2 Proposed Solution

The essence of this thesis lies in exploration–exploitation tradeoffs in active learn-ing. We define a smoothness condition on actively learned empirical risk mini-mizers that quantify such tradeoffs (implicitly); the condition is accompanied bya well-justified active learning meta-algorithm.

Recall that empirical risk minimization (ERM) is a learning paradigm, whichis based on the idea that it is possible to approximate the expected loss of hy-potheses using their empirical mean. Following the ERM philosophy, our con-dition deals with empirical mean and expected loss differences. We conditionthe deviation of such differences defined between any hypothesis and some fixedpivotal one. Our condition ensures that such deviations vary smoothly with thehypothesis space distance of their operand and the pivotal hypothesis. Thus, werequire that the density of the sample that defines the empirical estimator be cor-related with these distances. In other words, the learner should balance betweenexploitation of hypotheses in close proximity to the pivot and exploration of farhypotheses. Our solution can be intuitively viewed as a holistic approach to theactive learning’s explore–exploit Catch-22.

7

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 22: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

1.3 Major Contributions

We establish a novel approach to pool-based active learning and outline its bene-fits and drawbacks. The key ingredients of our approach are a novel smoothnesscondition on ERM estimators of relative regrets and a corresponding active learn-ing algorithm that uses such estimators. Our algorithm is guaranteed to outputa hypothesis that errs no more than (1 + ε) times the optimal error rate andto output such hypothesis at a “fast” rate (exponential in ε). The algorithmdepends on specific smooth estimator constructions and solvers for solving thespecific ERMs. The specific main contributions are listed below:

A novel approach to active learning. Our smoothness condition, accompa-nied by a well-justified meta-algorithm, brings a novel approach to active learning.It contrasts with the “shrinking version spaces” algorithmic approach that cur-rently dominates the active learning field. For two important problems in whichstandard active learning methods fail to provide meaningful guarantees, we useour approach to define active learning algorithms with non-trivial guarantees.

General purpose construction matches state-of-the-art sample com-plexity. For the generic pool-based setting, we define a construction of smoothestimators that uses knowledge only on disagreement coefficient θ and VC dimen-sion d. We prove that our construction yields (w.h.p.) a query complexity thatis in

O(θd(log2(1/ν))(log θ)),

where ν denotes the problem’s inherent agnostic noise. This query-complexitybound is O(log(1/ν)) times the best known bounds using disagreement coefficientand VC dimension bounds only (e.g., Dasgupta et al., 2007; Beygelzimer et al.,2009).

Extending the state-of-the-art of learning to rank from pairwise pref-erences. We define a specific construction of smooth estimator for the problemof active learning to rank from pairwise preferences (also known as query-efficientversion of minimum feedback in arc-set tournaments). The resulting algorithmimproves on what were previously the best results (Ailon, 2012) in two ways:Most importantly, it is more general and can be applied over any set of hy-potheses (whereas the former solution is restricted to a specific set); secondly, itslightly improves the sample complexity and is guaranteed (w.h.p.) to query nomore than O

(ε−3n log5 n

)labels, where n denotes the number of alternatives we

should rank. Additionally, we demonstrate that our specific construction beatsboth uniform sampling and any known active learning method that is based-ondisagreement coefficient and VC dimension arguments only (e.g., our general-purpose construction).

8

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 23: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Outlining the state-of-the-art of active correlation clustering. We de-fine a query-efficient variant of correlation clustering (semi-supervised clustering),in which the number of clusters is fixed and known. We present a correspondingconstruction that achieves (w.h.p.) a query complexity

O(nmax

ε−2k3, ε−3k2

log2 n

),

where n is the number of items to cluster and k is the number of clusters. Wedemonstrate that this result is not trivial in the sense that it beats both uniformsampling and any known active learning method that is based on disagreementcoefficient and VC dimension arguments only.

1.4 Thesis Outline

Chapter 2 provides definitions and a brief contextual background on active learn-ing. In Chapter 3, we lay out the key ingredients of our method, which we termSmooth Relative Regret Approximations (SRRA). Our approach relies on specificconstructions of SRRA estimators, and on a way to solve corresponding ERMoptimizations. We thus also present a general-purpose construction of smooth es-timators, analyze the corresponding query complexity of the resulting algorithm,and finally, discuss optimizations under convex relaxations. Chapters 4 and 5deal with query-efficient variants of learning to rank from pairwise preferencesand correlation k-clustering respectively. In each of these chapters, after pre-senting the problem, we discuss the pitfalls of applying uniform sampling anddisagreement-based approaches. We then present a corresponding SRRA con-struction and show that it defines the state-of-the-art guarantees. In Chapter 4we additionally provide empirical proof of concept of our approach using a novelbenchmark consisting of several synthetic and most importantly, three real-worlddatasets. In Chapter 5, we also discuss a hierarchical variant of the problemin which we try to actively learn ultrametrics. In our concluding Chapter 6 wepresent a powerful heuristic strategy for adding systematic exploration abilitiesto any active learner. Focusing on transductive graph-based active learners, weempirically demonstrate the usefulness of our approach and show that it definesthe corresponding state-of-the-art empirical result with respect to standard andcommon benchmarks.

9

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 24: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

10

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 25: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Chapter 2

Brief Introduction to ActiveLearning

This chapter provides a contextual background of active learning and introducesthe main concepts and notations we will use in the following chapters. We startby presenting two main active learning settings analogous to batch and onlinesupervised learning. Then we sketch the main active learning theoretical con-cepts and complexity terms. Last, we outline the most important active learningalgorithmic ideas.

2.1 Settings and Notations

We begin with the core learning elements that are common both to the super-vised learning (passive) and active learning settings. Let X be an instance spaceand Y be a label space. Let a distribution over X × Y be denoted by D, withcorresponding marginals DX and DY . Here we focus on a classification scenarioin which the label space is discrete and finite (the case in which Y is continuousis known as regression). We will also assume that unless explicitly stated, thelabel space is binary, Y = 0, 1, and the labeling is a deterministic function overinstances, that is, Y (X) : X −→ 0, 1. These assumptions do not restrict theresults and can be generalized.

Let H be a set of functions mapping X to Y . We call H a hypothesis class,and function h ∈ H a hypothesis. The labeling function Y (·) is unknown, and thehypothesis class represents the set of “feasible” estimations of Y . The discrepancybetween estimations h(x) and true labels Y (x) for x ∈ X is being quantified by aloss function ` : Y × Y −→ R+. The performance of a hypothesis h is measuredupon its expected (in D) loss, which is known as the risk function

riskD,`(h) = E(X,Y )∼D [`(h(X), Y )] .

Here we will mostly use the zero-one loss function `(a, b) = 1a6=b (where 1(·) is the

11

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 26: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

indicator function). In this case, the risk simply becomes the hypothesis errorrate

erD(h) = EDX [h(X) 6= Y (X)] .

We define the noise rate of the class H to be ν = infh∈H erD(h). We willassume here that there exists an optimal hypothesis h∗ ∈ H that achieves thenoise rate (erD(h∗) = ν). The case in which h∗ ≡ Y , that is ν = 0, is called therealizable setting. The other case where ν > 0 is known as the agnostic setting.Here we will consider the agnostic setting only.

We say that hypothesis h is ε-competitive if |erD(h)− ν| < εν. In other words,its risk does not exceed (1 + ε) times the hypothesis class noise rate. We measurethe performance of the active learning algorithm by its convergence rate: theminimal amount of labels it consumes for producing (w.h.p.) an ε-competitivehypothesis, h ∈ H. This is known as sample or query complexity.

There are two definitions of the sample complexity. The first is called theself-verifying sample complexity, which counts the minimal number of labels re-quired both to produce a competitive hypothesis and prove (w.h.p.) it is indeedcompetitive. This is the commonly used definition, and we will use it as well.The alternative definition counts the labeling overhead needed only to producea competitive hypothesis (without verifying it). Indeed, there are cases in whichthe difference between the two definitions is significant. We discuss this furtherin Section 2.2 below.

In this thesis, we focus only on what is known as the pool-based active learningsetting ; which is one of the two commonly analyzed active learning settings. Bothsettings are presented below.

Require: active learning algorithm ALG, a pool X ⊆ X , labeling oracle YX , andquery budget TS0 ← ∅, denotes the labeled samplefor i = 1, . . . , T doALG uses (S0, S1, . . . , Si) and X to choose an instance x ∈ X (or a batch)Si+1 ← Si ∪ (x, YX(x))

end forreturn hypothesis h = ALG

((S0, . . . , ST ), X

)Figure 2.1: Pool-based active learning protocol.

Pool-Based Active Learning. Pool-based active learning is analogous tobatch supervised learning, wherein the learner obtains access to a labeling oraclethat possesses the true labeling information of some subset of X . This subsetis called the pool. A protocol of pool-based active learning is presented in Fig-ure 2.1. The protocol is iterative — in each round the algorithm queries somelabel, based on its current set of labeled instances.

12

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 27: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Stream-Based Active Learning. Stream-based active learning is analogousto online learning. Here, the active learner receives a stream of instancesx1, x2, . . . ∈ X one at a time (each is drawn i.i.d. and paired with its label,(xi, yi), from D). At each step it decides whether to request or ignore the in-stance’s label. The online nature of the algorithm usually imposes restrictedresources (e.g., O(1) space complexity). Thus, label requests are mostly basedon a current solution hi. In comparison, in the pool-based setting the learnercan maintain the sampled examples and sequence of solutions. See Figure 2.2 forthe corresponding protocol. (Note that a few stream-based algorithms maintainadditional resources.)

Require: active learning algorithm ALG, and number of rounds Th0 initial hypothesisfor i = 1, . . . , T do

an instance-label pair (xi, yi) is drawn from DALG receives xi and decides whether to ask for yi (based on hi−1)ALG emits hypothesis hi

end forreturn hypothesis hT

Figure 2.2: Stream-based active learning protocol.

2.2 Theory of Active Learning

We discuss below the main theoretical active-learning results. We start by listingknown active-learning complexity terms and then we present alternative settingsand guarantees for the agnostic setting.

2.2.1 Complexity Terms

Complexity terms capture the essence of the problem at hand and accordingly,their usefulness is measured by their generality, simplicity, and expressiveness. Inother words, they should rely upon assumptions that are as minimal as possible(generality), easy to describe and compute (simplicity), and useful for creatingrelations between the performance of algorithms and the nature of a problem’sinstance at hand. Active learning is a relatively young research field, and until re-cently it lacked a theoretical foundation. This was changed by the breakthroughwork of Dasgupta (2005) who defined the first active learning specific complexityterm, the splitting index. Since then only two more terms have been introduced:the teaching dimension (Hanneke, 2007a; Goldman and Kearns, 1995),1 and the

1Hanneke was the first to demonstrate the usefulness of the teaching dimension measure foranalyzing sample complexities in active learning.

13

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 28: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

disagreement coefficient (Hanneke, 2007b). The disagreement coefficient is cur-rently the popular term to prove sample bounds of active learners. We also useit for analyzing our general-purpose construction (Section 3.2.1).

Splitting Index. The idea of splitting indices proposed by Dasgupta (2005)facilitated the first general theory of convergence rates for active learning. Inthe realizable case, we can view the active learning process as a (binary) search,where the algorithm searches a competitive hypothesis by “flipping” instancelabels. Using this analogy, a point x ∈ X informativeness is determined by theratio it splits the “search set.” That is, for a finite Q ⊆ h1, h2 : h1, h2 ∈ H,define Qy

x = h1, h2 : h1(x) = h2(x) = y. An instance x is said to ρ-split Q if

maxy∈0,1

|Qyx| ≤ (1− ρ) |Q| .

Definition 2.1 (Splitting Index) We say A ⊆ H is (ρ,∆, τ)-splittableif, for all finite Q ⊆

h1, h2 ∈

(A2

): P [h1(X) 6= h2(X)] > ∆

, then

P(X ρ-splits Q) ≥ τ .

The splitting index intuitively indicates how common informative instances are.Perhaps the most elegant property of the splitting index is that it links betweenthe number of labels required and the number of unlabeled instances (size of thepool); this property is unique to the splitting index, not shared by any otheractive learning complexity term. The next theorem gives a general performanceresult that uses this term:

Theorem 2.1 (Dasgupta (2005)) For any H with VC dimension d, there isan algorithm such that if the set h : P[h(X) 6= h∗(X)] ≤ 4∆) is (ρ,∆, τ)-splittable for all ∆ ≥ ε/2 , then with a probability of at least 1 − δ the algo-

rithm draws O ((1/ε) + (d/ρτ)) instances and uses O (d/ρ) labels and returns a

hypothesis with error at most ε. (O (·) is the analog of big-O, which suppressespoly-logarithmic terms.)

Teaching Dimension. The teaching dimension is a label complexity term de-fined by Goldman and Kearns (1995), designating the minimum number of labeledexamples required to present to any consistent passive learning algorithm in orderto uniquely identify any hypothesis in the hypothesis class. Related extensions(Hegedus, 1995; Hellerstein et al., 1996) were used to tightly characterize thenumber of membership queries sufficient for exact learning.2

2Learning with membership queries is a setting that comes from the COLT community andis related to active learning. The main difference between the two is that in membership queries,X ≡ Rd. Note that in many natural problems this cannot be assumed to be true. For example,it is not reasonable that every point in Rd represents a bag-of-words of some document. In theexact learning setting the goal is to identify an errorless hypothesis in the class.

14

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 29: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Hanneke (2007a) adjusted the (extended) teaching dimension to the PACsetting in which the learner is required to provide an approximately correct hy-pothesis rather than the exact errorless one. He defined a “binary search style”active-learning algorithm and provided a corresponding sample complexity anal-ysis using his PAC-extension teaching dimension term. Note, that in this sense,there seems to be some relationship between this term and Dasgupta’s splittingindex; however, as noted by Hanneke (2007a) the connection is not clear.

The (extended) teaching dimension did not gain popularity, and in generalcases it is not clear how to utilize it to analyze and construct active learningalgorithms. For this reason, we do not discuss it further.

Disagreement Coefficient. In parallel with the presentation of the teachingdimension variant, Hanneke (2007b) introduced a complexity term that becamethe standard active learning complexity term. Hanneke’s disagreement coefficientintuitively captures how difficult it is to estimate a hypothesis within H beyondsome threshold in terms of the probabilistic volume of instance labels that areneeded.

To define the disagreement coefficient we need the following definitions: Definethe distance dist(h1, h2) between two hypotheses h1, h2 ∈ H as PX∼DX [h1(X) 6=h2(X)]; observe that dist(·, ·) is a pseudo-metric over pairs of hypotheses. For ahypothesis h ∈ H and a number r ≥ 0, the ball B(h, r) around h of radius r isdefined as h′ ∈ H : dist(h, h′) ≤ r. For a set V ⊆ H of hypotheses, let DIS(V )denote

DIS(V ) = x ∈ X : ∃h1, h2 ∈ V such that h1(x) 6= h2(x) .Definition 2.2 (Uniform Disagreement Coefficient) The disagreement co-efficient of h with respect to H under DX is defined as

θh = supr>0

PDX [DIS (B(h, r))]

r, (2.1)

where PDX [W ] for W ⊆ X denotes the probability mass of W with respect tothe distribution DX . Define the uniform disagreement coefficient θ as suph∈H θh,namely

θ = suph∈H

supr>0

PDX [DIS (B(h, r))]

r. (2.2)

Remark 2.1 A slight though useful variation of the definitions of θh and θ canbe obtained by replacing supr>0 with supr≥ν in (2.1) and (2.2).

The disagreement coefficient is useful in bounding the query complexity ofmany algorithms (see e.g., Dasgupta et al., 2007; Balcan et al., 2008; Beygelzimeret al., 2009, 2010; Koltchinskii, 2010; Hanneke, 2011; Wang, 2011; El-Yaniv andWiener, 2012) probably due to its simplicity, and applicability beyond the real-izable case (where the splitting index fails). We will therefore compare our ideasand results with those of this complexity term.

15

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 30: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

2.2.2 Alternative Settings and Guarantees

We focus here on the agnostic setting, notably, the lower bound and the currentstate-of-the-art upper-bound on the sample complexity.

Agnostic Setting: Lower Bound. Kaariainen (2006) proves an information-theoretic limit on what we can hope to achieve with agnostic active learning.He was able to prove that for virtually any non-trivial marginal DX , noise rateν, number n, and active-learning algorithm, there exists a distribution D withmarginal DX and noise rate ν such that

P (erD − ν) ≥ c

√ν2 log(1/δ)

n.

This result was improved by Beygelzimer et al. (2009) to a lower bound of

c√

ν2dn

. Mind that in passive learning, VC convergence rates are proportional

to√

νd log(1/d)n

for a hypothesis class with VC dimension d. This negative result

indicates that, in a worst-case sense, active learning does not provide a significantadvantage over passive learning. Nevertheless, this is not much of a restrictionand can be removed by making for example, some assumptions on the distribu-tion D, the noise structure, or even by using a modified definition of the samplecomplexity (see discussion on the non-verifiable sample complexity below). Here,all the specific problems that we target in Chapters 4-6 are concerned with a(transductive) distribution-free setting, for which this lower bound is irrelevant.

Agnostic Setting: State-of-the-art Guarantees. We are interested in thebest guarantees for methods that use the disagreement coefficient and VC di-mension arguments only. Thus, we are concerned with methods that use minimalassumptions (better bounds require more assumptions). The current state-of-the-art sample complexity guarantee corresponds with the algorithms of Dasguptaet al. (2007) and Beygelzimer et al. (2009). Note that both of these algorithmsare described under the stream-based active-learning setting. The guarantee canbe summarized as follows. Assume that the (universal) disagreement coefficientθ and the VC dimension d are bounded. The number of labels the algorithm ofDasgupta et al. (2007) needs in order to achieve an error O(ν) is (w.h.p.),

O (θd log(1/ν)) .

For further details, see the discussion following Theorem 2 in Dasgupta et al.(2007).

Non-Verifiable Sample Complexity. Balcan et al. (2008) provides an alter-native definition of the sample complexity based on the subtle observation that

16

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 31: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

bifurcate into the following two cases. The first is the standard definition in whichthe emitted hypothesis error bound can be verified upon the given sample (calledself-verifying), and the other case is when the error rate cannot be verified fromthe sample; the latter is called non-verifiable sample complexity. Interestingly,Balcan et al. (2008) show that under the non-verifiable sample complexity set-ting, active learning gains a significant advantage over passive learning in termsof sample complexity rates. The key aspect of this remarkable advantage is thatthis setting can explicitly encapsulate a knowledge of the minority class’ prob-ability mass. Recall our discussion in Chapter 1 about exploration–exploitationtradeoffs in active learning. The probability of the minority class is the exactquantity that influences the exploration overhead (i.e., when it is minuscule, thealgorithm can only succeed by performing a uniform exploration). The success ofthe non-verifiable setting can be intuitively attributed to prior knowledge of theexploration overhead.

Structural Noise Assumptions. One way to overcome the negative result ofKaariainen (2006) is by considering some assumptions about the structure of thenoise (e.g., Balcan et al., 2007; Castro and Nowak, 2008; Hanneke, 2009; Koltchin-skii, 2010; Yang et al., 2010; Wang, 2011; Yang et al., 2011; Minsker, 2012). Themost popular assumption are the noise conditions of Mammen and Tsybakov(1999). These noise conditions informally capture the density of the noise aroundthe boundary; thus, they are also referred to as margin conditions. Assumingthat the (label) noise is inverse to the distance of the instance from the decisionboundary, the conditions describe how “quickly” the label mass P[Y (X) = 1] isbeing changed around the decision boundary. It intuitively parameterizes howrapidly the diameter of the hypothesis class shrinks as we eliminate sub-optimalhypotheses. This relates in a natural way to the idea behind the definition of thedisagreement coefficient and therefore provides a way to describe “distributionstructures” that provide fast sample rates.

2.3 Algorithmic Ideas

This section provides an overview of the main active-learning algorithmic ideas.These ideas not only yield theoretically justified methods but also motivate heuris-tics that gain good empirical evidence over real-world data. For an even morecomprehensive review, see, for example, Settles (2012).

2.3.1 Sampling by Uncertainty

The core idea behind the sampling-by-uncertainty query strategy is the assump-tion that the chance of label-estimation mistake is inversely proportional to theestimator’s confidence. Assuming this is indeed true, querying the least confident

17

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 32: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

instances has two intuitive motivations. First, there is no point in “spending”queries on labels that the current model “almost surely” knows. Secondly, re-vealing the least confident labels may have a large impact on the model.

In the next sections we discuss four methods that apply the natural ideaof sampling-by-uncertainty. The first two are the classical CAL and Query-by-Committee (QBC) methods, both of which measure the confidence with respectto a set of hypotheses. Each queried label potentially shrinks the search space(usually taken to be the version space). The last two methods maintain anintermediate estimator and query instances near the decision boundary. Eachacquired label refines the estimator (and the boundary, as a result). They areeffective whenever long distances from the decision boundary are proportional tothe estimator’s confidence.

CAL and Its Variants. CAL is a classical active-learning algorithm intro-duced by Cohn et al. (1994). The algorithm that was originally designed for therealizable setting maintains two sets: the uncertainty region which is defined bythe set of points on which two version space hypotheses disagree, and the versionspace. The original algorithm samples the next query from this set according tothe marginal distribution DX .

The basic idea of Cohn et al. (1994) began gaining popularity with Balcanet al. [2006, 2009a] who extended the idea to the agnostic setting, where thenotion of the version space is no longer valid; thus, it cannot be used to reducethe region of uncertainty. Balcan et al. (2006), on the other hand, define anoisy variant of the version space and assume that they can bound the error ofthe hypotheses (from above and below). Accordingly, their algorithm uses thesebounds to control the region of uncertainty.

Another example of a CAL-inspired algorithm is provided by Balcan et al.(2007) whose algorithm deals with linear separators (half-spaces) that are uni-formly distributed in the unit high-dimensional ball. In their elegant geometricalinterpretation of the region of uncertainty, the region is controlled by the marginof an intermediate solution.

Note that the algorithm of Balcan et al. (2006) motivated the definition of thedisagreement coefficient (Hanneke, 2007b) and as a result was the first agnosticactive-learning algorithm to gain a sample complexity analysis.

Query-by-Committee (QBC). The QBC algorithm of Seung et al. (1992)also considers the realizable case and tracks the version space. The original al-gorithm works in the streaming active-learning setting and assumes a Bayesianprior on the hypothesis class. For each streamline decision, the algorithm drawstwo hypotheses from the version space (according to the Bayesian prior). If thesampled hypotheses disagree on the instance classification, the algorithm queries

18

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 33: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

the label; otherwise, the algorithm ignores it. In the more general case, the com-mittee consists of an arbitrarily (or parameterized) number of hypotheses (e.g.,Dagan and Engelson, 1995; McCallum and Nigam, 1998; Melville and Mooney,2004). The intuition behind such a notion of confidence is related to versionspace shrinkage rates. The intuition here is that the empirical split ratio of thecommittee correlates with the split ratio of the version space and accounts forhow much it can be reduced.

The QBC algorithm was analyzed by Freund et al. (1997). The uniform sam-pling of surfaces in high dimensions, which is in the core of QBC, makes this al-gorithm intractable. Several follow-up studies addressed this problem (Bachrachet al., 2002; Gilad-Bachrach et al., 2003, 2005).

Most Uncertain Point. The motivation behind the most uncertain query-ing method, suggested in parallel by Campbell et al. (2000), Schohn and Cohn(2000), and Tong and Koller (2001) comes from the geometrical relationship be-tween confidence and distance from the Margin in Support Vector Machines. Thealgorithmic idea is very simple: query the closest point to the margin. Indeed, thealgorithm is also widely known as SIMPLE. The original algorithm is designed forthe pool-based setting and works as follows: At each stage, the algorithm main-tains a soft-valued function over the pool’s instances; this function is a surrogatefor the confidence of the best hypothesis with respect to the current labeled set,e.g., an ERM solution with respect to the training examples; the algorithm thenqueries the example that has a minimal confidence value.

The motivation of Schohn and Cohn (2000) and Tong and Koller (2001)emerged from geometric considerations of the hyperplane generated by the un-derlying SVM. Schohn and Cohn (2000) look at the geometric interpretation ofclassification confidence which is, in SVM, proportional to the example’s distancefrom the hyperplane. Tong and Koller (2001) argue that the most uncertain pointshould split the version space into proportional parts, and such query will reducethe search space in constant factor.

While the former motivations are specific for the SVM algorithm, Campbellet al. (2000) provided a more general form of this query strategy. They expressedthe most uncertain point querying strategy via an optimization problem. (Theminimizer of that problem defines the solution of the SVM algorithm.) The form ofthe optimization problem is general and can be used with most baseline (passive)algorithms and not only with SVM.

Coarse-to-Fine. The idea here is to start with a coarse sample that “covers”the region of interest and then gradually refine the sampling toward uncertaintyregions. The uncertainty regions are defined by the Bayesian decision boundary;thus, this process can be interpreted as a search for the decision boundary. Castroet al. (2005) presented a decision-tree algorithm that implements this idea in two

19

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 34: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

dimensions. Later, Castro and Nowak (2008) presented a different coarse-to-finealgorithm and matched it with a corresponding sample complexity-guarantee.

2.3.2 Utility Maximization (Error Minimization)

The utility maximization approach uses a utility function that defines the poten-tial “gain” of acquiring a label for some example. The gain is usually related toerror reduction. The unlabeled example with the maximal utility value is thenchosen to be queried.

MacKay (1992) analyzed active learning in a Bayesian setting and suggestedthat the change in entropy (information gain) can be used as a utility function.Lindenbaum et al. (1999, 2004) suggested this method to evaluate a generic utilityover a step of the next possible labeling values. The former features an elegantgame-tree interpretation, and the utility value function is calculated by lookingahead over this tree structure. The utility function implementations defined inLindenbaum et al. (2004), on the other hand, are all based on soft classificationvalues of an underlying passive learner.

Other implementations of the utility maximization approach choose to querythe point whose addition to the training set maximizes the expected accuracy ofa corresponding underlying passive learner. Roy and McCallum (2001) suggestedthis idea using the Naive-Bayes classifier, and later Baram et al. (2004) suggestedan implementation with an SVM classifier. Zhu et al. (2003b) implemented thisidea in the transductive setting using their transductive learning algorithm (Zhuet al., 2003a) as the passive leaner.

2.3.3 Online Weighting

Importance Weighting. The importance-weighting active-learning frame-work was recently described by Beygelzimer et al. (2009) and has been developedfurther through several follow-up works (e.g., Beygelzimer et al., 2010; Karam-patziakis and Langford, 2011). This framework maintains a distribution overinstances that expresses the usefulness of each instance. The original algorithmswork in the streaming active-learning setting where they receive examples, flip-ping a biased coin that reflects the example’s importance. In case of success (e.g.,“heads”), it requests the example’s label to be revealed. The acquired labeledsample defines an unbiased weighted empirical estimator of the true error wherethe weights are inversely proportional to the importance-probability. The ERMsolution defines an intermediate solution (hypothesis), which is used to define theimportance weights in the next learning round.

The specific smooth-estimator construction that we define in this thesis definesimportance-weights. We work in the pool-based setting, and the weights are setwith respect to our novel smooth-relative-regrets condition.

20

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 35: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Perceptron-based Approaches. Many of the streaming active-learning algo-rithms are variations on the Perceptron algorithm. Essentially, these methods arebased on augmenting Perceptron-type algorithms with a margin-based filteringrule.

The algorithm of Cesa-Bianchi et al. (2004) determines whether to query alabel by flipping a biased coin. The bias of the coin is proportional to the marginof the example with respect to the current hypothesis. A standard Perceptronupdate is performed whenever a query of a label finds that the algorithm’s pre-diction is incorrect.

The algorithm of Dasgupta et al. (2005) retains the standard Perceptron logicfor deciding whether to perform a hypothesis update and modifies the standardupdate rule. The update rule used by Dasgupta et al. (2005) applies a dynamiclearning rate that is proportional to the margin of the example with respect tothe current hypothesis. Note that this is exactly opposite to Cesa-Bianchi et al.(2004), who modified the standard update indication (using the margin) and keptthe standard update rule.

Another interesting implementation of the online weighting approach was pro-posed by Herbster et al. (2005). The main concern of their paper was applyingPerceptron methods over graphs; however, they also defined an active-learningmethod similar to the approach that queries the most uncertain point. The un-derlying passive learner is applied with a Perceptron variant that belongs to the“passive-aggressive” family of algorithms (Herbster, 2001; Crammer et al., 2006).

2.3.4 Other Ideas

Clustering-based Methods. A natural approach to active learning is basedon a clustering assumption, namely, that similar instances share the same label.When this assumption holds, the labeling overhead can be reduced to the numberof clusters. Below we describe the elegant algorithm of Har-Peled et al. (2007),which applies the clustering-based method.

Har-Peled et al. used a margin-based active learning algorithm that appliesthe SVM optimization solver. To start with, the algorithm’s goal is satisfiedwith an approximated optimal solution. Hence, instead of constructing an exact-margin maximization solution, it utilizes an approximated one (according to anotion of cluster approximation).

The algorithm replaces instance singletons with their corresponding (geomet-rical) core-sets, a method that was suggested in the context of passive learningby Badoiu and Clarkson (2003) and Tsang et al. (2005).3 In the context of SVM,the core-set is a subset of the training set that induces an ε-maximal margin so-lution. Recall that SVM’s solution has the large margin characteristic; thus, in

3Though the results of Tsang et al. (2005) have been criticized by Loosli and Canu (2007)the validity of the core-set approach remains intact.

21

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 36: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

this sense, the core-set induces an ε-maximal solution.Recall that one of the motivations for applying active learning is to reduce

the labeling cost. Thus, in this sense, the core-sets approach is naturally sug-gested for trading the labeling cost with an almost “good” solution. Har-Peledet al. (2007) noted the usefulness of active learning with core-sets and provideda corresponding algorithm in the spirit of the approach that queries the mostuncertain point.

Ensemble Methods. Ensemble methods combine the decisions of a set oflearners into a single decision. In practice, any single learner is likely to fail ona few datasets. The use of a set of learners can provide a robust decision wheremembers of the ensemble can cover up for the deficiencies of other members.

Baram et al. (2004) provided motivation for this approach by showing thatthe most-uncertain and utility-maximization approaches tend to fail on XOR-like problems. This deficiency relates to the classical exploration–exploitationproblem, where these methods tend to exploit points near their current decisionboundary rather than explore new regions. The solution of Baram et al. (2004)is based on the multi-armed bandit algorithmic framework of Auer et al. (1995)for combining an ensemble of active learners, and provides a general frameworkfor combining a set of active-learning schemes.

Another work tackling the same problem is that of Osugi et al. (2005), whoused an ensemble of only two learners: exploring and exploiting. The switchbetween exploration and exploitation is achieved by flipping a biased coin, wherethe bias reflects how much the hypothesis changed as a result of the last query.This change is measured by the distance between the new hypothesis and the onethat resulted in the previous learning round.

An even simpler approach was set forth by Guo and Greiner (2007), who com-bined two active-learning methods using a round-robin-like scheme. Interestingly,this very simple idea achieves excellent empirical results.

22

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 37: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Chapter 3

The Method of Smooth RelativeRegret Approximations

This chapter discusses a novel approach to pool-based active learning that is basedon a novel smoothness condition on empirical estimators of relative regrets. Wefocus entirely on the agnostic active learning within the pool-based setting anduse the notations of Section 2.1. Thus, ν = infh∈H erD(h) > 0 (and there existsh∗ ∈ H, so that erD(h∗) = ν).

For the sake of simplicity, we use the following assumptions. The label spaceis binary Y = 0, 1, and each label Y is a deterministic function of X, so thatif X ∼ DX , then (X, Y (X)) is distributed according to D. Extensions to multi-label spaces and soft (probabilistic) labels are straightforward. Additionally, weassume that the pool takes X in its entirety. Again, this is not a real restrictionfor two reasons: First, it matches a fairly common assumption that the pool is“large enough,” at least in the sense that it does not restrict the learner (i.e.,it “properly” represents DX ); Secondly, and somewhat more importantly, in therest of the dissertation, we discuss only the transductive setting, in which X isfinite and given in advance.

3.1 Smooth Relative Regret Approximations

(SRRA)

The underlying idea of the SRRA method is to utilize intermediate solutions(hypotheses) as focal points for designing smooth error estimators. The merit ofthese estimators is presented below in Corollary 3.1. However, in this section wewill not describe how to induce query mechanisms from the smoothness condition,mainly because, as we show later, there is solid indication that specific estimatordefinitions give the best results for specific problem domains.

Start by fixing a pivotal hypothesis, h ∈ H. This pivot serves as our focalerror-point via a function regh : H 7→ R that we call the relative regret function

23

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 38: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

with respect to h, defined as

regh(h′) = erD(h′)− erD(h) .

Note that, for h = h∗, this is simply the usual regret, or (in-class) excess risk1

function.

Definition 3.1 (Smooth Relative Regret Approximation) Let f : H 7→ Rbe any function, and 0 < ε < 1/5, and 0 < µ ≤ 1. We say that f is an (ε, µ)-smooth relative regret approximation ((ε, µ)-SRRA) with respect to h if for allh′ ∈ H,

|f(h′)− regh(h′)| ≤ ε · (dist(h, h′) + µ) .

If µ = 0, we simply call f an ε-smooth relative regret approximation with respectto h.

Think of the function f as an empirical version of regh. In other words, f iscompletely defined by the labeled sample at hand, meaning that the (pool-based)active learner has full control over the design of f . The SRRA definition in-tuitively relates to the explore–exploit principle of active learning (discussed inChapter 1). The SRRA condition hints that the sample should “cover” the spec-trum of disagreement distances, while being “denser” in closer circles of the pivoth. Think of the former as exploring the hypothesis class and the latter as exploit-ing the intermediate solution h. This notion will become clearer as we presentspecific SRRA estimators for several problem domains.

Remark 3.1 We compare the definition of SRRA with other known smoothnessconditions.

Lipschitz Continuity. The Lipschitz continuity condition is defined as follows:Given two metric spaces (A, dA), (B, db), a function f : A −→ B is called aLipschitz continuity if there exists a real valued constant K ≥ 0 such that, for alla1, a2 ∈ A

dB (f(a1), f(a2)) ≤ KdA(a1, a2) .

It is immediately noticeable that the left-hand sides of the (ε, µ)-SRRA definitionand the above smoothness condition differ and that the SRRA has an extra εµterm.

1Assuming that the risk is defined with respect to the zero-one loss function.

24

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 39: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Massart’s Condition. Perhaps the most similar to our Definition 3.1 is thesmoothness condition due to Massart (2000, Page 288, Assumption A2). Massartwas the first to consider analyzing learning processes directly with excess differ-ences h − h∗. Massart (2000) defines an abstraction of the contribution of h tothis discrepancy, which he calls the contrast function and denotes it by γ(h,X)(where X ∼ DX is a random variable). He defines the condition

Var [γ(h′, X)− γ(h,X)] ≤ dist2(h′, h) ,

and dist2(h, h∗) ≤ O(E [γ(h,X)− γ(h∗, X)]) for all h, h′ ∈ H. The conditionis matched with a (structural) regularization term and corresponding empiricalestimator. The estimator is accompanied by a matching error upper bound.

We compare Massart’s condition with our SRRA condition: Assuming f(h′) =fh′ − fh and taking γ(h,X) = fh − 1h(X)6=Y (X), we get

Var[f ′h − 1h′(X) 6=Y (X) − fh + 1h(X)6=Y (X)

]≤ dist2(h, h′) .

The left-hand side can be further developed into

E[(f(h′)− 1h′(X) 6=Y (X) + 1h(X)6=Y (X)

)2]−(E[f(h′)− 1h′(X)6=Y (X) + 1h(X)6=Y (X)

])2,

which is exactly Var[1h(X)6=Y (X) − 1h′(X) 6=Y (X)

](note that expectations are

taken with respect to X and f(h′) is a constant). In other words, the conditionbounds the variance of hypotheses distances.2 Additionally, note that the secondpart of Massart’s condition yields a very restrictive “structural” assumption onH: dist2(h, h∗) ≤ O(f(h)− regh∗(h)). Our SRRA condition is concerned directlywith empirical processes that estimate relative regrets; also, it does not imply anystructural assumptions. We conclude that our SRRA condition is considerablydifferent than Massart’s condition.

When used sequentially the SRRA condition turns out to be useful. Thefollowing theorem and corollary show that a sequence of (ε, µ)-SRRA estima-tors define an ε-competitive hypothesis. This results in a simple iterative meta-algorithm. These results constitute the main ingredients of our SRRA method.

Theorem 3.1 Let h ∈ H and f be an (ε, µ)-SRRA with respect to h. Let h1 bea minimizer of f(·) in H (h1 = argminh′∈H f(h′)). Then,

erD(h1) = (1 +O(ε)) ν +O (ε · erD(h)) +O(εµ) .

2Note that the term(E[f(h′)− 1h′(X)6=Y (X) + 1h(X) 6=Y (X)

])2equals (f(h′)−regh(h′))2,

however here, we can only devise a lower bound on this term.

25

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 40: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Proof Applying the definition of (ε, µ)-SRRA we have:

erD(h1)≤erD(h) + f(h1) + ε · dist(h, h1) + εµ

≤erD(h) + f(h∗) + ε · dist(h, h1) + εµ

≤erD(h) + ν − erD(h) + ε · dist(h, h∗) + ε · dist(h, h1) + 2εµ

≤ν + ε(

2dist(h, h∗) + dist(h1, h∗))

+ 2εµ. (3.1)

The first inequality is from the definition of (ε, µ)-SRRA; the second is from thefact that h1 minimizes f(·) by construction; the third is again from the definitionof (ε, µ)-SRRA, and the definitions of h∗ and regh; and the fourth is from thetriangle inequality. The proof is completed by plugging dist(h, h∗) ≤ erD(h) + ν,and dist(h1, h

∗) ≤ erD(h1) + ν into Equation 3.1, subtracting ε · erD(h1) fromboth sides, and dividing by (1− ε).

A simple inductive use of Theorem 3.1 proves that the following corollary boundsthe excess risk of an ERM-based active-learning algorithm (see Algorithm 1 forcorresponding pseudocode). The algorithm’s query complexity depends on thespecific constructions of (ε, µ)-SRRA estimators.

Corollary 3.1 Let h0, h1, h2, . . . be a sequence of hypotheses in H such that forall i ≥ 1, hi = argminh′∈H fi−1(h′), where fi−1 is an (ε, µ)-SRRA with respect tohi−1. Then, for all i ≥ 0,

erD(hi) = (1 +O(ε)) ν +O(εi)erD(h0) +O(εµ) .

Proof Applying Theorem 3.1 with hi and hi−1, we have

erD(hi) = (1 +O(ε)) ν +O (ε · erD(hi−1)) +O(εµ) .

Solving this recursion, one gets

erD(hi) =i∑

j=1

εj−1 (1 +O(ε)) ν︸ ︷︷ ︸(i)

+O(εi) · erD(h0) +O

(i∑

j=1

εj

)µ︸ ︷︷ ︸

(ii)

.

The result follows easily by bounding geometric sums. Recall that due to thedefinition of (ε, µ)-SRRA, we have ε ∈ (0, 1/5). Treating term (i),

i∑j=1

εj−1 (1 +O(ε)) ν =1− εi−1

1− ε (1 +O(ε)) ν =(1 +O(ε))

1− ε ν = (1 +O(ε)) ν .

26

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 41: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

We get the last equality from 1+ε1−ε = 1 + 2ε

1−ε = 1 + O(ε). Similarly, for (ii) wehave,

O

(i∑

j=1

εj−1

)εµ = O

(1

1− ε

)εµ = O (εµ) .

Corollary 3.1 is constructive and leads to the following simple iterative meta-algorithm. In each iteration the algorithm revises its current solution to theminimizer of a (ε, µ)-smooth relative regret approximation with respect to thiscurrent solution. If (h0, h1, . . . , hT ) is the series of intermediate solutions, thenthere is a corresponding series of (ε, µ)-SRRAs (f (0), f (1), . . . , f (T−1)). In order toapply this algorithm, we should give the algorithm an access to such a resource,as well as provide a suitable solver for minimizing f (i)(h′) over H.

Algorithm 1 An Active-Learning Algorithm from SRRAs

Require: an initial solution h0 ∈ H, estimation parameters ε ∈ (0, 1/5), µ ≥ 0,and number of iterations Tfor i = 0, 1, . . . , T dohi+1 ← argminh′∈H, f

(i)(h′), where f (i) is an (ε, µ)-smooth relative regretapproximation with respect to hi

end forreturn hT

Below we will show problems of interest in which (ε, µ)-SRRAs with respectto a given hypothesis h can be obtained using labeling-queries at a few randomly(and adaptively) selected points X from the pool X , if the uniform disagreementcoefficient θ is small. This will constitute another proof of the usefulness of thedisagreement coefficient in design and analysis of active-learning algorithms. Inthe next two chapters, we present two problems for which a direct constructionof an SRRA yields a significantly better query complexity than that which isguaranteed by using the disagreement coefficient alone.

3.2 Constant Uniform Disagreement Coefficient

Implies Efficient SRRAs

We present our first design of smooth relative regret approximation functions.The design relies on the ability to split X into disjoint (disagreement) sets ac-cording to fine neighborhoods around the pivot hypothesis. The idea comesfrom the definition of disagreement coefficient (Definition 2.2). We show that

27

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 42: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

a bounded uniform disagreement coefficient implies existence of query-efficient(ε, µ)-SRRAs. Plugging these SRRAs into Algorithm 1 yields an active-learningalgorithm with guarantees that match the state of the art (when using similarassumptions). This constitutes yet another proof of the usefulness of the dis-agreement coefficient in design of active-learning algorithms.

3.2.1 The Construction

Assume that the uniform disagreement coefficient θ corresponding to H is finiteand ν > 0. Consider the set of all dichotomies H∗ induced by H over X

H∗ =

( ⋃h′∈H

X ∈ X : h′(X) = 0)⋃( ⋃

h′∈H

X ∈ X : h′(X) = 1)

.

In other words, H∗ is the collection of all subsets S ⊆ X , whose elements X ∈ Sare mapped to the same value (0 or 1) by h′, for some h′ ∈ H.

In computational geometry, the hyper-graph (X ,H∗) is called a range space,and r ∈ H∗ is called a range. The VC dimension of (X ,H∗) is the maximumcardinality of a subset A ⊆ X for which A ∩ r : r ∈ H∗ contains all subsets ofA (e.g., Har-Peled, 2011, Chapter 5.1, page 61).

Assume (X ,H∗) has VC dimension d and fix h ∈ H. We define a split of Xinto µ disjoint sets according to their impact on disagreement distances over H.Let L = dlog µ−1e. Define X0 = DIS(B(h, µ)) and for i = 1, . . . , L define Xi tobe

Xi = DIS(B(h, µ2i)) \DIS(B(h, µ2i−1)) .

Observe the illustration of these sets in Figure 3.1, where the top of thefigure corresponds to the hypothesis class and the bottom to the instancespace. At the top, the hypotheses in H are arranged in ball neighborhoodsB(h, µ2i−1) : i = 1, . . . , L around the pivot h. The bottom of the figure de-picts the disjoint sets Xi. Each Xi corresponds to the hypotheses “disc”B(h, µ2i) \ B(h, µ2i−1).

Let ηi = PDX [Xi] be the measure of Xi and δ a failure probabil-ity hyper-parameter. For each i ≥ 0, draw a sample Xi,1, . . . , Xi,m ofm = O (ε−2θ (d log θ + log (δ−1 log(1/µ)))) examples in Xi, each of which isdrawn independently from the distribution DX |Xi (with repetitions). (By DX |Xi,we mean the distribution DX conditioned on Xi.) We will now define an estimatorfunction f : H 7→ R of regh as follows: For any h′ ∈ H and i = 0, 1, . . . , L let

fi(h′) = ηim

−1

m∑j=1

(1Y (Xi,j)6=h′(Xi,j) − 1Y (Xi,j)6=h(Xi,j)

).

Our estimator is now defined as f(h′) =∑L

i=0 fi(h′).

28

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 43: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Figure 3.1: Illustration of the disagreement SRRA construction. The top of the figurecorresponds to H and the bottom to X . The hypotheses in H are arranged in ballneighborhoods around the pivot h. The instances X are being split into disjoint setsthat match discs B(h, µ2i)\B(h, µ2i−1). We sampled a fixed number of m instancesfrom each subset Xi and defined with them an importance weighting estimator f forregh(·). The weight of each sampled instance is (inversely) proportional to its sampleprobability. We showed that the resulting unbiased estimator of regh is a query-efficient (ε, µ)-SRRA and defines an active-learning algorithm via Algorithm 1.

Theorem 3.2 Let f , h, h′, and m be as above. With probability at least 1 − δ,f is an (ε, µ)-SRRA with respect to h.

Proof A main tool to be exploited in the proof is called relative ε-approximationsset forth by Haussler (1992) and Li et al. (2000). It is defined as follows: Leth ∈ X 7→ R+ be some function, and let µh = EX∼DX [h(X)]. Let X1, . . . , Xm

denote i.i.d. draws from DX , and let µh = 1m·∑m

i=1 h(Xi) denote the empiricalaverage. Let κ > 0 be an adjustable parameter. We will use the followingmeasure of distance between µh and its estimator µh to determine how far thelatter diverges from the true expectations:

dκ(µh, µh) =|µh − µh|µh + µh + κ

.

29

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 44: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

This measure corresponds to a relative error when approximating µh by µh.Indeed, let ε > 0 be our approximation ratio and put dκ(µh, µh) < ε. This easilyyields

|µh − µh| <2ε

1− ε · µh +ε

1− ε · κ. (3.2)

In other words, it is implied that |µh − µh| < O(ε)(µh + κ

).

Let us fix a parameter 0 < δ < 1. Assume that H is a set of 0, 1 valuedfunctions on X of VC dimension d. Li et al. (2000) show that if one samples

m = c(ε−2κ−1(d log κ−1 + log δ−1)

)examples as above, then (3.2) holds uniformly for all h ∈ H with probability ofat least 1− δ.

We now apply this definition of relative ε-approximations, and the correspond-ing results within our context. For any h′, we define the following four sets ofinstances:

R++h′ = X ∈ X : h′(X) = Y (X) = 1, and h(X) = 0

R+−h′ = X ∈ X : h′(X) = 1, and h(X) = Y (X) = 0

R−+h′ = X ∈ X : h′(X) = 0, and h(X) = Y (X) = 1

R−−h′ = X ∈ X : h′(X) = Y (X) = 0, and h(X) = 1 .

Observe that the set X ∈ X : h(X) 6= h′(X) is equal to the union of R++h′ ,

R+−h′ , R−+

h′ , and R−−h′ . For each i = 0, . . . , L and b ∈ ++,+−,−+,−−, letRbh′,i = Rb

h′ ∩ Xi. Let Rbi = Rb

h′,i : h′ ∈ H. It is easy to verify that the

VC dimension of the range spaces(Xi,Rb

i

)is at most d. Each set in Rb

i is anintersection of a set in H∗ with some fixed set.

For any R ⊆ Xi, let ρi(R) = PX∼DX |Xi [X ∈ R] and ρi(R) = m−1∑m

j=1 1Xi,j∈R.Note that ρi(R) is an unbiased estimator of ρi(R).

From Equation (3.2), the assumptions on θ and ν, and the choice of m, wehave: with a probability of at least 1− δ/L for all R ⊆ R++

i ∪R+−i ∪R−+

i ∪R−−i ,

|ρi(R)− ρi(R)| = O(ε) ·(ρi(R) + θ−1

); (3.3)

and by the probability union bound we obtain that this uniformly holds for alli = 0, . . . , L with a probability of at least 1− δ.

Now fix h′ ∈ H and let r = dist(h, h′) and let ir = dlog(r/µ)e. By thedefinition of Xi, h(X) = h′(X) for all X ∈ Xi whenever i > ir. We can thereforedecompose regh(h

′) as:

30

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 45: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

regh(h′) = erD(h′)− erD(h)

=L∑i=0

ηi ·(PX∼DX |Xi [Y (X) 6= h′(X)]−PX∼DX |Xi [Y (X) 6= h(X)]

)=

ir∑i=0

ηi ·(PX∼DX |Xi [Y (X) 6= h′(X)]−PX∼DX |Xi [Y (X) 6= h(X)]

)=

ir∑i=0

ηi ·(− ρi(R++

h′ ) + ρi(R+−h′ ) + ρi(R

−+h′ )− ρi(R−−h′ )

).

On the other hand, we similarly have that

f(h′) =ir∑i=0

ηi ·(− ρi(R++

h′ ) + ρi(R+−h′ ) + ρi(R

−+h′ )− ρi(R−−h′ )

).

Combining these, we conclude by using (3.3) that

|regh(h′)− f(h′)| ≤ O

ir∑i=0

ηi ·(ρi(R

++h′ ) + ρi(R

+−h′ ) + ρi(R

−+h′ ) + ρi(R

−−h′ ) + 4θ−1

)).(3.4)

But now notice that∑ir

i=0 ηi ·(ρi(R

++h′ ) + ρi(R

+−h′ ) + ρi(R

−+h′ ) + ρi(R

−−h′ ))

equals

r, since it corresponds to those elements X ∈ X on which h, h′ disagree. Alsonote that

∑iri=0 ηi is at most 2 max PDX [DIS (B(h, r))] ,PDX [DIS (B(h, µ))].

By the definition of θ, this implies that the RHS of (3.4) is bounded by ε(r+ µ),as required by the definition of (ε, µ)-SRRA.3

Corollary 3.2 An (ε, µ)-SRRA with respect to h can be constructed with a prob-ability of at least 1− δ, using at most

m (1 + dlog(1/µ)e) = O(θε−2 (log(1/µ))

(d log θ + log(δ−1 log(1/µ))

))(3.5)

label queries.

Combining Corollaries 3.1 and 3.2 (Algorithm 1), we obtain an active-learningalgorithm in the ERM-setting with query complexity depending on the uniformdisagreement coefficient and the VC dimension. Assume δ is a constant; if weare interested in excess risk on the order of at least that of the optimal error ν,

3The O-notation disappeared because we assume that the constants are properly chosen inthe definition of the sample size m.

31

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 46: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

then we may take ε to be, say 1/5, and achieve the sought bound by constructing(1/5, ν)-SRRAs using O(θd(log(1/ν))(log θ)) once for each of O(log(1/ν)) itera-tions of Algorithm 1. If we seek a solution with error (1 + ε)ν, we would need toconstruct (ε, ν)-SRRAs using O(θdε−2(log(1/ν))(log θ)) query labels, one for eachof O(log(1/ν)) iterations of the algorithm. The total label query complexity isO(θd(log2(1/ν))(log θ)), which is O(log(1/ν)) times the best known bounds usingdisagreement coefficient and VC dimension bounds only (e.g., Dasgupta et al.,2007; Beygelzimer et al., 2009).

A few more notes of comparison are in place. First, note that in known ar-guments that bound query complexity by using the disagreement coefficient, thedisagreement coefficient θh∗ with respect to the optimal hypothesis h∗ is used inthe analysis, and not the uniform coefficient θ. Also note that both in previouslyknown results that bound the query complexity by using the disagreement coeffi-cient and VC dimension bounds and in our result as well, the slight improvementdescribed in Remark 2.1 applies. In other words, all arguments remain valid ifwe replace the supremums in (2.1) and (2.2) with supr≥ν .

3.3 Convex Relaxations

So far we have focused on theoretical ERM aspects only. By so doing,we have made no assumptions about the computability of the step hi =argminh′∈H fhi−1

(h′) in Corollary 3.1 (Step 2 in Algorithm 1). Although ERMresults are interesting in their own right, we take an additional step and considerconvex relaxations.

Instead of optimizing erD(h) over the set H, assume that we are interested inoptimizing erD(h) over h ∈ H, where H is a convex set of functions from X toR. Also assume there is a mapping φ : H 7→ H that is used as a “rounding” pro-cedure. When optimizing in H, one conveniently works with a convex relaxationerD : H → R+ as surrogate for the discrete loss erD, defined as follows

erD(h) = E(X,Y )∼D

[L(h(X), Y

)], (3.6)

where L : R × 0, 1 7→ R+ is some function convex in the first argument, andsatisfying

1(φ(h))(X)6=Y ≤ cL(h(X), Y

)for all h ∈ H and X ∈ X , where c > 0 is some constant. In other words, L upperbounds the discrete loss (up to a factor of c).

Example 3.1 For example, consider the well known setting of SVMRank withthe hinge loss relaxation (Herbrich et al., 2000; Joachims, 2002). Here, X ⊆Rd × Rd and we would like to learn an order over X . We take H as the set of

32

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 47: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

all vectors w ∈ Rd, and the rounding method φ : H 7→ H to convert w to anorder over X . For all w ∈ H and x = (u,v) ∈ X , w(x) = 〈w,u − v〉, andL(a, b) = max1−a(2b− 1), 0. Using this choice, optimizing over (3.6) becomesthe famous SVMRank optimization:

Minimize F (w, ξ) =∑u,v

ξu,v (3.7)

s.t., ∀u, v, Y (u, v) = 1 : (u− v) ·w ≥ 1− ξu,v∀u, v : ξu,v ≥ 0

‖w‖ ≤ c.

(Note: c is a regularization parameter.)

We now have a natural extension of relative regret: regh(h′) = erD(h′)−erD(h).

By our assumptions on convexity, regh : H 7→ R+ can be efficiently optimized.We now say that f : H 7→ R+ is an (ε, µ)-SRRA with respect to h ∈ H if for allh′ ∈ H, ∣∣∣regh(h

′)− f(h′)∣∣∣ ≤ ε

(dist(φ(h), φ(h′)

)+ µ).

If µ = 0 then we simply say that f is an ε-SRRA. The following is an analogueto Corollary 3.1:

Theorem 3.3 Let h0, h1, h2, . . . be a sequence of hypotheses in H such that forall i ≥ 1, hi = argminh′∈H fi−1(h′), where fi−1 is an (ε, µ)-SRRA with respect to

hi−1. Then for all i ≥ 1,

erD(hi) = (1 +O(ε)) ν +O(εi)erD(h0) +O(εµ) ,

where ν = inf h∈C erD(h) and the O-notations may hide constants that depend onc.

The proof is very similar to that of Corollary 3.1, and we omit the details.

3.4 Discussion

We presented here the key ingredients of our method of smooth relative regretapproximations. The core of our method is the smoothness condition of Defini-tion 3.1. The effectiveness, in terms of query complexity, of such smooth regret-estimators is evident in Corollary 3.1 and Algorithm 1. This is a meta-algorithmthat requires two components to be instantiated: an SRRA implementation, anda ERM-like solver. We discussed implementations for these two aspects in Sec-tions 3.2.1 and 3.3.

We think of the SRRA smoothness condition as a holistic explore–exploitcondition for active learning. Furthermore, it seems the need for explore–exploit

33

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 48: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

tradeoffs exists when the corresponding discrepancy between the estimated errorand the truth fluctuates. Our SRRA condition assures that this discrepancyis smooth with respect to distances from some pivotal intermediate solution.In many cases we can correlate such hypothesis-distances with correspondinginstance subsets. The condition intuitively guides us to spread queries all over thequery space (exploration) with a density inversely proportional to distances fromsome fixed intermediate solution. This intuition is well-demonstrated in the waythat we construct the disagreement-based SRRA presented in Section 3.2.1, wherewe spread the queries in such a way that it incorporates exploitation near currentsolutions with exploration that enables identification of better “far” candidates.

Our method is concise and simple, yet rigorous, and provides guarantees thatmeet the state of the art; nevertheless we have not yet shown that it extends thestate of the art. Moreover, the provided SRRA construction is infeasible becauseit requires calculations of disagreement regions with respect to any hypothesis.Thus, the construction presented in this chapter points to the potential of ourmethod. We will prove the advantages of our method in the next two chapters,where we present specific feasible SRRA constructions that provide the best-known guarantees.

34

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 49: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Chapter 4

Active Preference-based RankingUsing SRRAs

“Learning to rank” takes various forms in the theory and practice of learningand combinatorial optimization; in all its forms, the goal is to order a set V of nelements based on constraints. Preference-based constraints arise when the labelscome from natural-source judgments, such as human ratings or rankings.

Perhaps the most studied form of preference-based ranking considers absolute-value constraints. In this setting, each element in V is matched with a discretenumeric scale that defines ordinal type preferences; higher-value elements arepreferred over lower elements. The goal is to learn how to order V so as torespect the induced pairwise preferences. For example, review systems, such as,reviews of hotels, books, and restaurants use the “star quality” scale, 1, 2, 3, 4, 5,where, if u has a label of 5 (“very good”) and v has a label of 1 (“very bad”), thenany ordering that places v ahead of u is penalized. Note that even if the labelsare noisy, the induced pairwise preferences here are always transitive, hence nocombinatorial problem arises. Asking humans to judge “complex” phenomenaaccording to a numeric scale is extremely problematic. Indeed, such judgmentsare plagued with calibration issues, errors, and inconsistencies (see e.g., Stewartet al., 2005); this is intuitively analogous to embedding a high dimensional spaceon the line.

Our work deals with a completely different setting, one in which the basicunit of information consists of preferences over pairs u, v ∈ V . Ranking items ac-cording to their comparison in pairs dates back to the classical work of Thurstone(1927). Such preferences are appealing firstly because they can encapsulate “com-plex” judgments, at least in the sense that labels are direct signals rather thansurrogate signals, as with ordinal-scale labels. Secondly, relative judgments canbe non-transitive. Thus, here the problem becomes combinatorially interesting.

We focus on this combinatorial aspect of the problem studying learning torank from pairwise preferences (LRPP), a close relative of minimum feedback in

35

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 50: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

arc-set tournaments (MFAST)1 from the world of combinatorial optimization. InMFAST, the goal is to find a full linear order of V , given all

(n2

)pairwise com-

parisons of elements (i.e., all possible labels) for free. It turns out that MFAST isNP-hard (Alon, 2006), though Kenyon-Mathieu and Schudy (2007) show a PTASfor it. Namely, they show a degree-n polynomial time algorithm computing asolution with loss at most (1 + ε) times the optimal (the degree of the polyno-mial may depend on ε). Several important recent works address the challenge ofapproximating the minimum feedback arc-set problem (Ailon et al., 2008; Braver-man and Mossel, 2008; Coppersmith et al., 2010).

In terms of practicality, one of the obstacles of LRPP is in its apparentquadratic sample-complexity. From a learning theoretical point of view, theneed to acquire quadratic amount of pairwise preferences is unacceptable evenfor moderately large sets V that arise in applications. On the other hand, uni-form sub-sampling of pairs works poorly (see e.g., Section 4.2). Thus, devising agood active-learning approach is crucial to the problem’s practicability.

Here we consider a query-efficient variant of LRPP in which each preferencecomes with a cost, the goal being to produce a competitive solution while reducingthe preference-query overhead. Other very recent works consider similar settings(Jamieson and Nowak, 2011; Ailon, 2012). Jamieson and Nowak (2011) considera common scenario in which the alternatives can be characterized in terms ofd real-valued features and the ranking obeys the structure of the Euclidean dis-tances between such embeddings. They present an active-learning algorithm thatrequires, using average case analysis, as few as O(d log n) labels in the noiselesscase, and O(d log2 n) labels under a certain parametric noise model. Our workuses worst-case analysis and assumes an adversarial noise model. In Section 4.4we analyze the pure combinatorial problem (not assuming any feature embed-dings). In Section 4.5 we tackle the problem with linearly induced permutationsover feature-space embeddings.

Ailon (2012) considers the same setting as ours. Our main result for querycomplexity stated in Corollary 4.1 is a slight improvement on Ailon’s and itprovides another significant improvement. Ailon (2012) uses a querying methodthat is based on a divide-and-conquer strategy. The weakness of such a strategycan be demonstrated by considering an example in which we want to search overa restricted set of permutations (e.g., the setting of Section 4.5): While dividingand conquering, Ailon’s algorithm is doomed to search a Cartesian product oftwo permutation spaces (left and right), and there is no guarantee that thereeven exists a permutation in the restricted space that respects this division. Inour querying algorithm, this limitation is lifted.

1A maximization version exists as well.

36

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 51: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

4.1 Problem Definition

Let V be a set of n elements (alternatives). The instance space X is taken to bethe set of all distinct pairs of elements in V , namely V × V \

(u, u) : u ∈ V

.

The distribution DX is uniform on X . The label function Y : X 7→ 0, 1 encodesa preference function satisfying Y

((u, v)

)= 1 − Y

((v, u)

)for all u, v ∈ V . We

conventionally think of Y((u, v)

)= 1 as a stipulation that u is preferred over v.

For convenience, we will drop the double-parentheses in what follows.

Remark 4.1 We make the above “anti-symmetric” assumption on Y only forthe convenience of exposition. We could provide an alternative definition withoutmaking any assumption on Y , where we index the elements of V arbitrarily andtake X to be

(V2

), the set of unordered pairs of elements in V . Then, for the pair

vi, vj with i < j, the value Y (vi, vj) = 1 stipulates that vi is preferred overvj, and it is zero otherwise.

The class of solution functions H we consider is all h : X → 0, 1 such that itis skew-symmetric, h(u, v) = 1−h(v, u), and transitive, h(u, z) ≤ h(u, v)+h(v, z),for all distinct u, v, z ∈ V . This is equivalent to the space of permutations overV . Hereafter in this chapter, we will replace h, h′, . . . with π, σ, . . . ; we also usenotation u ≺π v as a predicate equivalent to π(u, v) = 1; and with a slight abuseof notation we will let π(u) denote the rank of u in the permutation induced byπ.

Endowing X with the uniform measure and denoting the number of all orderedpairs by N = n(n− 1), we have

erD(π) = N−1∑u,v∈X

1π(u,v)6=Y (u,v) . (4.1)

The distance dist(π, σ) turns out to be (up to normalization) the well-known Kendall-τ distance:

dist(π, σ) = N−1∑u6=v

1π(u,v)6=σ(u,v) . (4.2)

Another well-known distance over permutation that we will consider is Spear-man’s Footrule distance:

dSF(π, σ) =∑u∈V

|π(u)− σ(u)| , (4.3)

which is easier to manipulate than the Kendall-τ distance. A classical result byDiaconis and Graham (1977) connects these two distances and makes it possibleto interchange one with the other

Ndist(π, σ) ≤ dSF(π, σ) ≤ 2Ndist(π, σ) . (4.4)

37

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 52: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

4.2 The Necessity of Active Learning

A natural question in this context is why the optimal hypothesis cannot be ap-proximated by sampling preference-pairs uniformly at random. In other words,is it sufficient for our setting to apply a passive learning method? Do we reallyneed to apply the more sophisticated active-learning machinery?

Let us start to tackle this concern from the perspective of VC learning the-ory (see e.g., Vapnik, 1995). VC theory tells us that if the VC dimension ofH is d and we sample m > n pairs uniformly at random, denoted by Sm,then with a probability of at least 1 − δ, the (unbiased) empirical estimation,m−1

∑(u,v)∈Sm 1π(u,v) 6=Y (u,v) deviates from its expectation erD(π, Y ) by no more

than O

(√d logm+log(1/δ)

m

).

However, it is known that the VC dimension of H is n − 1 (e.g., Radinskyand Ailon, 2011), which is simply because for any set of n pairs there alwaysexists a labeling Y (·) that defines preference cycles. In other words, the set ofpermutations H cannot shatter n pairs. On the other hand, the maximal set ofpairs (u, v) ∈ X such that each alternative is incident exactly in a single pair isshattered in H.

The VC bound becomes O

(√n logm+log(1/δ)

m

). If we want to achieve an ad-

ditive error of ε with a probability of at least 1− exp(−n), then we will have tosample O (ε−2n log n) pairs uniformly at random with repetitions and optimizethe empirical estimation over H.

Recall that Spearman’s footrule and Kendall-τ Distances are of the sameorder (Equation 4.4). An additive error ε means that each alternative u moveson average εn indices away from its (optimal) rank in π∗. In practice, we usuallylike to achieve a constant average misplacement. Thus, we require an additiveerror of ε/n that gives rise to a quadratic sample complexity, which basicallymeans that we have to query the whole instance space X .

Let us now consider a different argument that motivates the use of activelearning. Assume that the problem is realizable, that is, the noise rate of thehypothesis class is zero. The only way we can achieve a zero-error permutationusing empirical risk minimization (ERM) strategy is when the sample containsall n − 1 consecutive pairs of the optimal permutation π∗. A standard applica-tion of concentration bounds around the mean tells us that the probability ofsuch an event is exponentially small when sampling pairs uniformly at random(with or even without replacements). For example, take Ximi=1 to be i.i.d. ran-dom variables indicating whether a sampled pair is a consecutive in π∗ or not.Clearly, each Xi is a Bernulli random variable with mean p = 1/N . ApplyingChernoff bound, we get that P [

∑iXi > (1 + t)mN−1] ≤ exp −ct2 (for some

global constant c). Observe that when m = o(N), we must take t = Ω(n), andthe probability that we actually hit the set of consecutive pairs is exponentially

38

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 53: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

low. We note that Ailon (2012) used arguments similar to ours.

4.3 Disagreement Coefficient Arguments Are

Not Sufficient for Effective Active Learning

The former section motivated a preference for active- over passive-learning forLRPP. In Chapter 3 we presented a disagreement-based SRRA construction thatassumes finite disagreement coefficient and VC dimension only and guarantees asample complexity that matches the state of the art (when using these assump-tions only). We will show below that one cannot achieve a useful active-learningalgorithm by using disagreement coefficient arguments only.

The uniform disagreement coefficient of H is by Definition 2.2

θ = supπ∈H

supr>0

N−1 |DIS (B(π, r))|r

.

It is easy to show that θ is Ω(n) (as it has been shown by Ailon, 2012). Notice thatif we start from some permutation π and swap the positions of any two elementsu, v ∈ V , then we obtain a permutation of distance that is at most O(1/n) awayfrom π, as depicted in Figure 4.1. Hence, the disagreement region of the ball ofradius O(1/n) around π is the entire space X . Plugging in the correspondingvalues, we get θ ≥ N−1N

n−1 = Ω(n).

Depicting two permutations π (up-per) and σ (lower). Observe that σis derived from π by inverting theorder of v1 and vn. As a result, πand σ disagree over 2(n − 1) − 1pairs. Hence dist(π, σ) = O(n)/N =O(1/n).

Figure 4.1: Maximal distance dist(π, σ) by a single pair inversion.

Recall that the VC dimension of H is n− 1. Using Corollary 3.2, we concludethat we would need Ω(n2) preference labels to obtain an (ε, µ)-SRRA for anymeaningful pair (ε, µ). This is uninformative because the cardinality of X isΘ(n2). A similar bound is obtained with any known active-learning bound usingdisagreement coefficient and VC dimension bounds only.

A slight improvement on this negative result can be obtained using the refineddefinition of disagreement coefficients of Remark 2.1: namely, by replacing supr>0

with supr≥ν in the definition of θ. Observe that the uniform disagreement coeffi-cient, as well as the disagreement coefficient at the optimal solution π∗ becomes2

2Due to symmetry, the uniform disagreement coefficient here equals θπ for any π ∈ H.

39

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 54: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

θ = θπ∗ = O(1/ν), if ν ≥ 1n, improving the query complexity bound to O(nν−1).

If ν tends to n−1 from above, this becomes a quadratic (in n) query complexityin the limit.

4.4 Better SRRA for LRPP

The former section demonstrated that the general disagreement-based SRRA con-struction of Chapter 3 provides uninformative guarantees when applied to LRPP(when ν is “low”). Thus, it is necessary to seek a better SRRA construction-scheme.

Before starting, let us remove the dependency on µ in the definition of (ε, µ)-SRRA to indicate that the flavor of this setting is different from that of the generalone. Following from our discussion of ν in the former section, taking µ = 1/nshould be sufficient; thus, we will simply use the term ε-SRRA here.

Consider the following idea for creating an ε-SRRA for LRPP with respectto some fixed π ∈ H. Each u ∈ V will define a disjoint partition of V , such thateach subset consists of elements that induce a similar order of “point-wise” errorcontribution when inverting their π-induced relative order with u. Following theguidance of the ε-SRRA definition, we will sample from the partition subsetsto ensure that our sampling becomes denser as this potential point-wise errorbecomes low, as depicted in Figure 4.2. Below we describe the construction indetail.

Let p be a sampling parameter as defined in (4.5). For all u ∈ V and for alli = 0, 1, . . . , dlog ne, let Iu,i denote the set of elements v such that (2i − 1)p <|π(u) − π(v)| < 2i+1p (recall that π(u) is the position of u in π). From this set,choose a random sequence of p elements Ru,i = (vu,i,1, vu,i,2, . . . , vu,i,p), each chosenuniformly and independently from Iu,i. We define the sample size parameter tobe

p = O(ε−3 log3 n

). (4.5)

This choice is derived from the machinery we will use to prove sample size guar-antees.

Remark 4.2 A variant of this sampling scheme is as follows: for each pair (u, v),add it to the query-set with probability proportional to min1, p/|π(u) − π(v)|.A similar scheme can be found in Ailon et al. (2007), Halevy and Kushilevitz(2007), and Ailon (2012), but the strong properties proven here were not known.

For distinct u, v ∈ V and a permutation σ ∈ H, let costu,v(σ) denote thecontribution of the pair u, v to erD(σ) namely, costu,v(σ) = N−11σ(u,v)6=Y (u,v) . Letregu,v|σ denote the contribution of u, v ∈ X to regπ(σ), that is,

regu,v|σ = 2 (costu,v(σ)− costu,v(π)) . (4.6)

40

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 55: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Figure 4.2: Depicting the core element of our SRRA for the LRPP construction.Element u defines a partition Iu,i over V . Each partition subset is depicted as adark “disc.” Observe that all the alternatives in a disc share a “similar” magnitudeof point-wise (footrule) error if an alternative switches places with u. From each discwe sample i.i.d. p elements. We do so for every u ∈ V and define a correspondingunbiased estimator for regπ.

Notice that the notation discards the dependency on π because it is assumed tobe fixed. The factor 2 is used because costu,v ≡ costv,u.

Our estimator f(σ) of regπ(σ) = erD(σ)− erD(π) is defined as

f(σ) =1

2

∑u∈V

dlogne∑i=0

|Iu,i|p

p∑t=1

regu,vu,i,t|σ . (4.7)

Clearly, f(σ) is an unbiased estimator of regπ(σ) for any σ; our goal is to provethat f(σ) is an ε-SRRA.

Theorem 4.1 With a probability of at least 1−n−3, the function f is an ε-SRRAwith respect to π.

Proof The main idea is to decompose the difference |f(σ) − regπ(σ)| vis-a-vis

41

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 56: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

corresponding pieces of dist(σ, π). The first half of the proof is devoted to the def-inition of such distance “pieces.” Then, using counting and standard deviation–bound arguments, we show that the decomposition is, with high probability, anε-SRRA.

Let us start with a few definitions. Recall that for any π ∈ H and u ∈ V , π(u)denotes the position of u in the unique permutation that π defines. For example,π(u) = 1 if u beats all other alternatives: π(u, v) = 1 for all v 6= u. Similarly,π(u) = n if u is beaten by all other alternatives. For any permutation σ ∈ H, wedefine the corresponding profile of σ as the vector:3

prof(σ) =(σ(u1)− π(u1), σ(u2)− π(u2), . . . , σ(un)− π(un)

).

Note that ‖prof(σ)‖1 is dSF(σ, π), the Spearman footrule distance between σ andπ. For a subset V ′ of V , we let prof(σ)[V ′] denote the restriction of the vectorprof(σ) to V ′. In other words, the vector obtained by zeroing in prof(σ) allcoordinates v 6∈ V ′.

Now fix σ ∈ H and two distinct u, v ∈ V . Assume u, v is an inversion in σwith respect to π and that |π(u) − π(v)| = b for some integer b. Then, either|π(u) − σ(u)| ≥ b/2 or |π(v) − σ(v)| ≥ b/2. We will “charge” the inversion toargmaxz∈u,v

|π(z) − σ(z)|

.4 For any u ∈ V , let chargeσ(u) denote the set of

elements v ∈ V such that (u, v) is an inversion in σ with respect to π, which ischarged to u based on the rule above. The function regπ(σ) can now be writtenas

regπ(σ) =∑u∈V

∑v∈chargeσ(u)

regu,v|σ , (4.8)

where regu,v|σ is defined in Equation (4.6). Indeed, any pair that is not invertedcontributes nothing to the difference. Similarly, our estimator f(σ) can be writtenas

f(σ) =∑u∈U

dlogne∑i=0

|Iu,i|p

p∑t=1

regu,vu,i,t|σ · 1vu,i,t∈chargeσ(u) .

Observe that above we dropped the factor 1/2 because we count each pair u, vonly once.For any even integer M , let Uσ,M denote the set of all elements u ∈ V such that

M/2 < |π(u)− σ(u)| ≤M .

Let Uσ,≤M denote: ⋃M ′≤M

Uσ,M ′ .

3For the sake of definition, assume an arbitrary indexing such that V =ui : i = 1, . . . , n

.

4By breaking ties using some canonical rule, such as charge to the greater of u, v viewed asintegers.

42

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 57: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Hereafter we shall remove the subscript π, because it is held fixed. Consider thefollowing restrictions of reg(σ) and f(σ):

reg(σ,M) =∑

u∈Uσ,M

∑v∈chargeσ(u)

regu,v|σ , (4.9)

f(σ,M) =∑

u∈Uσ,M

dlogne∑i=0

p∑t=1

|Iu,i|p

(regu,vu,i,t|σ · 1vu,i,t∈chargeσ(u)

). (4.10)

Clearly, also here, f(σ,M) is an unbiased estimator of reg(σ,M). Let Tσ,M denotethe set of all elements u ∈ V such that |π(u)− σ(u)| ≤ εM . We further split theexpressions in (4.9)–(4.10) as follows:

reg(σ,M) = A(σ,M) +B(σ,M), and f(σ,M) = A(σ,M) + B(σ,M), (4.11)

where,

A(σ,M) =∑

u∈Uσ,M

∑v∈chargeσ(u)∩Tσ,M

regu,v|σ , (4.12)

A(σ,M) =∑

u∈Uσ,M

dlogne∑i=0

|Iu,i|p

p∑t=1

regu,vu,i,t|σ · 1vu,i,t∈chargeσ(u)∩Tσ,M . (4.13)

We use (·) to denote set complement in V , and B(σ,M), B(σ,M) are analogouswith Tσ,M instead of Tσ,M , as follows:

B(σ,M) =∑

u∈Uσ,M

∑v∈chargeσ(u)∩Tσ,M

regu,v|σ (4.14)

B(σ,M) =∑

u∈Uσ,M

dlogne∑i=0

|Iu,i|p

p∑t=1

regu,vu,i,t|σ · 1vu,i,t∈chargeσ(u)∩Tσ,M (4.15)

We now estimate the deviation of A(σ,M) from A(σ,M). Fix M and noticethat the expression A(σ,M) is completely determined by non-zero elements ofthe vector prof(σ)[Uσ,≤M ∩ Tσ,M ]. Let Jσ,M denote the number of non-zerosin prof(σ)[Uσ,M ]. Each non-zero coordinate of prof(σ)[Uσ,≤M ∩ Tσ,M ] is boundedbelow by εM in absolute value by definition. Let P (d,M) denote the numberof possibilities for the vector prof(σ)[Tσ,M ] for σ running over all permutationssatisfying dSF(σ, π) = d. We claim that

P (d,M) ≤ n2d/(εM) . (4.16)

Indeed, there can be at most d/(εM) non-zeros in prof(σ)[Tσ,M ], and each non-zero coordinate can trivially take at most n values. The bound follows.

43

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 58: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Now, fix integers d and J and consider the subspace of permutations σ suchthat Jσ,M = J and dSF(σ, π) = d. Define for each u ∈ Uσ,M , i ∈ [dlog ne], andt = 1, . . . , p, a random variable Xu,i,t as follows:

Xu,i,t =|Iu,i|p

regu,vu,i,t|σ · 1vu,i,t∈chargeσ(u)∩Tσ,M .

Clearly, A(σ,M) =∑

u∈Uσ,M Xu,i,t. For any u ∈ V , let iu = argmaxi |Iu,i| ≤ 4Mand observe that by our charging scheme, Xu,i,t = 0 almost surely, for all i > iuand t = 1 . . . p. Also observe that for all u, i, t, |Xu,i,t| ≤ 2N−1|Iu,i|/p ≤ 2i+1/palmost surely. For a random variable X, we denote by ‖X‖∞ the infimum overnumbers α such that X ≤ α almost surely. We conclude:

∑u∈Uσ,M

iu∑i=0

p∑t=1

‖Xu,i,t‖2∞ ≤

∑u∈Uσ,M

iu∑i=0

N−2p22i+2/p2 ≤ c2p−1N−2JM2

for some global constant c2 > 0. (We used a bound on the sum of a geometricseries.) Using the Hoeffding bound (see Appendix A), we conclude that the prob-ability that A(σ,M) deviates from its expected value of A(σ,M) by more thansome s > 0 is at most exp−s2p/(2c2JM

2N−2). We also conclude that the prob-ability that A(σ,M) deviates from its expected value by more than εd/(N log n)is at most exp−c1ε

2d2p/(JM2 log2 n) for some global constant c1 > 0. Hence,by taking p = O(ε−3d−1MJ log3 n), by union bounding over all P (d,M) possibil-ities for prof(σ)[Tσ,M ], with a probability of at least 1 − n−7 simultaneously forall σ satisfying Jσ,M = J and dSF(σ, π) = d,

|A(σ,M)− A(σ,M)| ≤ εd/(N log n) . (4.17)

But note that, trivially, JM ≤ d; hence, our choice of p in (4.5) is satisfactory.Finally, by union bounding over the O(n3 log n) possibilities for the values ofJ , d, and M = 1, 2, 4, .. we conclude that (4.17) holds for all permutations σsimultaneously, with a probability of at least 1− n−3.

Consider now B(σ,M) and B(σ,M) which we will need to further decomposeas demonstrated next. For u ∈ Uσ,M , we define a disjoint cover (T 1

u,σ,M , T2u,σ,M) of

chargeσ(u) ∩ Tσ,M as follow: If π(u) < σ(u), then

T 1u,σ,M = v ∈ Tσ,M : π(u) + εM < π(v) < σ(u)− εM .

Otherwise,

T 1u,σ,M = v ∈ Tσ,M : σ(u) + εM < π(v) < π(u)− εM .

Note that by definition, T 1u,σ,M ⊆ chargeσ(u). The set T 2

u,σ,M is thus taken to be

T 2u,σ,M = (chargeσ(u) ∩ Tσ,M) \ T 1

u,σ,M .

44

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 59: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

The expressions B(σ,M), B(σ,M) now decompose as B1(σ,M) + B2(σ,M)and B1(σ,M) + B2(σ,M), respectively, as follows:

B1(σ,M) =∑

u∈Uσ,M

∑v∈T 1

u,σ,M

regu,v|σ (4.18)

B2(σ,M) =∑

u∈Uσ,M

∑v∈T 2

u,σ,M

regu,v|σ (4.19)

B1(σ,M) =∑

u∈Uσ,M

dlogne∑i=0

|Iu,i|p

p∑t=1

regu,vu,i,t|σ · 1vu,i,t∈T 1(u,σ,M) (4.20)

B2(σ,M) =∑

u∈Uσ,M

dlogne∑i=0

|Iu,i|p

p∑t=1

regu,vu,i,t|σ · 1vu,i,t∈T 2(u,σ,M) . (4.21)

Now notice that B1(σ,M) can be uniquely determined from prof(σ)[Tσ,M ].Indeed, in order to identify T 1

u,σ,M for some u ∈ Uσ,M , it suffices to identify zeros

in a subset of coordinates of prof(σ)[Tσ,M ], where the subset depends only onprof(σ)[u]. Additionally, the value of costu,v(σ)−costu,v(π) can be “read” fromprof(σ)[Tσ,M ] (and, of course, Y (u, v)) if v ∈ T 1

u,σ,M . Hence, a Hoeffding bound

and a union bound similar to the one used for bounding |A(σ,M)−A(σ,M)| canbe used to bound (with high probability) the difference |B1(σ,M) − B1(σ,M)|uniformly for all σ and M = 1, 2, 4, ..., as well.Bounding |B2(σ,M) − B2(σ,M)| can be performed with the following simpleclaim.

Claim 4.2 For u ∈ V and an integer q, we say that the sampling is successful at(u, q) if the random variable∣∣(i, t) : π(vu,i,t) ∈ [π(u) + (1− ε)q, π(u) + (1 + ε)q] ∪ [π(u)− (1 + ε)q, π(u)− (1− ε)q]

∣∣is at most twice its expected value. We say that the sampling is successful if it issuccessful at all u ∈ V and q ≤ n. If the sampling is successful, then uniformlyfor all σ and all M = 1, 2, 4, ...,

|B2(σ,M)−B2(σ,M)| = O(εJσ,MM/N).

The sampling is successful with probability at least 1− n−3 if p = O(ε−1 log n).

The last assertion in the claim follows from Chernoff bounds (see Appendix A).Note that our bound (4.5) on p is satisfactory, in virtue of the claim.

Summing up the errors |A(σ,M)−A(σ,M)|, |B(σ,M)−B(σ,M)| over all Mgives us the following assertion: With probability at least 1− n−2, uniformly forall σ,

|f(σ)− regπ(σ)| ≤ εN−1 dSF(π, σ) ≤ 2εdist(π, σ) ,

45

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 60: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

where the last inequality is by Diaconis and Graham (1977). This concludes theproof.

In the following Algorithm 2 our specific ε-SRRA construction for LRPPis summarized. Note that by the choice of the sample size p, the number ofpreference-queries required for computing f is O(ε−3n log4 n). Although we onlyprovide the required magnitude of the required sample size, observe that eachstep of the construction is computationally efficient. This allows us to empiricallyexperiment with our ideas, as will be seen in the following sections.

Algorithm 2 SRRA for LRPP

Require: V , H, a pivot π ∈ H, and estimation parameter ε ∈ (0, 1/5)p← O

(ε−3 log3 n

)for u ∈ V do

for i = 0, 1, . . . , dlog ne doIu,i ← v : (2i − 1)p < |π(u)− π(v)| < 2i+1pfor t = 1, . . . , p dovu,i,t ← a uniformly and independently sampled alternative from Iu,i

end forend for

end forreturn f : H −→ R, defined by

f(σ) =∑u∈V

dlogne∑i=0

|Iu,i|p

p∑t=1

(costu,vu,i,t(σ)− costu,vu,i,t(π)

),

We can now define a specific active learning algorithm for LRPP by plug-ging the ε-SRRA constructions, defined in Algorithm 2, into our method’s meta-algorithm which is defined in Corollary 3.1. This provides the following boundon the number of preference-queries.

Corollary 4.1 There exists an active-learning algorithm for obtaining a solutionπ ∈ H for LRPP with erD(π) ≤ (1 +O(ε)) ν with total query complexity ofO(ε−3n log5 n

). The algorithm succeeds with probability at least 1− n−2.

Corollary 4.1 tells us that the SRRA method provides a solution of cost (1 +ε)ν with query complexity that is slightly above linear in n (for constant ε),regardless of the magnitude of ν. By comparison, we saw in Section 4.3 that anyknown active-learning result that uses bounded disagreement coefficient and VCdimension arguments (only) guarantees a query complexity of O(nν−1), tendingto the pool size of n(n − 1) as µ becomes small. Note that ν = o(1) is quiterealistic for this problem. For example, consider the following noise model: A

46

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 61: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

ground truth permutation π∗ exists, Y (u, v) is obtained as a human response tothe question of preference between u and v with respect to π∗, and the human errswith a probability proportional to |π∗(u)− π∗(v)|−ρ. That is to say, closer pairsof items in the ground truth permutation are more prone to confuse a humanlabeler. The resulting noise is ν = n−ρ for some ρ > 0. (Note however that ourwork does not assume Bayesian noise, and we present this scenario for purposesof illustration only.)

In terms of query complexity, it turns out that our bound provides only a slightimprovement on the divide-and-conquer active-learning algorithm for LRPP ofAilon (2012). Specifically, we improve the dependency on ε from ε−6 to ε−3.Although our method provides only a minor improvement, it still defines thecurrent state-of-the-art for query-efficient LRPP; more importantly, though, itdefines the first query-efficient LRPP algorithm that is applicable over an arbi-trary set of permutations, H ⊆ V !. We utilize this fact in the following section,instantiating ε-SRRAs for the set of permutations induced by hyper-planes in Rd.

4.5 LRPP over Linearly Induced Permutations

in Constant-Dimensional Feature Space

A special class of interest is known as LRPP over linearly induced permutationsin constant-dimensional feature space. We use the same definition of X as inSection 4.1, except that now each point v ∈ V is associated with a feature vector,which we denote using bold face: v ∈ Rd. The hypothesis class H now consistsonly of permutations π such that there exists a vector wπ ∈ Rd satisfying

π(u, v) = 1〈wπ ,u−v〉>0 . (4.22)

(Where 〈·, ·〉 denotes the inner-product functional.)Observe that π ∈ H is indeed a permutation. First, π is pairwise consistent,

1〈wπ ,u−v〉>0 = 1 − 1〈wπ ,v−u〉>0, because 〈wπ,u − v〉 = −〈wπ,v − u〉. Second, πis transitive because if π(u, v) = π(z, u) = 1 for some u, v, z ∈ V , meaning that〈wπ,u−v〉 > 0 and 〈wπ, z−u〉 > 0, then from the bilinearity of dot-product wehave 〈wπ, z−v〉 = 〈wπ, z−u+v−u〉 = 〈wπ, z−u〉−〈wπ,u−v〉 ≥ 〈wπ, z−u〉 > 0.

We will apply the powerful mathematical theory of geometrical arrangements,which has been extensively researched with respect to computational geometry(see, e.g., Chapter 8 of de Berg et al., 2008, for further details), whereby inter-sections of n hyper-planes in Rd induce the cell complex. Here we will examinethe dual “space” in which X is a set of hyperplanes in Rd, and permutations inH are unique cells in the corresponding arrangement (Cell subdivision is inducedby intersections of these hyperplanes).

Each pair of alternatives (u, v) ∈ X is geometrically viewed as the followinghalfspace Hu,v = x : 〈x,u− v〉 > 0, whose (closure) supporting hyperplane is

47

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 62: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

hu,v = x : 〈x,u− v〉 = 0. Let H be the collection of these(n2

)hyperplanes

hu,v : (u, v) ∈ X. Note that hu,v = hv,u, and thus it matches the unordered pairu, v. We will assume there exists an arbitrary indexing on V and will identifya pair with hu,v iff u < v. (See Remark 4.1.) The collection H correspondsto the maximal dimensional cells in the underlying arrangement A(H). Thus,from now on, we call A(H) the permutation arrangement and naturally identifyfull-dimensional cells with their induced permutations. We denote by Cπ ⊆ Rd

the unique cell corresponding to a permutation π ∈ H. Figure 4.3 depicts theelements of this construction when the embedding is in the planar R2.

Assuming elements in V are arbi-trarily indexed. The pair u, v withu < v defines a unique dual linehu,v = w : 〈w,u− v〉 = 0. Apoint w ∈ R2 lies above hu,v iff〈w,u − v〉 > 0. Observe that a cellin this arrangement, such as the de-picted dark face, defines a permuta-tion. Indeed, all points w ∈ Cπ lieon the same side of any line hu,v (ei-ther above or below it).

Figure 4.3: Arrangement and duality depicted.

Here we will use the disagreement-based SRRA constructions defined in Sec-tion 3.2. It turns out that due to the geometrical structure, this SRRA con-struction provides non-trivial sample complexity guarantees, contrary to whatwe showed for the combinatorial setting of Section 4.1.

Bounding the VC dimension and disagreement coefficient. Using stan-dard tools from combinatorial geometry, the VC dimension of H is at most d−1.Roughly speaking, this property follows from the fact that in an arrangement ofm hyperplanes in d-space, each of which meets the origin, the overall number ofcells is at most O(md−1). See de Berg et al. (Chapter 8, 2008).

As for the uniform disagreement coefficient, we show below that it is boundedby O(n). Let π ∈ H be a permutation with a corresponding cell Cπ in A(H).The ball B(π, r) is the geometric closure of the union of all cells correspondingto “realizable” permutations σ satisfying dist(σ, π) ≤ r. The corresponding dis-agreement region DIS(B(π, r)) corresponds to the set of ordered pairs (halfspaces)intersecting B(π, r). Next, we show:

Proposition 4.1 The measure of DIS (B(π, r)) in DX is at most 8rn.

48

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 63: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Proof By Diaconis and Graham (1977), Spearman’s Footrule distance betweenany two permutations π and σ is at most twice Ndist(π, σ), where N = n(n− 1).Hence, if dist(π, σ) is r, then any element u could only swap locations with a setof elements located up to 2rN positions away to the right or left. This yieldsa total of 4rN “swap-candidates” for each u. Thus, at most 4rNn inversionsare possible. Note that each inversion corresponds to a hyperplane (unorderedpair) that we cross; thus, the total number of ordered pairs is at most 8rNn.The probability measure of this set is at most 8rn, because we assign equalprobability of N−1 for each possible pair in X . The result follows.

Following from (Proposition 4.1), where the disagreement coefficient θ is al-ways bounded by O(n), we establish our bound. We now invoke Corollary 3.2with µ = O(1/n2) (which is tantamount to µ = 0 for this problem, because|X | = O(n2) and we are using the uniform measure). We conclude:

Theorem 4.3 An ε-SRRA for LRPP in linearly induced permuta-tions in d dimensional feature space can be constructed with re-spect to any π ∈ H, with probability at least 1 − δ, using at mostO(ndε−2 log2 n+ nε−2(log n) (log(δ−1 log n))

)label-queries.

Combining Theorem 4.3, and the iterative algorithm described in Corollary 3.1,we get the following:

Corollary 4.2 There exists an algorithm for obtaining a solution π ∈ H forLRPP in linearly induced permutations in d dimensional feature space witherD(π) ≤ (1 +O(ε)) ν with total query complexity of

O(ε−2nd log3 n+ nε−2(log2 n)

(log(δ−1 log n)

) ). (4.23)

The algorithm succeeds with probability at least 1− δ.

We compare this bound to that of Corollary 4.1. For the sake ofcomparison, assume δ = n−2, so that (4.23) takes the simpler form ofO(ε−2nd log3 n/ log (1/ε)

). This bound is better than that of Corollary 4.1, as

long as the feature space dimension d is O(ε−2 log2 n). For larger dimensions,Corollary 4.1 gives a better bound. It would be interesting to obtain a smootherinterpolation between the geometric structure arising from the feature space andthe combinatorial structure arising from the permutations. We refer the reader toJamieson and Nowak (2011) for a recent result with improved query complexityunder certain Bayesian noise assumptions.

49

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 64: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

4.6 Heuristics for Optimizing the SRRAs

So far we have neglected the optimization step of Algorithm 1, which seeks theminimizer of fi(σ) over H, where fi is an ε-SRRA with respect to the interme-diate solution πi. This amounts to optimization of an NP-Hard problem, calledminimum feedback arc-set in sparse DiGraphs. Thus, there is no hope of devisingan exact solver; instead, we replace this step with various heuristics discussedbelow. We will empirically experiment with these solvers in the next section.

Our first heuristic is a simple greedy hill-climbing solver that starts with a seedsolution and repeatedly searches a single-element move that improves the sparsecost. A single-element move is a result of removing a single alternative from thepermutation and reinserting it into a different position. If such a move is found,it is performed; otherwise, the process is halted and the resulting permutation isreturned as a local optimum. When applied as part of the optimization step ofAlgorithm 1, we take the seed permutation to be the pervious SRRA’s iterationsolution. We refer to this solver as CLIMB.

The second solver we considered is a weighted sparse variant of the SVMRankconvex relaxation. We presented this solver in Example 3.1. Recall that suchconvex relaxation requires V to be embedded in Rd. Therefore, we matched eachalternative u ∈ V with a (random) feature vector u ∈ Rd, where, d ≥ 2n andeach coordinate ui is chosen independently uniformly at random from the range(0, 2). Using standard arguments in computational geometry, our choice of dassures us that the entire set of n! permutations can be almost surely obtainedby ordering the alternatives according to a linear score function u 7→ w · u.(Note that these random features are used just to relax a combinatorially hardoptimization problem – they do not describe the elements in V in any way.)Recall that the SRRA construction of Algorithm 2 induces a weighted, sparsecollection of labeled pairs. Accordingly, a weighted, sparse variant of SVMRankcan be considered. Instead of considering all pairs u, v in the convex relaxed-cost function (3.7), we only consider those pairs that were sampled in the SRRAconstruction (Algorithm 2) with the corresponding weight. This approach hasbeen described in detail in Ailon et al. (2012). We denote Algorithm 1 in whichLine 2 uses this relaxation to optimize for the next solution by COMBI-SVM.

Our last suggested solver is a simple combination of the other two. Firstwe obtain a solution using the convex relaxation defined by COMBI-SVM; then weuse that solution as a warm start for CLIMB. Therefore, we denote this solverCOMBI-SVM+CLIMB.

4.7 Empirical Proof of Concept

Due to the importance of the problem of ranking in many applications, it isimperative to test the theoretical guarantees described above. In this section,

50

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 65: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

we empirically examine our ideas on a few synthetic and real-world sets of data,but since datasets for the preferences setting we examined are not, as far as weknow, publicly available, we constructed our own benchmark data. The designand construction of such data is not trivial, mainly due to the effort necessaryto define diverse problems, and due to the preferences’ quadratic overhead. Inview of these constraints, as well as the constraints on our resources, our dataare limited in scale and size. Finally, we compare our solutions with the basicrandom-sampling baseline rather than competing algorithms which were lacking.Thus, we refer to our tests as a “proof of concept.”

4.7.1 Synthetic Experiments

Datasets. We define nine synthetic problems. In each problem the set V con-tains n = 100 alternatives. Each alternative u ∈ V is embedded in Rd as a featurevector u, with d = 200. The features are randomly i.i.d., generated by settingeach feature as a number chosen uniformly at random in the range (0, 2).

In order to generate the “ground truth” labeling function Y : X 7→ 0, 1, wefirst chose a random coefficient vector w ∈ Rd such that ‖w‖2 = 1 and orderedthe elements of V according to the value w·u. The underlying noiseless preferencefunction Yperfect : X 7→ 0, 1 is defined by

Yperfect(u, v) = 1 ⇐⇒ w · u > w · v .

(As expected, no ties were obtained.) We generated nine different noisy versionsof Yperfect, using two different noise generation methods with varying parameters.

In the first method, for each u, v ∈ V , we flip the value of Yperfect(u, v)with probability β| pos(u) − pos(v)|−α, where β > 0 and 0 < α < 1are hyperparameters and pos(x) is the position of x ∈ V (“rank”) inthe ordering induced by the perfect permutation. We call this model dis-tance decaying noise and denote the corresponding noisy labeling function byYdist(β,α). We experimented with six different values of (β, α), namely (β, α) ∈(0.1, 1), (0.2, 1), (0.5, 0.5), (0.3, 2), (0.5, 2), (0.7, 2). The matrices Ydist(β,α) for thechosen values is depicted in Figure 4.4 (a)–(f). Note that the noise pattern for(β, α) = (0.5, 0.5) is very close to uniform noise.

The structured confusions model, STRUCT, has a different noise structure. Theinversions are condensed in a randomly located fixed-size rectangle, as seen inFigure 4.4 (g), (h) and (i). In order to make the noise level of this and the abovemodel comparable, the area of the error rectangle for YSTRUCT(γ) is the same asthe amount of noise obtained in Ydist(β=γ,1).

Results We ran a version of Algorithm 1; at each iteration we generated abiased sample of labels (an SRRA) using Algorithm 2 with m = 1, which is theminimal possible value. Experimenting with other values of m yielded similar

51

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 66: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

(a) DIST(0.1,1) (b) DIST(0.2,1) (c) DIST(0.5,0.5)

(d) DIST(0.3,2) (e) DIST(0.5,2) (f) DIST(0.7,2)

(g) STRUCT(0.1) (h) STRUCT(0.2) (i) STRUCT(10)

Figure 4.4: Synthetic model-based preferences. We permuted the rows and columnson the basis of the w-induced order, resulting in a Yperfect noiseless ranking, whichcontrary to the upper-right and lower-left triangles depicted here, would have appearedas perfect solid-dark and solid-light triangles, respectively.

results. We ran 15 iterations of the algorithm, measuring the error rate of thehypothesis obtained at each step. Note that we altered Algorithm 1 as follows:instead of taking the next hypothesis hi+1 to be the minimizer of fi(h

′) overh′ ∈ H, we chose it to be the minimizer of f0(h′) + f1(h′) + · · ·+ fi(h

′). In otherwords, we used all the samples that had been drawn up to that point beforeoptimizing for the next hypothesis.

We computed the average of 10 such executions and plotted the average curveaccompanied by standard deviation brackets. For comparison, we ran an exper-iment in which at each interaction, instead of generating a biased sample (usingAlgorithm 2), we generated a sample of the same size using uniformly at-randomdrawn pairs instead. This allowed us to compare the active-learning scheme withstandard (passive) learning. At each iteration of Algorithm 1, we used COMBI-SVM

(but here, we did not need to draw a random feature vector — we used the samevectors drawn in the generation Yperfect).

52

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 67: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Sample Size

ErrorRate

0 500 1000 15000.0

0.1

0.2

0.3

0.4

0.5

RAND/COMBI-SVM

SRRA/COMBI-SVM

(a) DIST(0.1, 1)

Sample Size

ErrorRate

0 500 1000 1500

0.1

0.2

0.3

0.4

0.5

RAND/COMBI-SVM

SRRA/COMBI-SVM

(b) DIST(0.2, 1)

Sample Size

ErrorRate

0 500 1000 1500

0.2

0.3

0.4

0.5

RAND/COMBI-SVM

SRRA/COMBI-SVM

(c) DIST(0.5, 0.5)

Sample Size

ErrorRate

0 500 1000 1500

0.0

0.1

0.2

0.3

0.4

0.5

RAND/COMBI-SVM

SRRA/COMBI-SVM

(d) DIST(0.3, 2)

Sample Size

ErrorRate

0 500 1000 15000.0

0.1

0.2

0.3

0.4

0.5

RAND/COMBI-SVM

SRRA/COMBI-SVM

(e) DIST(0.5, 2)

Sample Size

ErrorRate

0 500 1000 1500

0.1

0.2

0.3

0.4

0.5

RAND/COMBI-SVM

SRRA/COMBI-SVM

(f) DIST(0.7, 2)

Sample Size

ErrorRate

0 500 1000 1500

0.1

0.2

0.3

0.4

0.5

RAND/COMBI-SVM

SRRA/COMBI-SVM

(g) STRUCT(0.1)

Sample Size

ErrorRate

0 500 1000 1500

0.1

0.2

0.3

0.4

0.5

RAND/COMBI-SVM

SRRA/COMBI-SVM

(h) STRUCT(0.2)

Sample Size

ErrorRate

0 500 1000 1500

0.35

0.40

0.45

0.50

RAND/COMBI-SVM

SRRA/COMBI-SVM

(i) STRUCT(10)

Figure 4.5: Results for synthetic datasets averaged over 10 runs, accompanied bystandard error of the mean.

Figure 4.5 depicts the learning curves corresponding to the noise models stud-ied. Observe, that as expected from the theory (see Section 4.4), the SRRA methodis an obvious winner in low noise conditions; while under high noise conditions,the RAND curve representing the uniform sample (passive learning) is better.

An obvious criticism of our experiment is that for the data generated usingthe distance-proportional noise, the sampling of noise locations in Ydist(β,α) issuspiciously similar to the method by which the SRRA sampler (Algorithm 2)samples labels. First, note that this is true only when the hypothesis h serving as

53

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 68: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

argument to Algorithm 2 is very close to Yperfect; this is not the case, however, inthe first iterations of Algorithm 1. Second, notice that the STRUCT noise modelalso exhibits better learning curves for small noise.

4.7.2 Real Data Experiments

The Food-Dish Datasets. We constructed three datasets based on a set ofreal-world alternatives and a full collection of preferences obtained by solicitinghuman responses via crowdsourcing services.

All three problems shared the same set V of 50 alternatives. Each alternativewas a food dish. These natural, real-world objects were gathered from a recipeweb-site5. The goal was to order the alternatives based on the three label func-tions YCOST, YWINE, YDATE that were generated by soliciting responses to the followingquestions on pairs of alternatives:

(i) COST: Which dish costs more (assuming that the dishes appear on the menuof the same restaurant)?

(ii) WINE: Which dish goes better with red wine?

(iii) DATE: Which dish is more appropriate to order on a first date?

The datasets were labeled using Amazon’s Mechanical-Turk–system for crowd-sourcing.6 Mechanical Turk is a marketplace website where requesters, such asus, submit HITs (Human Intelligence Tasks) and workers respond to those HITS(see, e.g., Figure 4.6). A HIT may contain a simple question, the response towhich can be used as a label. In our case, a HIT was created for each of the threepreference questions above, for each of the 1225 possible pairs of alternativesfrom the ground set of 50 dish images. Each HIT was assigned to five indepen-dent workers.7 The value of Ycost, YWINE, and YDATE at (u, v) was taken to be themajority vote over the five corresponding responses.8

To decrease the heterogeneity of the responding population, we required thatall workers be located in the US. To reduce dependencies between responses fromthe same worker, we prevented the same worker from responding to successiveHITs sharing the same alternative.

Figure 4.7 depicts the responses Ycost, YWINE, and YDATE (after taking the ma-jority for each pair). Observe that for cost, we obtain a cleaner preference matrixthan for WINE, in the sense that the data are almost sortable. Similarly, DATEdata seem least sortable. This corresponds to our intuition that the cost criterion

5http://www.epicurious.com/6https://www.mturk.com/mturk/welcome7Each worker has a unique Mechanical-Turk ID. In addition, Amazon attempts to verify a

one-to-one correspondence between users and worker IDs.8We could have equally well taken an average and used fractional preferences.

54

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 69: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Both dishes are served at the same restaurant. Which one costs more?

Guidelines:

Select exactly a single dish: either Left -or- RightWhenever both dishes are in the same degree for the task then you may choose any of the pairNote: Please make sure you do not submit more than 100 HITs for this task This is a restriction of our research, please respect it

Feel free to add any valuable comment.

Left image Right image

Please provide any comments you may have below, we appreciate your input!

Figure 4.6: Example for an Amazon Turk HIT that provides a label for the COST

dataset.

is the least subjective, and DATE is the most subjective of the three questionsasked. Note our effort to re-order the rows and columns in the correspondingpictures, so that the top-right portion appears almost solid dark (correspondingto 1’s), and the bottom-left almost solid bright (corresponding to 0’s). We cannotdo this perfectly because this is NP-Hard, so we used various heuristics and haveno proof of optimality. This was done for illustration purposes only.

4.7.3 The Combinatorial Experiment

We performed an experiment similar to that performed with the synthetic data.Unlike the synthetic data case, however, we do not assume features for the datahere; the entire structure is combinatorial, not geometric.

We compared the active-learning SRRA approach with the standard (passive)

55

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 70: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

(a) COST (b) WINE (c) DATE

Figure 4.7: Three crowdsourcing preferences. The dark colored Wu,v depicts thatrow alternative, u, is preferred over the column alternative v. A fully sortable, noiselesspreference would have been represented here as solid dark upper triangles, representinganti-symmetric matrices.

learning in which labels are revealed for pairs chosen uniformly at random. In oneexperiment we used COMBI-SVM+CLIMB for both sampling techniques (depictedin the first row of Figure 4.8), and in the other we used CLIMB (depicted inthe second row). One sees that active-learning is better in most parts of thelearning curve for cost (i.e., left column). When using CLIMB, the advantageof SRRA is more significant. This is quite an interesting phenomenon, whichtells us that the SRRA-active-sampling technique helps avoid local minima inthe greedy optimization heuristic. For WINE, SRRA seems to be better thanrandom sampling that uses CLIMB in certain parts of the learning curve, andworse than random sampling that uses COMBI-SVM. For DATE, random samplingis better than active sampling that uses both solvers: this is not surprising giventhat the noise level there, is the largest.

The Geometric Experiment

The main goal of this work was to test active sampling for the purpose of learninga combinatorial permutation. A more realistic scenario is one in which a practi-tioner tries to order alternatives using a set of attributes that reasonably describethese alternatives (i.e., features). Above, we used feature vectors (both for thesynthetic and real-world data), but they were used to relax the combinatorialoptimization difficulties, not to describe the objects.

Conveniently, the food-dish data also contains corresponding textual recipes.We generated bag-of-word-based features using stemming, stop-words elimina-tion, and some standard contextual normalization of quantities and time expres-sions. Figure 4.9 depicts an example of a recipe text.

We again compared SRRA active learning with uniform sampling, using SVM-Rank with the bag-of-word features as a solver (SVM) in each iteration of Algo-rithm 1.

The comparison is shown in the third row of Figure 4.8. For cost (left column),

56

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 71: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Sample Size

ErrorRate

0 500 1000 1500

0.1

0.2

0.3

0.4

0.5

RAND/COMBI-SVM+CLIMB

SRRA/COMBI-SVM+CLIMB

Sample Size

ErrorRate

0 500 1000 15000.1

0.2

0.3

0.4

0.5

RAND/COMBI-SVM+CLIMB

SRRA/COMBI-SVM+CLIMB

Sample Size

ErrorRate

0 500 1000 1500

0.25

0.30

0.35

0.40

0.45

0.50

RAND/COMBI-SVM+CLIMB

SRRA/COMBI-SVM+CLIMB

Sample Size

ErrorRate

0 500 1000 1500

0.1

0.2

0.3

0.4

0.5

RAND/CLIMB

SRRA/CLIMB

Sample Size

ErrorRate

0 500 1000 1500

0.2

0.3

0.4

0.5

RAND/CLIMB

SRRA/CLIMB

Sample Size

ErrorRate

0 500 1000 1500

0.25

0.30

0.35

0.40

0.45

0.50

RAND/CLIMB

SRRA/CLIMB

Sample Size

ErrorRate

0 500 1000 1500

0.1

0.2

0.3

0.4

0.5

RAND/SVM

SRRA/SVM

Sample Size

ErrorRate

0 500 1000 1500

0.2

0.3

0.4

0.5

RAND/SVM

SRRA/SVM

Sample Size

ErrorRate

0 500 1000 1500

0.30

0.35

0.40

0.45

0.50

RAND/SVM

SRRA/SVM

Figure 4.8: Comparing SRRA and random samplers. The first row corresponds toCOMBI-SVM + CLIMB solver; the second row to CLIMB; and the last to SVM. The firstcolumn corresponds to COST, the second to WINE, and the last to DATE. Each resultis an average of 10 runs along with the standard error of the mean.

SRRA significantly beats random sampling after viewing at most 300 samplepairs, and thereafter performs comparably. For WINE (middle column), SRRAbeats uniform random sampling after viewing at most 200 sample pairs, then fallsbehind. For DATE (right column), uniform random sampling is better throughout.Again, this fits the theoretical result, predicting better performance for SRRAactive learning in the case of low noise.

On the Improvements We Obtained. The bigest improvement is obtainedwhen using CLIMB over COST (depicted in the leftmost graph in the second rowof Figure 4.8). Here, the SRRA learning curve improves random sampling curveby 5.5% on the average; In the first third of the learning phase, the advantage ofSRRA is even greater: 8.3% on the average; in the second third of the learningphase, SRRA beats random sampling by an average of 4.6%; and in the last

57

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 72: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

At a Glance

main ingredients: Pear, Blue Cheese, Leafy Green, Steak;type: Quick & Easy.

Ingredients

3 tablespoons olive oil, divided; 1 tablespoon Sherry winevinegar; 1 tablespoon minced shallot; 1 1/2 teaspoonshoney; 2 8-to-9 ounce rib-eye steaks (each about 1 inchthick), trimmed; Coarsely cracked black pepper; 2 cupsfinely sliced radicchio; 1 cup (packed) mixed baby greens;1 small ripe Comice pear, quartered, cored, thinly sliced;1/3 cup crumbled chilled blue cheese.

Preparation

Whisk 2 tablespoons olive oil, Sherry wine vinegar, mincedshallot, and honey in large bowl. Season dressing to tastewith salt and pepper. Sprinkle steaks generously withcracked black pepper and salt. Heat 1 tablespoon oil inheavy medium skillet over medium-high heat. Add steaks;cook to desired doneness, about 3 minutes per side formedium-rare. Transfer to plates. Add last 4 ingredientsto dressing; toss. Mound salad alongside steaks and serve.

Figure 4.9: Example of a recipe text.

third of the learning phase the advantage shrinks to 3.7% on the average. Note,however, that these nice improvements diminish when comparing the curves weproduced with the SVM and COMBI-SVM + CLIMB solvers. SRRA is better usingthe SVM solver over COST, only at the beginning of the learning phase. However,observe that the learning phase here is completed after acquiring 500 labels.This amount of labels is twice n log n, which is the best we can wish for. Whencombining COMBI-SVM + CLIMB, the advantage of SRRA disappears. This canbe explained by the fact that the warm start we get from applying COMBI-SVM

provides a good guess that diminishes the advantage we observed when applyingCLIMB with arbitrary guesses.

4.7.4 A Note on Related Algorithms and Datasets

There are only a few active-learning methods designed specifically for our setting,and none that provide provable guarantees. The divide and conquer algorithmof Ailon (2012) is, as far as we know, the only one that is analyzed under thesame worst-case agnostic-noise setting as SRRA. Its core sampling technique andrelaxation, however, are very similar to those of SRRA, and thus it is expectedto behave similarly to SRRA. Another well-justified algorithm was recently givenby Jamieson and Nowak (2011); however, they assume a different setting that isdoes not lend itself to a comparison with our approach. Carterette et al. (2008)suggest a nice active sampling heuristic for comparing the rankings of two search

58

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 73: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

engines, which is a setting that is related to, yet different from ours. We triedto adapt our setting to the well-known uncertainty sampling (Tong and Koller,2001; Schohn and Cohn, 2000; Campbell et al., 2000) over the three real datasets.The comparison did not indicate a clear winner (uncertainty wins over WINE,loses over DATE, and gains a draw over cost).

There are many public archives known as “learning to rank datasets.” To thebest of our knowledge, the labels obtained in all of them are based on judgmentsof individual items, and not pairs. The exception is LETOR, where comparativelabels are provided but, as far as we understand, are induced from individualjudgments. It is challenging to design large-scale datasets using pairwise infor-mation for active learning because it is impossible to know in advance whichset of alternatives the algorithms would want to access, and each set of alterna-tives gives rise to quadratically many possibilities. We leave for future work theconstruction of larger datasets, as well as a comparison with other heuristics.

4.7.5 A Note on Misconceptions and Beliefs

The lack of benchmarked empirical work in the context of ranking from pairwisepreferences leaves room for various beliefs. This section seeks to corroborate afew.

Ranking by Preference Majority?

Ranking by preference majority is one of the simplest and most intuitive rankingheuristics; simply rank alternatives according to the number of pairs in whichthey have been preferred.9 We refer to this count as preference-wins.

Figure 4.10 depicts two different views of the preferences defined by the COST

dataset. In (a), the preference rows (and columns) are ordered according tothe preference-wins; in (b), they are ordered by applying a local-improvementheuristic over the ranking of Figure 4.10(a).

Comparing the two rankings in terms of Kendall-tau cost , the preference-majority ranking of Figure 4.10(a) entails a cost of 109, while the correspondinglocal-improvement ranking has a cost overhead of 78. This is a cost reduction ofroughly 30%.

A similar phenomena is observed in the other two datasets, indicating thatthe ranking by preference-majority rule is simply not the best choice.

Cycles of Preferences

The combinatorial hardness of the discussed ranking problem is due to the exis-tence of (preference) cycles within the graph corresponding to W ; taken withoutany cycles, this is merely a sorting problem. From a practitioner’s perspective,

9Equalities broken uniformly at random.

59

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 74: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

(a) Pref.-majority (b) Local-improve

Figure 4.10: Ranking by preference-majority versus local-improvement. The darkcolor illustrates that the row alternative is preferred over that of the correspondingcolumn. The REF ranking corresponds to row indices, where the top and bottom rowsare most and least preferred, respectively.

there is a common belief that natural sources inherently define such cycles, forexample, due to human irrationality.

Thus, it is fair to ask whether our crowdsource datasets contain cycles, andwhat is their magnitude. Table 4.1 provides an indication of the number of cyclesin each of the natural source datasets. The number of cycles is indicated by thenumber and size of the strongly connected components (scc) in the graph. Anscc with n vertices contains 2n − (n+ 1) cycles.

Observe the correlation between the number of cycles and the subjectiv-ity-magnitude of the preferences. The largest number of cycles occurs in thesubjective-dataset DATE , and the lowest number occurs in the absolute valuedataset ABS-VAL.

Table 4.1: The number of cycles in the different crowdsource datasets. We presentthe number and size of strongly connected components (scc’s).

Data set #scc Size of max scc

COST 2 43WINE 6 45DATE 1 50ABS-VAL 50 1

60

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 75: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

4.8 Discussion

The main concern of this chapter was the query-efficient variant of learning torank from pairwise preferences. We presented a specific SRRA construction thatdefines the state-of-the-art query complexity bounds for this problem (when noiseis low). Attempting to solve the problem with uniform sampling or any knownmethod that uses the disagreement coefficient and VC dimension arguments only,tends to fail (in the worst-case setting).

In addition to the theoretical results, we described several approaches for im-plementing our solution. We cannot hope for an exact solution for the ERMoptimization step of our SRRA meta-algorithm (Algorithm 1). Thus, we devisedseveral heuristic solvers for this optimization problem. We view the resultingimplementations as relaxed versions of the (theoretically justified) SRRA con-structions.

The lack of publicly available benchmark datasets designed for LRPP led usto generate such a benchmark. Our benchmark consisted of several syntheticand three real-world datasets. The design and construction of the real-worlddatasets were conducted via crowd sourcing. An empirical evaluation of ourSRRA implementations over these datasets supports the theory that our methodhas advantages when the inherent noise is low. The advantage is marginal withrespect to the uniform sampling baseline; nevertheless, it is significant. We as-sume that our results reflect the small magnitude of the datasets we used andthe presence of constants that are hidden in the big-O notation in our analysis.In Chapter 6, we discuss a heuristic that integrates exploration and exploitationwithout implementing SRRA, yet resembling it in character. There, we showthat a good “coverage” of hypothesis space yields exceptional empirical results.Thus, this provides optimistic empirical support for our ideas overall.

61

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 76: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

62

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 77: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Chapter 5

SRRA for Clustering with SideInformation

Clustering with side information is a fairly new variant of clustering describedindependently by both Demiriz et al. (1999) and Ben-Dor et al. (1999). In themachine-learning community it is also widely known as semi-supervised cluster-ing (see, e.g., Basu, 2005). This variant defines a family of settings that add acertain form of supervision to the originally fully unsupervised clustering setting.The nature of the available feedback that provides the side information definesthe specific setting.

The most natural and commonly discussed forms of supervision are single-itemlabels (see, e.g., Demiriz et al., 1999), and pairwise constraints (see, e.g., Ben-Dor et al., 1999). In the single-item-labels setting, the feedback matches itemswith those cluster indices to which they should (or should not) belong. Whilethis form of supervision is fairly close to that provided in supervised learning, itdiffers here in that no statistical assumptions are usually not made in the processthat provides the labels (e.g., some cluster index labels may even be absent).The pairwise constraint setting provides “locally” flavored feedback that stateswhether a pair of elements either “must” or “cannot” link together. We focus onthis form of supervision.

In machine learning, there are two main approaches for utilizing pairwise sideinformation. In the first approach, this information is used to fine-tune, or tolearn a distance function, which is then passed on to any standard clusteringalgorithm, such as k-means or k-medians (see, e.g., Klein et al., 2002; Xing et al.,2002; Cohn et al., 2000; Balcan et al., 2009b; Shamir and Tishby, 2011; Voevodskiet al., 2012). The second approach, which is more related to our work, modifiesthe clustering algorithm’s objective to incorporate the pairwise constraints (see,e.g., Basu, 2005; Eriksson et al., 2011; Cesa-Bianchi et al., 2012). Basu (2005),whose thesis, which also serves as a comprehensive survey, has championed thisapproach in conjunction with k-means and hidden Markov random field clusteringalgorithms.

63

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 78: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Our work is closely related to a combinatorial-optimization theoretical-settingknown as correlation clustering. This setting deals with clustering of a finite setof elements. Full knowledge of all “must”- and “cannot”-link pairwise constraintsis available (for free); however, similarities are noisy in the sense that they mayinduce non-transitivity noise (analogous to that which we discuss in Chapter 4).In other words, the optimal solution has a non-zero cost. The goal is to eradicatethis noise.

Correlation clustering was defined by Bansal et al. (2004), and Shamir et al.(2004) under the name cluster editing. The problem is NP-Hard (see, e.g.,Charikar et al., 2005). Constant factor approximations are known for variousminimization versions of this problems (Charikar and Wirth, 2004; Ailon et al.,2008). A PTAS is known for a minimization version in which the number ofclusters is fixed to be k (Giotis and Guruswami, 2006), as in our setting.

Here, we consider a query-efficient variant of correlation clustering. Eachpairwise information bit comes at a cost and must be treated frugally. The goal,then, is to cleanup the non-transitivity noise while minimizing the informationoverhead.

From the theoretical standpoint of computational complexity, since the un-derlying combinatorial problem is NP-Hard, a practitioner would have to resolveheuristics in order to deal with the underlying optimization problems, if theyassume worst-case scenarios. The query complexity standpoint, however, is farfrom understood. We believe that our results, though highly non-trivial, do notyet close the book on the question.

Our work isolates the use of information coming from pairwise clusteringconstraints and separates it from the geometry of the problem. In future work, itwould be interesting to analyze our framework in conjunction with the geometricstructure of the input. Interestingly, Eriksson et al. (2011) studies active learningfor clustering using the geometric input structure. Unlike our setting, though,they assume either no noise or parametric noise.

Our problem’s instance space could be thought of as edges of a graph (for otherwork on graph structure active learning, see, e.g., Cesa-Bianchi et al. (2012)). Themain difference between our settings is that Cesa-Bianchi et al. assume that thegraph structure is given in advance (yet the edge labels are hidden). We workover full graphs and thus cannot utilize structural information.

We are aware of no previous work that proves guarantees of the type weachieve in our setting.

The chapter is organized as follows. We start by defining the active-learningproblem. We then explain why the disagreement-based methods, such as the onediscussed in Section 3.2.1, would fail to provide meaningful guarantees. Then wedescribe a specific SRRA construction and analyze its guarantees. The chapterconcludes with a discussion of a hierarchical extension of the problem.

64

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 79: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

5.1 Problem Definition

We define the problem of pool-based active clustering with side information usingthe following notation: Let V be a set of points of size n, and k a given number;the instance space is the set of distinct pairs of elements X = V ×V \(u, u) ∈ V ;as before, we assume the labeling function is binary and deterministic. The labelY((u, v)

)= 1 means that u and v should be clustered together, and Y

((u, v)

)= 0

means the opposite. Assume that Y((u, v)

)= Y

((v, u)

)for all u, v. (Equiva-

lently, assume that X contains only unordered distinct pairs without any con-straint on Y . For notational convenience we preferred to define X as the set ofordered distinct pairs.)

The hypothesis class H is the set of equivalence relations over V with at mostk equivalence classes. More precisely, every h ∈ H is identified with a disjointcover V1, . . . , Vk of V (some Vi’s possible empty), with h

((u, v)

)= 1 if and only

if u, v ∈ Vj for some j. As usual, Y may induce a non-transitive relation (e.g.,we could have Y

((u, v)

)= Y

((v, z)

)= 1, and Y

((u, z)

)= 0). In what follows,

we will drop the double parentheses. Also, we will abuse notation by viewing has both an equivalence relation and a disjoint cover V1, . . . , Vk of V . We takeD to be the uniform measure on X . The error of h ∈ H is given as

erD(h) = N−1∑

(u,v)∈X

1h(u,v) 6=Y (u,v) ,

where N = |X | = n(n − 1). Recall that ν denotes the hypothesis class noiserate infh∈H erD(h). We define costu,v(h) to be the contribution N−11h(u,v) 6=Y (u,v)

of (u, v) ∈ X to erD. The distance dist(h, h′) is given as

dist(h, h′) = N−1∑

(u,v)∈X

1h(u,v) 6=h′(u,v) .

Viewing the problem as a graph,(V, Y ). We depict a “must link” la-bel Y (v) = 1 as thick-directed onboth sides of the edge, and a “can-not link” label as a dashed edge. Asolution h ∈ H for the case k = 2is depicted as the disjoint cover of Vcorresponding to the dark squares.

Figure 5.1: Depicting the problem as a graph.

Sometimes it is convenient to examine the problem by using the graphG = (V, Y ). Here we think of the elements V as nodes and consider an edge

65

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 80: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

iff Y (u, v) = 1. (Another possibility is to define G as a full edge-weighted graph.)The error of h, a disjoint cover of V , is the sum of edges that cross partition-subsets plus the sum of absent edges within each subset. For example, considerthe graph and clustering h depicted in Figure 5.1. The cost of this clustering iserD(h) = 1/N = 1/20. Observe that any solution incurs a non-zero error. Thus,the depicted partition is optimal.

The active learner is given V , k, and an oracle access to Y (·). Recall thatthe corresponding correlation clustering’s PTAS is not query efficient: It requiresknowledge of Y on all X . Here, the learner begins only with the knowledge of Vand has to pay in order for the edges Y to be revealed. What we are interestedin is the ERM aspects of the problem, that is, the query complexity requiredfor achieving low error. As before, we study the query complexity required forachieving a (1+ε) approximation for the noise-rate ν. From a learning theoreticalperspective, we want to find the best k-clustering that explains V , using as fewqueries as possible into X .

5.2 Passive Learning Is Not Useful

It is not difficult to find examples for which sampling pairwise constraints uni-formly at random will fail to provide meaningful results. First, notice that VCbounds are useless for our context. We note that the VC dimension of H is Θ(n),as we prove in the next section. Using arguments similar to those of Section 4.2,we get that VC theory arguments tell us that to achieve an additive error of εwith a probability of at least 1−exp(−n), we would have to sample O (ε−2n log n)pairs uniformly at random with repetitions (and optimize the empirical estima-tion over H). However, an additive error of ε means that the estimated clusteringdisagrees with the optimal solution on εN constraints. Thus, an item is on aver-age either separated or in conflict with Θ(εn) more elements than in the optimalclustering. In practice, we would like to keep this error constant and will haveto require an additive error of ε/n. In such a case, the corresponding VC samplebound turns out to be larger than N .

As a second example, consider the case in which V = V ′ ∪ u: the set V ′

can be perfectly k-clustered, and u must link only with a single element v′ ∈ V ′(i.e., Y (u, v′) = 1 and Y (u, v) = 0 for all V ′ \ v′). Clearly, ν = N−1 here.The only way we can ensure a (1 + ε)ν competitive solution is to establish thatthe sample contains (u, v′). The probability that we will “hit” this edge whensampling uniformly at random (e.g., with repetitions) with sample size o(N) isexponentially low, if we use, e.g., Chernoff bound, described in Appendix A.

66

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 81: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

5.3 Disagreement Coefficient Arguments Are

Not Sufficient for Effective Active Learning

Let us try to solve the problem using disagreement coefficient θ and VCbound d arguments only by applying, for example, the SRRA method with thedisagreement-SRRA construction suggested in Section 3.2.1.

We argue that the uniform disagreement coefficient of H is Ω(n). Pick anyh ∈ H with corresponding partitioning V1, . . . , Vk. Consider the partition ob-tained by moving an element u ∈ V from its current part Vj to some otherpart Vj′ for j′ 6= j. In other words, consider the clustering h′ ∈ H given byVj′ ∪ u, Vj \ u ∪

⋃i 6∈j,j′ Vi. Observe that dist(h, h′) = O(1/max|Vi|),

which for a fixed k = o(n) matches O(1/n). On the other hand, for any v ∈ Vand for any u ∈ V , there is a choice of j′ such that h and h′ obtained as abovewould disagree on (u, v) ∈ X . Hence, PDX [DIS (B (h,O(1/n)))] = 1, and wehave θ = suph∈H supr>0 (PDX [DIS (B (h,O(1/n)))] /r) ≥ O(n).

It is also apparent that the VC dimension of H is Θ(n). Assume without lossof generality that n is even and there exists an arbitrary indexing such that V =(u1, u2, u3, . . . , un). Clearly, the set (u1, u2), (u3, u4), . . . , (un−1, un) is shatteredin H (as long as k ≥ 2, of course).1 On the other hand, for any set S ⊆ X of sizen there exists a labeling Y (·) that defines an undirected cycle on the elements ofV . Clearly, the edges of a cycle cannot be shattered by functions in H because ifh(u1, u2) = h(u2, u3) = · · · = h(u`−1, u`) = 1 for h ∈ H, then also h(u1, u`) = 1.

Using Corollary 3.2, we conclude that we need Ω(n2) preference labels toobtain an (ε, µ)-SRRA for any meaningful pair (ε, µ). This is useless, becausethe cardinality of X is O(n2). Similarly to the results in the discussion at theend of Section 4.3, this can be improved by using Remark 2.1 to Ω(nν−1), whichtends to be quadratic in n as ν becomes smaller. Next, we show how to constructmore useful SRRAs for the problem, for an arbitrarily small ν.

5.4 Better SRRA for Semi-Supervised k-

Clustering

Fix h ∈ H, with h = V1, . . . , Vk (we allow empty Vi subsets). Order the Visubsets with respect to their sizes so that |V1| ≥ |V2| ≥ · · · ≥ |Vk|. We constructan ε-SRRA with respect to h as follows. For each cluster Vi ∈ h and for eachelement u ∈ Vi we draw k − i + 1 independent samples Sui, Su(i+1), . . . , Suk asfollows. Each sample Suj is a subset of Vj of size q (to be defined below), chosen

1Notice that this set is a perfect matching : Any vertex is incident exactly to one edge withinthe set. Also observe that H shatters any perfect matching.

67

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 82: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

uniformly with repetitions from Vj, where

q = c2 maxε−2k2, ε−3k

log n (5.1)

for some global c2 > 0. Note that the collection of pairs (u, v) ∈ X : v ∈Sui for some i is, roughly speaking, biased in such a way that pairs containingelements in smaller clusters (with respect to h) are more likely to be selected.We define our estimator f to be for any h′ ∈ H,

f(h′) =k∑i=1

|Vi|q

∑u∈Vi

∑v∈Sui

fu,v(h′) + 2

k∑i=1

∑u∈Vi

k∑j=i+1

|Vj|q

∑v∈Suj

fu,v(h′) , (5.2)

where fu,v(h′) = costu,v(h

′) − costu,v(h) and costu,v(h) = N−11h(u,v)6=Y (u,v). Notethat the summations over Sui above take into account a multiplicity of elementsin the multiset Sui.

Theorem 5.1 With a probability of at least 1−n−3, the function f is an ε-SRRAwith respect to h.

5.4.1 Setup

Consider another k-clustering h′ ∈ H with corresponding partitioningV ′1 , . . . , V ′k of V . We can write dist(h, h′) as

dist(h, h′) =∑

(u,v)∈X

distu,v(h, h′)

where distu,v(h, h′) = N−1(1h′(u,v)=11h(u,v)=0 + 1h(u,v)=11h′(u,v)=0).

Let ni denote |Vi| and recall that n1 ≥ n2 ≥ · · · ≥ nk. In what follows, weremove the subscript in regh and rename it reg (h is held fixed). The functionreg(h′) will now be written as:

reg(h′) =k∑i=1

∑u∈Vi

∑v∈Vi\u

regu,v(h′) + 2

k∑j=i+1

∑v∈Vj

regu,v(h′)

, (5.3)

whereregu,v(h

′) = costu,v(h′)− costu,v(h) .

Clearly, for each h′, it holds that f(h′) from (5.2) is an unbiased estimator ofreg(h′). We now analyze its error. For each i, j ∈ [k], let Vij denote Vi ∩ V ′j . Thiscaptures exactly the set of elements in the ith cluster in h and the jth cluster inh′. The distance dist(h, h′) can be written as follows:

dist(h, h′) = N−1

(k∑i=1

k∑j=1

|Vij × (Vi \ Vij)|+ 2k∑j=1

∑1≤i1<i2≤k

|Vi1j × Vi2j|)

.

(5.4)

68

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 83: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

We call each Cartesian set product in (5.4) a distance-contributing rectangle. Notethat unless a pair (u, v) appears in one of the distance-contributing rectangles,we have regu,v(h

′) = fu,v(h′) = 0. Hence, we can decompose reg(h′) and f(h′)

in correspondence with the distance-contributing rectangles, as follows:

reg(h′) =k∑i=1

k∑j=1

Gi,j(h′) + 2

k∑j=1

∑1≤i1<i2≤k

Gi1,i2,j(h′) (5.5)

f(h′) =k∑i=1

k∑j=1

Fi,j(h′) + 2

k∑j=1

∑1≤i1<i2≤k

Fi1,i2,j(h′) (5.6)

where

Gi,j(h′) =

∑u∈Vij

∑v∈Vi\Vij

regu,v(h′) (5.7)

Fi,j(h′) =

|Vi|q

∑u∈Vij

∑v∈(Vi\Vij)∩Sui

fu,v(h′) (5.8)

Gi1,i2,j(h′) =

∑u∈Vi1j

∑v∈Vi2j

regu,v(h′) (5.9)

Fi1,i2,j(h′) =

|Vi2|q

∑u∈Vi1j

∑v∈Vi2j∩Sui2

fu,v(h′) (5.10)

(Note that Sui are multisets, and the inner sums in (5.8) and (5.10) may countelements multiple times.)

5.4.2 Simple Case

Lemma 5.1 With a probability of at least 1 − n−3, the following holds simulta-neously for all h′ ∈ H and all i, j ∈ [k]:

|Gi,j(h′)− Fi,j(h′)| ≤ εN−1 · |Vij × (Vi \ Vij)| . (5.11)

Proof The predicate (5.11) (for a given i, j) depends only on the set Vij = Vi∩V ′j .Given a subset B ⊆ Vi, we say that h′ (i, j)-realizes B if Vij = B.

Now fix i, j, and B ⊆ Vi. Assume h′ (i, j)-realizes B. Let β = |B| andγ = |Vi|. Consider the random variable Fi,j(h

′). Think of the sample Sui as asequence Sui(1), . . . , Sui(q), where each Sui(s) is chosen uniformly at random fromVi for s = 1, . . . , q. We can now rewrite Fi,j(h

′) as follows:

Fi,j(h′) =

γ

q

∑u∈B

q∑s=1

Z(Sui(s)

)′, ,

69

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 84: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

where

Z(v) =

fu,v(h

′) v ∈ Vi \ Vij0 otherwise

.

For all s = 1, . . . q the random variable Z(Sui(s)

)is bounded by 2N−1 almost

surely, and its moments satisfy:

E[Z(Sui(s)

)]=

1

γ

∑v∈(Vi\Vij)

fu,v(h′) ,

E[Z(Sui(s)

)2]≤4N−2(γ − β)

γ. (5.12)

We are now ready to apply Bernstein inequality using the normalized versionof γ

qZ(v). We have

P [|Fi,j(h′)−Gi,j(h′)| ≥ t] ≤ exp

−t2/2∑qi=1

∑u∈Vij

γ2

q2

(4N−2(γ−β)

γ− 1

γ

∑v∈(Vi\Vij) fu,v(h

′))

+ Mt3

≤ exp

−t2/2∑q

i=1

∑u∈Vij

γ2

q24N−2(γ−β)

γ+ Mt

3

≤ exp

−t2/2

4N−2β(γ−β)γq

+ Mt3

≤ exp

−t2/2

4N−2β(γ−β)γq

+ 2N−1γt3q

.

From this we conclude that for any t ≤ 6N−1β(γ − β),

P [|Fi,j(h′)−Gi,j(h′)| ≥ t] ≤ exp

− qt2

16γβ(γ − β)N−2

.

Plugging in t = εN−1β(γ − β), we conclude

P[|Fi,j(h′)−Gi,j(h

′)| ≥ εN−1β(γ − β)]≤ exp

−qε

2β(γ − β)

16γ

.

Now note that the number of possible sets B ⊆ Vi of size β is at most nminβ,γ−β.Using union bound and recalling our choice of q, the lemma follows.

70

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 85: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

5.4.3 More Involved Case

Proving the following is more involved.

Lemma 5.2 With a probability of at least 1−n−3, the following holds uniformlyfor all h′ ∈ H and for all i1, i2, j ∈ [k] with i1 < i2:

|Fi1,i2,j(h′)−Gi1,i2,j(h′)| ≤ εN−1 max

|Vi1j × Vi2j|,

|Vi1j × (Vi1 \ Vi1j)|k

,|Vi2j × (Vi2 \ Vi2j)|

k

(5.13)

Proof The predicate (5.13) (for a given i1, i2, j) depends only on the sets Vi1j =Vi1 ∩ V ′j and Vi2j = Vi2 ∩ V ′j . Given subsets B1 ⊆ Vi1 and B2 ⊆ Vi2 , we say thath′ (i1, i2, j)-realizes (B1, B2) if Vi1j = B1 and Vi2j = B2.

We now fix i1 < i2, j and B1 ⊆ Vi1 , B2 ⊆ Vi2 . Assume h′ (i1, i2, j)-realizes(B1, B2). For brevity, denote βι = |Bι| and γι = |Viι | for ι = 1, 2. Using theBernstein inequality as in Lemma 5.1, we conclude the following two inequalities:

P [|Gi1,i2,j(h′)− Fi1,i2,j(h′)| > t] ≤ exp

− c3t

2q

β1β2γ2N−2

(5.14)

for any t in the range [0, N−1β1β2] and some global c3 > 0 . For t in the range(N−1β1β2,∞) and some global c4 we have

P [|Gi1,i2,j(h′)− Fi1,i2,j(h′)| > t] ≤ exp

− c4tq

γ2N−1

. (5.15)

Consider the following three cases.

1. β1β2 ≥ maxβ1(γ1 − β1)/k, β2(γ2 − β2)/k. Hence, β1 ≥ (γ2 − β2)/k, β2 ≥(γ1 − β1)/k. In this case, we can plug t = εN−1β1β2 in (5.14) to get

P[|Gi1,i2,j(h

′)− Fi1,i2,j(h′)| > εN−1β1β2

]≤ exp

−c3ε

2β1β2q

γ2

. (5.16)

Consider two subcases: (i) If β2 ≥ γ2/2, then the RHS of (5.16) is at

most exp− c3ε2β1q

2

. The number of possible subsets B1, B2 of sizes β1, β2,

respectively, is clearly at most nβ1+(γ2−β2) ≤ nβ1+kβ1 . Therefore, as long asq = O(ε−2k2 log n), then with a probability of at least 1 − n−6, this caseis taken care of in the following sense: Simultaneously for all j, i1 < i2,all possible β1 ≤ γ1 = |Vi1|, β2 ≤ γ2 = |Vi2| satisfying the assumptionsand for all B1 ⊆ Vi1j, B2 ⊆ Vi2j of sizes β1, β2, respectively, and for all h′

(i1, i2, j)-realizing (B1, B2) we have that |Gi1,i2,j(h′)− Fi1,i2,j(h′)| ≤ εβ1β2 .

(ii) If β2 < γ2/2, then by our assumption, β1 ≥ γ2/2k. Hence the RHS of

(5.16) is at most exp− c3ε2β2q

2k

. The number of sets B1, B2 of sizes β1, β2

71

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 86: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

respectively is clearly at most n(γ1−β1)+β2 ≤ nβ2(1+k) . Therefore, as long asq = O(ε−2k2 log n), then with a probability of at least 1 − n−6, this caseis taken care of in the following sense: Simultaneously for all j, i1 < i2,all possible β1 < γ1 = |Vi1|, β2 < γ2 = |Vi2| satisfying the assumptionsand for all B1 ⊆ Vi1j, B2 ⊆ Vi2j of sizes β1, β2, respectively, and for all h′

(i1, i2, j)-realizing (B1, B2), we have that |Gi1,i2,j(h′)−Fi1,i2,j(h′)| ≤ εβ1β2 .

The requirement q = O(ε−2k2 log n) is satisfied by our choice, Equa-tion (5.1).

2. β2(γ2 − β2)/k ≥ maxβ1β2, β1(γ1 − β1)/k. We consider two subcases.

(a) εβ2(γ2 − β2)/k ≤ β1β2. Using (5.14), we get

P[|Gi1,i2,j(h′)−Fi1,i2,j(h′)| > εN−1β2(γ2−β2)/k] ≤ exp

−c3ε

2β2(γ2 − β2)2q

k2β1γ2

(5.17)

Again, consider two subcases. (i) β2 ≤ γ2/2. In this case we concludefrom Equation (5.17)

P[|Gi1,i2,j(h′)−Fi1,i2,j(h′)| > εN−1β2(γ2−β2)/k] ≤ exp

−c3ε

2β2γ2q

4k2β1

(5.18)

Now, note that by our assumption

β1 ≤ (γ2 − β2)/k ≤ γ2/k ≤ γ1/k , (5.19)

the last inequality is in virtue of our assumption γ1 ≥ γ2. Also byassumption,

β1 ≤ β2(γ2 − β2)/(γ1 − β1) ≤ β2γ2/(γ1 − β1) . (5.20)

Plugging (5.19) in the RHS of (5.20), we conclude that

β1 ≤ β2γ2/(γ1(1− 1/k)) ≤ 2β2γ2/γ1 ≤ 2β2 .

From here we conclude that the RHS of (5.18) is at most

exp− c3ε2γ2q

8k2

.

The number of sets B1, B2 of sizes β1, β2, respectively, is clearly at mostnβ1+β2 ≤ n2β2+β2 ≤ n3γ2 . Hence, as long as q = O(ε−2k2 log n) (satis-fied by our assumption), with a probability of at least 1− n−6 simul-taneously for all j, i1 < i2, all possible β1 ≤ γ1 = |Vi1|, β2 ≤ γ2 = |Vi2|satisfying the assumptions and for all B1 ⊆ Vi1j, B2 ⊆ Vi2j of sizesβ1, β2, respectively, and for all h′ (i1, i2, j)-realizing (B1, B2), we havethat |Gi1,i2,j(h

′)−Fi1,i2,j(h′)| ≤ εβ2(γ2− β2)/k . In the second subcase

72

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 87: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

(ii) β2 > γ2/2; the RHS of (5.17) is at most exp−2c3ε2(γ2−β2)2q

k2β1

.

By our assumption, (γ2 − β2)/(kβ1) ≥ 1; hence, this is at most

exp−2c3ε2(γ2−β2)q

k

. The number of sets B1, B2 of sizes β1, β2, respec-

tively, is clearly at most nβ1+(γ2−β2) ≤ n(γ2−β2)/k+(γ2−β2) ≤ n2(γ2−β2).Therefore, as long as q = O(ε−2k log n) (satisfied by our assumption),then with a probability of at least 1 − n−6, using a similar countingand union bound argument as above, this case is taken care of in thesense that: |Gi1,i2,j(h

′)−Gi1,i2,j(h′)| ≤ εβ2(γ2 − β2)/k .

(b) εβ2(γ2 − β2)/k > β1β2. We now use (5.15) to conclude

P[|Gi1,i2,j(h′)−Fi1,i2,j(h′)| > εN−1β2(γ2−β2)/k] ≤ exp

−c4εβ2(γ2 − β2)q

kγ2

.

(5.21)Again, we consider the cases (i) β2 ≤ γ2/2 and (ii) β2 > γ2/2 as above.In (i), we get that the RHS of (5.21) is at most exp

− c4εβ2q

2k

. Now

notice that by our assumptions,

β1 < ε(γ2 − β2)/k ≤ γ2/2 ≤ γ1/2 . (5.22)

Also by our assumptions, β1 < β2(γ2−β2)/(γ1−β1), which by (5.22) isat most 2β2γ2/γ1 ≤ 2β2. Hence, the number of possibilities for B1, B2

is at most nβ1+β2 ≤ n3β2 . In (ii), we get that the RHS of (5.21) is at

most exp− c4ε(γ2−β2)q

2k

, and the number of possibilities for B1, B2 is

at most nβ1+(γ2−β2), which is bounded by n2(γ2−β2), following from ourassumptions. For both (i) and (ii), taking q = O(ε−1k log n) ensuresthat with a probability of at least 1 − n−6, using a similar countingand union bounding argument as above, case (b) is taken care of inthe sense that: |Gi1,i2,j(h

′)− Fi1,i2,j(h′)| ≤ εN−1β2(γ2 − β2)/k .

3. β1(γ1 − β1)/k ≥ maxβ1β2, β2(γ2 − β2)/k. We consider two subcases.

(a) εβ1(γ1 − β1)/k ≤ β1β2. Using (5.14), we get

P[|Gi1,i2,j(h

′)− Fi1,i2,j(h′)| > εN−1β1(γ1 − β1)/k]≤ exp

−c3ε

2β1(γ1 − β1)2q

k2β2γ2

.

(5.23)As before, consider case in which (i) β2 ≤ γ2/2 and (ii) β2 >γ2/2. For case (i), we use the fact that β1(γ1 − β1) ≥β2(γ2 − β2) by assumption and notice that the RHS of (5.23)

is at most exp− c3ε2β2(γ2−β2)(γ1−β1)q

k2β2γ2

. Hence, this is at most

exp− c3ε2(γ1−β1)q

2k2

. The number of possibilities of B1, B2 of sizes

β1, β2 is clearly at most n(γ1−β1)+β2 ≤ n(γ1−β1)+(γ1−β1)/k ≤ n2(γ1−β1).

73

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 88: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

From this, we conclude that q = O(ε−2k2 log n) is sufficientfor this case. For case (ii), we bound the RHS of (5.23) by

exp− c3ε2β1(γ1−β1)2q

2k2β22

. Using the assumption that (γ1 − β1)/β2 ≥ k,

the latter expression is upper bounded by exp− c3ε2β1q

2

. Again, by

our assumptions,

β1 ≥ β2(γ2−β2)/(γ1−β1) ≥ (ε(γ1−β1)/k)(γ2−β2)/(γ1−β1) = ε(γ2−β2)/k .(5.24)

The number of possibilities of B1, B2 of sizes β1, β2 is clearly at mostnβ1+(γ2−β2), which by (5.24) is bounded by nβ1+kβ1/ε ≤ n2kβ1/ε. Fromthis we conclude that as long as q = O(ε−3k log n) (satisfied by ourchoice), this case is taken care of in the sense repeatedly explainedabove.

(b) εβ1(γ1 − β1)/k > β1β2. Using (5.15), we get

P[|Gi1,i2,j(h

′)− Fi1,i2,j(h′)| > εN−1β1(γ1 − β1)/k]≤ exp

−c4εβ1(γ1 − β1)q

kγ2

.

(5.25)We consider two subcases, (i) β1 ≤ γ1/2 and (ii) β1 > γ1/2. In case(i), we have that

β1(γ1 − β1)

γ2

=1

2

β1(γ1 − β1)

γ2

+1

2

β1(γ1 − β1)

γ2

≥ 1

2

β1γ1

2γ2

+1

2

β2(γ2 − β2)

γ2

≥ β1/4 + minβ2, γ2 − β2/2 .

(The last step used γ1 ≥ γ2.) Hence, the RHS of (5.25) is boundedfrom above by

exp

−c4εq(β1/4 + minβ2, γ2 − β2/2)

k

.

The number of possibilities of B1, B2 of sizes β1, β2 is clearly at mostnβ1+minβ2,γ2−β2; hence, as long as q = O(ε−1k log n) (satisfied byour choice), this case is taken care of in the sense repeatedly ex-plained above. In case (ii), we can upper-bound the RHS of (5.25) by

exp− c4εγ1(γ1−β1)q

2kγ2

≥ exp

− c4ε(γ1−β1)q

2k

. The number of possibilities

of B1, B2 of sizes β1, β2 is clearly at most n(γ1−β1)+β2 , which, using ourassumptions, is bounded from above by n(γ1−β1)+(γ1−β1)/k ≤ n2(γ1−β1).Hence, as long as q = O(ε−1k log n), this case is taken care of in thesense repeatedly explained above.

74

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 89: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

This concludes the proof of the lemma.

Consequently, we get the following:

Lemma 5.3 with a probability of at least 1−n−3, the following holds simultane-ously for all k-clusterings H′:

|reg(h′)− f(h′)| ≤ 5εdist(h′, h) .

Proof By the triangle inequality,

|reg(h′)− f(h′)| ≤k∑i=1

k∑j=1

|Gi,j(h′)− Fi,j(h′)|+ 2

k∑j=1

∑1≤i1<i2≤k

|Gi1,i2,j(h′)− Fi1,i2,j(h′)|

(5.26)

Using (5.5)-(5.6), then Lemmas 5.1-5.2 (assuming the success of a high-probabilityevent), and rearranging the sum and finally using (5.4), we get:

|reg(h′)− f(h′)| ≤k∑i=1

k∑j=1

εN−1|Vij × (Vi \ Vij)|

+ 2εN−1

k∑j=1

∑i1<i2

(|Vi1j × Vi2j|+ k−1|Vi1j × (Vi1 \ Vi1j)|+ k−1|Vi2j × (Vi2 \ Vi2j)|

)≤

k∑i=1

k∑j=1

εN−1|Vij × (Vi \ Vij)|+ 2εN−1

k∑j=1

∑i1<i2

|Vi1j × Vi2j|

+ 2εN−1

k∑j=1

k∑i1<i2

k−1|Vi1j × (Vi1 \ Vi1j)|+ 2εN−1

k∑j=1

k∑i1<i2

k−1|Vi2j × (Vi2 \ Vi2j)|

≤k∑i=1

k∑j=1

εN−1|Vij × (Vi \ Vij)|+ 2εN−1

k∑j=1

∑i1<i2

|Vi1j × Vi2j|

+ 2εN−1

k∑j=1

k∑i1=1

kk−1|Vi1j × (Vi1 \ Vi1j)|+ 2εN−1

k∑j=1

k∑i2=1

kk−1|Vi2j × (Vi2 \ Vi2j)|

≤ 5εN−1

k∑i=1

k∑j=1

|Vij × (Vi \ Vij)|+ 2εN−1

k∑j=1

∑i1<i2

|Vi1j × Vi2j|

≤ 5εdist(h, h′) ,

as required.

75

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 90: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Algorithm 3 SRRA for Semi-Supervised k-Clustering

Require: V , k, H, a pivot h = Viki=1 ∈ H, estimation parameter ε ∈ (0, 1/5)q ← O(max ε−2k2, ε−3k log n)Index the clusters of h such that |V1| ≥ |V2| ≥ . . . ≥ |Vk|for u ∈ Vi, i = 1, . . . , k do

for j = i, . . . , k doSu,i ← sample q elements from Vj independently uniformly at random with repetitions

end forend forreturn f : H −→ R, defined by

f(h′) =k∑i=1

|Vi|q

∑u∈Vi

∑v∈Sui

(costu,v(h′)− costu,v(h))

+ 2k∑i=1

∑u∈Vi

k∑j=i+1

|Vj|q

∑v∈Suj

(costu,v(h′)− costu,v(h)) ,

5.4.4 Conclusion: f is an ε-SRRA

We conclude that f is an ε-SRRA estimator. Its construction pseudocode ispresented for convenience in Algorithm 3. Clearly, the number of label queriesrequired for obtaining this ε-SRRA estimator is O(nmaxε−2k3, ε−3k2 log n).Combining Theorem 5.1 with this bound and the iterative algorithm describedin Corollary 3.1 (Algorithm 1), we obtain the following:

Corollary 5.1 There exists an active-learning algorithm for obtaining a solutionh ∈ H for semi-supervised k-clustering with erD(h) ≤ (1 +O(ε)) ν with totalquery complexity of O

(nmax ε−2k3, ε−3k2 log2 n

). The algorithm succeeds with

probability at least 1− n−2.

We do not believe the ε−3 factor in the corollary is tight, and we speculatethat it could be reduced to ε−2 by using more advanced measure concentrationtools. Note, we assume k is fixed. Indeed, in practice, k is often taken to be aglobal constant (or at most o(n)). Thus, the sample complexity of our active-learning method using these direct SRRA constructions is almost linear in n. Asin the case of Corollary 4.1 and the ensuing discussion around LRPP, the result inCorollary 5.1 significantly outperforms known active-learning results that dependonly on disagreement coefficient and VC dimension bounds, for small ν.

76

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 91: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

5.5 Hierarchical Correlation Clustering

A natural extension of the clustering setting is concerned with clustering in theface of structural constraints. Probably the most common structure is the tree hi-erarchy, such as those that arise in taxonomies. Here, we will examine hierarchiescorresponding to uniform-depth trees with a fixed and known depth.

A depth-` hierarchical-clustering over a ground set V is a uniform-depth treeof height `, with V sitting in the leaves. The root corresponds to V and itschildren correspond to the level-` clustering. Each sub-tree of the root is now adepth `− 1 hierarchical clustering.

When viewed as a uniform-depth tree, a natural pseudometric arises over V ,where the distance between two elements u, v ∈ V is half the edge length of thepath connecting their corresponding leaves. Such a metric is called an ultrametricand satisfies certain strong structural properties (see Section 5.5.1). Alternatively,any ultrametric on a set V with distances in the set 0, . . . , ` defines such a tree,and in turn, a hierarchical clustering. For example, if ` = 2, then the distancesbetween element pairs are either 0, 1, or 2. If the two elements sit in the sametree leaf, meaning that they are in the same nested clusters, then it is 0. If thetwo elements do not sit in the same leaf, but rather descend from the same childof the root, then it is 1. If they descend from two different children of the root,it is 2. The extension to other values of ` is clear.

In this work, we also assume a limit of k on the width of the sought solutions,defined as the maximal number of children of a node in the tree. In the languageof hierarchical clusterings, this means that each cluster (at any level) may befurther sub-clustered into at most k clusters in the next level.

The label of a pair of elements u, v ∈ X in our setting is a number in 0, . . . , `,which is a noisy stipulation of the sought distance between u and v in the solutionhierarchical clustering. By noisy, we mean that the corresponding matrix of labelsneed not be a metric (let alone an ultrametric). The goal is to fit the label matrixto an ultrametric on X , while satisfying the usual supervised learning desiderata,namely, achieving low risk at a low query complexity.

Contrary to all former settings, the labeling function here is not binary. Still,we measure the quality of an output hierarchical clustering by its degree of prox-imity to the label matrix. The risk metric we choose is the average absolutevalue of differences between distance matrix elements. (In particular, we assumea uniform measure on the matrix elements.)

The size of the input space is quadratic in n; however, the logarithm of thenumber of hierarchical clusterings of depth ` and width k is at most `n log k.(There are at most k` leaves, and it suffices to assign a leaf for each v ∈ V ; hence,k`n possibilities.) Accordingly, we suspect that a small sample should suffice forthe purpose of obtaining a good solution.

77

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 92: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

5.5.1 Definitions and Notations

Hierarchical clusterings are in a natural one-to-one correspondence with a geo-metric object called ultrametric. In what follows, the two terms will be usedinterchangeably. As before, let V be a set of n items, our instance space X con-sists of all distinct pairs of elements in V : namely, X = (V ×V )\

(u, u) : u ∈ V

and its size |X | is N = n(n−1). Also, we assume D ∼ DX is the uniform measure.

An (integer) ultra-pseudometric over V is a function τ : X 7→ N, which is apseudometric on V , and it also satisfies the following strong triangle inequalityfor all distinct u, v, w,∈ V : τ((u, v)) ≤ maxτ((u,w)), τ((w, v)) . (Note that thestrong triangle inequality implies the usual triangle inequality.)

V

C1

U1

b

1, 2, 3

U2

b

4

b

5

b

6

U3

b

7

b

8

b

9

C2

U4

b

10

b

11, 12

U5

b

13

b

14

b

15

U6

b

16

b

17, 18

C3

U7

b

19

b

20

b

21

U8

b

22

b

23

b

24

U9

b

25

b

26

b

27

Figure 5.2: An example of an ultrametric tree with ` = k = 3. Let vi denote thevertex indexed with i ∈ [27]. Observe that τ(v25, v27) = 1, τ(v22, v1) = 3 is themaximal distance, and V(Rτ,2) = C1, C2, C3 defines the tree’s first level.

Let U(V ) denote the space of ultrametrics over V . It is well-known (see, e.g.,Ailon and Charikar, 2011) that ultrametrics have a one-to-one correspondencewith a natural family of rooted trees in which each v ∈ V is hosted in a leaf, andthe edge distance between the root and any leaf is a constant, which is calledthe height of the tree.2 If T is such a tree, then the corresponding ultrametricτT ∈ U(V ) is defined by τT ((u, v)), equaling half the edge distance between theleaves hosting u, v in the tree.

2By stating that “each vertex is hosted in a leaf,” we simply mean that there is a mappingfrom V to the set of leaves, and each leaf is identified with the corresponding pre-image.

78

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 93: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Mapping τ to a tree T such that τ = τT is also easy: Let M =max(u,v)∈X τ((u, v)). If M = 0, then the tree contains a single node hostingall vertices. Otherwise, the relation

Rτ,i = (u, v) ∈ X : τ((u, v)) ≤ i ∪ (u, u) : u ∈ V

is an equivalence relation (by the strong triangle inequality); the tree is con-structed by creating a root node with children equaling the roots of recursivelyconstructed trees, one for each equivalence class of Rτ,M (with corresponding re-striction of τ to the members of the class). In what follows, we will let V(Rτ,i)denote the set of equivalence classes ofRτ,i, namely, any C ∈ V(Rτ,i) is a maximalsubset of V satisfying (u, v) ∈ Rτ,i for all u, v ∈ C.

We will also need the notion of restrictions. For a set C ⊆ V , let XC =(C ×C) \ (u, u) : u ∈ C, namely, the set of distinct pairs of elements of C. Forτ ∈ U(V ), let τ|C denote the restriction of τ to XC ; clearly, τ|C ∈ U(C). Whenusing restrictions, it will be implicitly understood that the underlying measure isuniform on XC .

An ultrametric τ ∈ U(V ) has width k if the number of children of any nodein the corresponding tree is at most k. The depth(τ) of an ultrametric is definedas the uniform edge distance between the root and any leaf in the correspondingtree, or alternatively, as the maximal distance between a pair of elements. LetUk,`(V ) denote the space of ultrametrics on V of degree k and depth at most `,which we call the space of (k, `)-ultrametrics.

We call the non-hierarchical clustering described in Section 5.1 a flat k-clustering. Here, we think of a flat clustering as a function h : X 7→ 0, 1, suchthat h((u, v)) = 0 if u, v ⊆ Vi for some i, and otherwise h((u, v)) = 1. Notethat this is opposite to the convention used so far (defined in Section 5.1). Hereh is a distance function, whereas it was formerly a similarity function. Using thedistance semantic allows us to view a clustering as a pseudometric over V , overthe distances 0, 1. More specifically, when using the labeling Y : X 7→ 0, 1,we rewrite the error rate to be erD(h, Y ) = N−1

∑(u,v)∈X |h(u, v)− Y (u, v)|. Ob-

serve that now the set of flat clusterings over V becomes Uk,1(V ); that is, Uk,1(V )is exactly the hypothesis class H defined in Section 5.1.

Learning (k, `)-Ultrametrics

Consider the following supervised learning scenario. The label function is now asymmetric function Y : X 7→ [`]. The setting is agnostic in the sense that Y maynot even be a metric (let alone an ultrametric).

We are interested in learning an ultrametric τ ∈ Uk,`(V ). Our goal is to finda solution as close as possible to Y . We choose the risk function

riskD,`(τ) = N−1∑

(u,v)∈X

|τ ((u, v))− Y ((u, v))| .

79

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 94: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Note that here we apply the absolute-loss function `(a − b) = |a − b| for anya, b ∈ N. Similarly, we define the (pseudometric) dist(τ, τ ′) by

dist(τ, τ ′) = N−1∑

(u,v)∈X

|τ ((u, v))− τ ′ ((u, v))| .

As in the case of flat k-clusterings, here again we have a supervised learning prob-lem, except that now the label space is no longer binary (it is [`]), the hypothesisclass is Uk,`(V ), and we measure performance with respect to absolute loss insteadof zero-one loss.

Let τ ∗ = argminτ∈Uk,` dist(τ, Y ). Let ν = dist(τ ∗, Y ) be the optimal risk.Throughout, we assume the agnostic setting, namely, ν > 0. The goal is to find asolution τ with small excess risk, defined as dist(τ, Y )−ν. The query complexityis defined as the number of instances (u, v) ∈ X for which Y ((u, v)) is uncovered.We wish to keep this complexity measure low.

In what follows, we suppress the double-parentheses notation and write, e.g.,Y (u, v), τ(u, v). It is understood that Y and τ take a single argument in X .

5.5.2 SRRA for Learning Shallow Ultrametrics

We start by defining a reduction from distances between ultrametrics to distancesbetween flat (correlation) clusterings. The idea is to decompose such a distancefrom flat clustering distances corresponding to decomposition of tree levels. Thisis summarized in the next lemmata. The proof is a trivial induction over `.

Lemma 5.4 Let τ, ω be two k, `-ultrametrics over V . Denote the distance be-tween two flat clusterings by dflat. We have

dist(τ, ω) =`−1∑i=0

dflat (V(Rτ,i),V(Rω,i)) .

Proof We will apply induction over `. Clearly, this is true when ` = 1 (i.e., theclustering is flat). Assume it holds for `′ ≥ 1, and we are in the case ` = `′ + 1.Define τ , ω to be `−1 height trees resulting from subtracting the leaves from τ andω. That is, the leaves of τ , ω are V(Rτ,0) and V(Rω,0) correspondingly. For anypair (u, v) ∈ X , let distu,v(τ, ω) = |τ(u, v)− ω(u, v)|. Using the tree-analogousdefinition of ultrametric pairwise distance, we have:

distu,v(τ, ω) = distu,v(τ , ω)︸ ︷︷ ︸(i)

+N−1∑

(u,v)∈X

∣∣1∃Cτ∈V(Rτ,0): u,v⊆Cτ − 1∃Cω∈V(Rω,0): u,v⊆Cω∣∣

︸ ︷︷ ︸(ii)

.

The proof is complete by noting that (ii) is the flat-clustering distancedflat (V(Rτ,0),V(Rω,0)) and by applying the inductive assumption on (i).

80

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 95: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Similarly, we can decompose the ultrametric risk to error rates of ` flat clus-terings. Define the pairwise risk over pair (u, v) ∈ X to be

risku,v(τ) = N−1 |τ(u, v)− Y (u, v)| .

It is not difficult to show that

risku,v(τ) = N−1

`−1∑i=0

∣∣1∃Cτ∈V(Rτ,i): u,v⊆Cτ − 1Y (u,v)<i

∣∣ .An inductive argument similar to the one described above will give us the follow-ing.

Lemma 5.5 Let τ be a, k, `-ultrametric over V . For any i ∈ [`], let erD(i)denote the flat-clustering error rate corresponding to the disjoint cover V(Rτ,i)and the pairwise binary labeling 1Y (u,v)<i. We have

riskD,`(τ) =`−1∑i=0

erD(i)

We are now ready to define the SRRA construction. Define fi to be theflat clustering defined in Algorithm 3 with respect to the pivotal flat clusteringV(Rτ,i). The estimator is then defined to be f =

∑`−1i=0 fi.

Lemma 5.6 With a probability of at least 1−n−2, we have that f is an ε-SRRA.

Proof

|f(ω)− regτ (ω)| ≤`−1∑i=0

|fi − erD(i)| (5.27)

≤`−1∑i=0

ε dflat (V(Rτ,i),V(Rω,i)) (5.28)

= εdist(τ, ω) . (5.29)

(5.27) is due to the definition of f and Lemma 5.5, and we get (5.28) fromTheorem 5.1, and (5.29) from Lemma 5.4.

We conclude the following.

Corollary 5.2 There exists an active-learning algorithm for learning k, `-ultrametrics with a cost of at most (1 + ε)ν and query complexity

O(n`max

ε−2k3`, ε−3k2`

log2 n

).

The algorithm succeeds with probability > 1− n−1.

81

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 96: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Observe that the query complexity of this decomposition is exponential in k, `.Also note that ` ≤ logk n. Thus, the bound is meaningful as long as ` ∈ o(log n).In practical terms, this means that this SRRA construction is good only for“shallow” hierarchies (compared to |V |). Yet, we stress that this is, as far as weknow, the first active-learning method for learning ultrametrics with worst-casequery complexity guarantees.

5.6 Discussion

We presented the query-efficient variant of the correlation clustering problem.The problem deals with active learning over equivalence relations (with a boundednumber of equivalence classes). Similar to LRPP, which dealt with order rela-tions, this setting has a finite instance space and the distribution is redundant(a distribution-free setting). We showed that attempting to solve this problemwith uniform sampling and disagreement-coefficient-based methods fail to pro-vide useful sample complexity guarantees (in the worst-case setting). In contrast,our direct SRRA construction achieves non-trivial query complexity guaranteesthat outline the state of the art.

The specific correlation clustering of the SRRA construction we define treatstwo types of disagreements: in-class, and between-class disagreements. This madethe analysis a bit more complex than in former SRRA constructions. Yet, theprincipals remain similar. The sampling should be denser over instances in which“close” solutions are more “likely” to disagree. Indeed, we give larger “weights”to small clusterings (as defined by the pivot).

In addition, we discussed the related problem of query-efficient hierarchicalcorrelation clustering, and more specifically learning ultrametrics. Here, we pre-sented a simple reduction that utilizes flat clustering SRRA constructions andachieves a query complexity that is non-trivial for “shallow” hierarchies.

82

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 97: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Chapter 6

Active Exploration forGraph-based Learning:Clustering with Side Informationis a Big Plus

So far, we have been establishing a method of smooth relative regret approx-imation. A key point in the definition of SRRA functions is the fine-grainedcoverage of the hypotheses space. Although this is only subtle in the definitionitself (Definition 3.1), it clearly arises in all of the SRRA constructions that wehave discussed this far (Chapters 3–5). Consider for example, the disagreement-based SRRA construction presented in Section 3.2.1. The way it selects queriesensures that on the one hand, it acquires labeled examples from all disagreementregions, and on the other hand, that queries are denser near the pivotal solution.This exactly accounts for integrating exploration and exploitation with respectto an intermediate solution.

While it is tempting to view SRRAs as the secret ingredient for balancingthe exploration–exploitation tradeoff in active learning, it turns out that we needan intimate understanding of the relations between the hypothesis and instancespaces to actually utilize it. In Chapter 3, we gained the knowledge of disagree-ment regions; and in Chapters 4 and 5, we were able to break the hypothesismetric-space into small combinatorial pieces. A natural question that arises iswhether we can apply the method in the absence of such knowledge?

This chapter considers this question. Note, however, that we focus here ontesting the effectiveness of our ideas from a practitioner’s point of view. Therefore,we deal with a setting that has a few empirically good active-learning algorithmsat its disposal. All active-learning algorithms that we examine here are eitherknown, or variations on, known heuristics. Our proposed solution can be viewedas a modular approach for active (transductive graph-based) learning that ensurescoverage in the “spirit” of SRRA smoothness.

83

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 98: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

As before, we focus on active classification problems within a transductivesetting; however, instances here are always constraint-free and described as pointsin a feature space. Given a sample of unlabeled examples and a labeling budgetm, the learner must select m examples to be labeled by the teacher. The goal isto use the m labeled examples to classify the rest of the points in the sample.

Despite the attractiveness of the active-transductive learning setting demon-strated in the former chapters, most of the research contributions on active learn-ing for classification have focused on inductive models. A few studies do considerthe above active-transductive model (Zhu et al., 2003b; Herbster et al., 2005; Yuet al., 2006), but these works tend to rely on graph-based algorithms, which havebeen used extensively in transductive settings (Chapelle et al., 2006).1 One ofour motivations for emphasizing the importance of coverage in active learning isthe observation that the known active-transductive algorithms (as well as manyactive-inductive algorithms) tend to suffer from excessive “self-confidence,” whichcan severely impair their performance. This flaw, which results in the neglect ofentire areas of the input space during the early stages of the learning process, isdemonstrated in Section 6.0.1. We propose a simple yet effective solution thatenforces systematic exploration of the input space whenever it is necessary. Our+EXPLORE method (Begleiter et al., 2008) can be viewed as a “patch” for fix-ing this deficiency and as a proposed modular approach for generating effectivenew, active-transductive algorithms that clearly outperform currently availableactive-transductive algorithms.

A few previous works consider the above mentioned deficiency of self-confidentactive-inductive learning algorithms. All of their solutions are based on ensem-ble methods. In Baram et al. (2004), it was demonstrated that self-confidentlearners fail on XOR-like problems. Their solution provides a general frame-work for combining a set of active-learning schemes. The framework is based ononline learning algorithms for the multi-armed bandit problem. A simpler solu-tion by Osugi et al. (2005) more directly relates the deficiency to the classicalexploration–exploitation problem. They switch between a self-confident learnerand an “exploration” learner whenever the induced hypothesis does not change“much.” The ensemble methods employed by Baram et al. (2004) and Osugiet al. (2005) depend on a number of hyper-parameters and there is no clear wayto calibrate them. A recent, simpler idea appearing in Guo and Greiner’s work(2007) combines two active learners using a round-robin-like scheme; however, thelearners employed in this solution are both self-confident. Thus, the combinedalgorithm tends to suffer from the above deficiency.

Our proposed +EXPLORE solution is based on cluster covering. Other workshave considered clustering within the context of active-inductive learning. Theactive method of Xu et al. (2003) queries cluster centers of the instances that lie

1Note that the semantics of the graphs here are different than what we considered in LRPPand semi-supervised clustering. This will be explained later in detail.

84

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 99: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

within the margin of the support-vector machine. The algorithm of Nguyen andSmeulders (2004) combines clustering and active learning; however, their methodis not general in purpose and the switch between the baseline active learner andthe clustering depends on a predefined hyper-parameter.

6.0.1 Motivating Example, and a Preview

The active-transductive algorithm of Zhu et al. (2003b) is among the few knownalgorithms designed for the active-transductive game (more details on this algo-rithm appear in Section 6.2). The starting point of the current study was ourempirical evaluation of this algorithm (as well as others), which showed that it isa top performer in this setting. Our initial study also revealed a major deficiency,depicted in Figure 6.1. Recall that here we always treat instances as points inreal-valued feature space (Rd). The synthetic example in Figure 6.1(a) is a bi-

+1

−1

−1

(a) Three Gaussians+ somenoise

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

Size of training set

Err

or

rate

Alg. (Zhu et al. 2003b)

Alg. (Zhu et al. 2003b) +EXPLORE

(b) Learning(error) curves

Dark squares correspond to queriesmade by algorithm (Zhu et al. 2003b)

(c) Algorithm (Zhu et al. 2003b)queries

Figure 6.1: Motivating example.

nary classification problem with three (non-isotropic) Gaussians and two “outlier”points that reside between the lower Gaussians. When applied to this example,2

the algorithm of Zhu et al. (2003b) exhibits behavior that is summarized by theupper learning curve of Figure 6.1(b); it does not query any point within the low-est Gaussian and fails to decrease its error below 15% within the first 100 activequeries. Of course, this example was carefully constructed to emphasize this badbehavior, but it does represent a re-occurring pattern we observed with many“real” datasets. The +EXPLORE method we develop here salvages this algorithmw.r.t. such learning problems, without reducing its already good performance foreasier problems.

2We set the hyper-parameter k = 10 in the algorithm of Zhu et al. (2003b) and noticedthat the algorithm failed with other k values on similar examples. The noise is essential forestablishing this example, because this algorithm does handle noiseless XOR-like problems (see,e.g., Zhu et al., 2003b, Figure 2).

85

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 100: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

6.1 Problem Definition

We consider a distribution-free transductive setting (Vapnik, 1982, Chapter 10).We will use notations that are a bit different than the ones we used so far.

The distribution-free transductive setting is defined as follows. Consider afixed set Sm+u = (xi, yi)m+u

i=1 ⊆ Rd×+1,−1 of m+u points, along with theirbinary labels. The learner is provided with the unlabeled full-sample Xm+u =xim+u

i=1 . A training set Sm consisting of m labeled points (xi, yi) is selectedfrom Sm+u uniformly at random among all subsets of size m and is given to thelearner. The test set Xu of size u containing the remaining unlabeled points is alsogiven to the learner. The learners we consider here generate soft classificationvectors h = (h1, . . . , hm+u) ∈ Rm+u, where hi is the soft label of example xi givenby the hypothesis h. The algorithm outputs sign(hi) for the actual (binary)classification of xi. In the passive transductive setting the goal of the learner isto predict the labels of the test points in Xu from (Sm, Xu), so as to minimize thetransductive risk, 1

u

∑xi∈Xu `(sign(hi), yi), w.r.t., the 0/1 loss function `. The m

training points are actively selected by the learner. The examples to be queriedare selected iteratively. At each iteration the learner selects the next example tobe queried and receives its label from a teacher.3 The goal of the active learneris to minimize the transductive risk over the remaining points that were notqueried.4

Graph-based Transductive Learning. In the graph-based transductive set-ting the learner receives an adjacency matrix over the points Xm+u in additionto the pool of unlabeled points. This (m + u) × (m + u) adjacency matrix re-flects the similarity between the full-sample points. A common way to establishsuch a graph is by connecting k-nearest neighbors with respect to Euclidean dis-tances. See for example, our description of the adjacency matrix constructionin Section 6.3.1. The similarity can be utilized additional side information forimproving classification. It is important to note that this setting is different fromthose of LRPP and correlation clustering (Chapters 4 and 5). Here, instancescorrespond to vertices, and the graph structure is known (or partially known)in advance. In LRPP and correlation clustering, instances consist of edges in agraph, and the graph’s structure is revealed in tandem with the acquisition oflabels.

3There is also a “batch” form of selection of training examples, in which the learner selectsthe examples for the training set and then simultaneously queries all of them (e.g., see Yu et al.2006). Batch querying is a special case of sequential querying and can be viewed as a limitationof the learner.

4Alternate optimization criteria may consider the entire learning curve of the active learner(see, e.g., Baram et al., 2004; Melville et al., 2005).

86

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 101: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

6.2 A Review of the Graph-Based Transductive

Algorithms We Use

We focus on four graph-based transductive algorithms5 by Joachims (2003), Zhuet al. (2003a), Belkin et al. (2004), and Zhou et al. (2004). These algorithmsgenerate smooth solutions, that is, namely the soft-classification does not changemuch between nearby points.

Let y ∈ −1, 1, 0m+u be a vector of known labels defined as follows: ifxi ∈ Xu, then the ith entry in y is 0; otherwise, the ith entry in y is yi(the true label of xi). All four algorithms minimize the objective functionminh∈Rm+u (htRh + c(h− y)tC(h− y)), where the left-hand term is a regular-ization term corresponding to the smoothness requirement and the right-handterm corresponds to the loss of the hypothesis h. The constant c ∈ R provides abalance between the regularization and loss terms.

The regularizer R is an (m + u) × (m + u) matrix induced by an adjacencymatrix W. The adjacency matrix reflects the similarity between the full-samplepoints. In our experiments we used the adjacency matrix corresponding to thekNN graph G, which is built as described in Section 6.3.1. All four algorithms usethe graph Laplacian regularizer L = D−W or its normalized version Lnorm = I−D−

12 WD−

12 , where D is a diagonal matrix with the (i, i)th entry di =

∑m+uj=1 wij

and I is an identity matrix. The cost matrix C is an (m+ u)× (m+ u) diagonalmatrix with the (i, i)th entry being a misclassification cost for the ith example. Allexamples in the training (test) set have the same misclassification cost, denotedby Cm (Cu).

We now describe the four graph-based algorithms that we used: GRFM, SOFT,SGT, and CM. These algorithms differ essentially by their definition of the costmatrix C. The first three algorithms use the graph Laplacian regularizer R = Land the last one (CM) uses the normalized graph Laplacian Lnorm.

The Gaussian random field model (GRFM) algorithm (Zhu et al., 2003a)sets Cl =∞ and Cu = 0. Hence, this algorithm gives an infinite penalty onempirical errors and thus enforces solutions with zero training errors.

Algorithm SOFT. The algorithm of Belkin et al. (2004) also sets Cu = 0. How-ever, in this algorithm Cl = 1 and hence empirical errors are allowed.6 Werefer to this algorithm as SOFT.

The Spectral Graph Transducer (SGT) algorithm (Joachims, 2003) sets0 < Cl < ∞, and Cu = 0. However, SGT adds two constraints:

∑i hi = 0;

5The application of our active-learning scheme to other transductive algorithms is straight-forward. See Zhu (2006) for a comprehensive survey of the existing transductive algorithms.

6In addition, this algorithm uses the constraint∑i hi = 0, which is required for proving a

risk bound.

87

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 102: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

and∑

i h2i = m+u, imposing solutions that minimize the ratiocut, induced

in G by positive and negative values in h.7

Algorithm CM. Finally, the consistency method (CM) of Zhou et al. (2004) usesthe normalized Laplacian regularizer R = Lnorm and sets Cl = Cu = 1.This value of Cu forces the soft classification values of the unlabeled (test)points that are far from the labeled ones to be close to zero.

6.3 An Exploration-Exploitation Routine and

Its Implementation

Our starting point is an active-learning algorithm ALG = (P, Q) where P is apassive-learning algorithm and Q is a querying component. The passive learner Puses a given (Sm, Xu) to generate a transductive hypothesis h and the queryingcomponent selects the next example to query x ∈ Xu, using h and (Sm, Xu).In this section we describe the +EXPLORE routine, whose goal is to improve theperformance of such algorithms by enforcing systematic exploration of unlabeledpoints. The proposed routine requires two components: an auxiliary queryingfunction QA(Sm, Xu) and a switching function SW(Sm, Xu) that determines whetherto generate the query using Q or QA. Note that QA and SW do not rely on h andthus do not suffer from the “self-confidence” deficiency.

Our routine can be viewed as a meta-algorithm that operates the given activelearner and augments it with additional querying capabilities. The routine per-forms m iterations corresponding to the m required queries. At each iteration,the switching component defines which querying method to apply next. Upontermination, the routine uses the passive learner P and the aggregated trainingset Sm to classify the remaining test points.

The pseudocode in Figure 6.2 defines our proposed procedure. In the followingsections we describe our implementation of the decision function SW and auxiliaryquerying component QA (Section 6.3.1). In our experiments we examined severalimplementations of the Q and P components; these are described in Sections 6.2and 6.3.2.

6.3.1 Implementation of QA and SW

Our implementation of QA and SW is based on a very simple and effective method ofcluster covering. At each iteration we cluster Xu. If there is an uncovered clustercontaining no labeled points, our switching function SW decides to use QA, which

7Let A+(h) and A−(h) be the set of the indices of positive and negative components in h,

respectively. The ratiocut is defined to be∑i∈A+(h), j∈A−(h) wij

(1

|A+(h)| + 1|A−(h)|

).

88

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 103: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Require: The unlabeled full sample Xm+u; an active learning algorithm ALG =(P,Q); a switching component SW; and an auxiliary querying component QA.

Ensure: A classification of Xm+u.S0 = ∅.for i = 1 to m do

h = P(Si−1, Xm+u−i+1).if SW(Si−1, Xm+u−i+1) == Q then

x = Q(h, Si−1, Xm+u−i+1).else

x = QA(Si−1, Xm+u−i+1).end ifQuery the label y of the x point.Si = Si−1 ∪ (x, y).Xm+u−i = Xm+u−i+1\x.

end forreturn P(Sm, Xu).

Figure 6.2: +EXPLORE routine.

selects a “representative” point from the largest uncovered cluster. Otherwise,SW operates the original querying function Q.

The clustering we perform in this implementation is semi-supervised, whichmeans that it takes into account all available points and labels. Since the setof acquired labels grows during the active learning session, the clustering wecompute is dynamically improved after each iteration.

In what follows of this section, we describe our implementation of the semi-supervised clustering. At the ith iteration we build a graph Gi representing thecurrent training and test sets (Si−1, Xm+u−i+1). This is done in two steps. In thefirst step we generate a symmetric kNN graph, denoted by G, which representsthe unlabeled full sample Xm+u. In this graph, there is an edge between twopoints iff one of them is among the k “most similar” points to the other. Wemeasure the similarity between xi and xj by the cosine similarity, d(x1,x2) =x1 ·x2/(||x1|| ||x2||). We note that this choice of metric is arbitrary and any metriccould be used. If there exists an edge between the points xi and xj, then we setits weight to be wij = d(xi,xj); otherwise, wij = 0.

Starting with G, we then construct the graph Gi, which encodes all knownlabels (in Si−1). Many methods have been proposed for incorporating labeledpoints in clustering (see Kulis et al., 2005, and the references therein). We triedseveral of them and obtained rather weak results in our setting. Hence, we proposea novel heuristic, which is guided by the following commonly used principles(Kulis et al., 2005):

89

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 104: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

1. Points with different labels should not in general be “similar.” Thus, wedelete the edges between such points in Si−1 (by setting their weights tozero).

2. Points with the same label can be similar. Hence, if there exists apair of points xr and xs in Si−1 with the same label and there isno edge wrs between them, we add an edge whose weight is wrs =12

(minj:wrj 6=0wrj+ minj:wsj 6=0wsj

).

After Gi is constructed, we cluster it using a graph-based (pairwise) cluster-ing algorithm. In general, any unsupervised clustering algorithm can be used.We preferred an algorithm that includes some kind of reasonable mechanism forselecting the number of clusters. While automatic selection of the number ofclusters is an ill-defined problem, there are some reasonable heuristics, such asthe Eigenvector-Alignment mechanism of Zelnik-Manor and Perona (2004). Inlight of our familiarity and extensive experience with spectral techniques (such asthose discussed in Ng et al., 2001; von Luxburg, 2007), we selected this method.

What remained to be described is our implementation of QA, the auxiliaryquerying component. As mentioned above, given the largest uncovered cluster,our goal is to select a representative point in this cluster. A representative pointcan be defined in several meaningful ways. For example, it can be referred toas the most central point in the cluster (in the sense of minimizing the maximaldistance to any point). While this approach makes sense, it seems to be com-putationally expensive (for example, it takes cubic time in cluster size using theFloyd-Warshall algorithm to select a representative point). Therefore, we definedthe representative point as the one that is most similar to its neighbors, namely,the point with the largest sum of weights of its edges. This point can be identifiedin quadratic time in the cluster size.

6.3.2 On Some Known and Some New Querying Compo-nents

In this section we describe all querying components Q that were used in ourexperiments. Several of them are known, whereas some are new. The first activequerying method that we consider is a transductive variant of the worst-caseheuristic of Campbell et al. (2000). This heuristic is motivated by the followingconsiderations. We assume that the absolute value |hi| of the soft-classificationof the ith point is proportional to the true probability whose label yi is sign(hi),and we choose to query x = argmaxxi∈Xu min(1 − hi)

2, (−1 − hi)2. It can

be verified8 that the values of h produced by the passive algorithms that weconsider (see Section 6.2) are in [−1, 1]m+u. Therefore, the solution for the above

8If the absolute values of some set of his exceed ±1, then cutting all of them to sign(hi)will only reduce the training error and the regularization term.

90

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 105: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

optimization problem is the most uncertain point, argminxi∈Xu |hi|. We term thismethod “UNCERTAIN.” Some active-transductive experiments with the UNCERTAINquerying function are presented in Zhu et al. (2003b), and Herbster et al. (2005).

When operated with SVM, UNCERTAIN coincides with the minimum marginmethod called SIMPLE in (Tong and Koller, 2001), which queries the point withthe minimal distance to the separating hyperplane. We propose a transductivevariant of the SIMPLE strategy. A graph cut between positive and negative ver-tices, induced by h, can be considered a transductive variant of the separatinghyperplane. Hence, the transductive analogue, denoted by CUT, queries the un-labeled point that is closest to the cut. The distance to the cut is measuredaccording to the edge weights, and the larger the path to the cut, the closer thepoint is. The ties are resolved by a random selection among the closest points tothe cut. Unlike SVM, in graph-based algorithms the UNCERTAIN and CUT methodscan query different points, although they mostly query points lying on the graphcut (i.e., the points xi such that there exists xj, wij 6= 0 and sign(hi) 6= sign(hj).

We define a novel querying method that queries the most “coarse” point. Thiscoarseness corresponds to the difference between the soft classifications of thepoint and its neighbors. Specifically, we define the coarseness of xi as

∑m+uj=1 (hi−

hj)2wij. Here, this method is called COARSE. Intuitively, COARSE queries points

residing in regions that include many close points with opposite labels.

The last two active querying methods we examined are those of Zhu et al.(2003b) and Herbster et al. (2005). The method of Zhu et al. (2003b), termedhere REDUCE-RISK, queries the point that minimizes the expected transductiverisk of the underlying passive algorithm. Since the true risk cannot be computed,Zhu et al. (2003b) approximated it by the overall uncertainty over the test set,∑

xi∈Xu |hi|. The naıve implementation of REDUCE-RISK is computationally inten-sive, since for each query it needs to run the passive classifier Ω(u) times. Theydeveloped an efficient implementation for their passive algorithm (Zhu et al.,2003a).

The querying function of Herbster et al. (2005) queries the point that op-timizes the trade-off between being uncertain and being central (namely, thedistance from it to any point in the graph). This heuristic is motivated by thebound on the number of mistakes made by the underlying online algorithm ofHerbster et al. (2005).

6.4 Empirical Evaluation

We empirically validated the efficiency of the +EXPLORE procedure using 14 dif-ferent self-confident active-transductive algorithms and 11 standard datasets.Among these algorithms, two are known (Zhu et al., 2003b; Herbster et al., 2005)and the rest are novel.

91

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 106: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

6.4.1 Datasets and Experimental Setting

The comparison is made on 11 datasets: PIMA, BUPA, VOTING, TAE, IONOSPHERE,MUSH, MUSK, MONK, COIL, DIGIT, and TEXT. These datasets are used in the contextof empirical validation of transductive algorithms. The first eight datasets arestandard UCI datasets used by Blum and Chawla (2001);9 the image datasetsCOIL and DIGIT are used by Chapelle et al. (2006); and the 20-newsgroups’ binarysub-problem “Atheism versus Religion” TEXT was used by Zhu et al. (2003b). Alldatasets were shuffled and cut in half.10

We ran GRFM, SOFT, SGT, and CM passive learners with the following queryingcomponents: CUT, UNCERTAIN, and COARSE. In addition, we experimented withthe active-transductive algorithms of Zhu et al. (2003b) and Herbster et al. (2005).We term all these (P, Q)-combination algorithms as SELF-CONF(P, Q).

Recall that SELF-CONF(P, Q) algorithms base their query on a transductivehypothesis and thus require an initial training set consisting of two examples.Hence, we report the mean error of such learners over five initializations chosenuniformly at random. Note that our +EXPLORE procedure always starts withexploration steps and thus implies a deterministic choice of the initial trainingset.

DataP = SGT SELF-CONF(P, Q) SELF-CONF(P, Q) Active-transductive

algorithms of

(SGT, UNCERTAIN) +EXPLORE (CM, UNCERTAIN) +EXPLORE Zhu et al. Herbster et al.

PIMA 29.8±0.4 31.1±1.0 27.5 28.8±0.5 27.2 28.9±0.8 29.0±0.0

BUPA 38.5±0.7 36.6±1.6 36.6 35.6±1.0 34.1 39.2±0.4 46.3±0.0

VOTING 5.6±1.0 0.6±0.5 0.6 0.5±0.2 0.0 1.2±0.0 4.6± 0.22

TAE 30.8±2.7 22.3±2.8 7.7 15.4±2.1 11.5 20.0±1.5 36.2±0.9

IONOSPHERE 22.1±1.4 19.7±1.8 13.5 19.4±1.5 18.3 15.3±1.0 28.6±0.0

MUSH 6.1±1.8 0.4±0.0 3.6 0.6±0.4 0.0 3.3±0.6 8.0±0.3

MUSK 19.2±1.9 15.0±1.4 13.1 11.0±0.4 12.0 15.8±0.2 25.6±0.0

MONK 19.2±1.9 10.4±1.6 13.3 18.6±1.5 15.1 20.2±1.2 20.4±1.6

COIL 19.2±3.4 11.6±2.1 10.0 9.5±0.6 8.4 37.1±3.7 46.7±3.2

DIGIT 3.4±0.7 0.3±0.0 0.3 1.5±0.2 1.3 1.0±0.0 3.9±1.5

TEXT 7.8±0.7 5.0±0.3 4.5 10.5±0.5 9.4 11.1±0.2 13.3±0.8

Table 6.1: The error (%) of the “best” representatives of the +EXPLORE, Q, and P

methods. The lowest error in each row (over all columns) appears in bold font.

9Some of these UCI datasets contain nominal features, which we translated into a vector ofindicator bits.

10This was done to reduce the amount of running time, which took more than a month using20 CPUs.

92

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 107: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

We report the best result in hindsight achieved over a grid of hyper-parameters. In general, no parameter selection scheme exists for (both trans-ductive and inductive) active learning. Hence the goal of this section is toexplore the potential of +EXPLORE on top of SELF-CONF. The grid of k ∈5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 75, 100 was shared by all of the algorithms.The adjacency matrices W were built with cosine similarity. We used the follow-ing values for c, the hyper-parameter that balances the regularization and lossterms : 0.001, 0.01, 0.1, 1, 10, 100 in SOFT and CM; 0.1, 1, 10, 100, 1000, 3200 inSGT11.

6.4.2 The Efficiency of +EXPLORE

Figure 6.3(a)-(c) depicts scatter plots comparing the +EXPLORE, SELF-CONF, andRAND (P) methods. The comparison comprises 154 experiments and was car-ried out over all datasets using all SELF-CONF(P, Q) combinations and the knownactive-transductive algorithms of Zhu et al. (2003b) and Herbster et al. (2005).Notice that most of these experiments correspond to new SELF-CONF(P, Q) com-binations that have not been tested before. In Figure 6.3(a)-(c), the pointsabove/below the dividing line correspond to a loss/win of the y-axis methodover the x-axis. We only depict results for which there is no overlap between thecorresponding mean error ± the standard error of the mean (SEM).

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

PASSIVE mean error

SE

LF

−C

ON

F m

ean

err

or

Total: 81 points

68% below

32% above

(a) SELF-CONF vs. PASSIVE

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

PASSIVE mean error

+E

XP

LO

RE

err

or

Total: 122 points

85% below

15% above

(b) +EXPLORE vs. PASSIVE

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

SELF−CONF mean error

+E

XP

LO

RE

err

or

Total: 119 points

83% below

17% above

(c) +EXPLORE vs. SELF-CONF

Figure 6.3: Comparing the three methods: RAND, SELF-CONF, and +EXPLORE. Eachpoint in each axis comprises two mean error results of two methods (in the x-axis andthe y-axis) over one of the datasets for a training size m = 50.

Observe that the SELF-CONF methods are better than RAND only for 55 outof 81 results. When applying SELF-CONF together with +EXPLORE, the advan-tage over RAND is increased to 104 out of 122 results. Note that the number ofsignificant wins over P is increased by 89% when using +EXPLORE. This effect isconfirmed by Figure 6.3(c), which depicts the clear advantage of +EXPLORE overSELF-CONF.

11All other hyper-parameters of the SGT implementation were set to their default values.

93

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 108: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Next in Table 6.1 we compare the “best” representative of each method. Theserepresentatives were chosen according to the Friedman rank test (Demsar, 2006)at a 95% significance level. For completeness, we also include the results ofthe relevant active-transductive algorithms of (Zhu et al., 2003b; Herbster et al.,2005). The comparison shows that +EXPLORE achieves the best results on 9 outof the 11 datasets.

6.4.3 The Advantage of Adaptive Exploration

We sketch a few numerical examples that indicate the usefulness of performing adynamic exploration. This adaptive nature of exploration is crucial for establish-ing the advantage added by +EXPLORE to SELF-CONF algorithms. Figure 6.4(a)depicts how a bad choice of only three points at the beginning of the learningrounds dramatically affects the performance. The three dots on the error curveof +EXPLORE correspond to the exploration steps (determining the initial trainingset).

Figure 6.4(b-c) depicts the positive effect of performing adaptive exploration.Observe in Figure 6.4(b) how the sequence of three exploration steps, startingaround m = 33, separates the error curves of SELF-CONF and +EXPLORE. Thesequence of explorations as depicted in Figure 6.4(c) dramatically reduces theerror rate from 0.2 to 0.

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Size of training set

Err

or

rate

SELF−CONF

SELF−CONF+EXPLORE

Exploration

(a) (SGT, CUT) on MUSK

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Size of training set

Err

or

rate

SELF−CONF

SELF−CONF+EXPLORE

Exploration

(b) (SOFT, CUT) on IONOSPHERE

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Size of training set

Err

or

rate

SELF−CONF

SELF−CONF+EXPLORE

Exploration

(c) (CM, CUT) on MUSH

Figure 6.4: The effect of dynamic exploration: Comparing the learning (error) curvesof SELF-CONF with SELF-CONF+EXPLORE. Queries by exploration (using QA) areindicated by dark dots.

6.5 Discussion

One way to think of +EXPLORE is as a simple yet effective enhancement procedurethat repairs the self-confidence deficiency of many active-transductive algorithms;however, we prefer to view it as a surrogate for the rigorous SRRA condition.+EXPLORE is literally an implementation of the key observation at the core of theSRRA method.

94

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 109: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

We empirically tested our proposed +EXPLORE method using the known active-transductive algorithms of (Zhu et al., 2003b; Herbster et al., 2005) and 12 newalgorithms. The experiments clearly indicate that our +EXPLORE enhancementimproves the performance of self-confident active learners in most cases. More-over, state-of-the-art results are achieved when applying (SGT, UNCERTAIN) and(CM, UNCERTAIN) along with the +EXPLORE method.

The +EXPLORE method is a heuristic guided by the “spirit” of the SRRAsmoothness condition; yet, its clear empirical advantage provides a very goodindication of the potential effectiveness of our ideas.

95

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 110: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

96

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 111: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Appendix A

Standard Concentration to theMean Bounds

In our proofs, we use the following concentration to the mean bounds. Theserather well-known standard tools that are widely used in our context; however,we state them here for the sake of completeness.

We use the well known Hoeffding inequality (e.g., Devroye et al., 1996, Chapter8.1, page 122).

Theorem A.1 (Hoeffding) Let X1, X2, . . . , Xn be n i.i.d. random variables,and B1, B2, . . . , Bn a series of natural numbers. For each i E[Xi] = 0, andXi ≤ Bi almost surely. Then,

P

[n∑i=1

Xi > t

]≤ exp

[ −t22∑n

i=1Bi

].

We use the following version of Chernoff bound for the tail distribution of asum of 0− 1 random variables not necessarily distributed equally. Such variablesare known as Poisson trails (not to be confused with Poisson random variables).Bernoulli trails are special case of Poisson trials in which all random variableshave the same distribution. We take the bound from (Mitzenmacher and Upfal,2005, Theorems 4.4 and 4.5, Chapter 4.2.1).

Theorem A.2 (Chernoff) Let X1, X2, . . . , Xn a be independent Poisson trailssuch that Pr(Xi) = pi. Let X =

∑ni=1 Xi and µ = E [X]. Then, for 0 < δ ≤ 1

P [X ≥ (1 + δ)µ] ≤ exp

[−t23

],

and

P [X ≥ (1− δ)µ] ≤ exp

[−t22

].

97

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 112: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

We take the description of Bernstein inequality from (e.g., Devroye et al.,1996, Theorem 8.2, page 124). (Recall that Var(X) = E(X2)− E(X)2.)

Theorem A.3 (Bernstein) Let X1, X2, . . . , Xn be n i.i.d. random variables,with E[Xi] = 0 for all i. Assume there exists a number M for which Xi ≤ Malmost surely. Then,

P

[n∑i=1

Xi > t

]≤ exp

[ −t2/2E[X2

i ] +Mt/3

].

98

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 113: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

References

Nir Ailon. An active learning algorithm for ranking from pairwise preferences withan almost optimal query complexity. Journal of Machine Learning Research,13:137–164, 2012.

Nir Ailon and Moses Charikar. Fitting tree metrics: Hierarchical clustering andphylogeny. SIAM J. Comput., 40(5):1275–1291, 2011.

Nir Ailon, Bernard Chazelle, Seshadhri Comandur, and Ding Liu. Estimating thedistance to a monotone function. Random Struct. Algorithms, 31(3):371–383,2007.

Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent in-formation: Ranking and clustering. Journal of the ACM, 55(5):23:1–23:27,October 2008.

Nir Ailon, Ron Begleiter, and Esther Ezra. Active learning using smooth relativeregret approximations with applications. In Proceedings of the 25th AnnualConference on Learning Theory. JMLR Workshop and Conference Proceedings,volume 23 of JMLR Workshop and Conference Proceedings, 2012.

Noga Alon. Ranking tournaments. SIAM Journal on Discrete Mathematics, 20,2006.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gam-bling in a rigged casino: the adversarial multi-armed bandit problem. In FOCS,1995.

Francis R. Bach. Active learning for misspecified generalized linear models. InAdvances in Neural Information Processing Systems 19, pages 65–72. 2007.

Ran Bachrach, Shai Fine, and Eli Shamir. Learning using query by committee,linear separation and random walks. Theoretical Computer Science, 284(1),2002.

Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic activelearning. In ICML, pages 65–72, 2006.

99

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 114: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based activelearning. In COLT, pages 35–50, 2007.

Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman. The true samplecomplexity of active learning. In COLT, pages 45–56, 2008.

Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic activelearning. volume 75, pages 78–89, 2009a.

Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Approximate clusteringwithout the approximation. In Proceedings of the twentieth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA ’09, pages 1068–1077, 2009b.

Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. MachineLearning, 56:89–113, 2004.

Yoram Baram, Ran El-Yaniv, and Kobi Luz. Online choice of active learningalgorithms. JMLR, 5:255–291, 2004.

Sugato Basu. Semi-supervised Clustering: Probabilistic Models, Algorithms andExperiments. PhD thesis, Department of Computer Sciences, University ofTexas at Austin, 2005.

Ron Begleiter, Ran El-Yaniv, and Dmitry Pechyony. Repairing self-confidentactive-transductive learners using systematic exploration. Pattern RecognitionLetters, 29(9):1245–1251, 2008.

Mikhail Belkin, Irina Matveeva, and Partha Niyogi. Regularization and semi-supervised learning on large graphs. In COLT, 2004.

Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. Clustering gene expressionpatterns. Journal of Computational Biology, 6(3/4):281–297, 1999.

Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weightedactive learning. In Proceedings of the 26th Annual International Conference onMachine Learning, ICML ’09, 2009.

Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic activelearning without constraints. In NIPS, 2010.

Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeled data usinggraph mincuts. In ICML, 2001.

Mark Braverman and Elchanan Mossel. Noisy sorting without resampling. InSODA, pages 268–276, 2008.

100

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 115: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Mihai Badoiu and Kenneth L. Clarkson. Smaller core-sets for balls. In Proceed-ings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms(SODA), 2003.

Colin Campbell, Nello Cristianini, and Alex J. Smola. Query learning with largemargin classifiers. In ICML, 2000.

Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T Du-mais. Here or there: Preference judgments for relevance. In In Proceedings ofthe European Conference on Information Retrieval (ECIR), 2008.

Rui Castro and Robert Nowak. Minimax bounds for active learning. IEEETransactions on Information Theory, 54(5):2339–2353, 2008.

Rui Castro, Rebecca Willett, and Robert Nowak. Faster rates in regression viaactive learning. In NIPS, 2005.

Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear classi-fication and selective sampling under low noise conditions. In NIPS, pages249–256, 2008.

Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Learning noisylinear classifiers via adaptive and selective sampling. Machine Learning, 83(1):71–102, 2011.

Nicolo Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. Worst-case analysisof selective sampling for linear-threshold algorithms. In In Advances in NeuralInformation Processing Systems 17, 2004.

Nicolo Cesa-Bianchi, Claudio Gentile, Fabio Vitale, and Giovanni Zappella. Ac-tive learning on trees and graphs. In COLT, pages 320–332, 2010.

Nicolo Cesa-Bianchi, Claudio Gentile, Fabio Vitale, and Giovanni Zappella. Acorrelation clustering approach to link classification in signed networks. In Pro-ceedings of the 25th Annual Conference on Learning Theory. JMLR Workshopand Conference Proceedings, volume 23 of JMLR Workshop and ConferenceProceedings, pages 34.1–34.20, 2012.

Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien, editors. Semi-Supervised Learning. MIT Press, 2006.

Moses Charikar and Anthony Wirth. Maximizing quadratic programs: Extendinggrothendieck’s inequality. In FOCS, pages 54–60. IEEE Computer Society,2004.

Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering withqualitative information. J. Comput. Syst. Sci., 71(3):360–383, 2005.

101

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 116: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

David Cohn, Les Atlas, and Richard Ladner. Improving generalization with activelearning. Machine Learning, 15:201–221, 1994.

David Cohn, Rich Caruana, and Andrew Mccallum. Semi-supervised clusteringwith user feedback. unpublished manuscript, 2000. URL http://www.cs.

umass.edu/~mccallum/papers/semisup-aaai2000s.ps.

Don Coppersmith, Lisa K. Fleischer, and Atri Rurda. Ordering by weightednumber of wins gives a good ranking for weighted tournaments. ACM Trans.Algorithms, 6:55:1–55:13, July 2010.

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and YoramSinger. Online passive-aggressive algorithms. Journal of Machine LearningResearch, 7:551–585, 2006.

Ido Dagan and Sean P. Engelson. Committee-based sampling for training proba-bilistic classifiers. In Proc. 12th International Conference on Machine Learning,pages 150–157. Morgan Kaufmann, 1995.

Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In NIPS,2005.

Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. InICML, pages 208–215, 2008.

Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis ofPerceptron-based active learning. In Proceedings of the Eighteenth AnnualConference on Learning Theory (COLT), 2005.

Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic activelearning algorithm. In NIPS, 2007.

Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Com-putational geometry: Algorithms and applications. Springer-Verlag, Berlin, 3rdedition, 2008.

Ayhan Demiriz, Kristin Bennett, and Mark J. Embrechts. Semi-supervised clus-tering using genetic algorithms. In In Artificial Neural Networks in Engineering(ANNIE-99, pages 809–814. ASME Press, 1999.

Janez Demsar. Statistical comparisons of classifiers over multiple data sets.JMLR, 7:1–30, 2006.

Luc Devroye, Laszlo Gyorfi, and Gabor Lugosi. Probabilistic Theory of PatternRecognition. Springer-Verlag, 1996.

102

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 117: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Persi Diaconis and R. L. Graham. Spearman’s Footrule as a measure of disarray.Journal of the Royal Statistical Society, 39(2):262–268, 1977.

Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classi-fication. Journal of Machine Learning Research, 11:1605–1641, 2010.

Ran El-Yaniv and Yair Wiener. Active learning via perfect selective classification.Journal of Machine Learning Research, 13:255–279, 2012.

Brian Eriksson, Gautam Dasarathy, Aarti Singh, and Robert D. Nowak. Activeclustering: Robust and efficient hierarchical clustering using adaptively selectedsimilarities. Journal of Machine Learning Research - Proceedings Track, 15:260–268, 2011.

Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selectivesampling using the query by committee algorithm. Machine Learning, 28(2–3):133–168, 1997.

Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. Kernel query by committee(KQBC). Technical Report 2003-88, Leibniz Center, the Hebrew University,2003.

Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. Query by committee madereal. In Proceedings of the 19th Conference on Neural Information ProcessingSystems (NIPS), 2005.

Ioannis Giotis and Venkatesan Guruswami. Correlation clustering with a fixednumber of clusters. Theory of Computing, 2(1):249–266, 2006.

Sally A. Goldman and Michael J. Kearns. On the complexity of teaching. Journalof Computer and System Sciences, 50(1):20–31, February 1995.

Yuhong Guo and Russ Greiner. Optimistic active-learning using mutual infor-mation. In Proceedings of the 20th International Joint Conference on ArtificialIntelligence, 2007.

Shirley Halevy and Eyal Kushilevitz. Distribution-free property-testing. SIAMJ. Comput., 37(4):1107–1138, 2007.

Steve Hanneke. Teaching dimension and the complexity of active learning. InProceedings of the 20th Annual Conference on Learning Theory (COLT), 2007a.

Steve Hanneke. A bound on the label complexity of agnostic active learning. InProceedings of the 24th Annual International Conference on Machine Learning(ICML), 2007b.

Steve Hanneke. Adaptive rates of convergence in active learning. In COLT, 2009.

103

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 118: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Steve Hanneke. Rates of convergence in active learning. Annals of Statistics, 39(1):333–361, 2011.

Steve Hanneke and Liu Yang. Negative results for active learning with convexlosses. Journal of Machine Learning Research - Proceedings Track, 9:321–325,2010.

Sariel Har-Peled. Geometric Approximation Algorithms. Mathematical Surveysand Monographs. American Mathematical Society, 2011.

Sariel Har-Peled, Dan Roth, and Dav Zimak. Maximum margin coresets foractive and noise tolerant learning. In Proceedings of the International JointConference on Artificial Intelligence (IJCAI), 2007.

David Haussler. Decision theoretic generalizations of the PAC model for neuralnet and other learning applications. Information and Control, 100(1):78–150,September 1992.

Tibor Hegedus. Generalized teaching dimensions and the query complexity oflearning. In Proceedings of the Eighth Annual Conference on ComputationalLearning Theory, pages 108–117, 1995.

Lisa Hellerstein, Krishnan Pillaipakkamnatt, Vijay Raghavan, and Dawn Wilkins.How many queries are needed to learn? Journal of the ACM, 43(5):840–862,1996.

Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rankingboundaries for ordinal regression. In Advances in Large Margin Classifiers,chapter 7, pages 115–132. The MIT Press, 2000.

Mark Herbster. Learning additive models online with fast evaluating kernels. InProceedings of the 14th Annual Conference on Computational Learning Theory,2001.

Mark Herbster, Massimiliano Pontil, and Lisa Wainer. Online learning overgraphs. In ICML, 2005.

Kevin G. Jamieson and Rob Nowak. Active ranking using pairwise comparisons.In NIPS 24, pages 2240–2248, 2011.

Thorsten Joachims. Optimizing search engines using clickthrough data. In KDD,2002.

Thorsten Joachims. Transductive learning via spectral graph partitioning. InICML, 2003.

104

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 119: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Matti Kaariainen. Active learning in the non-realizable case. In ALT, pages63–77, 2006.

Nikos Karampatziakis and John Langford. Online importance weight aware up-dates. In Proceedings of the Twenty-Seventh Conference Annual Conference onUncertainty in Artificial Intelligence (UAI-11), pages 392–399, 2011.

Claire Kenyon-Mathieu and Warren Schudy. How to rank with few errors. In Pro-ceedings of the thirty-ninth annual ACM symposium on Theory of computing,STOC ’07, pages 95–103, 2007.

Dan Klein, Sepandar D. Kamvar, and Christopher D. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledgein data clustering. In ICML, pages 307–314, 2002.

Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk inactive learning. Journal of Machine Learning Research, 11:2457–2485, 2010.

B. Kulis, S. Basu, I. Dhillon, and R.J. Mooney. Semi-supervised graph clustering:A kernel approach. In ICML, 2005.

Yi Li, Philip M. Long, and Aravind Srinivasan. Improved bounds on the samplecomplexity of learning. Journal of Computer and System Sciences, 62:2001,2000.

Michael Lindenbaum, Shaul Markovitch, and Dmitry Rusakov. Selective samplingfor nearest neighbor classifiers. In The Proceedings of the Sixteenth NationalConfernce on Artificial Intelligence, pages 366–371, 1999.

Michael Lindenbaum, Shaul Markovitch, and Dmitry Rusakov. Selective samplingfor nearest neighbor classifiers. Machine Learning, 54(2):125–152, 2004.

Gaelle Loosli and Stephane Canu. Comments on the ”core vector machines: FastSVM training on very large data sets”. Journal of Machine Learning Research,8, 2007.

David MacKay. Bayesian Methods for Adaptive Models. PhD thesis, CaliforniaInstitute of Technology, 1992.

Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis.Annals of Statistics, 27:1808–1829, 1999.

Pascal Massart. Some applications of concentration inequalities to statistics.Annales de la Faculte des Sciences de Toulouse, IX:245–303, 2000.

Andrew K. McCallum and Kamal Nigam. Employing EM in pool-based activelearning for text classification. In Proceedings of ICML-98, 15th InternationalConference on Machine Learning, pages 350–358, 1998.

105

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 120: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Prem Melville and Raymond Mooney. Diverse ensembles for active learning. InICML, 2004.

Prem Melville, Maytal Saar-Tsechansky, Foster Provost, and Raymond Mooney.Economical active feature-value acquisition through expected utility estima-tion. In KDD-05 Workshop on Utility-Based Data Mining, 2005.

Stanislav Minsker. Plug-in approach to active learning. Journal of MachineLearning Research, 13:67–90, 2012.

Michael Mitzenmacher and Eli Upfal. Probability and computing - randomizedalgorithms and probabilistic analysis. Cambridge University Press, 2005.

Andrew Ng, Andrew Jordan, and Yair Weiss. On spectral clustering: Analysisand an algorithm. In NIPS, 2001.

Hieu T. Nguyen and Arnold Smeulders. Active learning using pre-clustering. InICML, 2004.

Francesco Orabona and Nicolo Cesa-Bianchi. Better algorithms for selective sam-pling. In ICML, pages 433–440, 2011.

Thomas Osugi, Deng Kun, and Stephen Scott. Balancing exploration and ex-ploitation: A new algorithm for active machine learning. In ICDM, 2005.

Kira Radinsky and Nir Ailon. Ranking from pairs and triplets: informationquality, evaluation methods and query complexity. In WSDM, pages 105–114,2011.

Nicholas Roy and Andrew McCallum. Toward optimal active learning throughsampling estimation of error reduction. In Proceedings of the 18th InternationalConference on Machine Learning, 2001.

Greg Schohn and David Cohn. Less is more: Active learning with support vectormachines. In ICML, 2000.

Burr Settles. Active Learning (Synthesis Lectures on Artificial Intelligence andMachine Learning). Morgan and Claypool, 2012.

H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedingsof the Fifth Workshop on Computational Learning Theory, 1992.

Ohad Shamir and Naftali Tishby. Spectral clustering on a budget. Journal ofMachine Learning Research - Proceedings Track, 15:661–669, 2011.

Ron Shamir, Roded Sharan, and Dekel Tsur. Cluster graph modification prob-lems. Discrete Applied Math, 144:173–182, nov 2004.

106

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 121: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Neil Stewart, Gordon D. A. Brown, and Nick Chater. Absolute identification byrelative judgment. Psychological Review, 112(4):881–911, 2005.

Masashi Sugiyama. Active learning in approximately linear regression based onconditional expectation of generalization error. Journal of Machine LearningResearch, 7:141–166, 2006.

Louis Leon Thurstone. A law of comparative judgment. Psychological Review, 34(4):273–286, July 1927.

Simon Tong and Daphne Koller. Support vector machine active learning withapplications to text classification. In JMLR, number 2, pages 45–66, 2001.

Ivor W. Tsang, James T. Kwok, and Pak-Ming Cheung. Core vector machines:Fast SVM training on very large data sets. Journal of Machine Learning Re-search, 6:363–392, 2005.

Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning.Annals of Statistics, 32:135–166, 2004.

Vladimir Vapnik. Estimation of Dependencies Based on Empirical Data.Springer-Verlag, New York, 1982.

Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer Verlag,New York, 1995.

Konstantin Voevodski, Maria-Florina Balcan, Heiko Roglin, Shang-Hua Teng,and Yu Xia. Active clustering of biological sequences. Journal of MachineLearning Research, 13:203–225, 2012.

Ulrike von Luxburg. A tutorial on spectral clustering. Technical Report TR-149,Max Planck Institute for Biological Cybernetics, 2007.

Liwei Wang. Smoothness, disagreement coefficient, and the label complexity ofagnostic active learning. Journal of Machine Learning Research, 12:2269–2292,2011.

Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance met-ric learning, with application to clustering with side-information. In Advancesin Neural Information Processing Systems 15, pages 505–512. MIT Press, 2002.

Zhao Xu, Kai Yu, Volker Tresp, Xiaowei Xu, and Jizhi Wang. Representativesampling for text classification using support vector machines. In ECIR, 2003.

Liu Yang, Steve Hanneke, and Jaime G. Carbonell. Bayesian active learning usingarbitrary binary valued queries. In ALT, pages 50–58, 2010.

107

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 122: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Liu Yang, Steve Hanneke, and Jaime G. Carbonell. The sample complexity ofself-verifying bayesian active learning. Journal of Machine Learning Research- Proceedings Track, 15:816–822, 2011.

Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimentaldesign. In ICML, 2006.

Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In NIPS,2004.

Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bern-hard Scholkopf. Learning with local and global consistency. In NIPS, 2004.

Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report TR-1530, University of Wisconsin-Madison, 2006.

Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Semi-supervised learningusing gaussian fields and harmonic functions. In ICML, 2003a.

Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learningand semi-supervised learning using gaussian fields and harmonic functions. InICML 2003 workshop on The Continuum from Labeled to Unlabeled Data inMachine Learning and Data Mining, 2003b.

108

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 123: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

למידה של ומעשיים עיוניים בנושאים מחקר

פעילה

בגליטר רון

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 124: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 125: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

למידה של ומעשיים עיוניים בנושאים מחקר

פעילה

מחקר על חיבור

התואר לקבלת הדרישות של חלקי מילוי לשם

לפילוסופיה דוקטור

בגליטר רון

לישראל טכנולוגי מכון — הטכניון לסנט הוגש

2013 מרץ חיפה תשעג אדר

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 126: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 127: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

אליניב רן חבר ופרופסור אילון ניר דוקטור בהנחיית נעשה המחקרהמחשב למדעי הפקולטה

תודה הכרת

בהשתלמותי הנדיבה הכספית התמיכה על לטכניון מודה אני

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 128: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 129: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

תקציר

מאוסף ללמידה דרכים של וחקר איפיון, בבניה, הדן המחשב במדעי תחום הינו מכונה למידת

משתכללות מידע כמויות של והעיבוד האיכסון ויכולות וחופשי, נגיש הוא מידע בו העידן נתונים.

די.אנ.איי. רצפי המאפיינות מערכות לדוגמא לתחום. ומגוונים רבים שימושים מעלה תדיר באופן

השלמת חיפוש, במנועי חיפוש תוצאות סידור עצמאית, בצורה הנוהגים רכבים מחלה, הקובעים

מתוך תמונות איחזור פיננסיות, במערכות חשודה התנהגות זיהוי סמס), (הודעות במסרונים מילים

לשניה. אחת משפה טקסטים של אוטומטי ותרגום גדול, תמונות מסד

יצירתם פעולת תהליך ההכללה, במושג דנה המכונה למידת בתחום המרכזיות מהשאלות אחת

נקרא זו בבעיה שדן הקלאסי הלמידה מודל פרטים. של מאסופה כלליים חוקים או כללים של

כלל לבחור וסיווגם, פרטים אוסף בהינתן הינה, זה במודל המטרה המכוונת. הלמידה מודל

עתידיים. פרטים של סיווגם את האפשר ככל נאמנה שיתאר סיווג כללי של מוגבל אוסף מתוך

שנוכל כלל לבחור תקינות, והודעות זבל להודעות וסיווגם דוא”לים של אוסף בהינתן לדוגמא,

הכללים את הבוחרת הישות תקינה. היא האם יקבע ואשר חדשה דוא”ל הודעת כל על להפעיל

מדידות של אוסף ערכי ע”י מיוצגים כלל ובדרך ”דוגמאות“ נקראים הפרטים ”הלומד“, נקראת

הכללים ואוסף ”השערה“, נקרא סיווג כלל ”תיוג“, נקרא דוגמא של סיווגה ”תכוניות“, הנקראים

ההשערות“. ”מחלקת נקרא הנגישים

של בדוגמא יקר. הוא דוגמא של האמיתי התיוג נקבע בו התהליך רבות מעשיות בבעיות

שירות משתמשי כגון אנושיים מתייגים ידי על נקבע האמיתי התיוג זבל דוא”ל סינון למידת

שלהם המיידי הצורך מבחינת שולי לעיניין ליבם תשומת ואת זמנם, את דורשים ומכאן הדוא”ל

ולעיתים ממש של כסף בעבור נרכשים התיוגים רבים במקרים הודעות). משלוח או (קריאה

i

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 130: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

למודל גרסאות לחיפוש הביא זה מצורך יוצא כורח לתחום. מומחה של מזמנו נגזרת עלותם

המתוייגות. הדוגמאות בכמות חיסכון כדי תוך טוב סיווג כלל לנפק שעשויות המכוון הלמידה

בשני פעילה. ולמידה טרנסדוקטיבית, למידה מכוונת, חצי למידה הם הללו המרכזיים המודלים

מתוייגות לא לדוגמאות גישה המתוייגות, לדוגמאות בנוסף הלומד, מקבל הראשונים המודלים

במודל לתייג. עליו שיהיה העתידיות הדוגמאות מאפייני את לשער בכדי לנצל יכול הוא אותן

לרכוש רשאי הוא בו האמיתי, הסיווג עם ל”דיאלוג“ אפשרות הלומד מקבל הפעילה הלמידה

מצומצם). (ומכאן מיטבי מתוייגות דוגמאות אוסף לבחור ומכאן בעצמו תיוגים

ניתנת הפעיל ללומד הפעילה. הלמידה מודל של ומעשיות עיוניות בסוגיות דנה זו תיזה עבודת

הפעילה הלמידה במודל פתרון ההכללה. כלל את יסיק מהן האימון דוגמאות בחירת על השליטה

ידי על היותר, לכל שידרשו, התיוגים למספר אומדן נותן אשר המדגם“ ”גודל מדד לפי נמדד

האמיתי. לתיוג בהתאמה קטנה נקובה סיווג טעות שטועה השערה להסיק כדי הלומד

המקוונת בגרסה המאגר. וגרסת המקוונת, מרכזיות: גרסאות שתי יש הפעילה הלמידה למודל

דוגמא בוחן הלומד שלב, בכל יחידות. דוגמאות של סדרה הלומד בוחן הפעילה הלמידה של

המאגר, בגרסת ממנה. להתעלם לחילופין או שלה האמת תיוג את לבקש האם ומחליט יחידה

הדוגמאות ”אופי“ את (המייצג מתוייגות לא דוגמאות של למאגר גישה הפעיל הלומד מקבל

מתייחסים אנו זו בעבודה שלב. בכל מהן אחת כל של התיוג את לחשוף רשאי והוא בבעיה)

המודל על גם שלנו והרעיונות התוצאות את להשית ניתן כי נדגיש אולם בלבד, המאגר לגרסת

המקוונת. בגרסה

והבנה משמעותי נפח צבר הוא אולם התישעים, בשנות עוד החל הפעילה הלמידה במודל הדיון

את מכילה המושגים שמחלקת ההנחה תחת המודל נבחן בראשית האחרון. בעשור עיונית טובה

מכונה המודל אכן, ”רעש“. נטולת שהבעיה לכך השקולה מקלה הנחה זוהי האמת. סיווג כלל

הלמידה כאן מושלם. סיווג לכלל בו להגיע שניתן כך שום על ”האפשרי“ המקרה זו הנחה תחת

שלא ההשערות כל את החיפוש ממרחב מסירה תיוג של חשיפה כל בו לחיפוש משולה הפעילה

עם נשארים שאנו עד החיפוש מרחב מצטמצם תיוגים, יותר שנחשפים ככל התיוג. עם מסכימים

קיימת לחיפוש ההמשלה המטרה. סיווג את מכילה שתמיד קבוצה כה, עד טעו שלא ההשערות

של הדרך פורצות העיוניות ומהתוצאות [1994] ולנדר אטלס כהן של הקלאסי מהאלגוריתם עוד

ii

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 131: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

הפעילה, הלמידה למודל הראשון הסיבוכיות מדד את בעבודתו הציג דסגופטה .[2005] דסגופטה

מקרה הראה הוא בנוסף פעיל. חיפוש אלגוריתם ידי על שנדרש המדגם גודל את ניתח בעזרתו

סבילה. מכוונת למידה פני על יתרון תיתן לא פעילה למידה בו כללי

ראשית מעשית. איננה היא אולם חשובה היא האפשרי המקרה הנחת תחת הבעיה ניתוח

הבחירה שנית, מדידות. תלוי שתיאורן משום לדוגמא רועשות, הן אמת בעיות של שמופעים משום

SVM—ה אלגוריתם בחירת (לדוגמא, הלמידה מאלגוריתם ישירות נגזרת ההשערות מחלקת של

כזו בחירה סופי). גאומטרי מרחב תחת לינאריים מפרידים להיות ההשערות מחלקת את קובעת

חישובית. יעילות צרכי לספק כדי לדוגמא מוגבלת, כלל בדרך ומכאן הנדסית הינה

גם ולכן מוגבלת היא המושגים שמחלקת העובדה את בחשבון לוקחת יותר מעשית הנחה

המקרה הנחת קוראים אנו הזו להנחה האמיתי. לסיווג ביחס ישגה בה ביותר הטוב הכלל

נזקקה הקהילה לכן זה. מודל תחת תקפות אינן דסגופטה של התוצאות כי נציין האגנוסטי.

היום מהווה [2007 [האנקה, ההסכמה אי מקדם הנקרא הסיבוכיות מדד חדשים. לרעיונות

למידה אלגוריתמי ופיתוח לניתוח ומשמש נתונה פעילה למידה בעיית של לקושי להסבר מוסכמה

מסביב קבוע ברדיוס כדור שמגדיר ההשערות קבוצת עבור אומד, ההסכמה אי מקדם מתאימים.

אין עליהן הסתברותי) (נפח הדוגמאות מספר את ההשערות, במחלקת ביותר הטובה להשערה

ככל בו שניתקל הנכונה התיוג שאלת בחירת של הקושי את מספר המדד בקבוצה. תיוג הסכמת

הנכון. הפתרון של הסביבה אל בחיפוש שנתקרב

הנחת תחת המאגר, בגרסת פעילה, ללמידה חדשה גישה מציגים אנו הזו התיזה בעבודת

השערה של טעות אומדני הפרשי על חלקות תנאי עומד שלנו הגישה בבסיס האגנוסטי. המקרה

מראים אנחנו השוואה. כנקודת המשמשת נעוצה להשערה יחסית ההשערות, ממחלקת כלשהי,

מהירים. למידה קצבי המשיג כאלו, טעות אומדני על המבוסס פעילה, למידה אלגוריתם שקיים

ובפתרון טעות אומדן של מסויימת בהגדרה תלוי ומימושו כללי הוא מציגים שאנחנו האלגוריתם

גבוהה). (בהסתברות האומדן את הממזער המושג מציאת לבעיית חישובי

לבעיית שמתאים אומדן הוא הראשון אומדנים. שלושה מגדירים אנחנו התיזה בעבודת

הבטחות כמו לפחות טוב מדגם גודל מבטיח זה שאומדן מראים אנחנו כללית. פעילה למידה

הבאים האומדנים זאת, לעומת אותן. משפר אינו הוא אולם, הידועות. ביותר הטובות מדגם גודל

iii

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 132: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

ביותר הטובות האומדן הבטחות את משפרים ספציפיות בעיות עבור מוגדרים אשר מציגים, שאנו

המבוסס ידוע פעילה למידה אלגוריתם כל אלו בעיות שעבור מראים אנחנו מכך, יתרה הידועות.

משמעותי מדגם גודל מלהשיג יכשל ההסכמה אי מדד ועל וופניק—צ’רבוננקיס חסמי על ורק אך

המשמעות כאלו.) אלגורתמים עבור בסיסיות הכי ההנחות אלו כי (נדגיש נמוך). הבעיה (כשרעש

שציינו. הסיבוכיות מדדי שבבסיס ומהרעיונות הידועות מהגישות שונה שלנו שהגישה היא

וחלוקה בזוגות, העדפות מעל סדרים למידת הן: בהן דנים שאנחנו הספציפיות הבעיות

קומבינטורית, אופטימיזציה של מהתחום מגיעות הבעיות שתי בזוגות. קשרים פי על לצבירים

קבוצה מקבלים אנחנו הראשונה בבעיה המכונה. למידת בקהילת עיניין ויותר ליותר זוכות והן

היא המטרה עצמים. של בזוגות העדפות המסמלים לתיוגים (פעילה) וגישה עצמים, של סופית

עצמים של סופית בקבוצה דנה השניה הבעיה גם להעדפות. שמתאים העצמים מעל סדר ללמוד

(זרות) קבוצה לתתי העצמים קבוצת את לחלק היא המטרה כאן עצמים. של זוגות מעל ובתיוגים

עליהם אסור האם או קבוצה תת באותה יחד לשכון צריכים עצמים זוג האם מסמלים והתיוגים

שאינו פתרון קיים שלא היא וההנחה אילוצים, למעשה הם התיוגים הבעיות בשתי יחד. לשכון

האגנוסטי). למקרה שמתאימה (הנחה מהם חלק מפר

היישומי. בעיניין גם בעבודה נוגעים אנחנו התיאורטי, וניתוחם אלגוריתמים של להצגה בנוסף

שהגדרנו. לאלגוריתם היתכנות הוכחת אמפירי ניסוי בעזרת מראים אנחנו הסדרים למידת בבעיית

הניסוי תוצאות אנשים. של העדפות על המבוסס מסוגו ראשון נתונים בסיס יצרנו הניסוי עבור

פעילה למידה שיטת מציגים אנחנו בנוסף שלנו. התיאורטיות התוצאות לנכונות חיווי נותנות

הידועות הגישות את מנצחת שהיא אמפירי באופן ומדגימים שלנו הגישה ברוח המוגדרת כללית

מקבלת מציגים, שאנחנו ההיוריסטית השיטה סטנדרטיים. השוואה נתוני של גדולה קבוצה על

. ”חלקות“ הן שנשאלות התיוג ששאלות הדואג שלב לו ומוסיפה פעיל למידה אלגוריתם כקלט

פעיל. למידה אלגוריתם כל לשפר ועשויה כללית השיטה זה, במובן

הן: התיזה עבודת של המרכזיות התוצאות

טעות אומדני מעל חלקות תנאי שבבסיסה פעילה ללמידה חדשה גישה מציגים אנחנו .1

כאלו. באומדנים שמשתמש כללי באלגוריתם מלווה והיא יחסית,

iv

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013

Page 133: THEORY AND PRACTICE OF ACTIVE LEARNING · encouragement. For all these reasons, Nir is a genuine role model and the ideal advisor | I couldn’t ask for better! ... Dr. Esther Ezra,

חסמי על ורק אך המבוססים כאלו טעות לאומדני מימוש מציגים אנחנו .2

למידה אלגוריתם היא המתקבלת התוצאה ההסכמה. אי מדד ועל וופניק—צ’רבוננקיס

את משיג המתקבל שהאלגוריתם מראים אנחנו מינימליות. הנחות תחת כללי פעילה

אלו). הנחות (תחת הידועות ביותר הטובות המדגם גדלי חסמי תוצאות

ומראים אומדן של בנייה מציגים אנחנו בזוגות, העדפות מעל סדרים למידת בעיית עבור .3

המדגם גודל חסמי תוצאת את משפר גישתנו, של הפעילה, הלמידה באלגוריתם ששילובו

תחת שפועל ידוע, פעילה למידה אלגוריתם שכל מראים אנחנו הידועות. ביותר הטובות

בנוסף נמוך). הבעיה שרעש (בהנחה זו בעיה עבור סבירה תוצאה ישיג לא הנחות, אותן

ומדגימה שייצרנו, יחודיים נתונים מעל שלנו לרעיון יישומיות היתכנות הוכחת מבצעים אנחנו

שיטתנו. של הפוטנציאל את

אלגוריתם שכל מראים אנחנו בזוגות. יחסים לפי עצמים של החלוקה בבעיית דנים אנחנו .4

גדלי ורק אך להשיג יכול שלמעלה, הכלליות ההנחות על המתבסס ידוע, פעילה למידה

במיוחד מגדירים שאנו אומדן זאת, לעומת נמוך). הבעיה שרעש (בהנחה טריוויאליים מדגם

חסמי שמשיג פעילה למידה אלגוריתם מגדיר שלנו הכללי הגישה באלגוריתם ושילובו לבעיה

זו. לבעיה לראשונה משמעותיים, מדגם גודל

שלנו. הגישה ברוח היוריסטית שיטה מציגים אנחנו תיאוריה, המבוססות הגישות מלבד .5

את תכסה התיוגים שבחירת שדואג רכיב לו ומוסיפה כלשהו פעיל לומד עוטפת השיטה

ביותר הטובות השיטות מול אל השיטה של מקיפה יישומית בחינה הדוגמאות. מרחב

ביותר הטובים הביצועים את המקרים ברוב משפר אכן שלנו שהרעיון מדגימה הידועות

שלנו. לרעיונות נוסף תוקף נותן הוא בכך הידועים.

v

Technion - Computer Science Department - Ph.D. Thesis PHD-2013-04 - 2013