pacs, simple-pac and query learning

Information Processing Letters 73 (2000) 11–16

PACS, simple-PAC and query learning✩

Jorge Castro1, David Guijarro∗Department Llenguatges i Sistemes Informàtics, UPC, Campus Nord, Mòdul C5, Jordi Girona Salgado, 1-3, 08034 Barcelona, Spain

Received 11 February 1999; received in revised form 25 November 1999Communicated by L.A. Hemaspaandra

Abstract

We study a distribution dependent form of PAC learning that uses probability distributions related to Kolmogorov complexity:the PACS model. We relate the query model and the simple-PAC model with PACS. Using these relationships and previousresults in those models we are able to get alternative proofs to all known results in the PACS model in a systematic way. 2000Published by Elsevier Science B.V. All rights reserved.

Keywords:Theory of computation; PAC learning; Query learning; Kolmogorov complexity

1. Introduction

One of the most relevant models of learning is“PAC learning”, introduced by Valiant [13], which hasbeen widely used to investigate the phenomenon oflearning from examples. Informally, in this model onehas to learn a target concept with high probability inpolynomial time (and, a fortiori, from a polynomialnumber of examples), within a certain error, andunder all probability distributions on the examples.Because of this last requirement, to learn under alldistributions, PAC learning is also called distribution-free, or distribution-independent learning.

Distribution-independent learning is a strong re-quirement. Many concept classes are not known to bepolynomially learnable, or known not to be polynomi-ally learnable if RP6= NP, although some such con-

✩ Supported by the Esprit Long Term Research Project ALCOMIT (nr. 20244), the EC Working Group NeuroCOLT2 – EP27150,and the Spanish DGICYT (project KOALA BP95-0797).∗ Corresponding author. Email: [email protected] Email: [email protected].

cept classes are polynomially learnable under somefixed distributions [8,12].

With the aim of defining learning models wheremore concept classes were learnable, some researchershave proposed versions of PAC where the require-ments are relaxed. In some of these proposals learningis only required under a finite set of distributions, andfrequently this set is reduced to the uniform distribu-tion. For instance, in [8], the classµDNF is shown tobe learnable under the uniform distribution and in [12]it is shown to be not learnable in the distribution-independent setting, unless RP6= NP. In other casesthe power of the learning algorithm is enhanced withthe ability of asking queries, as it can be seen withmonotone DNF in [13] and with DFAs in [1]. Therehave been also cases where both relaxations have beencombined, see for instance [7].

In this line of research, Li and Vitányi proposedin [9] the simple-PAC learning model that, roughlyspeaking, replaces the condition of learning under alldistributions of Valiant’s original model by the requestof learning under all simple distributions, provided

0020-0190/00/$ – see front matter 2000 Published by Elsevier Science B.V. All rights reserved.PII: S0020-0190(99)00161-1

12 J. Castro, D. Guijarro / Information Processing Letters 73 (2000) 11–16

the sample is given according to the “universal” dis-tribution. This distribution assigns high probabilitiesto low Kolmogorov complexity examples. With thenew model, Li and Vitányi [9] developed a theory oflearning for simple concepts—concepts with low Kol-mogorov complexity—that intuitively should be poly-nomially learnable. In fact, they showed several exam-ples that strengthen this intuition.

Recently, Denis et al. in [5] have restricted thesimple-PAC model to the case where a benign teachermight choose examples based on the knowledge ofthe target concept. In this framework, called PACSmodel, examples with short descriptions with respectto the target concept have high probability to be drawn.Note that in this model the distribution depends on thetarget concept. It has been shown that DNF [5] andDFAs [11] are learnable in the PACS model.

This note shows that, although PACS seems apowerful learning model, all its positive known resultsare derivable from known positive results in seeminglyweaker models in a systematic way. By systematicway we mean that we will describe two reductionmethods that produce all positive known results inPACS model.

Our main results show that first any class learnablevia queries, for any set of reasonable queries, is alsoPACS learnable. This implies, for instance, that DFAsare PACS learnable, providing an alternative proofto [11]. And second, the remaining known resultsin PACS [5] are derivable from previously knownalgorithms in the simple-PAC model.

Section 2 is devoted to definitions and previousresults, Section 3 presents the relation between querylearning and PACS and Section 4 relates PACS andsimple-PAC.

2. Preliminaries

For definitions on learning-from-examples, PAC/query learning and representation classes we refer thereader to [2,13,14]. We useΣ for the alphabet thatcodifies examples andl(r) to denote, in a fixed repre-sentation class, the length of the shortest representa-tion for a conceptr in that class. The symmetric dif-ference of setsA andB is denoted byA4B.

A probability distributionP is called enumerable, ifthere exists a recursive functiong(x, k), nondecreas-

ing in k, with P(x) = limk→∞ g(x, k). The existenceof universal distributions in the class of enumerabledistributions can be proved. That means that there areenumerable distributions that multiplicatively domi-nate each enumerable distribution. It can be shown thatone of the universal enumerable distributions, denotedbym, has the following properties∑x

m(x) < 1,

and

m(x)= 2−K(x)+O(1),

whereK(x) denotes the Kolmogorov complexity andthe equality is known as the Coding Theorem. The uni-versal distribution has many important properties. Un-derm, easily describable objects have high probabil-ity, and complex objects have low probability.

These results can be extended using the conditionalKolmogorov complexityK(x|y) instead ofK(x) todefine the universal conditional enumerable distribu-tionm(x|y) that verifies

∀y ∈Σ∗,∑x

m(x|y) < 1,

and

∀x, y ∈Σ∗, m(x|y)= 2−K(x|y)+O(1),

where the equality is known as the Conditional CodingTheorem (see [10]).

Definition 1. A distributionD is simple iff it is mul-tiplicatively dominated by the universal distribution.That is, there exists a constantc, such that for allx,

cm(x)>D(x).

It can be shown that simple distributions properlyinclude enumerable ones and that there is a distribu-tion which is not simple. The following theorem re-lates learning under simple distributions with learningunder the universal distributionm.

Theorem 2 [9]. A concept classC is PAC learnableunder the universal distributionm, iff it is PAClearnable under all simple distributions, provided thatin the learning phase the set of examples is drawnaccording tom.

J. Castro, D. Guijarro / Information Processing Letters 73 (2000) 11–16 13

In [9] it is shown how to exploit this completenesstheorem to obtain new learning algorithms for DNFwith simple terms and for simple reversible languages.

Recently, Denis et al. [5] introduced the PACSmodel where the learning system might be aided bya benign teacher who knows the target concept anduses this knowledge in selecting the examples. Un-der this model, examples with low conditional Kol-mogorov complexity have high probability. Formally,the probability of drawing an examplex for a targetconcept with representationr is given as

mr (x)= µr2−K(x|r),whereµr satisfies

µr∑x

2−K(x|r) = 1.

By Kraft’s inequality it holds thatµr > 1 (see [10]).Note that this model is not a fixed-distribution versionof PAC because the distribution depends on the targetconcept. Concept classes such as poly-term DNFand k-reversible DFA were shown to be learnableunder the PACS model [5]. In a later work [11] thelearnability of DFAs was also shown.

Note thatmr andmλ, whereλ denotes the emptyword, are similar to the normalized counterparts of theenumerable distributionsm(·|r) andm. It is easy tosee that allmr are simple and universal for the classof simple distributions, but no one of distributionsmrcan be enumerable (see Chapter 4 of [10]). We use inthis paper as universal distributionsmr andmλ insteadof m(·|r) andm (as Li and Vitányi proposed) in orderto follow the choices in [5,11]. All the results belowcan be shown also ifm andm(·|r) are the choicesconsidered and some of them can be obtained in aneasier way. We note that Theorem 2 above remainstrue changingm bymλ. Now, we define simple-PAC.

Definition 3. A concept classC is simple-PAC learn-able iff it is PAC learnable undermλ.

We use the query learning model defined by An-gluin in [2]. In this model the learner has to find an ex-act representation of the target concept asking queriesto a teacher that knows it. The most used queries aremembership queries and equivalence queries. A mem-bership query is an example and the answer is YES ifthe example belongs to the concept and NO otherwise.

An equivalence query consists of a representation of aconcept where the answer is YES if that concept isequal to the target concept and a counterexample inthe symmetric difference otherwise. The learner hasto output an equivalent representation in polynomialtime no matter which strategy is used by the teacher inanswering the queries.

3. From query learning to PACS learning

We show below that any class exactly learnable byan algorithm that uses queries is also learnable in thePACS model. We prove the result first for membershipand equivalence queries and discuss after to what kindof queries our proof technique extends.

We restrict our attention only to query learningin the presence of bounded teachers. These teachersprovide counterexamples up to a given lengthn orreply affirmatively if no counterexample is of lengthat mostn, see [14] for a complete discussion. This isdone for technical convenience only.

Let R be a representation class of a concept classC, and letL be an exact learning algorithm forR thatuses membership and equivalence queries. So, for anyr ∈R, any positive integern, and any bounded teacherT , algorithmL asks at most a polynomial numberof queriesp(l(r), n) and outputs a representationLT (r) ∈ R such thatLT (r) is equivalent tor inΣ6n. In the following we fix a teacherTmin; suchteacher answers all the equivalence queries with thelexicographically smallest counterexample (if thereexists any at all).

Lemma 4. Fixed the target conceptr and the teacherTmin, all membership queries and all counterexamplesin the computation ofL have complexityK(·|r)bounded byO(log(nl(r))).

Proof. Let Ar be a fixed algorithm that given anyhypothesish, finds the lexicographically smallestexample inh4 r wheneverh4 r 6= ∅. We enumerateall the membership queries and counterexamples in thecomputation ofL with teacherTmin. By hypothesis,this enumeration ranges up to at mostp(l(r), n).So, any membership or counterexample wordq canbe described by means of its number in the list ofqueries, any representation of the target conceptr,


and algorithmsL andAr . Therefore, the Kolmogorovcomplexity ofq conditional tor, K(q|r), is boundedby logp(l(r), n) + K(r|r) + |L| + K(Ar |r)+ O(1).SinceK(r|r), |L|, andK(Ar |r) are constants, weobtain the bound of the lemma.2

Let Srd(n) be the set{x ∈ Σ6n: mr (x) > 1/nd}.Note that Srd(n) includes all the wordsx ∈ Σ6nsuch thatK(x|r) 6 d logn. The following standardlemma shows that, drawing a sample of polynomialsize according tomr , we get, with high probability, allexamples of low conditional Kolmogorov complexity(see [9]).

Lemma 5. Drawing according tomr a sampleS ofsizend(d lnn+ ln1/δ), with probability greater than1− δ, S includes all the examples inSrd(n).

Finally, we show the main result.

Theorem 6. Let R be a representation class learn-able by membership and equivalence queries, thenRis also learnable in the PACS model.

Proof. This proof follows similar steps to the proof ofTheorem 2 in [6].

Let r ∈ R be the target concept and letL be amembership and equivalence queries algorithm forR.Let d be the constant hidden in the O-notation ofLemma 4. Note that Lemma 4 says thatSrd(nl(r)) ∩Σ6n contains all smallest counterexamples and mem-bership queries that appear in the interaction betweenL and Tmin. Lemma 5 guarantees that a sampleSdrawn according tomr of size(nl(r))d(d ln(nl(r))+ln1/δ), containsSrd(nl(r)) with probability at least1− δ. Now, we simulate the computation ofL an-swering its membership queries according to its labelsin S, and solving its equivalence queries finding thelexicographically smallest counterexample inS. Thesimulation ends when no counterexample is found toan equivalence query or a membership query can notbe answered. So, with high probability the simulationoutputs a hypothesis equivalent to the target.2

Observe that the PACS algorithm obtained with theproof of Theorem 6 is a probably exact learner, i.e., itgets with high probability an exact identification of thetarget.

It is also important to note that we do not use the fullpower of query learnability in Theorem 6. First, weonly need the existence of a polynomial time learnerwhenTmin is the teacher, i.e., a learner that may notwork with other teachers. And second, we can alsoapply the same proof to other teacherT and typeof queryQ, if there is a polynomial time algorithmthat decides which is the answer given byT to aninput to Q from the input toQ and a sample thatcontains the answer thatT would give. So Theorem 6can be generalized to learning algorithms that workwith teachers and queries that satisfy the previouscondition. We do not formalize this issue to keepthe understandability of the paper and because it isstraightforward. Moreover, our proof of Theorem 6implies that any PAC algorithm with queries can betransformed into a PACS algorithm.

Also, from Theorem 6 immediately follows thatthe class of DFAs, that Angluin showed learnableby membership and equivalence queries in [1], islearnable in the PACS model. This result was shownin a recent paper [11] by an argumentation specific toDFA.

4. PACS versus simple-PAC

In this section we study the relationships betweenPACS and simple-PAC learning models. Assumingsome properties in the learning algorithms we canshow that both models are strongly related. Moreover,these properties are satisfied for all known learningalgorithms in these models.

We start relating distributionsmλ andmr . LetR bea representation class andr ∈R be a concept.

Lemma 7. There exist constantsk1 6= 0 and k2,independent ofx andr such that(1) mr (x)2−K(r)k16mλ(x).(2) mr (x)2K(r)k2>mλ(x).(3) Let S be a sample of size2

K(r)nd

k1(d lnn+ ln1/δ)

drawn according tomλ. The probability of theeventS ⊇ Srd(n) is at least1− δ.

Proof. We use the following well known inequalitiesthat relateK(x) andK(x|r) (see [10]):(a) K(x|r)6K(x)+O(1).(b) K(x)6K(x|r)+K(r)+O(1).

J. Castro, D. Guijarro / Information Processing Letters 73 (2000) 11–16 15

To prove (1) we use (b) and get

µλµr2−K(x) > µλµr2−K(x|r)−K(r)−O(1),

whereµλ andµr are the constants such thatmλ(x)=µλ2−K(x) and mr (x) = µr2−K(x|r). Rewriting thatlast inequality in terms of probabilities we obtain,

µrmλ(x)>µλmr (x)2−K(r)−O(1).

To finish the proof it is enough to show thatµλ/µr >c, for some constantc. To see this, note that usinginequality (a)

1=µλ∑x

2−K(x) = µr∑x

2−K(x|r)

>µr2−O(1)∑x

2−K(x),

that shows thatµλ/µr is bounded below by a constantgreater than zero.

To prove (2) we use (a) and get

µλµr2−K(x|r) > µλµr2−K(x)−O(1)

therefore,

µλmr (x)>µrmλ(x)2−O(1)

now it is enough to show thatµλ/µr is O(2K(r)),which can be done using inequality (b) as follows:

1=µr∑x

2−K(x|r) = µλ∑x

2−K(x)

>µλ2−O(1)−K(r)∑x

2−K(x|r),

that shows thatµλ/µr is O(2K(r)).To show (3) note that (1) implies that themλ prob-

ability of a wordx in Srd(n) is at leastk12−K(r)n−d .Now using an standard analysis as in Lemma 5 thestatement follows. 2

LetR be a representation class that is PACS learn-able, andA be a learning-from-examples algorithmfor R. We say thatA is robust if A outputs good ap-proximation hypotheses whenever it receives a sam-ple that contains all words with high probability. For-mally, algorithmA is robust if there exist a constantd and a polynomialp such that for anyr ∈R given asampleS of size at leastp(1/ε,n, l(r)), if S ⊇ Srd(n)thenA with input S outputs a hypothesish such thatmr (h4 r) < ε. As far as we know, all PACS learning

algorithms in the literature are robust (see [5,11]). Fur-thermore, all PACS algorithms that can be obtained byapplying Theorem 6 are also robust.

For a representation classRwe define as simple(R)the set{r ∈R: K(r)6 c log(l(r))} for an arbitrarycthat we fix for the rest of the paper, as Li and Vitányido in [9]. We note that this definition of simple(R)is specific for this paper and more restrictive than thedefinitions of simple concepts used in [9,3].

Theorem 8. LetR be a PACS learnable representa-tion class that has a learning algorithm that is robust.Then,simple(R) is simple-PAC learnable.

Proof. Let A be a robust PACS learning algorithmfor R, i.e., for some constantd and polynomialp,given a sampleS of size p(1/ε′, n, l(r)) such thatS ⊇ Srd(n), thenA outputs a hypothesish such thatmr (h4 r) < ε′.

Let r be a concept withK(r)6 c log(l(r)). Accord-ing to the last statement of Lemma 7, given amλ sam-pleS′ of size

max

(p

(1

ε′, n, l(r)

),l(r)cnd

k1

(d lnn+ ln

1

δ

)),

with probability greater than 1− δ, S′ includes allwords inSrd(n). Therefore,A outputs a hypothesishsuch thatmr (h4 r) < ε′ with mλ probability greaterthan 1− δ. Using statement (2) of Lemma 7

mλ(h4 r)6mr (h4 r)l(r)ck2< ε′l(r)ck2,

which shows that an appropriate choice ofε′, namelyε′ such thatε′l(r)ck2< ε, suffices for the result.2

An interesting consequence of Theorems 6 and 8 isthat the simple concepts of a query learnable conceptclass, are simple-PAC learnable in the sense of Li andVitányi. Formally,

Corollary 9. If R is learnable by an algorithmthat uses membership and equivalence queries, thensimple(R) is simple-PAC learnable.

Now we discuss the converse of Theorem 8: underwhat conditions simple-PAC learnability of simple(R)implies PACS learnability ofR? First we note that,oncer ∈R is fixed, we can pass fromK(·) to K(·|r)and vice versa simply changing the universal Turing


machine used as reference to define Kolmogorov com-plexity. Therefore, if there is a simple-PAC algorithmfor the class simple(R) that works independently ofthe reference universal machine, the same algorithmproves that the whole classR is PACS learnable. Thisis true because if we take as referenceK(·|r) to defineKolmogorov complexity, thenr ∈ simple(R) and thesimple-PAC algorithm will work. As before, all knownsimple-PAC algorithms are independent of the refer-ence universal Turing machine. The previous com-ments have proven the following theorem:

Theorem 10. If simple(R) is simple-PAC learnablewith an algorithm that does not depend on the ref-erence universal Turing machine used to define Kol-mogorov complexity thenR is PACS learnable withthe same algorithm.

Using the results that state that simple conceptsof DNF, k-reversible automata (both in [9]) anddecision lists (in [3,4]) are all simple-PAC learnableindependently of the reference machine, together withour Theorem 10, we can prove the following corollarythat extends and gives an alternative proof to theresults in [5].

Corollary 11. DNF, decision lists andk-reversibleautomata are PACS learnable.

We note that the “independence” condition is nottrivially satisfied. If we use two different universalTuring machinesU1 andU2, the corresponding univer-sal distributionsmU1 andmU2 are equal up to a mul-tiplicative constant. This does not seem to be enoughto guarantee that a simple-PAC learning algorithm thatworks withmU1 has to work also withmU2.

5. Conclusions

This note relates the PACS model with query learn-ing and simple-PAC. We have shown that any learningalgorithm that uses queries can be transformed into aPACS algorithm. This implies that PACS is a (poten-tially) more powerful model than PAC with the aid of

any reasonable set of queries. We have also given suf-ficient conditions for transforming a simple-PAC algo-rithm into a PACS algorithm and vice versa.

One justification for PACS in [5] is that its strengthprovides a necessary condition of learnability. Incontrast, our results show that all positive knownresults in PACS model are derivable from knownresults in other models. Therefore this justificationneeds that something new is proved learnable in thatmodel and not learnable, for instance, in the querymodel or the simple-PAC model.

References

[1] D. Angluin, Learning regular sets from queries and counterex-amples, Inform. and Comput. 75 (1987) 87–106.

[2] D. Angluin, Queries and concept learning, Machine Learn-ing 2 (4) (1988) 319–342.

[3] J. Castro, J.L. Balcázar, Simple-PAC learning of simple deci-sion lists, in: Proc. 6th International Workshop on Algorith-mic Learning Theory, Lecture Notes in Artificial Intelligence,Vol. 997, Springer, Berlin, 1995, pp. 239–248.

[4] C. Denis, R. Gilleron, PAC learning under helpful distributions,in: Proc. 8th International Workshop on Algorithmic LearningTheory, Lecture Notes in Artificial Intelligence, Vol. 1316,Springer, Berlin, 1997, pp. 132–145.

[5] C. Denis, F. D’Halluin, R. Gilleron, PAC learning with simpleexamples, in: Proc. 13th Annual Symposium on the TheoreticalAspects of Computer Science, Lecture Notes in Comput. Sci.,Vol. 1046, Springer, Berlin, 1996, pp. 231–242.

[6] S.A. Goldman, H.D. Mathias, Teaching a smarter learner,J. Comput. System Sci. 52 (2) (1996) 255–267.

[7] J. Jackson, An efficient membership-query algorithm for learn-ing DNF with respect to the uniform distribution, J. Comput.System Sci. 55 (1997) 414–440.

[8] M. Kearns, M. Li, L.G. Valiant, Learning Boolean formulae,J. ACM 41 (6) (1995) 1298–1328.

[9] M. Li, P. Vitányi, Learning simple concepts under simpledistributions, SIAM J. Comput. 20 (1991) 911–935.

[10] M. Li, P. Vitányi, An Introduction to Kolmogorov Complexityand its Applications, Springer, Berlin, 1993.

[11] R. Parekh, V. Honavar, Learning dfa from simple examples,in: Proc. 8th International Workshop on Algorithmic LearningTheory, Lecture Notes in Artificial Intelligence, Vol. 1316,Springer, Berlin, 1997, pp. 116–131.

[12] L. Pitt, L.G. Valiant, Computational limitations on learningfrom examples, J. ACM 35 (1989) 965–984.

[13] L. Valiant, A theory of the learnable, Comm. ACM 27 (1984)1134–1142.

[14] O. Watanabe, A framework for polynomial time query learn-ability, Math. Systems Theory 27 (1992) 211–229.

pacs, simple-pac and query learning

Documents