dominating distributions and learnability

12
Dominating Distributions and Learnability Gyora M. Benedek ELBIT-EVS Ltd. Haifa, Israel [email protected] .ac.il Abstract We consider PAC-learning where the dis- tribution is known to the student. The problem addressed here is characterizing when learnability with respect to distri- bution D1 implies learnability with re- spect to distribution Dz. The answer to the above question de- pends on the learnability model. If the number of examples need not be bounded by a polynomial, it is sufficient to require that all sets which have zero probability with respect to Dz have zero probability with respect to D1. If the number of examples is required to be polynomial, then the probability with re- spect to Dz must be bounded by a multi- plicative constant from that of D1. More stringent conditions must hold if we in- sist that every hypothesis consistent with the examples be close to the target. Finally, we address the learnability prop- erties of classes of distributions. Currently visiting AT&T Bell Laboratories, Mur- ray Hill, NJ 07974-0636. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. COLT’92-71921PA, USA ~ 1992 ACM O-8979’I-498-81921000710253...$505O Alon Itai * Computer Science Department Technion, Haifa, Israel [email protected]. ac.il 1 INTRODUCTION We consider PAC-learning with respect to specific distributions and consider the following problem: when can a learning algorithm designed for distri- bution D1 be used (or modified) for learning with respect to another distribution – D2. The object of this line of research is to design learning algo- rithms for entire classes of distributions. Thus, we specify conditions on the distributions such that a learning algorithm with respect to D1 can be used (or modified) to get a learning algorithm with re- spect to distribution Dz. In PAC-learning (see precise definitions in Section 2), the “teacher” hss a target concept t,she se- lects 1 random (with respect to distribution D) ex- amples and tells the “learner” (who doesn’t know t) which examples belong to t. The learner, tts- ing only this information, has to find a hypothesis close to the target. The accuracy parameter, c, bounds the distance between the learner’s hypoth- esis and t. We allow the learner to make mistakes, i.e., not learn. The confidence parameter, 6, is a bound on this event. Our first result (Section 3) is that for any two dis- tributions D1 and D2, for which for all measurable sets S, Dz(S) = O implies D1 (S) = O, learning with respect to D1 implies learning with respect to D2 . Learning has polynomial sample complexity if the number of examples 1 needed to learn to accuracy 253

Upload: others

Post on 03-Feb-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Dominating Distributions and Learnability

Gyora M. Benedek

ELBIT-EVS Ltd. Haifa, Israel

[email protected] .ac.il

Abstract

We consider PAC-learning where the dis-

tribution is known to the student. The

problem addressed here is characterizing

when learnability with respect to distri-

bution D1 implies learnability with re-

spect to distribution Dz.

The answer to the above question de-

pends on the learnability model. If

the number of examples need not be

bounded by a polynomial, it is sufficient

to require that all sets which have zero

probability with respect to Dz have zero

probability with respect to D1. If the

number of examples is required to be

polynomial, then the probability with re-

spect to Dz must be bounded by a multi-

plicative constant from that of D1. More

stringent conditions must hold if we in-

sist that every hypothesis consistent with

the examples be close to the target.

Finally, we address the learnability prop-

erties of classes of distributions.

● Currently visiting AT&T Bell Laboratories, Mur-

ray Hill, NJ 07974-0636.

Permission to copy without fee all or part of this material is

granted provided that the copies are not made or distributed for

direct commercial advantage, the ACM copyright notice and the

title of the publication and its date appear, and notice is given

that copying is by permission of the Association for Computing

Machinery. To copy otherwise, or to republish, requires a fee

and/or specific permission.COLT’92-71921PA, USA

~ 1992 ACM O-8979’I-498-81921000710253...$505O

Alon Itai *

Computer Science Department

Technion, Haifa, Israel

[email protected]. ac.il

1 INTRODUCTION

We consider PAC-learning with respect to specific

distributions and consider the following problem:

when can a learning algorithm designed for distri-

bution D1 be used (or modified) for learning with

respect to another distribution – D2. The object

of this line of research is to design learning algo-

rithms for entire classes of distributions. Thus, we

specify conditions on the distributions such that a

learning algorithm with respect to D1 can be used

(or modified) to get a learning algorithm with re-

spect to distribution Dz.

In PAC-learning (see precise definitions in Section

2), the “teacher” hss a target concept t,she se-

lects 1 random (with respect to distribution D) ex-

amples and tells the “learner” (who doesn’t know

t)which examples belong to t. The learner, tts-

ing only this information, has to find a hypothesis

close to the target. The accuracy parameter, c,

bounds the distance between the learner’s hypoth-

esis and t. We allow the learner to make mistakes,

i.e., not learn. The confidence parameter, 6, is a

bound on this event.

Our first result (Section 3) is that for any two dis-

tributions D1 and D2, for which for all measurable

sets S, Dz(S) = O implies D1 (S) = O, learning

with respect to D1 implies learning with respect

to D2 .

Learning has polynomial sample complexity if the

number of examples 1 needed to learn to accuracy

253

c and confidence 6 is bounded by a polynomial in

6-1 and 6-1. In Section 4 we characterize when

polynomial learning with respect to D1 implies

polynomial learning with respect to D2.

Our results give necessary conditions for a concept

class to be learnable with respect to a class of dis-

tributions: For example, the results of Section 3

imply that if G is a concept class of Borel sets

over the real segment [0, 1], C is learnable with re-

spect to distribution D if and only if it is learnable

with respect to the uniform distribution. This re-

sult does not imply that the number of examples

needed to learn with respect to D is bounded by a

polynomial in the number of examples needed to

learn with respect to the uniform distribution; for

this, stronger requirements, such as those given in

Section 4 are needed.

An orthogonal issue is that of solid learnability -

under this model the learner may return any hy-

pothesis consistent with the examples. In Section

5 we discuss when solid learnability with respect

to D1 implies solid learnability with respect to lh.

2 MODELS FOR

LEARNABILITY

The following definitions are

of measure theory [7], which

our work.

the basic definitions

is the foundation of

Definition 1 Let X be a set. R C 2X ts a u-

algebra over X if X ~ R and R is closed under

complements and countable unions.

The best known u-algebra is that of the Borel sets

– the smallest u-algebra containing all the inter-

vals of the real line.

Definition 2 Let R be a u-algebra over X and D

a function from R to the nonnegative real num-

bers. D is a distribution over R if

1. D(x) = 1.

2. D is additive, i.e., if A, B E R are disjoint

then D(A U B) = D(A)+ D(B).

From the definitions 0 c R and D(0) = O, how-

ever there may be additional sets S c R for which

D(S) = O.

In the sequel we shall assume a fixed u-algebra

R, whose members will be called measurable sets,

and investigate different distributions over R.

Our basic model of learning from examples is the

PAC (probably approximately correct) model as

defined in [5], A concept class is a set C ~ R

of concepts. The learner receives labeled exam-

ples according to distribution D. After receiv-

ing 1 examples, the learner produces a hypothesis

L(((xl, C(@),... , (w, c(zL)))) E R, (the booleanfunction C(Z) denotes the characteristic function

of c, i.e., C(Z) = 1 iff z c c). The function L is a

learning function.

Y1, Y2 c R are z-close with respect to the distribu-

tion D if D(YI @Yz) < c (6I denotes the symmetric

difference). Otherwise, Y1 and Y2 are c-far with

respect to the distribution D.

Definition 3 (Learnability for distribution

D):[2] A function L learns a concept class C, with

respect to distribution D if for every e, 6> 0 there

is an / = 1(c, 6, D) > 0 such that for every c ~ C’

and /’ z 1 the probability

P(L((z1, C(zl)), . .. . (x1), C(zt,))) is e-close to c)

>1–6.

The learning is distribution free if the number of

examples, 1, is independent of the distribution [5].

Another important aspect is solid learnability [4].

A concept class is solidly learnable if given a suffi-

ciently large sample, every concept consistent with

the sample is a good approximation of the target

concept.

254

2.1 PREVIOUS RESULTS

2.1.1 Distribution Free Learning

Blumer et al., [5] analyzed distribution free learn-

ability and gave a combinatorial characterization

of the learnable concept classes. They also showed

that a concept class is learnable with respect to

all well behaved distributions if and only if it is

solidly learnable.

2.1.2 Fixed Distribution

Benedek and Itai, [2], considered learnability with

respect to a fixed distribution D and gave the

following characterization of learnable concept

classes.

Definition 4 (Finite cover [2]): Let C be a

concept class over R and c > 0, a set KC ~ R

is an c-cover of C with respect to D if for ev-

ery c G C there is an h G KC c-close to c. C is

finitely coverable with respect to D if for every

E > 0 there is a jinite c-cover of C (the size of

the cover may depend on c). The cardinality of a

smallest c-cover of C with respect to D is denoted

by nD(6’, c).

The following lemma shows that we may assume

that C is covered by concepts.

Lemma 1 [2] C is jinitely coverable with respect

to D if and only if for every & > 0 there is a

finite subset Cc G C which is an E-cover of C

with respect to D.

In view of the above lemma, throughout the pa-

per, we will assume that a cover is a subset of the

concept class.

The following theorem characterizes learnability

with respect to a fixed distribution:

Theorem 1 [2’] C is learnable with respect to dis-

tribution D if and only if C is finitely coverab!e

with respect to D.

Moreover, they introduce the following learning

function.

The best-agreement learning functiorn[2]:

Input: 1 examples, ((xl, c(z1)), . . . . (zL, C(xz))).

Let El be the maximum integer such that

54Et hI(EtnD(C, l/(2 Et))) < L

Let B = {bl, . . . . b~~} be a minimum l/(2 E~)-cover

of C, (nD = nD(C, l/(2 E1))).

Output: Any bi such that the cardinality of {xj :

1 < j s t?, C(zj ) # bi (zj )} is minimum (among

the nD elements of the cover ~).

Theorem 2 [2] Let C be a finitely coverable con-

cept class with respect to distribution D. Then the

best-agreement-learning- function learns C with re-

spect to D with accuracy and confidence l/Et.

Theorem 3 If C is learnable with respect to D

with accuracy c and confidence ~ using t? = l(e, ~)

ezamples then there exists a 2c-cover of size 2t+1.

Corollary 1 Let C be a concept class learn-

able by a polynomial sample (1? = 1(c, 6) =

(~)o(’)), then the Best Agreement /earning func.

tion also requires a polynomial samp[e, namely,

Y( t54 in ~ + ! in 2 + in 2) examples suflce to learn C

wtth accuracy E and confidence c.

2.2 ENUMERABLE AND

POLYNOMIAL DISTRIBUTIONS

M. Li and P. Vitanyi [8] considered enumerable

distributions, those distributions D for which the

set

{(z, y) : zEN, yEQ, D(.v)>y)

is recursively enumerable. They showed the ex-

istence of a universal distribution U, and proved

that if a concept class is learnable with respect to

U itis learnable with respect to to all enumerable

distributions, (and also for a slightly larger class).

These results were extended to polynomial distri-

butions and also to continuous ones.

255

Our results apply to all pairs of distributions, not

only enumerable ones. Also, our results are appli-

cable to concept classes which are not learnable

with respect to the universal distribution.

3 LEARNABILITY

DOMINATING

DISTRIBUTIONS

AND

Definition 5 Let D1 and D2 be two distribu-

tions over a u-algebra R. D1 dominates Dz if for

every set S E R, D1(S) = O implies Dz(S) = O

In this section we show that every D1-learnable

concept class is also learnable with respect to Dz

if and only if D1 dominates D2. First we present

the following two lemmas.

Lemma 2 Let C be an injinite concept class

jinitely coverable wtth respect to D. Then there

exist distinct concepts Ai c C for i = 1,2, ,.. such

that Ai iS l/2i-close toAi+l.

Proof: Let Co=C. Fori=l,2,. .. we con-

struct infinite concept classes Ci C C, such that

Ai E Ci and every set in Ci is 2-i-close to Ai- 1.

For i > 0 let A’l , . . . . K,, be a 2-i-cover of Ci_l

(Kj E Ci_l). (Since Ci-l C C and C is finitely

coverable, so is Ci _ 1.) Since Ci_ 1 is infinite, for

some j infinitely many concepts of Ci _ 1 are 2-i-

close to Kj. Let Ci be this set of concepts (ex-

cluding Kj itself) and let Ai = Kj. ❑

A generalization of the following lemma appears

in Halmos [7, Theorem E, p. 38].

Lemma 3 Let Gl, G2, . . . be sets such that Gi ~

Gi+l and D(f)~l Gi) = 0, Then D(Gi) + O.

Theorem 4 If a concept class C is learnable with

respect to D1 and D1 dominates D2 then C is

learnable with respect to D2.

Proof: By contradiction: let D1 dominate D2

and let C be learnable with respect to D1 and not

learnable with respect to D2. By Theorem 1, C

is finitely coverable with respect to D1. By the

same theorem, there exists an a > 0 such that

there is an infinite set C’ s C of concepts which

are pairwise a-far with respect to D2.

By Lemma 2 there exists an infinite sequence of

concepts Al, A2, . . . ~ C’ such that Ai is 2-i-close

toAi+l with respect to D1. Let Ai = Ai @ Ai+l.

Then D1 (Ai) < l/2i. Also for every i

CxJ m

Dl(UAj) < ~Dl(Aj) < l/2i-1.

j =i j =i

Let Gi = U?=i Aj. Then

Dl(fi Gi) = O. (1)

i=l

Gi ~ Gi+l, D2(Gi) > CY so by Lemma 3,

Dz(f& Gj) >0.

Since D1 dominates D2, also D1 (n~l Gi) >0.

Contradicting (l). •1

Suppose some point x E X has positive probabil-

ity, i.e., D({z}) >0. Then this probability cannot

be split. This suggests the following definition.

Definition 6 A measurable set A is an atom with

respect to distribution D if D(A) > 0 and for

every measurable subset B ~ A either D(B) = O

or D(B) = D(A).

Note that some distribution may have atoms that

are not singletons (e.g., if R consists of the empty

set and all the Borel sets b for which [0, ~] ~ b ~

[0, 1], and D is the uniform distribution, then the

segment [O, ~] is an atom, moreover, all atoms con-

tain this segment).

Definition 7 Measurable sets E and F are D-

equivalent if D(E @ F) = O.

It is easy to see that D-equivalence is an equiva-

lence relation. Consider the D-equivalence classes.

If a class contains an atom then all its members

are atoms.

Lemma 4 Ifa and b are non D-equivalent atoms

then D(a n b) = O.

256

Proof: If D(a n b) >0 then since a ~ a n b,

D(a n b) = D(a).

By the same argument D(a n b) = D(b). Thus,

D(a@b) = (D(a) –ll(anb))+(ll(b) -ll(anb)) = O.

Thus, a and b are D-equivalent, contrary to the

hypothesis.

The following lemma

169].

Lemma 5 The set ~

atoms is countable.

appears in [7, Ex. 10, p.

of D-equivalence classes of

Proof: Let & consist of those

classes such that for A ~ A E &n

Because of the previous lemma, &

equivalence

can contain

at most n equivalence classes. Since the set of

all equivalence classes is the union of the & ‘s,

~ is a countable union of finite sets, and thus is

countable. •1

Definition 8 (’7, p. 168]. A set S is atom-free

with respect to distribution D if it does not con-

tain any aioms with respect to D.

The next lemma shows that it is easy to learn with

respect to distributions which are determined by

their behavior on atoms

Lemma 6 Let D be a dtstribuiion such that for

all atom free S E R D(S) = O. Then every con-

cept class C is learnable with respect to D.

Proofi Let A1, A2, . . . be the D-equivalence

classes of atoms. Let Aj c Aj and A = Uj Aj .

w.1.o.g., D(Aj) z D(Aj+l). (A is measurable,

since by Lemma 5 it is a countable union of mea-

surable sets. )

Let ill be the minimum index, j, for which

~i,j D(Ai ) < e. (Lemma 4 implies that such

an 114 exists. )

{

S= UA@~{l,...,iM}

jEJ }

is an c-cover of C with respect to D. Hence, C is

finitely coverable with respect to D. Thus, C is

learnable with respect to D. n

Lemma 7 Let X be the disjoint union of X1 and

X2~R, Ci={cn Xi:c~C}. T’hen Cis

learnable with respect to D if and only if for i =

1,2 Ci is learnable with respect to D.

Proofi Suppose C is learnable with respect

to D. Let {bI,..., bN} be an e-cover. Then {bl n

Xi, .... bN (1 Xi } is an e-cover of Ci (with respect

to D).

In the other direction, if {bj, . . . . bj,} (i = 1,2) are

$-covers of Xl, X2, then

{b; ub~:l<j<N1,1<k<N2}

is an e-cover of C (with respect to D). Therefore,

for all e >0 there exists a finite c-cover. n

Definition 9 (weak-domination) Distribution

D1 weakly-dominates distribution D2 if every set

S c R for which D1(S) = O either Dz.(S) = O or

S contains an atom with respect to Dz.

Obviously, domination implies weak domination.

The following theorem shows that weak domina-

tion is sufficient to preserve learnability.

Theorem 5 If a concept class C is learnable with

respect to D1 and D1 weakly-dominates D2 then

C is learnable with respect to Dz.

Proof (sketch): Define Aj and A as in the proof

of Lemma 6 (with D2 replacing D). Let Xl =

X–A, and X2=A.

Let Cl, C2 as in Lemma 7. By that lemma, for

i = 1, 2, Ci is learnable with respect to D1.

Since Xl is atom-free with respect to D2, restrict-

ing R to Xl, D1 dominates D2. Hence by Theo-

rem 4, Cl is learnable with respect to D2.

257

D2 restricted to XZ satisfies the conditions of

Lemma 6 and thus C2 is learnable with respect

to Da. Combining these results with Lemma 7

implies that C is learnable with respect to D2. ❑

Before showing that the weak-domination require-

ment is necessary in Theorem 5, we state the fol-

lowing lemma whose proof appears in Halmos [7,

Ex. 3, p. 174]

Lemma 8 If S is atom-free (with respect to dis-

tribution D) then for all O ~ a < 1 there exists a

set S c S such that

D(S’) = aD(S).

Theorem 6 Let D1 and LIZ be distributions such

that D1 does not weakly-dominate Da. Then there

exists a concept class C such that C is learnable

with respect to D1 but not with respect to D2.

Proof: By definition, since D1 does not

weakly-dominate D2, there exists a measurable set

S for which

1. S is atom-free (with respect to distribution

Da),

2. D1(S) = O, and

3. D2(S) = a >0.

We define a class C over S which does not have a

~-cover with respect to D2. So, in particular, it is

not finitely coverable, thus by Theorem 1 it is not

learnable with respect to Da. (Since D1(S) = O,

for all c e C, D1 (c) = O and thus can be approx-

imated by the empty set. Hence, C is learnable

with respect to D1. )

By successively halving S (using Lemma 8) we

obtain 2m disjoint sets AT, . . . . A~~_l C S,

such that D2(A~) = ~/2m. Let Ij = {i :

the jth bit of z’ is 1} and c~ = U;EI,A~ (j =

o ,..., m– 1). C = {cjm :m>o, o<j <m}.

Note that for all m, if j # j’ then Dz (cj~ @

cj/~) = u/2.

Let kl, ..., 7?. be a minimum &cover. Considern+lco ,..., cn~+l. By the pigeonhole principle there

exist distinct j, j’ such that both c~+l and c~,+l

are &close to the same k~.

u

5 = D2(cjn+1 @ CjIn+l)

< D2(cj ‘+1 @ k~) + DZ(Cjln+l @ ki)

< 2U—.—5

Thus, kl,..., kn cannot be a $cover. ❑

4 POLYNOMIAL LEARN-

ABILITY AND a-DOMINATING

DISTRIBUTIONS

Definition 10 Let a ~ 1 and D1 and DZ be dis-

tributions over a u-algebra R, D1 a-dominates

D2 if for every set S 6 R

Obviously, for every a z 1, a-domination im-

plies domination but not vice-versa. (I.e., there

is no a > 0 such that domination implies a-

abomination.)

Example 1: Let X be the positive integers,

Dl(i) = 2-i, and Da(i) = 6/(7ri)2. Then D1

and D2 dominate each other, but D1 does not a-

dominate D2 for any a. (Da 2-dominates D1.)

Theorem 7 If a concept class C is learnable with

respect to D1 with polynomial sample complexity,

and for some a > 0, D1 m-dominates D2 then

C is learnable with respect to Dz with polynomial

sample complexity.

Before proving the theorem let us indicate why

it is hard. Because D1 a-dominates DZ, if a hy-

pothesis is (c/a)-close to t with respect to D1 it

is s-close to t with respect to D2. Thus, in order

to learn to accuracy & with respect to D2 it is suf-

ficient to learn with accuracy c/a with respect to

D1. The problem lies in assuring confidence. Ev-

ery sequence of 1 sample points is a member of Xe

on which we can only show that D$ d-dominates

258

D;. Thus we might have to (exponentially) in-

crease the sample size. Moreover, for larger sam-

ples the domination will be satisfied only with an

even larger parameter. To overcome these difficul-

ties our proof takes a completely different track.

Proofi Without loss of generality, c s 6.

Let A = 11 (e/(2 a), ~) be the number of examples

needed to learn C with respect to D1 with ac-

curacy E/(2Q) and confidence ~. By Theorem 3,

there exists an s/(2a)-cover of size A of C with

respect to D1 and it is an c/2-cover of C with

respect to D2.

Therefore, with respect to D2, the best-

agreement-learning-function learns C with ac-

curacy 2(e/2) = e and confidence c using

42(.s ,6) = $& (ln * + A) = O ($log(l/&) + $~)

examples.

Finally, since a is a constant, and /1 (s, 6) is poly-

nomial in $ and in ~ so is 12 (e, 6). n

The previous theorem does not imply that the

number of examples needed to learn with respect

to D2 is a polynomial function of the number

needed to learn with respect to D1. Just that

both are polynomials in .s, 6.

Example 2: Let X = {0,1,...}, Cn =

{o, 1 ,..., n}, C= {c~ : n ~ O}. By Lemma6

(alternatively see [5]), for any distribution D over

X, C is learnable with respect to D.

We now show a distribution-dependent algorithm

to learn C.

Given c,6 >0.

1. Let i. be the minimum integer satisfying

z~io+l D(io) < ~.

2. obtain (ln 6-1 )/piO examples.

3. Let k be the maximal positive example.

4. Return c~.

Claim 1 The algorithm learns C

Let

T(o) = 1

T(i + 1) = 2~(~)

‘1a=

E~=o q“

We now define distributions D1, Dz:

1Dl(i) = —

cm(i)

For all i, ~D1(i) ~ D2(i) ~ 2D1(i).

For k = 1,2, l~(.s, 6), the number of examples

needed by the above algorithm to learn C with

respect to Dk with accuracy c and confidence 6,

is O(logc-l, log6-1).

Claim 2 There ezists no polynomial g such that

for all c, 6>0,

/l(&, 5) < g(tz(s, 6)).

Proofi Choose n ~ 10 and let e = Dl(n).

Therefore, in order to learn C to accuracy & with

respect to distribution D1, n must belong to the

sample (if n does not belong to the sample there

is no way to distinguish between Cn and Cn- l).

Therefore, we need a sample of size

11 = (lni$-l)/Dl(n) = aln6-1.

For distribution D2 we do not need to receive ex-

amples of j z n, since for such j, Dl(j + 1) <

& D1 (j), implying that

< Dl(n) = c.

Thus we need a sample of size

~z < ln6-1 _ lntl-l

– D2(n – 1) Dl(i)/2= 2ar(n – I)(lnc$-l),

while

t?l = a(lnti-l)~(n) = cr(ln 6-1)2’(n-1)

= a(ln 6-1)& 9/2@(ln 6-1)

259

For 6 = e-~: II = CY2~212a. Thus /1 cannot be

bounded by a polynomial in 12. ❑

Using results of [2], with D1 converging fast and

D2 converging slowly, we can show:

Theorem 8 Let D1 and D2 be two distributions

such that D1 does not a-dominate D2 (for any

~ ~ 1). Then there exists a concept class C poly-

nomially learnable with respect to D1 and not poty-

nomially learnable with respect D2.

Note that even though a-domination implies that

polynomial learnability is preserved, it does not

imply that every algorithm that learns with re-

spect to D1 also learns with respect to D2. This

motivates the learning model of the next section.

5

5.1

SOLID LEARNABILITY

DEFINITION

Blumer et al. [5] characterized learnability, by

the VC-dimension. Moreover, they showed that

subject to some measurability constraints, every

hypothesis consistent with the examples is a good

approximation to the target. I.e., every learnable

concept class satisfies the following more stringent

requirement. See [4] for a comparison between

solid learnability and (not necessarily solid) learn-

ability.

Definition 11 A concept class C is solidly learn-

able if for all c, 6>0, there exists an [ = 1(s, 6)

such that for every target t c C if (xl, . . . . xl) is a

random sample then with probability greater than

1 – 6, every concept consistent with (x, t(x)) is

c-close to t.

Note that we do not require that the hypothesis be

found by an algorithm, so we allow the problem of

finding a consistent hypothesis to be undecidable.

5.2 POLYNOMIAL SOLID

LEARNABILITY AND

(a, @-DOMINATION

Definition 12 Distributions D1 and D2 are

(a, @)-dominating if D1 a-dominates D2 and D2

~-dominates D1.

To prove the sufficiency of (a, ,f3)-domination for

learning, we need the following lemma:

Lemma 9 Let S be measurable with respect to D1

and let D1 and D2 be as above. Then the prob-

ability that a sample of length m ~ max(2@, 1 +

2P In a) drawn by D2 includes a point in S is at

least D1 (S).

Proofi The probability that a D2-sample of

length m contains a point in S is 1 – D2(X – S)m.

For this probability to be larger than D1 (S) itis

sufficient to have 1 – (1 – D2(S))~ = 1 – D2(X –

S)~ z D1 (S), which implies

~> lnDl(X– S)

– lnD2(X – S)”

To finish the proof we show that max(2~, 1 +

2,Bln ~) > lS~~[Z1;j.

Consider two cases:

Case (i): Dz(S) ~ *.

We use the following inequalities that hold for ev-

ery O~x <l:

z ~ –ln(l–z)~fi (2)

By the domination of DI, D1 (S) ~ ~D2(S). The

second inequality of (2) implies,

By the first inequality of (2),

– In(l – D2(S)) ~ D2(S).

To prove the lemma (for case (i)):

in D1 (X – S) ln(l – D1 (S))

in D2(X – S) = ln(l – D2(S))

260

– ln(l – DI(S)) < I!~;$lJ=

– In(l – Dz(S)) - Dz(S)

B P

= 1–BD2(S)%P*= 2P

The last inequality holds since we assumed that

D2(S) < +.

Case (ii): D2(S) > ~. We have D1(X – S) ~

* implying

in D1(X – S) < lnD2(X– S)–lna

lnD2(X – S) – in D2(X – S)

in a= l–

In Dz(X – S)

in a

< l–ln(l–~)<l–

= 1+2/31ncz

in a

$

Definition 13 (indicative) Let t c C. x c Xe is

c-indicative with respect to t if every hypothesis

consistent with (x, t(x)) is c-close to t (with re-

spect to Dl). A set G c Xl is c-indicative with

respect to t if all its elements are c-indicative with

respect to t. In the sequel t will be omitted.

Theorem 9 Let C be a concept class solidly

learnable with respect to distribution D1 with sam-

ple complexity 1 = ll(.s/a, 6) where c and 6 are the

accuracy and confidence parameters. Let D1 and

D2 be (CX,@-dominating. Then C is solidly learn-

able with respect to D2 with the same accuracy and

confidence parameters by 12(s, 6) = m( examples,

where m = max(2~, 1 + 2j?lna).

Proof: Let I be the set of c-indicative (with

respect to D1 ) l-samples. For notational conve-

nience assume that 1 = 2; the general case follows

similarly. Let Xi be a sample by D1 and Yi a

sample by D2.

1–($ ~ D1(I) = P((X1 ,X2) E 1)

< F’((Yl, .... Ym. x2) : (3)

3i,l<i~m, (~,XZ)~I)

= l–P((Yl,..., Ym, x2) : (4)

(The measurability of {(Yl, . . . . Y~, X2) : 3i, 1 ~

i ~ m, (Yi, X2) c 1} is proved in the appendix. )

P((Yi, x2) c 1) < P((Yi, Ym+l, . . .. Y2m) : (6)

2?n

< 1– ~ P((H, Yj) : (7)

Zrn

q(u, x2) f? q 2 ~ P((E,y) : (8)

j=m+l

Substituting (6) in (5) we get

m 2m

Thus, the probability that YI, . . . . Y2m contains a

pair (Yi, Yj ) which forms an indicative set is at

least 1 – 6. Since any concept consistent with

Y1, . . . . Y2m is consistent also with (~, Yj ), there is

probability at least 1 – 6 that any hypothesis con-

sistent with Y1, . . . . Y2m is (c/a) -close to the target

(with respect to Dl) and since D1 a-dominates

DZ, the hypothesis is e-close to t with respect to

D2 . •1

Corollary 2 Let D1, Dz be distributions such

that D1 (~, @)-dominates D2. Then if a concept

class C is polynomially learnable with respect to

D1 then C is polynomially learnable with respect

to D2.

The following examples show that a-domination

is not sufficient to preserve polynomial solid learn-

ability.

261

Example 3: Let y. = O, y. = V.-l + 6/(nn2),

and Y = U{yn}. Then Y. + 1 as n a co. Let

X = (O, 1), and X. = (Yn-l, Y.] (the half open

segment).

The concept class C consists of sets c ~ (O, 1) with

the following property: if Vn E c then X. G c,

otherwise Xn n c is finite.

Let D2 be the uniform distribution, and D1 a dis-

tribution which gives equal weight the segment

(YW-1, Yn) and to Y~; namely, D1(Y.) = ~,

D1 (S – Y) = ~D2 (S). Clearly, D1 2-dominates

D2 .

For n. = [3/(0r2)l, D1(U,>nc X,) < c. Thus,

in order to learn a target t, it is sufficient to ex-

actly learn tn Xi for all i < nc. After the learner

received I?(c, 6) = e-l log(nc /6) examples, with

probability z 1 – 6, all the points yl, . . . . yne are

included in the sample. The hypothesis of the

learner consists of the union of all the X;’s for

which yi belongs to t. (If y; @ t then Xi nt is finite

and has zero probability, therefore, the symmetric

difference between it and the empty set and has

zero probability with respect to D1.)

Consider learning t = (O, 1) with respect to D2.

Since Y is countable the probability of receiving

any yn is zero. Also, if the sample points are

Zl, ,.., Zn, the concept c = {zl, . . . . Zn} is consis-

tent with t but is l-far from it (with respect to

D2).

This example does not contradict Theorem 7,

since C is polynomialiy learnable with respect to

D2, just not solidly so. (If for n ~ n., Xn ~ t

there is very small probability of not getting a

negative example in X.. )

Example 4: Let X, C, D1, and D2 be as in the

previous example, and let Da(y.) = 4-”, D3(S –

Y) = ;D2(S).

D1 2-dominates D3, but D3 only dominates D1,

itdoes not a-dominate it for any a. In order to

solidly learn with respect to D3 to accuracy c, a

learner needs to obtain yn, , and to do so with

confidence 1 – 6 the learner needs 4n’ log 6-1 ex-

amples. And since nc is polynomial in e-1, C is

not polynomially solidly learnable.

5.3 BI-DOMINATION AND SOLID

LEARNABILITY

For polynomial learnability we have been able to

show when learnability with respect to one distri-

bution implies learnability with respect to another

distribution for both the solid (Corollary 2) and

the nonsolid case (Theorems 7 and 8). For non-

polynomial solid learnability the same condition

(cr-~ domination) is sufficient. However, since the

requirement for learnability is weaker, a weaker

condition might suffice.

We have not been able to characterize when solid

learnability with respect to distribution D1 im-

plies solid learnability with respect to D2. We

conjecture, however, that a result similar to The-

orem 4 or 5 holds:

Conjecture 1 Let D1 and D2 mutually dominate

each other. C is solidly learnable with respect to

D1 if and only if it is solidly learnable with respect

to D2.

Example 3 above shows that domination in one

direction is not sufficient.

6 LEARNABILITY WITH

RESPECT TO A SET OF

DISTRIBUTIONS

The results of the previous sections allow us to

discuss learnability under a set of distributions.

Definition 14 Let C ~ R be a concept class and

D a set of distributions on X. Then C is learnable

with respect to D if C is learnable with respect to

every distribution D E D.

Similarly, we may define polynomial learnability

with respect to D, and solid learnability with re-

spect to D.

262

As an immediate consequence of Theorem 5 we

have:

Corollary 3 Let D be a class of distributions

over X such that for every D1, D2 ~ D, D1 weakly

dominates D2. If C ~ R is a concept class learn-

able with respect to some D G D then C is learn-

able with respect to D.

For polynomial learnability Theorems 7 and 9 sim-

ilarly imply:

Corollary 4 Let D be a class of distributions

over X such that for every distribution D1, DZ c

D, there exists an a such that D1 a-dom~nates D2,

and let C ~ R be a concept class (solidly) poly-

nomially learnable with respect to some D G D.

Then C is (solidly) polynomially learnable with re-

spect to D.

This corollary can be strengthened in several

ways: First, if C is (solidly) learnable by

0((s6)-~ ) examples with respect to D then X is

learnable by 0((e6)- ~ ) examples with respect to

every D’ E D. Note that while the exponent does

not depend on a, the constant in the big-Oh does.

Thus, we may have a different constant for each

distribution.

Second, if there exists an a such that for all D1, D2

and D1 a-dominates D2 and C is (solidly) polyno-

mially learnable with respect to some D ~ D then

there exists a polynomial p(e - 1,6-1 ) such that for

all distributions D e D, P(s- 1, 6-1, many exam-

ples suffice.

Moreover, using an (s/a) -cover for D, we can ap-

ply Best Agreement to find a hypothesis which is

e-close to t with respect to to every distribution

D’ c D. Thus, the same algorithm learns with

respect to all distributions in V.

7 EXTENSIONS

In [3], the authors consider learning when the

number of examples needed may vary with the

target concept.

Nonuniform learnability: A function L

nonuniformly learns a concept class C, with re-

spect to distribution D if for every g, 6 > 0 there is

an 1?= 1(c, 6, c, D) > 0 such that for every c E C

and P ~ 1

P(L((x1, c(x1)), . . . . (z~l, C(zi, ))) is c-close to c)

>1–8.

The intuition behind this definition is that for a

fixed target concept c E C, with high probability;

the hypotheses output by ~ get closer to c as the

number of examples increases.

In the full paper we extend our results to nonuni-

form learnability, and obtain similar results for the

polynomial and solid nonuniform cases.

ACKNOWLEDGEMENTS

It is a pleasure to thank Shai Ben-David for

minating discussions and useful comments.

References

illu-

[1]

[2]

[3]

[4]

[5]

Bartlett, P. W. and R. C. Williamson, Inves-

tigating distributions assumptions in the PAC’

learning model, COLT ’91, 24-32.

Benedek G.M. and Itai A., Learnability

by fixed distributions, Theoretical Computer

Science 86, (1991) 377-389. (A preliminary

version appeared in COLT ’88 (1988 ).)

Benedek G.M. and Itai A., Nonuniform

learnability, 15-th ICALP, (1988), 82-92.

Ben-David S., Benedek G.M. and Mansour

Y., A parameterization scheme for classify-

ing models of learnability, to appear in in-

formation and computation. (A preliminary

version appeared in COLT ’89 (1989 ).)

Blumer A., Ehrenfeucht A., Haussler D. and

Warmuth M., CHassifymg learnable geomet-

ric concepts with the Vapnik- C’hervonenkis

dtmension, Proc. of 18th Symp. Theory of

Comp., 273-282., (1986).

263

[6]

[7]

[8]

[9]

[10]

Ehrenfeucht A., Haussler D., Kearns M.

and Valiant L., A general lower bound on

the number of examples needed for learning,

COLT ’88.

Halmos, F’. R., Measure Theory, Van Nos-

trand, (1950).

Ming Li, Vitanyi, P., Learning simple con-

cepts under simple distributions To appear in

SIAM J. on Computing, an extended abstract

appeared in 30th FOCS (1989).

Vapnik V.N. and Chervonenkis A.Ya., On the

uniform convergence of relative frequencies of

events to their probabilities, Th. Prob. and its

Appl., 16(2), 264-80, (1971).

Valiant L. G., A Theory of the Learnable,

Comm. ACM, 27(11), 1134-42, (1984).

APPENDIX

MEASURE THEORETIC

PROPERTIES

The purpose of this appendix is to show that the

sets defined in proof of Theorem 9 are indeed mea-

surable. For the sake of brevity, these properties

are proved only in the context necessary for this

work, however, they hold under more general con-

ditions too.

An t-sample is a point in Xe. Thus, in order to

discuss the probability of getting a good sample

we should know what are the measurable sets in

the product space X1. Following Halmos [7], if R

is a a-algebra over X, then the u-algebra Re over

the product space XL consists of the a-algebra

generated from all the sets of the form S = S1 x

. . . x Sl where Si E iM. (In other words, A set S ~

Xe is a basic measurable set if S = S’l x . . . x St,

where each Si ~ X is measurable. The class of

measurable sets is the smallest class of sets which

contain the basic measurable sets and are closed

under countable unions and complementation.)

We now can prove some properties of product

measures.

Lemma 10 (Permutation) Let r be a permuta-

tion of {1,...,/} and S 6 Re then

7(S) = {Zm(l), . . .. Zr(l) : (zl, . . ..zI) E S} G Re.

Proof: If S is measurable then S is derived by

applying u-algebra operations (complementations

and countable-unions) from basic measurable sets

(products of sets of R). Thus we may derive T(S)

from the same operations except that the indices

of the sets are permuted according to m. ❑

Lemma 11 If S ~ R2 then the set

s. = {(Yl,...,Y~,z) ~ ‘m+l : 3i (Yi7z) ~ s)~ Rm+l,

Proof: s. =

{( Yl, ~2,...>%~) : (yl, ~) ESandzi 6X}

U{(Zl, yz, Z3..., Z~, Z ):(yz,~)~Sandzi~X}

U{(zl, ,..., %-.l,ym,z) : (Yin, z) ES

and ~i c X}.

The last element of the union is measurable since

it is equal to Xm– 1 x S. The other sets are also

measurable since they are a permutation of the

last set. ❑

264