dominating distributions and learnability
TRANSCRIPT
Dominating Distributions and Learnability
Gyora M. Benedek
ELBIT-EVS Ltd. Haifa, Israel
[email protected] .ac.il
Abstract
We consider PAC-learning where the dis-
tribution is known to the student. The
problem addressed here is characterizing
when learnability with respect to distri-
bution D1 implies learnability with re-
spect to distribution Dz.
The answer to the above question de-
pends on the learnability model. If
the number of examples need not be
bounded by a polynomial, it is sufficient
to require that all sets which have zero
probability with respect to Dz have zero
probability with respect to D1. If the
number of examples is required to be
polynomial, then the probability with re-
spect to Dz must be bounded by a multi-
plicative constant from that of D1. More
stringent conditions must hold if we in-
sist that every hypothesis consistent with
the examples be close to the target.
Finally, we address the learnability prop-
erties of classes of distributions.
● Currently visiting AT&T Bell Laboratories, Mur-
ray Hill, NJ 07974-0636.
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the ACM copyright notice and the
title of the publication and its date appear, and notice is given
that copying is by permission of the Association for Computing
Machinery. To copy otherwise, or to republish, requires a fee
and/or specific permission.COLT’92-71921PA, USA
~ 1992 ACM O-8979’I-498-81921000710253...$505O
Alon Itai *
Computer Science Department
Technion, Haifa, Israel
[email protected]. ac.il
1 INTRODUCTION
We consider PAC-learning with respect to specific
distributions and consider the following problem:
when can a learning algorithm designed for distri-
bution D1 be used (or modified) for learning with
respect to another distribution – D2. The object
of this line of research is to design learning algo-
rithms for entire classes of distributions. Thus, we
specify conditions on the distributions such that a
learning algorithm with respect to D1 can be used
(or modified) to get a learning algorithm with re-
spect to distribution Dz.
In PAC-learning (see precise definitions in Section
2), the “teacher” hss a target concept t,she se-
lects 1 random (with respect to distribution D) ex-
amples and tells the “learner” (who doesn’t know
t)which examples belong to t. The learner, tts-
ing only this information, has to find a hypothesis
close to the target. The accuracy parameter, c,
bounds the distance between the learner’s hypoth-
esis and t. We allow the learner to make mistakes,
i.e., not learn. The confidence parameter, 6, is a
bound on this event.
Our first result (Section 3) is that for any two dis-
tributions D1 and D2, for which for all measurable
sets S, Dz(S) = O implies D1 (S) = O, learning
with respect to D1 implies learning with respect
to D2 .
Learning has polynomial sample complexity if the
number of examples 1 needed to learn to accuracy
253
c and confidence 6 is bounded by a polynomial in
6-1 and 6-1. In Section 4 we characterize when
polynomial learning with respect to D1 implies
polynomial learning with respect to D2.
Our results give necessary conditions for a concept
class to be learnable with respect to a class of dis-
tributions: For example, the results of Section 3
imply that if G is a concept class of Borel sets
over the real segment [0, 1], C is learnable with re-
spect to distribution D if and only if it is learnable
with respect to the uniform distribution. This re-
sult does not imply that the number of examples
needed to learn with respect to D is bounded by a
polynomial in the number of examples needed to
learn with respect to the uniform distribution; for
this, stronger requirements, such as those given in
Section 4 are needed.
An orthogonal issue is that of solid learnability -
under this model the learner may return any hy-
pothesis consistent with the examples. In Section
5 we discuss when solid learnability with respect
to D1 implies solid learnability with respect to lh.
2 MODELS FOR
LEARNABILITY
The following definitions are
of measure theory [7], which
our work.
the basic definitions
is the foundation of
Definition 1 Let X be a set. R C 2X ts a u-
algebra over X if X ~ R and R is closed under
complements and countable unions.
The best known u-algebra is that of the Borel sets
– the smallest u-algebra containing all the inter-
vals of the real line.
Definition 2 Let R be a u-algebra over X and D
a function from R to the nonnegative real num-
bers. D is a distribution over R if
1. D(x) = 1.
2. D is additive, i.e., if A, B E R are disjoint
then D(A U B) = D(A)+ D(B).
From the definitions 0 c R and D(0) = O, how-
ever there may be additional sets S c R for which
D(S) = O.
In the sequel we shall assume a fixed u-algebra
R, whose members will be called measurable sets,
and investigate different distributions over R.
Our basic model of learning from examples is the
PAC (probably approximately correct) model as
defined in [5], A concept class is a set C ~ R
of concepts. The learner receives labeled exam-
ples according to distribution D. After receiv-
ing 1 examples, the learner produces a hypothesis
L(((xl, C(@),... , (w, c(zL)))) E R, (the booleanfunction C(Z) denotes the characteristic function
of c, i.e., C(Z) = 1 iff z c c). The function L is a
learning function.
Y1, Y2 c R are z-close with respect to the distribu-
tion D if D(YI @Yz) < c (6I denotes the symmetric
difference). Otherwise, Y1 and Y2 are c-far with
respect to the distribution D.
Definition 3 (Learnability for distribution
D):[2] A function L learns a concept class C, with
respect to distribution D if for every e, 6> 0 there
is an / = 1(c, 6, D) > 0 such that for every c ~ C’
and /’ z 1 the probability
P(L((z1, C(zl)), . .. . (x1), C(zt,))) is e-close to c)
>1–6.
The learning is distribution free if the number of
examples, 1, is independent of the distribution [5].
Another important aspect is solid learnability [4].
A concept class is solidly learnable if given a suffi-
ciently large sample, every concept consistent with
the sample is a good approximation of the target
concept.
254
2.1 PREVIOUS RESULTS
2.1.1 Distribution Free Learning
Blumer et al., [5] analyzed distribution free learn-
ability and gave a combinatorial characterization
of the learnable concept classes. They also showed
that a concept class is learnable with respect to
all well behaved distributions if and only if it is
solidly learnable.
2.1.2 Fixed Distribution
Benedek and Itai, [2], considered learnability with
respect to a fixed distribution D and gave the
following characterization of learnable concept
classes.
Definition 4 (Finite cover [2]): Let C be a
concept class over R and c > 0, a set KC ~ R
is an c-cover of C with respect to D if for ev-
ery c G C there is an h G KC c-close to c. C is
finitely coverable with respect to D if for every
E > 0 there is a jinite c-cover of C (the size of
the cover may depend on c). The cardinality of a
smallest c-cover of C with respect to D is denoted
by nD(6’, c).
The following lemma shows that we may assume
that C is covered by concepts.
Lemma 1 [2] C is jinitely coverable with respect
to D if and only if for every & > 0 there is a
finite subset Cc G C which is an E-cover of C
with respect to D.
In view of the above lemma, throughout the pa-
per, we will assume that a cover is a subset of the
concept class.
The following theorem characterizes learnability
with respect to a fixed distribution:
Theorem 1 [2’] C is learnable with respect to dis-
tribution D if and only if C is finitely coverab!e
with respect to D.
Moreover, they introduce the following learning
function.
The best-agreement learning functiorn[2]:
Input: 1 examples, ((xl, c(z1)), . . . . (zL, C(xz))).
Let El be the maximum integer such that
54Et hI(EtnD(C, l/(2 Et))) < L
Let B = {bl, . . . . b~~} be a minimum l/(2 E~)-cover
of C, (nD = nD(C, l/(2 E1))).
Output: Any bi such that the cardinality of {xj :
1 < j s t?, C(zj ) # bi (zj )} is minimum (among
the nD elements of the cover ~).
Theorem 2 [2] Let C be a finitely coverable con-
cept class with respect to distribution D. Then the
best-agreement-learning- function learns C with re-
spect to D with accuracy and confidence l/Et.
Theorem 3 If C is learnable with respect to D
with accuracy c and confidence ~ using t? = l(e, ~)
ezamples then there exists a 2c-cover of size 2t+1.
Corollary 1 Let C be a concept class learn-
able by a polynomial sample (1? = 1(c, 6) =
(~)o(’)), then the Best Agreement /earning func.
tion also requires a polynomial samp[e, namely,
Y( t54 in ~ + ! in 2 + in 2) examples suflce to learn C
wtth accuracy E and confidence c.
2.2 ENUMERABLE AND
POLYNOMIAL DISTRIBUTIONS
M. Li and P. Vitanyi [8] considered enumerable
distributions, those distributions D for which the
set
{(z, y) : zEN, yEQ, D(.v)>y)
is recursively enumerable. They showed the ex-
istence of a universal distribution U, and proved
that if a concept class is learnable with respect to
U itis learnable with respect to to all enumerable
distributions, (and also for a slightly larger class).
These results were extended to polynomial distri-
butions and also to continuous ones.
255
Our results apply to all pairs of distributions, not
only enumerable ones. Also, our results are appli-
cable to concept classes which are not learnable
with respect to the universal distribution.
3 LEARNABILITY
DOMINATING
DISTRIBUTIONS
AND
Definition 5 Let D1 and D2 be two distribu-
tions over a u-algebra R. D1 dominates Dz if for
every set S E R, D1(S) = O implies Dz(S) = O
In this section we show that every D1-learnable
concept class is also learnable with respect to Dz
if and only if D1 dominates D2. First we present
the following two lemmas.
Lemma 2 Let C be an injinite concept class
jinitely coverable wtth respect to D. Then there
exist distinct concepts Ai c C for i = 1,2, ,.. such
that Ai iS l/2i-close toAi+l.
Proof: Let Co=C. Fori=l,2,. .. we con-
struct infinite concept classes Ci C C, such that
Ai E Ci and every set in Ci is 2-i-close to Ai- 1.
For i > 0 let A’l , . . . . K,, be a 2-i-cover of Ci_l
(Kj E Ci_l). (Since Ci-l C C and C is finitely
coverable, so is Ci _ 1.) Since Ci_ 1 is infinite, for
some j infinitely many concepts of Ci _ 1 are 2-i-
close to Kj. Let Ci be this set of concepts (ex-
cluding Kj itself) and let Ai = Kj. ❑
A generalization of the following lemma appears
in Halmos [7, Theorem E, p. 38].
Lemma 3 Let Gl, G2, . . . be sets such that Gi ~
Gi+l and D(f)~l Gi) = 0, Then D(Gi) + O.
Theorem 4 If a concept class C is learnable with
respect to D1 and D1 dominates D2 then C is
learnable with respect to D2.
Proof: By contradiction: let D1 dominate D2
and let C be learnable with respect to D1 and not
learnable with respect to D2. By Theorem 1, C
is finitely coverable with respect to D1. By the
same theorem, there exists an a > 0 such that
there is an infinite set C’ s C of concepts which
are pairwise a-far with respect to D2.
By Lemma 2 there exists an infinite sequence of
concepts Al, A2, . . . ~ C’ such that Ai is 2-i-close
toAi+l with respect to D1. Let Ai = Ai @ Ai+l.
Then D1 (Ai) < l/2i. Also for every i
CxJ m
Dl(UAj) < ~Dl(Aj) < l/2i-1.
j =i j =i
Let Gi = U?=i Aj. Then
Dl(fi Gi) = O. (1)
i=l
Gi ~ Gi+l, D2(Gi) > CY so by Lemma 3,
Dz(f& Gj) >0.
Since D1 dominates D2, also D1 (n~l Gi) >0.
Contradicting (l). •1
Suppose some point x E X has positive probabil-
ity, i.e., D({z}) >0. Then this probability cannot
be split. This suggests the following definition.
Definition 6 A measurable set A is an atom with
respect to distribution D if D(A) > 0 and for
every measurable subset B ~ A either D(B) = O
or D(B) = D(A).
Note that some distribution may have atoms that
are not singletons (e.g., if R consists of the empty
set and all the Borel sets b for which [0, ~] ~ b ~
[0, 1], and D is the uniform distribution, then the
segment [O, ~] is an atom, moreover, all atoms con-
tain this segment).
Definition 7 Measurable sets E and F are D-
equivalent if D(E @ F) = O.
It is easy to see that D-equivalence is an equiva-
lence relation. Consider the D-equivalence classes.
If a class contains an atom then all its members
are atoms.
Lemma 4 Ifa and b are non D-equivalent atoms
then D(a n b) = O.
256
Proof: If D(a n b) >0 then since a ~ a n b,
D(a n b) = D(a).
By the same argument D(a n b) = D(b). Thus,
D(a@b) = (D(a) –ll(anb))+(ll(b) -ll(anb)) = O.
Thus, a and b are D-equivalent, contrary to the
hypothesis.
The following lemma
169].
Lemma 5 The set ~
atoms is countable.
❑
appears in [7, Ex. 10, p.
of D-equivalence classes of
Proof: Let & consist of those
classes such that for A ~ A E &n
Because of the previous lemma, &
equivalence
can contain
at most n equivalence classes. Since the set of
all equivalence classes is the union of the & ‘s,
~ is a countable union of finite sets, and thus is
countable. •1
Definition 8 (’7, p. 168]. A set S is atom-free
with respect to distribution D if it does not con-
tain any aioms with respect to D.
The next lemma shows that it is easy to learn with
respect to distributions which are determined by
their behavior on atoms
Lemma 6 Let D be a dtstribuiion such that for
all atom free S E R D(S) = O. Then every con-
cept class C is learnable with respect to D.
Proofi Let A1, A2, . . . be the D-equivalence
classes of atoms. Let Aj c Aj and A = Uj Aj .
w.1.o.g., D(Aj) z D(Aj+l). (A is measurable,
since by Lemma 5 it is a countable union of mea-
surable sets. )
Let ill be the minimum index, j, for which
~i,j D(Ai ) < e. (Lemma 4 implies that such
an 114 exists. )
{
S= UA@~{l,...,iM}
jEJ }
is an c-cover of C with respect to D. Hence, C is
finitely coverable with respect to D. Thus, C is
learnable with respect to D. n
Lemma 7 Let X be the disjoint union of X1 and
X2~R, Ci={cn Xi:c~C}. T’hen Cis
learnable with respect to D if and only if for i =
1,2 Ci is learnable with respect to D.
Proofi Suppose C is learnable with respect
to D. Let {bI,..., bN} be an e-cover. Then {bl n
Xi, .... bN (1 Xi } is an e-cover of Ci (with respect
to D).
In the other direction, if {bj, . . . . bj,} (i = 1,2) are
$-covers of Xl, X2, then
{b; ub~:l<j<N1,1<k<N2}
is an e-cover of C (with respect to D). Therefore,
for all e >0 there exists a finite c-cover. n
Definition 9 (weak-domination) Distribution
D1 weakly-dominates distribution D2 if every set
S c R for which D1(S) = O either Dz.(S) = O or
S contains an atom with respect to Dz.
Obviously, domination implies weak domination.
The following theorem shows that weak domina-
tion is sufficient to preserve learnability.
Theorem 5 If a concept class C is learnable with
respect to D1 and D1 weakly-dominates D2 then
C is learnable with respect to Dz.
Proof (sketch): Define Aj and A as in the proof
of Lemma 6 (with D2 replacing D). Let Xl =
X–A, and X2=A.
Let Cl, C2 as in Lemma 7. By that lemma, for
i = 1, 2, Ci is learnable with respect to D1.
Since Xl is atom-free with respect to D2, restrict-
ing R to Xl, D1 dominates D2. Hence by Theo-
rem 4, Cl is learnable with respect to D2.
257
D2 restricted to XZ satisfies the conditions of
Lemma 6 and thus C2 is learnable with respect
to Da. Combining these results with Lemma 7
implies that C is learnable with respect to D2. ❑
Before showing that the weak-domination require-
ment is necessary in Theorem 5, we state the fol-
lowing lemma whose proof appears in Halmos [7,
Ex. 3, p. 174]
Lemma 8 If S is atom-free (with respect to dis-
tribution D) then for all O ~ a < 1 there exists a
set S c S such that
D(S’) = aD(S).
Theorem 6 Let D1 and LIZ be distributions such
that D1 does not weakly-dominate Da. Then there
exists a concept class C such that C is learnable
with respect to D1 but not with respect to D2.
Proof: By definition, since D1 does not
weakly-dominate D2, there exists a measurable set
S for which
1. S is atom-free (with respect to distribution
Da),
2. D1(S) = O, and
3. D2(S) = a >0.
We define a class C over S which does not have a
~-cover with respect to D2. So, in particular, it is
not finitely coverable, thus by Theorem 1 it is not
learnable with respect to Da. (Since D1(S) = O,
for all c e C, D1 (c) = O and thus can be approx-
imated by the empty set. Hence, C is learnable
with respect to D1. )
By successively halving S (using Lemma 8) we
obtain 2m disjoint sets AT, . . . . A~~_l C S,
such that D2(A~) = ~/2m. Let Ij = {i :
the jth bit of z’ is 1} and c~ = U;EI,A~ (j =
o ,..., m– 1). C = {cjm :m>o, o<j <m}.
Note that for all m, if j # j’ then Dz (cj~ @
cj/~) = u/2.
Let kl, ..., 7?. be a minimum &cover. Considern+lco ,..., cn~+l. By the pigeonhole principle there
exist distinct j, j’ such that both c~+l and c~,+l
are &close to the same k~.
u
5 = D2(cjn+1 @ CjIn+l)
< D2(cj ‘+1 @ k~) + DZ(Cjln+l @ ki)
< 2U—.—5
Thus, kl,..., kn cannot be a $cover. ❑
4 POLYNOMIAL LEARN-
ABILITY AND a-DOMINATING
DISTRIBUTIONS
Definition 10 Let a ~ 1 and D1 and DZ be dis-
tributions over a u-algebra R, D1 a-dominates
D2 if for every set S 6 R
Obviously, for every a z 1, a-domination im-
plies domination but not vice-versa. (I.e., there
is no a > 0 such that domination implies a-
abomination.)
Example 1: Let X be the positive integers,
Dl(i) = 2-i, and Da(i) = 6/(7ri)2. Then D1
and D2 dominate each other, but D1 does not a-
dominate D2 for any a. (Da 2-dominates D1.)
Theorem 7 If a concept class C is learnable with
respect to D1 with polynomial sample complexity,
and for some a > 0, D1 m-dominates D2 then
C is learnable with respect to Dz with polynomial
sample complexity.
Before proving the theorem let us indicate why
it is hard. Because D1 a-dominates DZ, if a hy-
pothesis is (c/a)-close to t with respect to D1 it
is s-close to t with respect to D2. Thus, in order
to learn to accuracy & with respect to D2 it is suf-
ficient to learn with accuracy c/a with respect to
D1. The problem lies in assuring confidence. Ev-
ery sequence of 1 sample points is a member of Xe
on which we can only show that D$ d-dominates
258
D;. Thus we might have to (exponentially) in-
crease the sample size. Moreover, for larger sam-
ples the domination will be satisfied only with an
even larger parameter. To overcome these difficul-
ties our proof takes a completely different track.
Proofi Without loss of generality, c s 6.
Let A = 11 (e/(2 a), ~) be the number of examples
needed to learn C with respect to D1 with ac-
curacy E/(2Q) and confidence ~. By Theorem 3,
there exists an s/(2a)-cover of size A of C with
respect to D1 and it is an c/2-cover of C with
respect to D2.
Therefore, with respect to D2, the best-
agreement-learning-function learns C with ac-
curacy 2(e/2) = e and confidence c using
42(.s ,6) = $& (ln * + A) = O ($log(l/&) + $~)
examples.
Finally, since a is a constant, and /1 (s, 6) is poly-
nomial in $ and in ~ so is 12 (e, 6). n
The previous theorem does not imply that the
number of examples needed to learn with respect
to D2 is a polynomial function of the number
needed to learn with respect to D1. Just that
both are polynomials in .s, 6.
Example 2: Let X = {0,1,...}, Cn =
{o, 1 ,..., n}, C= {c~ : n ~ O}. By Lemma6
(alternatively see [5]), for any distribution D over
X, C is learnable with respect to D.
We now show a distribution-dependent algorithm
to learn C.
Given c,6 >0.
1. Let i. be the minimum integer satisfying
z~io+l D(io) < ~.
2. obtain (ln 6-1 )/piO examples.
3. Let k be the maximal positive example.
4. Return c~.
Claim 1 The algorithm learns C
Let
T(o) = 1
T(i + 1) = 2~(~)
‘1a=
E~=o q“
We now define distributions D1, Dz:
1Dl(i) = —
cm(i)
For all i, ~D1(i) ~ D2(i) ~ 2D1(i).
For k = 1,2, l~(.s, 6), the number of examples
needed by the above algorithm to learn C with
respect to Dk with accuracy c and confidence 6,
is O(logc-l, log6-1).
Claim 2 There ezists no polynomial g such that
for all c, 6>0,
/l(&, 5) < g(tz(s, 6)).
Proofi Choose n ~ 10 and let e = Dl(n).
Therefore, in order to learn C to accuracy & with
respect to distribution D1, n must belong to the
sample (if n does not belong to the sample there
is no way to distinguish between Cn and Cn- l).
Therefore, we need a sample of size
11 = (lni$-l)/Dl(n) = aln6-1.
For distribution D2 we do not need to receive ex-
amples of j z n, since for such j, Dl(j + 1) <
& D1 (j), implying that
< Dl(n) = c.
Thus we need a sample of size
~z < ln6-1 _ lntl-l
– D2(n – 1) Dl(i)/2= 2ar(n – I)(lnc$-l),
while
t?l = a(lnti-l)~(n) = cr(ln 6-1)2’(n-1)
= a(ln 6-1)& 9/2@(ln 6-1)
259
For 6 = e-~: II = CY2~212a. Thus /1 cannot be
bounded by a polynomial in 12. ❑
Using results of [2], with D1 converging fast and
D2 converging slowly, we can show:
Theorem 8 Let D1 and D2 be two distributions
such that D1 does not a-dominate D2 (for any
~ ~ 1). Then there exists a concept class C poly-
nomially learnable with respect to D1 and not poty-
nomially learnable with respect D2.
Note that even though a-domination implies that
polynomial learnability is preserved, it does not
imply that every algorithm that learns with re-
spect to D1 also learns with respect to D2. This
motivates the learning model of the next section.
5
5.1
SOLID LEARNABILITY
DEFINITION
Blumer et al. [5] characterized learnability, by
the VC-dimension. Moreover, they showed that
subject to some measurability constraints, every
hypothesis consistent with the examples is a good
approximation to the target. I.e., every learnable
concept class satisfies the following more stringent
requirement. See [4] for a comparison between
solid learnability and (not necessarily solid) learn-
ability.
Definition 11 A concept class C is solidly learn-
able if for all c, 6>0, there exists an [ = 1(s, 6)
such that for every target t c C if (xl, . . . . xl) is a
random sample then with probability greater than
1 – 6, every concept consistent with (x, t(x)) is
c-close to t.
Note that we do not require that the hypothesis be
found by an algorithm, so we allow the problem of
finding a consistent hypothesis to be undecidable.
5.2 POLYNOMIAL SOLID
LEARNABILITY AND
(a, @-DOMINATION
Definition 12 Distributions D1 and D2 are
(a, @)-dominating if D1 a-dominates D2 and D2
~-dominates D1.
To prove the sufficiency of (a, ,f3)-domination for
learning, we need the following lemma:
Lemma 9 Let S be measurable with respect to D1
and let D1 and D2 be as above. Then the prob-
ability that a sample of length m ~ max(2@, 1 +
2P In a) drawn by D2 includes a point in S is at
least D1 (S).
Proofi The probability that a D2-sample of
length m contains a point in S is 1 – D2(X – S)m.
For this probability to be larger than D1 (S) itis
sufficient to have 1 – (1 – D2(S))~ = 1 – D2(X –
S)~ z D1 (S), which implies
~> lnDl(X– S)
– lnD2(X – S)”
To finish the proof we show that max(2~, 1 +
2,Bln ~) > lS~~[Z1;j.
Consider two cases:
Case (i): Dz(S) ~ *.
We use the following inequalities that hold for ev-
ery O~x <l:
z ~ –ln(l–z)~fi (2)
By the domination of DI, D1 (S) ~ ~D2(S). The
second inequality of (2) implies,
By the first inequality of (2),
– In(l – D2(S)) ~ D2(S).
To prove the lemma (for case (i)):
in D1 (X – S) ln(l – D1 (S))
in D2(X – S) = ln(l – D2(S))
260
– ln(l – DI(S)) < I!~;$lJ=
– In(l – Dz(S)) - Dz(S)
B P
= 1–BD2(S)%P*= 2P
The last inequality holds since we assumed that
D2(S) < +.
Case (ii): D2(S) > ~. We have D1(X – S) ~
* implying
in D1(X – S) < lnD2(X– S)–lna
lnD2(X – S) – in D2(X – S)
in a= l–
In Dz(X – S)
in a
< l–ln(l–~)<l–
= 1+2/31ncz
in a
$
❑
Definition 13 (indicative) Let t c C. x c Xe is
c-indicative with respect to t if every hypothesis
consistent with (x, t(x)) is c-close to t (with re-
spect to Dl). A set G c Xl is c-indicative with
respect to t if all its elements are c-indicative with
respect to t. In the sequel t will be omitted.
Theorem 9 Let C be a concept class solidly
learnable with respect to distribution D1 with sam-
ple complexity 1 = ll(.s/a, 6) where c and 6 are the
accuracy and confidence parameters. Let D1 and
D2 be (CX,@-dominating. Then C is solidly learn-
able with respect to D2 with the same accuracy and
confidence parameters by 12(s, 6) = m( examples,
where m = max(2~, 1 + 2j?lna).
Proof: Let I be the set of c-indicative (with
respect to D1 ) l-samples. For notational conve-
nience assume that 1 = 2; the general case follows
similarly. Let Xi be a sample by D1 and Yi a
sample by D2.
1–($ ~ D1(I) = P((X1 ,X2) E 1)
< F’((Yl, .... Ym. x2) : (3)
3i,l<i~m, (~,XZ)~I)
= l–P((Yl,..., Ym, x2) : (4)
(The measurability of {(Yl, . . . . Y~, X2) : 3i, 1 ~
i ~ m, (Yi, X2) c 1} is proved in the appendix. )
P((Yi, x2) c 1) < P((Yi, Ym+l, . . .. Y2m) : (6)
2?n
< 1– ~ P((H, Yj) : (7)
Zrn
q(u, x2) f? q 2 ~ P((E,y) : (8)
j=m+l
Substituting (6) in (5) we get
m 2m
Thus, the probability that YI, . . . . Y2m contains a
pair (Yi, Yj ) which forms an indicative set is at
least 1 – 6. Since any concept consistent with
Y1, . . . . Y2m is consistent also with (~, Yj ), there is
probability at least 1 – 6 that any hypothesis con-
sistent with Y1, . . . . Y2m is (c/a) -close to the target
(with respect to Dl) and since D1 a-dominates
DZ, the hypothesis is e-close to t with respect to
D2 . •1
Corollary 2 Let D1, Dz be distributions such
that D1 (~, @)-dominates D2. Then if a concept
class C is polynomially learnable with respect to
D1 then C is polynomially learnable with respect
to D2.
The following examples show that a-domination
is not sufficient to preserve polynomial solid learn-
ability.
261
Example 3: Let y. = O, y. = V.-l + 6/(nn2),
and Y = U{yn}. Then Y. + 1 as n a co. Let
X = (O, 1), and X. = (Yn-l, Y.] (the half open
segment).
The concept class C consists of sets c ~ (O, 1) with
the following property: if Vn E c then X. G c,
otherwise Xn n c is finite.
Let D2 be the uniform distribution, and D1 a dis-
tribution which gives equal weight the segment
(YW-1, Yn) and to Y~; namely, D1(Y.) = ~,
D1 (S – Y) = ~D2 (S). Clearly, D1 2-dominates
D2 .
For n. = [3/(0r2)l, D1(U,>nc X,) < c. Thus,
in order to learn a target t, it is sufficient to ex-
actly learn tn Xi for all i < nc. After the learner
received I?(c, 6) = e-l log(nc /6) examples, with
probability z 1 – 6, all the points yl, . . . . yne are
included in the sample. The hypothesis of the
learner consists of the union of all the X;’s for
which yi belongs to t. (If y; @ t then Xi nt is finite
and has zero probability, therefore, the symmetric
difference between it and the empty set and has
zero probability with respect to D1.)
Consider learning t = (O, 1) with respect to D2.
Since Y is countable the probability of receiving
any yn is zero. Also, if the sample points are
Zl, ,.., Zn, the concept c = {zl, . . . . Zn} is consis-
tent with t but is l-far from it (with respect to
D2).
This example does not contradict Theorem 7,
since C is polynomialiy learnable with respect to
D2, just not solidly so. (If for n ~ n., Xn ~ t
there is very small probability of not getting a
negative example in X.. )
Example 4: Let X, C, D1, and D2 be as in the
previous example, and let Da(y.) = 4-”, D3(S –
Y) = ;D2(S).
D1 2-dominates D3, but D3 only dominates D1,
itdoes not a-dominate it for any a. In order to
solidly learn with respect to D3 to accuracy c, a
learner needs to obtain yn, , and to do so with
confidence 1 – 6 the learner needs 4n’ log 6-1 ex-
amples. And since nc is polynomial in e-1, C is
not polynomially solidly learnable.
5.3 BI-DOMINATION AND SOLID
LEARNABILITY
For polynomial learnability we have been able to
show when learnability with respect to one distri-
bution implies learnability with respect to another
distribution for both the solid (Corollary 2) and
the nonsolid case (Theorems 7 and 8). For non-
polynomial solid learnability the same condition
(cr-~ domination) is sufficient. However, since the
requirement for learnability is weaker, a weaker
condition might suffice.
We have not been able to characterize when solid
learnability with respect to distribution D1 im-
plies solid learnability with respect to D2. We
conjecture, however, that a result similar to The-
orem 4 or 5 holds:
Conjecture 1 Let D1 and D2 mutually dominate
each other. C is solidly learnable with respect to
D1 if and only if it is solidly learnable with respect
to D2.
Example 3 above shows that domination in one
direction is not sufficient.
6 LEARNABILITY WITH
RESPECT TO A SET OF
DISTRIBUTIONS
The results of the previous sections allow us to
discuss learnability under a set of distributions.
Definition 14 Let C ~ R be a concept class and
D a set of distributions on X. Then C is learnable
with respect to D if C is learnable with respect to
every distribution D E D.
Similarly, we may define polynomial learnability
with respect to D, and solid learnability with re-
spect to D.
262
As an immediate consequence of Theorem 5 we
have:
Corollary 3 Let D be a class of distributions
over X such that for every D1, D2 ~ D, D1 weakly
dominates D2. If C ~ R is a concept class learn-
able with respect to some D G D then C is learn-
able with respect to D.
For polynomial learnability Theorems 7 and 9 sim-
ilarly imply:
Corollary 4 Let D be a class of distributions
over X such that for every distribution D1, DZ c
D, there exists an a such that D1 a-dom~nates D2,
and let C ~ R be a concept class (solidly) poly-
nomially learnable with respect to some D G D.
Then C is (solidly) polynomially learnable with re-
spect to D.
This corollary can be strengthened in several
ways: First, if C is (solidly) learnable by
0((s6)-~ ) examples with respect to D then X is
learnable by 0((e6)- ~ ) examples with respect to
every D’ E D. Note that while the exponent does
not depend on a, the constant in the big-Oh does.
Thus, we may have a different constant for each
distribution.
Second, if there exists an a such that for all D1, D2
and D1 a-dominates D2 and C is (solidly) polyno-
mially learnable with respect to some D ~ D then
there exists a polynomial p(e - 1,6-1 ) such that for
all distributions D e D, P(s- 1, 6-1, many exam-
ples suffice.
Moreover, using an (s/a) -cover for D, we can ap-
ply Best Agreement to find a hypothesis which is
e-close to t with respect to to every distribution
D’ c D. Thus, the same algorithm learns with
respect to all distributions in V.
7 EXTENSIONS
In [3], the authors consider learning when the
number of examples needed may vary with the
target concept.
Nonuniform learnability: A function L
nonuniformly learns a concept class C, with re-
spect to distribution D if for every g, 6 > 0 there is
an 1?= 1(c, 6, c, D) > 0 such that for every c E C
and P ~ 1
P(L((x1, c(x1)), . . . . (z~l, C(zi, ))) is c-close to c)
>1–8.
The intuition behind this definition is that for a
fixed target concept c E C, with high probability;
the hypotheses output by ~ get closer to c as the
number of examples increases.
In the full paper we extend our results to nonuni-
form learnability, and obtain similar results for the
polynomial and solid nonuniform cases.
ACKNOWLEDGEMENTS
It is a pleasure to thank Shai Ben-David for
minating discussions and useful comments.
References
illu-
[1]
[2]
[3]
[4]
[5]
Bartlett, P. W. and R. C. Williamson, Inves-
tigating distributions assumptions in the PAC’
learning model, COLT ’91, 24-32.
Benedek G.M. and Itai A., Learnability
by fixed distributions, Theoretical Computer
Science 86, (1991) 377-389. (A preliminary
version appeared in COLT ’88 (1988 ).)
Benedek G.M. and Itai A., Nonuniform
learnability, 15-th ICALP, (1988), 82-92.
Ben-David S., Benedek G.M. and Mansour
Y., A parameterization scheme for classify-
ing models of learnability, to appear in in-
formation and computation. (A preliminary
version appeared in COLT ’89 (1989 ).)
Blumer A., Ehrenfeucht A., Haussler D. and
Warmuth M., CHassifymg learnable geomet-
ric concepts with the Vapnik- C’hervonenkis
dtmension, Proc. of 18th Symp. Theory of
Comp., 273-282., (1986).
263
[6]
[7]
[8]
[9]
[10]
Ehrenfeucht A., Haussler D., Kearns M.
and Valiant L., A general lower bound on
the number of examples needed for learning,
COLT ’88.
Halmos, F’. R., Measure Theory, Van Nos-
trand, (1950).
Ming Li, Vitanyi, P., Learning simple con-
cepts under simple distributions To appear in
SIAM J. on Computing, an extended abstract
appeared in 30th FOCS (1989).
Vapnik V.N. and Chervonenkis A.Ya., On the
uniform convergence of relative frequencies of
events to their probabilities, Th. Prob. and its
Appl., 16(2), 264-80, (1971).
Valiant L. G., A Theory of the Learnable,
Comm. ACM, 27(11), 1134-42, (1984).
APPENDIX
MEASURE THEORETIC
PROPERTIES
The purpose of this appendix is to show that the
sets defined in proof of Theorem 9 are indeed mea-
surable. For the sake of brevity, these properties
are proved only in the context necessary for this
work, however, they hold under more general con-
ditions too.
An t-sample is a point in Xe. Thus, in order to
discuss the probability of getting a good sample
we should know what are the measurable sets in
the product space X1. Following Halmos [7], if R
is a a-algebra over X, then the u-algebra Re over
the product space XL consists of the a-algebra
generated from all the sets of the form S = S1 x
. . . x Sl where Si E iM. (In other words, A set S ~
Xe is a basic measurable set if S = S’l x . . . x St,
where each Si ~ X is measurable. The class of
measurable sets is the smallest class of sets which
contain the basic measurable sets and are closed
under countable unions and complementation.)
We now can prove some properties of product
measures.
Lemma 10 (Permutation) Let r be a permuta-
tion of {1,...,/} and S 6 Re then
7(S) = {Zm(l), . . .. Zr(l) : (zl, . . ..zI) E S} G Re.
Proof: If S is measurable then S is derived by
applying u-algebra operations (complementations
and countable-unions) from basic measurable sets
(products of sets of R). Thus we may derive T(S)
from the same operations except that the indices
of the sets are permuted according to m. ❑
Lemma 11 If S ~ R2 then the set
s. = {(Yl,...,Y~,z) ~ ‘m+l : 3i (Yi7z) ~ s)~ Rm+l,
Proof: s. =
{( Yl, ~2,...>%~) : (yl, ~) ESandzi 6X}
U{(Zl, yz, Z3..., Z~, Z ):(yz,~)~Sandzi~X}
U{(zl, ,..., %-.l,ym,z) : (Yin, z) ES
and ~i c X}.
The last element of the union is measurable since
it is equal to Xm– 1 x S. The other sets are also
measurable since they are a permutation of the
last set. ❑
264