learning intersections and thresholds of halfspaces adam klivans (mit/harvard) ryan o’donnell...

37
Learning intersections and thresholds of halfspaces Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard)

Upload: kenneth-willis

Post on 16-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Learning intersections and thresholds of halfspaces

Adam Klivans (MIT/Harvard)Ryan O’Donnell (MIT)

Rocco Servedio (Harvard)

LearningWe consider the PAC model of [Valiant-84],

in which learning a “concept class” C of boolean functions means:

- a function f in C is selected, and also a probability distribution D over {+1,−1}n

- the learning algorithm gets access to random examples <x, f(x)>, where the x’s are drawn from D

- goal: efficiently output a hypothesis h such that w.h.p., Prx←D[ f(x) ≠ h(x)] < ε.

Learning exampleExample: C is the class of all conjunctions of

variables.

Perhaps the concept selected is:

x1 AND x2 AND x4.

One might see examples:

< (+ + − + − +), + >

< (− + − + − −), − >

< (+ + + − − +), − >

What is a learning algorithm for this class?

Halfspaces

Let h be a hyperplane in Rn:

h = {x : ∑wi xi = θ}.

h naturally induces a boolean function:f : {+1,−1}n → {+1,−1},

f (x) = sgn(∑wi xi − θ).

We call such a function a boolean halfspace, or a weighted majority. The majority function itself is an example (wi ≡ 1, θ = 0).

i=1

n

Learning halfspaces

Learning halfspaces is a very old problem; dates back to models for the brain from the ’50s: [Agmon-54, Rosenblatt-58, Block-62].

The concept class of halfspaces has long been known to be PAC learnable in polynomial time via Linear Programming [BEHW-89].

Indeed, this works over any distribution on Rn, including those singling out {+1,−1}n.

Learning halfspaces

Basic idea: given a bunch of examples, find a halfspace which classifies them correctly.

By some learning theory technology (“Occam’s Razor”), this is a good algorithm.

Consider the coefficients of a hypothesis halfspace to be unknowns, a1, …, an, θ.

Each example induces some linear constraints: e.g., < (+ + − + − −), + > induces a1+a2−a3+a4−a5−a6 > θ. Solve LP.

Learning intersections of halfspaces

The next logical extension of this, and a very important one, is learning intersections of halfspaces.

Intersections of halfspaces form a very rich concept class: all convex bodies, CNF formulas…

Learning them is also an important problem for computer vision, study of perceptrons.

But very little is known.

Prior work- [Baum91]: poly time algorithm for intersection of

two halfspaces through the origin under symmetric distributions (those satisfying D(x) = D(−x)).

- [BlumKannan,Vempala97] learn an intersection of O(1) halfspaces in poly time over near-uniform distributions on the Euclidean sphere:

- not relevant for boolean halfspaces

- [KwekPitt98] gave a polynomial time alg., but requires membership queries

- also not relevant for boolean halfspaces

Our results

Theorem 1: The concept class of

arbitrary functions of k boolean halfspaces over {+1,−1}n

is learnable under the uniform distribution to accuracy 1−ε in time:

nO(k²/ε²).

This is polynomial time if k = O(1), ε = Ω(1).(Prior to this, no algorithm could learn even an intersection of

2 arbitrary boolean halfspaces under the uniform distribution in subexponential time.)

Our results

Theorem 2: The concept class of

intersections of k boolean halfspaces with weight bound W

is learnable under any probability distribution to accuracy 1−ε in time:

nO(k log k log W)/ε.

So if the weights are polynomially bounded, one can learn an intersection of log many halfspaces in quasipolynomial time.

More results

Function Halfspaces Distrib. Time

any fcn. of k weight W any nO(k² log k logW)/εweight k threshold (e.g., inters. of k)

weight W any nO(k log k logW)/ε

intersection of k weight W any nO(√W log k)/ε

read-once intersection of k

arbitrary uniform nO((log(k)/ε)²)

read-once majority of k

arbitrary uniform nÕ((log(k)/ε) )4

Sketch of techniques

For arbitrary distribution results: show that functions of low weight halfspaces have low degree polynomial threshold representations.

For uniform distribution results: show that functions of halfspaces have low noise sensitivity.

Both conclusions imply learning results generically.

Talk outline

Plan for the rest of the talk:

1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.

(Sketch other arbit. dist. results.)

2. Prove nO(k²/ε²) bound for learning arbitrary functions of k halfspaces under the uniform distribution.

(Sketch other unif. dist. results.)

Polynomial threshold functions

A (multilinear) polynomial p : Rn→ R is a PTF for f if it sign-represents f :

f(x) = sgn(p(x)) for all x {+1,−1}n.

- every boolean halfspace is a degree 1 PTF for itself

- every boolean function has a degree n PTF

By linear programming [KS01]: if every function in a class C has a PTF of degree

d, then C is learnable in time nO(d)/ε.

PTFs for intersections of halfspaces

Suppose f and g are hyperplanes, f(x) = ∑wi xi−θ, g(x) = ∑wi' xi−θ' .

We would like a PTF for sgn(f) sgn(g).

Failed attempt 1:- try f(x)g(x): is >0 if f(x)>0 and g(x)>0

is >0 if f(x)<0 and g(x)<0 Failed attempt 2:

- try f(x)+g(x): is >0 if f(x)>0 and g(x)>0 is <0 if f(x)<0 and g(x)<0 is ?? if f(x)>0 and g(x)<0

PTFs for intersections of halfspacesThe solution: apply a (polynomial?) function

to f and g to make them look more like their sign.

Assume ∑|wi| < W. Then for all x {+1,−1}n,

f(x), g(x) [-W,-1] ∪ [1,W].Beigel et al. [BRS95] showed how to

construct a univariate rational function which is an essentially optimal approximator of the sgn function on [-W,-1] ∪ [1,W].

BRS’s sgn-approximator

p(x)=(x-1)(x-2)2(x-4)2(x-8)2(x-16)2(x-32)2

q(x) =

Q is a rational function of degree O(log k log W) such that:

Q(x) [1, 1+1/k] for x [1,W],Q(x) [-1-1/k, -1] for x [-W,-1].

p(-x)-p(x)p(-x)+p(x)

PTFs for intersections of halfspaces

Now given weight W halfspaces h1, …, hk, sgn(Q(h1(x)) + … + Q(hk(x)) − (k−½))

is a rational function which sign-represents the intersection. Once taken to a common denominator, it has degree O(k log k log W).

Easy to get a polynomial: sgn(p/q)=sgn(pq).So we have a PTF for the intersection of k

weight-W halfspaces of degree O(k log k log W). Hence a learning algorithm running in time nO(k log k log W).

Talk outline

Plan for the talk:

1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.

2. Prove nO(k²/ε²) bound for learning functions

of k halfspaces under the uniform distribution.

Noise sensitivity

Let f : {+1,−1}n → {+1,−1} be a boolean function. Pick x {+1,−1}n uniformly at random, and let y be an ε-corruption of x: flip each bit of x independently with probability ε.

defn: The noise sensitivity of f is:

NSε(f) = Pr[f(x) ≠ f(y)].

Noise sensitivity examples

• Let f be a projection to one bit,

f(x1, …, xn) = x1.

Then NSε(f) = ε.

• Suppose f depends on only k bits.Then NSε(f) ≤ k ε.

• PARITY is the most noise-sensitive function:

NSε(PARITYn) = ½ − ½(1−2ε)n.

Noise sensitivity – study and apps.

• [Benjamini-Kalai-Schramm-98] – percolation, low-level circuit complexity

• [Kahn-Kalai-Linial-88] – random walks on the hypercube

• [Håstad-97] – probabilistically checkable proofs• [Bshouty-Jackson-Tamon-99] – learning theory

under noise• [O-02] – Yao’s XOR Lemma, average case

hardness of NP• [Bourgain-02, Kindler-Safra-02, FKRSS-02] –

study of juntas, Fourier analysis of boolean fcns.

Low noise sens. fast learning

We show that if the noise sensitivity of all f in C is uniformly bounded:

NSε(f) ≤ α(ε),

then C is learnable under the uniform distribution in time:

nO(1)/α (ε/3).

Intuition: if f is not too noise sensitive, nearby points are highly correlated, so a net of examples works.

−1

Proof of NS-learning connection

Actually, the intuition is wrong. Here is the proper proof sketch:

Low noise sensitivity Fourier spectrum concentrated at low levels; this uses the formula: NSε(f ) = ½−½ Σ(1−2ε)|S| f(S)2 and a Markovish inequality.

Low level Fourier concentration efficient uniform distribution learning; this is by the “Low degree” Fourier sampling learning algorithm of [Linial-Mansour-Nisan-93].

ˆ

Noise sensitivity of halfspaces

Function NSεproof

one boolean halfspace

O(√ε) Y. Peres, ’98

any function of k halfspaces

O(k√ε) union bound

read-once intersection of k halfspaces

O(√ε log k) difficult probabilistic

analysisread-once majority of k halfspaces

Õ((ε log k)¼)

Consequences

Let C be the class of functions of k boolean halfspaces. Take α(ε) = O(k√ε), so all f C have NSε(f) ≤ α(ε).

α−1(ε/3) = O(ε2/k2).

Hence we get Theorem 1:

a uniform distribution learning algorithm running

in time nO(k²/ε²).

Noise sensitivity of a halfspace

We now sketch Peres’s beautiful proof that the noise sensitivity of a single halfspace is O(√ε).

Suppose the halfspace is f = sgn(∑wi xi−θ). Without (much) loss of generality, one can assume θ = 0. Recall that xi’s are selected randomly from {+1,−1} and the sum is formed; then each xi is flipped indep. with prob. ε. We want to show that the prob. the sums land on opposite sides of 0 – call this a “flop”, prob. P – is O(√ε).

Noise sensitivity of a halfspace

With high probability, the number of flipped bits is about k := εn. Let’s assume we always flip exactly k random bits, and that k divides n. (Both assumptions are easily removed.)

We now model the problem thus: Pick signs xi at random. Randomly permute the weights. Divide the weights into n/k blocks of size k. Form the n/k block sums, X1= ∑wi xi, X2= ∑wi xi, etc.

i=1…k i=k+1…2k

Noise sensitivity of a halfspace

Write S = X1 + … + Xn/k for the initial sum. Because of the permutation, we may assume that the random signs in the first block are the “flips”. Put S' = S − X1, so the sum before flipping is S'+X1, and the sum after flipping is S'−X1. We are trying to bound the probability P that these two sums have opposite signs (a flop). Note that this happens iff |S'| < |X1|.

Noise sensitivity of a halfspace

sgn(X1) and S' are independent, so:

Pr[sgn(X1) ≠ sgn(S')] = ½.

sgn(X1) and |X1| are independent, so:

Pr[sgn(X1) ≠ sgn(S') | |S'| > |X1|] = ½

Pr[sgn(X1) ≠ sgn(S) | |S'| > |X1|] = ½Pr[sgn(X1) ≠ sgn(S) & no flop] = ½(1−P)Pr[sgn(X1) ≠ sgn(S)] = ½(1−P)P = 2 E[½ – I[sgn(X1) ≠ sgn(S)]].

Noise sensitivity of a halfspace

Of course, there was nothing special about block X1 as opposed to any other block. So in fact,

P = 2 E[½ – I[sgn(Xi) ≠ sgn(S)]].

for all i = 1…n/k.

Write τ=sgn(S), σi=sgn(Xi), and average:

P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]].

Noise sensitivity of a halfspace

P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]]

The quantity inside the expectation is some random variable, a number which is either ½ – (k/n) ∑i I[1 ≠ σi] or ½ – (k/n) ∑i I[−1 ≠ σi].

If I tell you a number is either a or b, then assuredly it’s at most |a| + |b|. Applying this to the expectation, pointwise:

P ≤ 2 E[|½ – (k/n) ∑i I[σi=1]| + |½ – (k/n) ∑i I[σi=−1]|].

Noise sensitivity of a halfspace

P ≤ 2 E[ |½ – ε ∑ I[σi=1]| + |½ – ε ∑ I[σi=−1]| ]

But the σi’s are simply independent, uniformly random signs. Hence both quantities in the expectation are merely the expected absolute deviation from the mean in 1/ε samples of an unbiased 0/1 random variable – i.e., O(√ε).

i=1…1/ε i=1…1/ε

Extensions

This concludes the proof that a single halfspace has noise sensitivity O(√ε), from which the uniform distribution learning algorithm for functions of k halfspaces follows.

To get the extended learning algorithms, must work harder at analyzing noise sensitivity. Key result: if a halfspace h is biased – say, the probability of + is p < ½, then: NSε(h ) ≤ min{2p, C p (ε log(1/p))½}.

Talk outline

Plan for the talk:

1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.

2. Prove nO(k²/ε²) bound for learning functions

of k halfspaces under the uniform distribution.

Open technical challenges

• Give an upper bound on the degree necessary for a PTF which represents the AND of two arbitrary halfspaces.(For a new lower bound, see my talk tomorrow!)

• Give a better analysis of the noise sensitivity of the intersection of k halfspaces on n bits. Is it O((ε log k)½)?

The huge open problem

It still remains open how to learn an intersection of two arbitrary boolean halfspaces under an arbitrary distribution in subexponential time!