learning intersections and thresholds of halfspaces adam klivans (mit/harvard) ryan o’donnell...

Learning intersections and thresholds of halfspaces

Adam Klivans (MIT/Harvard)Ryan O’Donnell (MIT)

Rocco Servedio (Harvard)

LearningWe consider the PAC model of [Valiant-84],

in which learning a “concept class” C of boolean functions means:

- a function f in C is selected, and also a probability distribution D over {+1,−1}n

- the learning algorithm gets access to random examples <x, f(x)>, where the x’s are drawn from D

- goal: efficiently output a hypothesis h such that w.h.p., Prx←D[ f(x) ≠ h(x)] < ε.

Learning exampleExample: C is the class of all conjunctions of

variables.

Perhaps the concept selected is:

x1 AND x2 AND x4.

One might see examples:

< (+ + − + − +), + >

< (− + − + − −), − >

< (+ + + − − +), − >

What is a learning algorithm for this class?

Halfspaces

Let h be a hyperplane in Rn:

h = {x : ∑wi xi = θ}.

h naturally induces a boolean function:f : {+1,−1}n → {+1,−1},

f (x) = sgn(∑wi xi − θ).

We call such a function a boolean halfspace, or a weighted majority. The majority function itself is an example (wi ≡ 1, θ = 0).

i=1

n

Learning halfspaces

Learning halfspaces is a very old problem; dates back to models for the brain from the ’50s: [Agmon-54, Rosenblatt-58, Block-62].

The concept class of halfspaces has long been known to be PAC learnable in polynomial time via Linear Programming [BEHW-89].

Indeed, this works over any distribution on Rn, including those singling out {+1,−1}n.

Learning halfspaces

Basic idea: given a bunch of examples, find a halfspace which classifies them correctly.

By some learning theory technology (“Occam’s Razor”), this is a good algorithm.

Consider the coefficients of a hypothesis halfspace to be unknowns, a1, …, an, θ.

Each example induces some linear constraints: e.g., < (+ + − + − −), + > induces a1+a2−a3+a4−a5−a6 > θ. Solve LP.

Learning intersections of halfspaces

The next logical extension of this, and a very important one, is learning intersections of halfspaces.

Intersections of halfspaces form a very rich concept class: all convex bodies, CNF formulas…

Learning them is also an important problem for computer vision, study of perceptrons.

But very little is known.

Prior work- [Baum91]: poly time algorithm for intersection of

two halfspaces through the origin under symmetric distributions (those satisfying D(x) = D(−x)).

- [BlumKannan,Vempala97] learn an intersection of O(1) halfspaces in poly time over near-uniform distributions on the Euclidean sphere:

- not relevant for boolean halfspaces

- [KwekPitt98] gave a polynomial time alg., but requires membership queries

- also not relevant for boolean halfspaces

Our results

Theorem 1: The concept class of

arbitrary functions of k boolean halfspaces over {+1,−1}n

is learnable under the uniform distribution to accuracy 1−ε in time:

nO(k²/ε²).

This is polynomial time if k = O(1), ε = Ω(1).(Prior to this, no algorithm could learn even an intersection of

2 arbitrary boolean halfspaces under the uniform distribution in subexponential time.)

Our results

Theorem 2: The concept class of

intersections of k boolean halfspaces with weight bound W

is learnable under any probability distribution to accuracy 1−ε in time:

nO(k log k log W)/ε.

So if the weights are polynomially bounded, one can learn an intersection of log many halfspaces in quasipolynomial time.

More results

Function Halfspaces Distrib. Time

any fcn. of k weight W any nO(k² log k logW)/εweight k threshold (e.g., inters. of k)

weight W any nO(k log k logW)/ε

intersection of k weight W any nO(√W log k)/ε

read-once intersection of k

arbitrary uniform nO((log(k)/ε)²)

read-once majority of k

arbitrary uniform nÕ((log(k)/ε) )4

Sketch of techniques

For arbitrary distribution results: show that functions of low weight halfspaces have low degree polynomial threshold representations.

For uniform distribution results: show that functions of halfspaces have low noise sensitivity.

Both conclusions imply learning results generically.

Talk outline

Plan for the rest of the talk:

1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.

(Sketch other arbit. dist. results.)

2. Prove nO(k²/ε²) bound for learning arbitrary functions of k halfspaces under the uniform distribution.

(Sketch other unif. dist. results.)

Polynomial threshold functions

A (multilinear) polynomial p : Rn→ R is a PTF for f if it sign-represents f :

f(x) = sgn(p(x)) for all x {+1,−1}n.

- every boolean halfspace is a degree 1 PTF for itself

- every boolean function has a degree n PTF

By linear programming [KS01]: if every function in a class C has a PTF of degree

d, then C is learnable in time nO(d)/ε.

PTFs for intersections of halfspaces

Suppose f and g are hyperplanes, f(x) = ∑wi xi−θ, g(x) = ∑wi' xi−θ' .

We would like a PTF for sgn(f) sgn(g).

Failed attempt 1:- try f(x)g(x): is >0 if f(x)>0 and g(x)>0

is >0 if f(x)<0 and g(x)<0 Failed attempt 2:

- try f(x)+g(x): is >0 if f(x)>0 and g(x)>0 is <0 if f(x)<0 and g(x)<0 is ?? if f(x)>0 and g(x)<0

PTFs for intersections of halfspacesThe solution: apply a (polynomial?) function

to f and g to make them look more like their sign.

Assume ∑|wi| < W. Then for all x {+1,−1}n,

f(x), g(x) [-W,-1] ∪ [1,W].Beigel et al. [BRS95] showed how to

construct a univariate rational function which is an essentially optimal approximator of the sgn function on [-W,-1] ∪ [1,W].

BRS’s sgn-approximator

p(x)=(x-1)(x-2)2(x-4)2(x-8)2(x-16)2(x-32)2

q(x) =

Q is a rational function of degree O(log k log W) such that:

Q(x) [1, 1+1/k] for x [1,W],Q(x) [-1-1/k, -1] for x [-W,-1].

p(-x)-p(x)p(-x)+p(x)

PTFs for intersections of halfspaces

Now given weight W halfspaces h1, …, hk, sgn(Q(h1(x)) + … + Q(hk(x)) − (k−½))

is a rational function which sign-represents the intersection. Once taken to a common denominator, it has degree O(k log k log W).

Easy to get a polynomial: sgn(p/q)=sgn(pq).So we have a PTF for the intersection of k

weight-W halfspaces of degree O(k log k log W). Hence a learning algorithm running in time nO(k log k log W).

Talk outline

Plan for the talk:


2. Prove nO(k²/ε²) bound for learning functions

of k halfspaces under the uniform distribution.

Noise sensitivity

Let f : {+1,−1}n → {+1,−1} be a boolean function. Pick x {+1,−1}n uniformly at random, and let y be an ε-corruption of x: flip each bit of x independently with probability ε.

defn: The noise sensitivity of f is:

NSε(f) = Pr[f(x) ≠ f(y)].

Noise sensitivity examples

• Let f be a projection to one bit,

f(x1, …, xn) = x1.

Then NSε(f) = ε.

• Suppose f depends on only k bits.Then NSε(f) ≤ k ε.

• PARITY is the most noise-sensitive function:

NSε(PARITYn) = ½ − ½(1−2ε)n.

Noise sensitivity – study and apps.

• [Benjamini-Kalai-Schramm-98] – percolation, low-level circuit complexity

• [Kahn-Kalai-Linial-88] – random walks on the hypercube

• [Håstad-97] – probabilistically checkable proofs• [Bshouty-Jackson-Tamon-99] – learning theory

under noise• [O-02] – Yao’s XOR Lemma, average case

hardness of NP• [Bourgain-02, Kindler-Safra-02, FKRSS-02] –

study of juntas, Fourier analysis of boolean fcns.

Low noise sens. fast learning

We show that if the noise sensitivity of all f in C is uniformly bounded:

NSε(f) ≤ α(ε),

then C is learnable under the uniform distribution in time:

nO(1)/α (ε/3).

Intuition: if f is not too noise sensitive, nearby points are highly correlated, so a net of examples works.

−1

Proof of NS-learning connection

Actually, the intuition is wrong. Here is the proper proof sketch:

Low noise sensitivity Fourier spectrum concentrated at low levels; this uses the formula: NSε(f ) = ½−½ Σ(1−2ε)|S| f(S)2 and a Markovish inequality.

Low level Fourier concentration efficient uniform distribution learning; this is by the “Low degree” Fourier sampling learning algorithm of [Linial-Mansour-Nisan-93].

ˆ

Noise sensitivity of halfspaces

Function NSεproof

one boolean halfspace

O(√ε) Y. Peres, ’98

any function of k halfspaces

O(k√ε) union bound

read-once intersection of k halfspaces

O(√ε log k) difficult probabilistic

analysisread-once majority of k halfspaces

Õ((ε log k)¼)

Consequences

Let C be the class of functions of k boolean halfspaces. Take α(ε) = O(k√ε), so all f C have NSε(f) ≤ α(ε).

α−1(ε/3) = O(ε2/k2).

Hence we get Theorem 1:

a uniform distribution learning algorithm running

in time nO(k²/ε²).

Noise sensitivity of a halfspace

We now sketch Peres’s beautiful proof that the noise sensitivity of a single halfspace is O(√ε).

Suppose the halfspace is f = sgn(∑wi xi−θ). Without (much) loss of generality, one can assume θ = 0. Recall that xi’s are selected randomly from {+1,−1} and the sum is formed; then each xi is flipped indep. with prob. ε. We want to show that the prob. the sums land on opposite sides of 0 – call this a “flop”, prob. P – is O(√ε).


With high probability, the number of flipped bits is about k := εn. Let’s assume we always flip exactly k random bits, and that k divides n. (Both assumptions are easily removed.)

We now model the problem thus: Pick signs xi at random. Randomly permute the weights. Divide the weights into n/k blocks of size k. Form the n/k block sums, X1= ∑wi xi, X2= ∑wi xi, etc.

i=1…k i=k+1…2k


Write S = X1 + … + Xn/k for the initial sum. Because of the permutation, we may assume that the random signs in the first block are the “flips”. Put S' = S − X1, so the sum before flipping is S'+X1, and the sum after flipping is S'−X1. We are trying to bound the probability P that these two sums have opposite signs (a flop). Note that this happens iff |S'| < |X1|.


sgn(X1) and S' are independent, so:

Pr[sgn(X1) ≠ sgn(S')] = ½.

sgn(X1) and |X1| are independent, so:

Pr[sgn(X1) ≠ sgn(S') | |S'| > |X1|] = ½

Pr[sgn(X1) ≠ sgn(S) | |S'| > |X1|] = ½Pr[sgn(X1) ≠ sgn(S) & no flop] = ½(1−P)Pr[sgn(X1) ≠ sgn(S)] = ½(1−P)P = 2 E[½ – I[sgn(X1) ≠ sgn(S)]].


Of course, there was nothing special about block X1 as opposed to any other block. So in fact,

P = 2 E[½ – I[sgn(Xi) ≠ sgn(S)]].

for all i = 1…n/k.

Write τ=sgn(S), σi=sgn(Xi), and average:

P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]].


P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]]

The quantity inside the expectation is some random variable, a number which is either ½ – (k/n) ∑i I[1 ≠ σi] or ½ – (k/n) ∑i I[−1 ≠ σi].

If I tell you a number is either a or b, then assuredly it’s at most |a| + |b|. Applying this to the expectation, pointwise:

P ≤ 2 E[|½ – (k/n) ∑i I[σi=1]| + |½ – (k/n) ∑i I[σi=−1]|].


P ≤ 2 E[ |½ – ε ∑ I[σi=1]| + |½ – ε ∑ I[σi=−1]| ]

But the σi’s are simply independent, uniformly random signs. Hence both quantities in the expectation are merely the expected absolute deviation from the mean in 1/ε samples of an unbiased 0/1 random variable – i.e., O(√ε).

i=1…1/ε i=1…1/ε

Extensions

This concludes the proof that a single halfspace has noise sensitivity O(√ε), from which the uniform distribution learning algorithm for functions of k halfspaces follows.

To get the extended learning algorithms, must work harder at analyzing noise sensitivity. Key result: if a halfspace h is biased – say, the probability of + is p < ½, then: NSε(h ) ≤ min{2p, C p (ε log(1/p))½}.

Talk outline

Plan for the talk:


2. Prove nO(k²/ε²) bound for learning functions

of k halfspaces under the uniform distribution.

Open technical challenges

• Give an upper bound on the degree necessary for a PTF which represents the AND of two arbitrary halfspaces.(For a new lower bound, see my talk tomorrow!)

• Give a better analysis of the noise sensitivity of the intersection of k halfspaces on n bits. Is it O((ε log k)½)?

The huge open problem

It still remains open how to learn an intersection of two arbitrary boolean halfspaces under an arbitrary distribution in subexponential time!

learning intersections and thresholds of halfspaces adam klivans (mit/harvard) ryan o’donnell...

Documents