learning intersections and thresholds of halfspaces adam klivans (mit/harvard) ryan o’donnell...
TRANSCRIPT
Learning intersections and thresholds of halfspaces
Adam Klivans (MIT/Harvard)Ryan O’Donnell (MIT)
Rocco Servedio (Harvard)
LearningWe consider the PAC model of [Valiant-84],
in which learning a “concept class” C of boolean functions means:
- a function f in C is selected, and also a probability distribution D over {+1,−1}n
- the learning algorithm gets access to random examples <x, f(x)>, where the x’s are drawn from D
- goal: efficiently output a hypothesis h such that w.h.p., Prx←D[ f(x) ≠ h(x)] < ε.
Learning exampleExample: C is the class of all conjunctions of
variables.
Perhaps the concept selected is:
x1 AND x2 AND x4.
One might see examples:
< (+ + − + − +), + >
< (− + − + − −), − >
< (+ + + − − +), − >
What is a learning algorithm for this class?
Halfspaces
Let h be a hyperplane in Rn:
h = {x : ∑wi xi = θ}.
h naturally induces a boolean function:f : {+1,−1}n → {+1,−1},
f (x) = sgn(∑wi xi − θ).
We call such a function a boolean halfspace, or a weighted majority. The majority function itself is an example (wi ≡ 1, θ = 0).
i=1
n
Learning halfspaces
Learning halfspaces is a very old problem; dates back to models for the brain from the ’50s: [Agmon-54, Rosenblatt-58, Block-62].
The concept class of halfspaces has long been known to be PAC learnable in polynomial time via Linear Programming [BEHW-89].
Indeed, this works over any distribution on Rn, including those singling out {+1,−1}n.
Learning halfspaces
Basic idea: given a bunch of examples, find a halfspace which classifies them correctly.
By some learning theory technology (“Occam’s Razor”), this is a good algorithm.
Consider the coefficients of a hypothesis halfspace to be unknowns, a1, …, an, θ.
Each example induces some linear constraints: e.g., < (+ + − + − −), + > induces a1+a2−a3+a4−a5−a6 > θ. Solve LP.
Learning intersections of halfspaces
The next logical extension of this, and a very important one, is learning intersections of halfspaces.
Intersections of halfspaces form a very rich concept class: all convex bodies, CNF formulas…
Learning them is also an important problem for computer vision, study of perceptrons.
But very little is known.
Prior work- [Baum91]: poly time algorithm for intersection of
two halfspaces through the origin under symmetric distributions (those satisfying D(x) = D(−x)).
- [BlumKannan,Vempala97] learn an intersection of O(1) halfspaces in poly time over near-uniform distributions on the Euclidean sphere:
- not relevant for boolean halfspaces
- [KwekPitt98] gave a polynomial time alg., but requires membership queries
- also not relevant for boolean halfspaces
Our results
Theorem 1: The concept class of
arbitrary functions of k boolean halfspaces over {+1,−1}n
is learnable under the uniform distribution to accuracy 1−ε in time:
nO(k²/ε²).
This is polynomial time if k = O(1), ε = Ω(1).(Prior to this, no algorithm could learn even an intersection of
2 arbitrary boolean halfspaces under the uniform distribution in subexponential time.)
Our results
Theorem 2: The concept class of
intersections of k boolean halfspaces with weight bound W
is learnable under any probability distribution to accuracy 1−ε in time:
nO(k log k log W)/ε.
So if the weights are polynomially bounded, one can learn an intersection of log many halfspaces in quasipolynomial time.
More results
Function Halfspaces Distrib. Time
any fcn. of k weight W any nO(k² log k logW)/εweight k threshold (e.g., inters. of k)
weight W any nO(k log k logW)/ε
intersection of k weight W any nO(√W log k)/ε
read-once intersection of k
arbitrary uniform nO((log(k)/ε)²)
read-once majority of k
arbitrary uniform nÕ((log(k)/ε) )4
Sketch of techniques
For arbitrary distribution results: show that functions of low weight halfspaces have low degree polynomial threshold representations.
For uniform distribution results: show that functions of halfspaces have low noise sensitivity.
Both conclusions imply learning results generically.
Talk outline
Plan for the rest of the talk:
1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.
(Sketch other arbit. dist. results.)
2. Prove nO(k²/ε²) bound for learning arbitrary functions of k halfspaces under the uniform distribution.
(Sketch other unif. dist. results.)
Polynomial threshold functions
A (multilinear) polynomial p : Rn→ R is a PTF for f if it sign-represents f :
f(x) = sgn(p(x)) for all x {+1,−1}n.
- every boolean halfspace is a degree 1 PTF for itself
- every boolean function has a degree n PTF
By linear programming [KS01]: if every function in a class C has a PTF of degree
d, then C is learnable in time nO(d)/ε.
PTFs for intersections of halfspaces
Suppose f and g are hyperplanes, f(x) = ∑wi xi−θ, g(x) = ∑wi' xi−θ' .
We would like a PTF for sgn(f) sgn(g).
Failed attempt 1:- try f(x)g(x): is >0 if f(x)>0 and g(x)>0
is >0 if f(x)<0 and g(x)<0 Failed attempt 2:
- try f(x)+g(x): is >0 if f(x)>0 and g(x)>0 is <0 if f(x)<0 and g(x)<0 is ?? if f(x)>0 and g(x)<0
PTFs for intersections of halfspacesThe solution: apply a (polynomial?) function
to f and g to make them look more like their sign.
Assume ∑|wi| < W. Then for all x {+1,−1}n,
f(x), g(x) [-W,-1] ∪ [1,W].Beigel et al. [BRS95] showed how to
construct a univariate rational function which is an essentially optimal approximator of the sgn function on [-W,-1] ∪ [1,W].
BRS’s sgn-approximator
p(x)=(x-1)(x-2)2(x-4)2(x-8)2(x-16)2(x-32)2
q(x) =
Q is a rational function of degree O(log k log W) such that:
Q(x) [1, 1+1/k] for x [1,W],Q(x) [-1-1/k, -1] for x [-W,-1].
p(-x)-p(x)p(-x)+p(x)
PTFs for intersections of halfspaces
Now given weight W halfspaces h1, …, hk, sgn(Q(h1(x)) + … + Q(hk(x)) − (k−½))
is a rational function which sign-represents the intersection. Once taken to a common denominator, it has degree O(k log k log W).
Easy to get a polynomial: sgn(p/q)=sgn(pq).So we have a PTF for the intersection of k
weight-W halfspaces of degree O(k log k log W). Hence a learning algorithm running in time nO(k log k log W).
Talk outline
Plan for the talk:
1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.
2. Prove nO(k²/ε²) bound for learning functions
of k halfspaces under the uniform distribution.
Noise sensitivity
Let f : {+1,−1}n → {+1,−1} be a boolean function. Pick x {+1,−1}n uniformly at random, and let y be an ε-corruption of x: flip each bit of x independently with probability ε.
defn: The noise sensitivity of f is:
NSε(f) = Pr[f(x) ≠ f(y)].
Noise sensitivity examples
• Let f be a projection to one bit,
f(x1, …, xn) = x1.
Then NSε(f) = ε.
• Suppose f depends on only k bits.Then NSε(f) ≤ k ε.
• PARITY is the most noise-sensitive function:
NSε(PARITYn) = ½ − ½(1−2ε)n.
Noise sensitivity – study and apps.
• [Benjamini-Kalai-Schramm-98] – percolation, low-level circuit complexity
• [Kahn-Kalai-Linial-88] – random walks on the hypercube
• [Håstad-97] – probabilistically checkable proofs• [Bshouty-Jackson-Tamon-99] – learning theory
under noise• [O-02] – Yao’s XOR Lemma, average case
hardness of NP• [Bourgain-02, Kindler-Safra-02, FKRSS-02] –
study of juntas, Fourier analysis of boolean fcns.
Low noise sens. fast learning
We show that if the noise sensitivity of all f in C is uniformly bounded:
NSε(f) ≤ α(ε),
then C is learnable under the uniform distribution in time:
nO(1)/α (ε/3).
Intuition: if f is not too noise sensitive, nearby points are highly correlated, so a net of examples works.
−1
Proof of NS-learning connection
Actually, the intuition is wrong. Here is the proper proof sketch:
Low noise sensitivity Fourier spectrum concentrated at low levels; this uses the formula: NSε(f ) = ½−½ Σ(1−2ε)|S| f(S)2 and a Markovish inequality.
Low level Fourier concentration efficient uniform distribution learning; this is by the “Low degree” Fourier sampling learning algorithm of [Linial-Mansour-Nisan-93].
ˆ
Noise sensitivity of halfspaces
Function NSεproof
one boolean halfspace
O(√ε) Y. Peres, ’98
any function of k halfspaces
O(k√ε) union bound
read-once intersection of k halfspaces
O(√ε log k) difficult probabilistic
analysisread-once majority of k halfspaces
Õ((ε log k)¼)
Consequences
Let C be the class of functions of k boolean halfspaces. Take α(ε) = O(k√ε), so all f C have NSε(f) ≤ α(ε).
α−1(ε/3) = O(ε2/k2).
Hence we get Theorem 1:
a uniform distribution learning algorithm running
in time nO(k²/ε²).
Noise sensitivity of a halfspace
We now sketch Peres’s beautiful proof that the noise sensitivity of a single halfspace is O(√ε).
Suppose the halfspace is f = sgn(∑wi xi−θ). Without (much) loss of generality, one can assume θ = 0. Recall that xi’s are selected randomly from {+1,−1} and the sum is formed; then each xi is flipped indep. with prob. ε. We want to show that the prob. the sums land on opposite sides of 0 – call this a “flop”, prob. P – is O(√ε).
Noise sensitivity of a halfspace
With high probability, the number of flipped bits is about k := εn. Let’s assume we always flip exactly k random bits, and that k divides n. (Both assumptions are easily removed.)
We now model the problem thus: Pick signs xi at random. Randomly permute the weights. Divide the weights into n/k blocks of size k. Form the n/k block sums, X1= ∑wi xi, X2= ∑wi xi, etc.
i=1…k i=k+1…2k
Noise sensitivity of a halfspace
Write S = X1 + … + Xn/k for the initial sum. Because of the permutation, we may assume that the random signs in the first block are the “flips”. Put S' = S − X1, so the sum before flipping is S'+X1, and the sum after flipping is S'−X1. We are trying to bound the probability P that these two sums have opposite signs (a flop). Note that this happens iff |S'| < |X1|.
Noise sensitivity of a halfspace
sgn(X1) and S' are independent, so:
Pr[sgn(X1) ≠ sgn(S')] = ½.
sgn(X1) and |X1| are independent, so:
Pr[sgn(X1) ≠ sgn(S') | |S'| > |X1|] = ½
Pr[sgn(X1) ≠ sgn(S) | |S'| > |X1|] = ½Pr[sgn(X1) ≠ sgn(S) & no flop] = ½(1−P)Pr[sgn(X1) ≠ sgn(S)] = ½(1−P)P = 2 E[½ – I[sgn(X1) ≠ sgn(S)]].
Noise sensitivity of a halfspace
Of course, there was nothing special about block X1 as opposed to any other block. So in fact,
P = 2 E[½ – I[sgn(Xi) ≠ sgn(S)]].
for all i = 1…n/k.
Write τ=sgn(S), σi=sgn(Xi), and average:
P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]].
Noise sensitivity of a halfspace
P = 2 E[½ – (k/n) ∑i I[τ ≠ σi]]
The quantity inside the expectation is some random variable, a number which is either ½ – (k/n) ∑i I[1 ≠ σi] or ½ – (k/n) ∑i I[−1 ≠ σi].
If I tell you a number is either a or b, then assuredly it’s at most |a| + |b|. Applying this to the expectation, pointwise:
P ≤ 2 E[|½ – (k/n) ∑i I[σi=1]| + |½ – (k/n) ∑i I[σi=−1]|].
Noise sensitivity of a halfspace
P ≤ 2 E[ |½ – ε ∑ I[σi=1]| + |½ – ε ∑ I[σi=−1]| ]
But the σi’s are simply independent, uniformly random signs. Hence both quantities in the expectation are merely the expected absolute deviation from the mean in 1/ε samples of an unbiased 0/1 random variable – i.e., O(√ε).
i=1…1/ε i=1…1/ε
Extensions
This concludes the proof that a single halfspace has noise sensitivity O(√ε), from which the uniform distribution learning algorithm for functions of k halfspaces follows.
To get the extended learning algorithms, must work harder at analyzing noise sensitivity. Key result: if a halfspace h is biased – say, the probability of + is p < ½, then: NSε(h ) ≤ min{2p, C p (ε log(1/p))½}.
Talk outline
Plan for the talk:
1. Prove nO(k log k log W) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.
2. Prove nO(k²/ε²) bound for learning functions
of k halfspaces under the uniform distribution.
Open technical challenges
• Give an upper bound on the degree necessary for a PTF which represents the AND of two arbitrary halfspaces.(For a new lower bound, see my talk tomorrow!)
• Give a better analysis of the noise sensitivity of the intersection of k halfspaces on n bits. Is it O((ε log k)½)?