sparsity constrained minimization via mathematical ... · pdf filesparsity constrained...

Sparsity Constrained Minimization via Mathematical Programming withEquilibrium Constraints

Ganzhao Yuan ∗ Bernard Ghanem †

Abstract

Sparsity constrained minimization captures a wide spec-trum of applications in both machine learning and signalprocessing. This class of problems is difficult to solve sinceit is NP-hard and existing solutions are primarily based onIterative Hard Thresholding (IHT). In this paper, we con-sider a class of continuous optimization techniques basedon Mathematical Programs with Equilibrium Constraints(MPECs) to solve general sparsity constrained problems.Specifically, we reformulate the problem as an equivalentbiconvex MPEC, which we can solve using an exact penaltymethod or an alternating direction method. We elaborateon the merits of both proposed methods and analyze theirconvergence properties. Finally, we demonstrate the effec-tiveness and versatility of our methods on several importantproblems, including feature selection, segmented regres-sion, MRF optimization, trend filtering and impulse noiseremoval. Extensive experiments show that our MPEC-based methods outperform state-of-the-art techniques, es-pecially those based on IHT.Keywords: Sparsity Constrained Optimization, ExactPenalty Method, Alternating Direction Method, MPEC,Convergence Analysis

1. IntroductionIn this paper, we mainly focus on the following general-

ized sparsity constrained minimization problem:

minx

f(x), s.t. ‖Ax‖0 ≤ k (1)

where x ∈ Rn, A ∈ Rm×n and k (0 < k < m) is a positiveinteger. ‖·‖0 is a function that counts the number of nonzeroelements in a vector. To guarantee convergence, we assumethat f(·) is convex and L-Lipschitz continuous (but not nec-essarily smooth), A has right inverse (i.e. rank(A) = m),and there always exists bounded solutions to Eq (1).

∗King Abdullah University of Science & Technology (KAUST). Email:[email protected].†King Abdullah University of Science & Technology (KAUST). Email:

[email protected].

The optimization in Eq (1) describes many applicationsof interest in both machine learning and signal processing,including compressive sensing [19], Dantzig selector [11],feature selection [51], trend filtering [38], image restoration[65, 18], sparse classification and boosting [59, 61], sub-space clustering [21], sparse coding [40, 4], portfolio selec-tion, image smoothing [62], sparse inverse covariance es-timation [26, 67], blind source separation[41], permutationproblems [25, 36], joint power and admission control [42],Potts statistical functional [58], to name a few. Moreover,we notice that many binary optimization problems [57] canbe reformulated as an `0 norm optimization problem, sincex ∈ −1,+1n ⇔ ‖x−1‖0 + ‖x + 1‖0 ≤ n. Thus, it canbe rewritten in a more compact form as ‖Ax − b‖0 ≤ nwith A = [In, In]T ∈ R2n×n and b = [−1T ,1T ]T ∈ R2n.

A popular method to solve Eq (1) is to consider its `1norm convex relaxation, which simply replaces the `0 normfunction1 with the `1 norm function. When A is identityand f(·) is a least squares objective function, it has beenproven that when the sampling matrix in f(·) satisfies cer-tain incoherence conditions and x is sparse at its optimalsolution, the problem can be solved exactly by this con-vex method [12]. However, when these two assumptions donot hold, this convex method can be far from satisfactory[65, 42].

Recently, another breakthrough in sparse optimization isthe multi-stage convex relaxation method [70, 13]. In thework of [70], the local solution obtained by this method isshown to be superior to the global solution of the standard`1 convex method, which is used as its initialization. How-ever, this type of method seeks an approximate solution tothe sparse regularized problem with a smooth objective, andcannot solve the sparsity constrained problem 2 consideredhere. Our method uses the same convex initialization strat-egy, but it is applicable to general `0 norm minimizationproblems.

Challenges and Contributions: We recognize three mainchallenges hindering existing work. (a) The general prob-

1Strictly speaking, ‖ · ‖0 is not a norm (but a pseudo-norm) since it isnot positive homogeneous.

2Some may solve the constrained problem by tuning the continuousparameter of regularized problem, but it can be hard when m is large.

1

arX

iv:1

608.

0443

0v1

[m

ath.

OC

] 1

5 A

ug 2

016

Table 1: `0 norm optimization techniques.

Method and Representative Reference Descriptiongreedy descent methods [46] only for smooth (typically quadratic) objective`1 norm relaxation [12] ‖x‖0 ≈ ‖x‖1k-support norm relaxation [1] ‖x‖0 ≈ ‖x‖k-sup , max0<v≤1 (

∑i x

2i /vi)

1/2, s.t. 〈v,1〉 ≤ kk-largest norm relaxation [64] ‖x‖0 ≈ ‖x‖k-lar , max−1≤v≤1 〈v,x〉, s.t. ‖v‖1 ≤ kSOCP convex relaxation [14] ‖x‖0 ≤ k ⇒ ‖x‖1 ≤

√k‖x‖2

SDP convex relaxation [14] ‖x‖0 ≤ k ⇒ X = xxT , ‖X‖1 ≤ ktr(X)Schatten `p approximation [29] ‖x‖0 ≈ ‖x‖pre-weighted `1 approximation [13] ‖x‖0 ≈ 〈1, ln(|x|+ ε)〉`1-2 DC approximation [63] ‖x‖0 ≈ ‖x‖1 − ‖x‖20-1 mixed integer programming [8] ‖x‖0 : ‖x‖∞ ≤ λ ⇔ minv∈0,1 〈1,v〉 : |x| ≤ λviterative hard shreadholding [44] 0.5‖x− x′‖22, s.t. ‖x‖0 ≤ knon-separable MPEC [This paper] ‖x‖0 = minu ‖u‖1, s.t. ‖x‖1 = 〈x,u〉, − 1 ≤ u ≤ 1separable MPEC [65] [This paper] ‖x‖0 = minv 〈1,1− v〉, s.t. |x| v = 0, 0 ≤ v ≤ 1

lem in Eq (1) is NP-hard. There is little hope of finding theglobal minimum efficiently in all instances. In order to dealwith this issue, we reformulate the `0 norm minimizationproblem as an equivalent augmented optimization problemwith a bilinear equality constraint using a variational char-acterization of the `0 norm function. Then, we propose twopenalization/regularization methods (exact penalty and al-ternating direction) to solve it. The resulting algorithmsseek a desirable exact solution to the original optimizationproblem. (b) Many existing convergence results for non-convex `0 minimization problems tend to be limited to un-constrained problems or inapplicable to constrained opti-mization. We carefully analyze both our proposed algo-rithms and prove that they always converge to a first-orderKKT point. To the best of our knowledge, this is the firstattempt to solve general `0 constrained minimization withguaranteed convergence. (c) Exact methods (e.g. methodsusing IHT) can produce unsatisfactory results, while ap-proximation methods such as Schatten’s `p norm method[29] and re-weighted `1 [13] fail to control the sparsity ofthe solution. In comparison, experimental results show thatour MPEC-based methods are more effective than the state-of-the-art.

Organization and Notations: This paper is organized asfollows. Section 2 provides a brief description of the re-lated work. Section 3 and Section 4 present our proposedexact penalty method and alternating direction method opti-mization framework, respectively. Section 5 discusses somefeatures of the proposed two algorithms. Section 6 summa-rizes the experimental results. Finally, Section 7 concludesthis paper. Throughout this paper, we use lowercase anduppercase boldfaced letters to denote real vectors and ma-trices respectively. We use 〈x,y〉 and x y to denote theEuclidean inner product and elementwise product between

x and y. “,” means define. σ(A) denotes the smallestsingular value of A.

2. Related WorkThere are mainly four classes of `0 norm minimization

algorithms in the literature: (i) greedy descent methods, (ii)convex approximate methods, (iii) non-convex approximatemethods, and (iv) exact methods. We summarize the mainexisting algorithms in Table 1.

Greedy descent methods. They have a monotoni-cally decreasing property and optimality guarantees in somecases [55], but they are limited to solving problems withsmooth objective functions (typically the square function).(a) Matching pursuit [46] selects at each step one atom ofthe variable that is the most correlated with the residual. (b)Orthogonal matching pursuit [54] uses a similar strategy butalso creates an orthonormal set of atoms to ensure that se-lected components are not introduced in subsequent steps.(c) Gradient pursuit [9] is similar to matching pursuit, butit updates the sparse solution vector at each iteration witha directional update computed based on gradient informa-tion. (d) Gradient support pursuit [3] iteratively choosesthe index with the largest magnitude as the pursuit directionwhile maintaining the stable restricted Hessian property. (e)Other greedy methods has been proposed, including basispursuit [15], regularized orthogonal matching pursuit [50],compressive sampling matching pursuit [49], and forward-backward greedy method [71, 53].

Convex approximate methods. They seek convex ap-proximate reformulations of the `0 norm function. (a) The`1 norm convex relaxation[12] provides a convex lowerbound of the `0 norm function in the unit `∞ norm. It hasbeen proven that under certain incoherence assumptions,this method leads to a near optimal sparse solution. How-

2

ever, such assumptions may be violated in real applications.(b) k-support norm provides the tightest convex relaxationof sparsity combined with an `2 penalty. Moreover, it istighter than the elastic net by exactly a factor of

√2 [1, 48].

(c) k-largest norm provides the tightest convex Boolean re-laxation in the sense of minimax game [64, 52, 16]. (d)Other convex Second-Order Cone Programming (SOCP)relaxations and Semi-Definite Programming (SDP) relax-ations [14] have been considered in the literature.

Non-convex approximate methods. They seek non-convex approximate reformulations of the `0 norm function.(a) Schatten `p norm with p ∈ (0, 1) was considered by[29] to approximate the discrete `0 norm function. It resultsin a local gradient Lipschitz continuous function, to whichsome smooth optimization algorithms can be applied. (b)Re-weighted `1 norm [13] minimizes the first-order Taylorseries expansion of the objective function iteratively to finda local minimum. It is expected to refine the `0 regularizedproblem, since its first iteration is equivalent to solving the`1 norm problem. (c) `1-2 DC (difference of convex) ap-proximation [63] was considered for sparse recovery. Ex-act stable sparse recovery error bounds were establishedunder a restricted isometry property condition. (d) Othernon-convex surrogate functions of the `0 function have beenproposed, including Smoothly Clipped Absolute Deviation(SCAD) [23], Logarithm [27], Minimax Concave Penalty(MCP) [69], Capped `1 [70], Exponential-Type [28], Half-Quadratic [30], Laplace [56], and MC+ [47]. Please refer to[43, 31] for a detailed summary and discussion.

Exact methods. Despite the success of approximatemethods, they are unappealing in cases when an exact solu-tion is required. Therefore, many researchers have soughtout exact reformulations of the `0 norm function. (a) 0-1mixed integer programming [8, 6] assumes the solution hasbound constraints. It can be solved by a tailored branch-and-bound algorithm, where the cardinality constraint isreplaced by a surrogate constraint. (b) Hard thresholding[44, 5, 68, 35] iteratively sets the smallest (in magnitude)values to zero in a gradient descent format. It has been in-corporated into the Quadratic Penalty Method (QPM) [44]and Mean Doubly Alternating Direction Method [18] (MD-ADM). However, we found they often converge to unsat-isfactory results in practise. (c) The MPEC reformulation[65] considers the complementary system of the `0 normproblem by introducing additional dual variables and min-imizing the complimentary error of the MPEC problem.However, it is limited to `0 norm regularized problems andthe convergence results are weak.

From above, we observe that existing methods eitherproduce approximate solutions or are limited to smooth ob-jectives f(.) and when A = I. The only existing generalpurpose exact method for solving Eq (1) is the quadraticpenalty method [44] and the mean doubly alternating di-

rection method [18]. However, they often produce unsat-isfactory results in practise. The unappealing shortcom-ings of existing solutions and renewed interests in MPECs[65, 24] motivate us to design new MPEC-based algorithmsand convergence results for sparsity constrained minimiza-tion.

3. Exact Penalty MethodIn this section, we present an exact penalty method

[45, 34, 37, 7] for solving the problem in Eq (1). Thismethod is based on a new equivalent non-separable MPECreformulation of the `0 norm function.

3.1. Non-Separable MPEC Reformulation

First of all, we present a new non-separable MPEC re-formulation.

Lemma 1. For any given x ∈ Rn, it holds that

‖x‖0 = min−1≤u≤1

‖u‖1, s.t. ‖x‖1 = 〈u,x〉, (2)

and u∗ = sign(x) is the unique optimal solution of Eq(2).Here, the standard signum function sign is applied compo-nentwise, and sign(0) = 0.

Proof. Refer to the Appendix A.

Using Lemma 1, we can rewrite Eq (1) in an equivalentform as follows.

minx,u

f(x) + I(u), s.t. ‖Ax‖1 = 〈Ax,u〉 (3)

where I(u) is an indicator function on Θ with

I(u)=

0, u ∈ Θ

∞, u /∈ Θ,Θ , u| ‖u‖1 ≤ k, −1 ≤ u ≤ 1.

3.2. Proposed Optimization Framework

We now give a detailed description of our solution algo-rithm to the optimization in Eq (3). Our solution is basedon the exact penalty method, which penalizes the comple-mentary error directly by a penalty function. The resultingobjective J : Rn × Rm → R is defined in Eq (4), where ρis the tradeoff penalty parameter that is iteratively increasedto enforce the constraints.

Jρ(x,u) , f(x) + I(u) + ρ(‖Ax‖1 − 〈Ax,u〉) (4)

In each iteration, ρ is fixed and we use a proximal pointmethod [2, 10] to minimize over x and u in an alternatingfashion. Details of this exact penalty method are in Algo-rithm 1. Note that the parameter T is the number of inneriterations for solving the bi-convex problem. We make thefollowing observations about the algorithm.

3

(a) Initialization. We initialize u0 to 0. This is for the sakeof finding a reasonable local minimum in the first iteration,as it reduces to a convex `1 norm minimization problem forthe x-subproblem.(b) Exact property. The key feature of this method is theboundedness of the penalty parameter ρ (see Theorem 1).Therefore, we terminate the optimization when the thresh-old is reached (see Eq (9)). This distinguishes it from thequadratic penalty method [44], where the penalty may be-come arbitrarily large for non-convex problems.(c) u-Subproblem. Variable u in Eq (8) is updated by solv-ing the following problem:

ut+1 = arg min−1≤u≤1

1

2‖u− z‖22, s.t. ‖u‖1 ≤ k, (5)

where z = ρAxt+1/µ and µ is a non-negative proxi-mal constant. Due to the symmetry in the objective andconstraints, we can without loss of generality assume thatz ≥ 0, and flip signs of the resulting solution at the end. Assuch, Eq (5) can be solved by

ut+1 = arg min0≤u≤1

1

2‖u− |z|‖22, s.t. 〈u,1〉 ≤ k, (6)

where the solution in Eq (5) can be recovered via ut+1 =sign(z) ut+1. Eq (6) can be solved exactly in n log(n)time using the breakpoint search algorithm [33]. Note thatthis algorithm includes the simplex projection [20] as a spe-cial case. For completeness, we also include a Matlab codein Appendix D.(d) x-Subproblem. Variable x in Eq (7) is updated by solv-ing a `1 norm convex problem which has no closed-formsolution. However, it can be solved using existing `1 normoptimization solvers such as classical or linearized ADM[32] with convergence rate O(1/t).

3.3. Theoretical Analysis

In Theorem 1, we show that when the penalty parameterρ is larger than a threshold, the biconvex objective functionin Eq (4) is equivalent to the original constrained MPECproblem in Eq (1). Moreover, Theorem 2 characterizes theconvergence rate of our exact penalty method and deter-mines the number of iterations beyond which a certain ac-curacy is guaranteed.

Theorem 1. Exactness of the Penalty Function. Thepenalty problem minx,v Jρ(x,v) admits the same localand global optimal solutions as the original problem in Eq(3) when ρ > L/σ(A).

Proof. Refer to Appendix B.

Theorem 2. Convergence rate of Algorithm 1. Assumethat ‖xt‖ ≤ δ for all t. Algorithm 1 will converge to the

Algorithm 1 MPEC-EPM: An Exact Penalty Method forSolving MPEC Problem (3)(S.0) Initialize x0 = 0 ∈ Rn, u0 = 0 ∈ Rm, ρ0 > 0. Sett = 0 and µ > 0.(S.1) Solve the following x-subproblem:

xt+1 = arg minx

J (x,ut) +µ

2‖x− xt‖22 (7)

(S.2) Solve the following u-subproblem:

ut+1 = arg minu

J (xt+1,u) +µ

2‖u− ut‖22 (8)

(S.3) Update the penalty in every T iterations:

ρt+1 = min(L/σ(A), 2ρt) (9)

(S.4) Set t := t+ 1 and then go to Step (S.1)

first-order KKT point in at most d(ln(Lδ)− ln(ερ0)

)/ln 2e

outer iterations 3 with the accuracy at least ‖Ax‖1 −〈Ax,v〉 ≤ ε.

Proof. Refer to Appendix B.

4. Alternating Direction MethodThis section presents a proximal alternating direction

method (PADM) for solving Eq (1). This is mainly mo-tivated by the recent popularity of ADM in the non-convexoptimization literature. One direct solution is to apply ADMon the MPEC problem in Eq (3). However, this strategymay not be appealing, since it introduces a non-separablestructure with no closed form solution for its sub-problems.Instead, we consider a separable MPEC reformulation usedin [65, 66].

4.1. Separable MPEC Reformulations

Lemma 2. [65] For any given x ∈ Rn, it holds that

‖x‖0 = min0≤v≤1

〈1,1− v〉, s.t. v |x| = 0, (10)

and v∗ = 1− sign(|x|) is the unique solution of Eq(10).

Using Lemma 2, we can rewrite Eq (1) in an equivalentform as follows.

minx,v

f(x) + I(v), s.t. |Ax| v = 0 (11)

where I(v) is an indicator functionon Ω with

I(v)=

0, v ∈ Ω

∞, v /∈ Ω,Ω , v|〈1,1− v〉 ≤ k, 0 ≤ v ≤ 1,

3Every time we increase ρ, we call it one outer iteration.

4

4.2. Proposed Optimization Framework

To solve Eq (11) using PADM, we form the augmentedLagrangian function L(.)in Eq (12)

L(x,v,π) , f(x) + I(v) +

〈|Ax| v,π〉+α

2‖|Ax| v‖2

(12)

where π is the Lagrange multiplier associated with the con-straint |Ax| v = 0, and α > 0 is the penalty parameter.We detail the PADM iteration steps for Eq (11) in Algorithm2, which has the following properties.(a) Initialization. We set v0 = 1 and π0 = η1, where η isa small parameter. This finds a reasonable local minimum inthe first iteration, as it reduces to an `1 norm minimizationproblem for the x-subproblem.(b) Monotone and boundedness property. For any feasi-ble solution v in Eq (15), it holds that |Ax|v ≥ 0. Usingthe fact that αt > 0 and due to the πt update rule, πt ismonotone increasing. If we initialize π0 > 0 in the firstiteration, π is always positive. Another key feature of thismethod is the boundedness of the multiplier π (see Theo-rem 3). It reduces to alternating minimization algorithm forbiconvex optimization problem [2, 10].(c) v-Subproblem. Variable v in Eq (15) is updated bysolving the following problem:

vt+1 = arg minv

1

2vTDv + 〈v,b〉, s.t. v ∈ Ω (13)

where b , πt|Axt+1|−µvt and D is a diagonal matrixwith d , α|Axt+1| |Axt+1| + µ in the main diagonalentries. Introducing the proximal term in the v-subproblemleads to a strongly convex problem. Thus, it can be solvedexactly using [33].(d) x-Subproblem. Variable x in Eq (14) is updated bysolving a re-weighted `1 norm optimization problem. Sim-ilar to Eq (8), it can be solved using classical/linearizedADM.

4.3. Theoretical Analysis

In Theorem 3, we show that when the monotone increas-ing multiplier π is larger than a threshold, the biconvex ob-jective function in Eq (12) is equivalent to the original con-strained MPEC problem in Eq (11). Moreover, based on theKurdyka-Łojasiewicz inequality [2, 10], Theorem 4 estab-lishes the convergence of the alternating direction method.

Theorem 3. Boundedness of multiplier and exactness ofthe augmented Lagrangian Function. The augmented La-grangian minimization problem minx,v L(x,v, π) admitsthe same local and global optimal solutions as the originalproblem in Eq (3) when π > L/σ(A).

Proof. Refer to Appendix C.

Algorithm 2 MPEC-ADM: An Alternating DirectionMethod for Solving MPEC Problem (11)(S.0) Initialize x0 = 0 ∈ Rn, v0 = 1 ∈ Rm, π0 = η · 1 ∈Rm. Set t = 0 and µ > 0.(S.1) Solve the following x-subproblem:

xt+1 = arg minx

L(x,vt,πt) +µ

2‖x− xt‖2 (14)

(S.2) Solve the following v-subproblem:

vt+1 = arg minv

L(xt+1,v,πt) +µ

2‖v − vt‖2 (15)

(S.3) Update the Lagrange multiplier:

πt+1 = πt + α(|Axt+1| vt+1) (16)

(S.4) Set t := t+ 1 and then go to Step (S.1).

Theorem 4. Convergence of Algorithm 2. Assume that xt

is bounded for all t. As t → ∞, Algorithm 2 converges toa first order KKT point of the reformulated MPEC problemin Eq (11).

Proof. Refer to Appendix C.

5. On MPEC OptimizationIn this paper, MPEC reformulations are considered to

solve the `0 norm problem. Mathematical programs withequilibrium constraints (MPEC) are optimization problemswhere the constraints include variational inequalities orcomplementarities. It is related to the Stackelberg game andit is used in the study of economic equilibrium and engineer-ing design. MPECs are difficult to deal with because theirfeasible region is not necessarily convex or even connected.

General MPEC motivation. The basic idea behindMPEC-based methods is to transform the thin and nons-mooth, nonconvex feasible region into a thick and smoothone by introducing regularization/penalization on the com-plementary/equilibrium constraints. In fact, several othernonlinear methods solve the MPEC problem with similarmotivations. For example, an exact penalty method wherethe complementarity term is moved to the objective in theform of an `1-penalty was proposed in [34]. The work in[22] suggested a smoothing family by replacing the com-plementarity system with the perturbed Fischer-Burmeisterfunction. Also, a quadratic regularization technique to han-dle the complementarity constraints was proposed in [17].

In the defense of our methods. We propose two differ-ent penalization/regularization schemes to solve the equiv-alent MPECs. They have several merits. (a) Each refor-mulation is equivalent to the original `0 norm problem.(b) They are continuous reformulations, so they facilitate

5

KKT analysis and are amenable to the use of existing con-tinuous optimization techniques to solve the convex sub-problems. We argue that, from a practical viewpoint, im-proved solutions to Eq (1) can be obtained by reformulatingthe problem using MPEC and focusing on the complemen-tarity constraints. (c) They find a good initialization be-cause they both reduce to a convex relaxation method inthe first iteration. (d) Both methods exhibit strong con-vergence guarantees, this is because they essentially re-duce to alternating minimization for bi-convex optimiza-tion problems [2, 10]. (e) They have a monotone/greedyproperty owing to the complimentarity constraints broughton by MPEC. The complimentary system characterizes theoptimality of the KKT solution. We let w , x,u (orw , x,v). Our solution directly handles the compli-mentary system of Eq (1) which takes the following form:〈p(w), q(w)〉 = 0, p(w) ≥ 0, q(w) ≥ 0. Here p(·) andq(·) are non-negative mappings of w which are vertical toeach other. The complimentary constraints enable all thespecial properties of MPEC that distinguish it from generalnonlinear optimization. We penalize the complimentary er-ror of ε , 〈p(w), q(w)〉 (which is always non-negative) andensure that the error ε is decreasing in every iteration. SeeFigure 1 for a geometric interpretation for MPEC optimiza-tion.

MPEC-EPM vs. MPEC-ADM. We consider two algo-rithms based on two variational characterizations of the `0norm function (separable and non-separable MPEC4). Gen-erally speaking, both methods have their own merits. (a)MPEC-EPM is more simple since it can directly use exist-ing `1 norm minimization solvers/codes while MPEC-ADMinvolves additional computation for the quadratic term inthe standard Lagrangian function. (b) MPEC-EPM is lessadaptive in its per-iteration optimization, since the penaltyparameter ρ is monolithically increased until a threshold isachieved. In comparison, MPEC-ADM is more adaptive,since a constant penalty also guarantees monotonically non-decreasing multipliers and convergence. (c) Regarding nu-merical robustness, while MPEC-EPM needs to solve thesubproblem with certain accuracy, MPEC-ADM is more ro-bust since the dual Lagrangian function provides a supportfunction and the solution never degenerates even if the x-problem is solved only approximately. (d) For matrix exten-sions, one can extend our results to the matrix rank problemfor non-separable MPEC5 and separable MPEC6. The for-mer reformulation for the rank function is more interesting,since it results in a convex function with respect to X.

4Note that the non-separable MPEC has one equilibrium constraint(‖x‖1 = 〈x,u〉) and the separable MPEC has n equilibrium constraints(|x| v = 0). The terms separable and non-separable are related towhether the constraints can be decomposed to independent components.

5rank(X) = min‖U‖≤1 ‖U‖∗, s.t. ‖X‖∗ = 〈X,U〉6rank(X) = min0≤v≤1 〈1,1− v〉, s.t. σ(X) v = 0.

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18

20

← 〈p(w), q(w)〉 = ε

pi(w)

qi(w

)Figure 1: Effect of regularizing/smoothing the original op-timization problem.

6. Experimental ValidationIn this section, we demonstrate the effectiveness of our

algorithms on five `0 norm optimization tasks, namely fea-ture selection, segmented regression, MRF optimization,trend filtering, and impulse noise removal. We compareagainst the following six methods, (i) convex `1 mini-mization method (CVX)7, (ii) Direct Alternating DirectionMethod (DI-ADM)8 , (iii) Mean Doubly Alternating Di-rection Method (MD-ADM) [18] 9, (iv) Quadratic PenaltyMethod (QPM) [44], (v) MPEC-EPM (Algorithm 1) and(vi) MPEC-ADM (Algorithm 2). We vary the sparsity pa-rameter k = 0.01, 0.06, ..., 0.96 × n on different datasets and report the objective value of the optimization prob-lem10. All algorithms are implemented in Matlab on an Intel2.6 GHz CPU with 8 GB RAM.

7Since this approximate method can not control the sparsity of the solu-tion, we solve a `1 regularized problem where the regulation parameter isswept over 2−10, 2−8, ..., 210. Finally, the solution that leads to small-est objective value after a hard thresholding projection (which reduces tosetting the smallest n − k values of the solution in magnitude to 0) isselected.

8This is an ADM directly applied to Eq (1).9This is an ADM that treats arithmetic means of the solution sequence

as the actual output.10All the algorithms generally terminate within 8 minutes, we do not in-

clude the CPU time here. In fact, our methods converge within reasonabletime. They are only 3-7 times slower than the classical convex methods.This is expected, since they are alternating methods and have the samecomputational complexity as the convex methods.

6

0 0.2 0.4 0.6 0.8 1

0.2

0.3

0.4

0.5

0.6

0.7

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(a) ‘a6a’

0 0.2 0.4 0.6 0.8 1

−800

−600

−400

−200

0

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(b) ‘gisette’

0 0.2 0.4 0.6 0.8 1

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(c) ‘w2a’

0 0.2 0.4 0.6 0.8 1

−300

−250

−200

−150

−100

−50

0

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(d) ‘rcv1binary’

0 0.2 0.4 0.6 0.8 1

−800

−600

−400

−200

0

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(e) ‘a5a’

0 0.2 0.4 0.6 0.8 1

−800

−600

−400

−200

0

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(f) ‘realsim’

0 0.2 0.4 0.6 0.8 1

0.2

0.3

0.4

0.5

0.6

0.7

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(g) ‘mushrooms’

0 0.2 0.4 0.6 0.8 1

0.2

0.3

0.4

0.5

0.6

0.7

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(h) ‘splice’

0 0.2 0.4 0.6 0.8 1

−10

−8

−6

−4

−2

0

x 105

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(i) ‘madelon’

0 0.2 0.4 0.6 0.8 1

−2

−1.5

−1

−0.5

0

0.5

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(j) ‘w1a’

0 0.2 0.4 0.6 0.8 1

−6

−5

−4

−3

−2

−1

0

x 104

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(k) ‘duke’

0 0.2 0.4 0.6 0.8 1

−2500

−2000

−1500

−1000

−500

0

Sparsity

Ob

jec

tive

CVX

GSP

GREEDY

QPM

DI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(l) ‘mnist’

Figure 2: Feature selection based on logistic loss minimization. We set λ = 0.01.

7

0 0.2 0.4 0.6 0.8 1

1000

1100

1200

1300

1400

Sparsity

Ob

jec

tiv

e

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(a) ‘a6a’

0 0.2 0.4 0.6 0.8 1

350

400

450

500

550

Sparsity

Ob

jec

tiv

e

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(b) ‘gisette’

0 0.2 0.4 0.6 0.8 1

1000

1050

1100

1150

1200

1250

1300

Sparsity

Ob

jec

tiv

e

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(c) ‘w2a’

0 0.2 0.4 0.6 0.8 1

0.5

1

1.5

2

2.5

x 10−3

Sparsity

Ob

jec

tive

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(d) ‘rcv1binary’

0 0.2 0.4 0.6 0.8 1

1000

1050

1100

1150

1200

Sparsity

Ob

jec

tiv

e

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(e) ‘a5a’

0 0.2 0.4 0.6 0.8 1750

800

850

900

950

1000

1050

1100

Sparsity

Ob

jec

tiv

e

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(f) ‘realsim’

0 0.2 0.4 0.6 0.8 1

10

20

30

40

50

60

70

80

Sparsity

Ob

jec

tiv

e

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(g) ‘mushrooms’

0 0.2 0.4 0.6 0.8 1

370

380

390

400

410

Sparsity

Ob

jec

tiv

e

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(h) ‘splice’

0 0.2 0.4 0.6 0.8 1

1000

1100

1200

1300

1400

Sparsity

Ob

jec

tive

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(i) ‘madelon’

0 0.2 0.4 0.6 0.8 1

50

100

150

200

Sparsity

Ob

jec

tiv

e

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(j) ‘w1a’

0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

1.2

Sparsity

Ob

jec

tiv

e

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(k) ‘duke’

0 0.2 0.4 0.6 0.8 1

5

10

15

20

25

30

Sparsity

Ob

jec

tiv

e

CVX

QPM

NI−ADM

MD−ADM

MPEC−EPM

MPEC−ADM

(l) ‘mnist’

Figure 3: Feature selection based on hinge loss minimization. We set λ = 0.01.

8

0 0.2 0.4 0.6 0.8 1

2

4

6

8

10

12

Sparsity

Ob

jec

tiv

e

CVX

DI−ADM

MD−ADM

QPM

MPEC−EPM

MPEC−ADM

(a) n = 1024, σ = 10

0 0.2 0.4 0.6 0.8 1

2

4

6

8

10

12

Sparsity

Ob

jec

tiv

e

CVX

DI−ADM

MD−ADM

QPM

MPEC−EPM

MPEC−ADM

(b) n = 2048, σ = 10

0 0.2 0.4 0.6 0.8 1

2

4

6

8

10

12

Sparsity

Ob

jec

tiv

e

CVX

DI−ADM

MD−ADM

QPM

MPEC−EPM

MPEC−ADM

(c) n = 4096, σ = 10

0 0.2 0.4 0.6 0.8 1

1

2

3

4

5

Sparsity

Ob

jec

tiv

e

CVX

DI−ADM

MD−ADM

QPM

MPEC−EPM

MPEC−ADM

(d) n = 1024, σ = 5

0 0.2 0.4 0.6 0.8 1

1

2

3

4

5

6

Sparsity

Ob

jec

tiv

e

CVX

DI−ADM

MD−ADM

QPM

MPEC−EPM

MPEC−ADM

(e) n = 2048, σ = 5

0 0.2 0.4 0.6 0.8 1

1

2

3

4

5

6

Sparsity

Ob

jec

tiv

e

CVX

DI−ADM

MD−ADM

QPM

MPEC−EPM

MPEC−ADM

(f) n = 4096, σ = 5

Figure 4: Segmented regression. Results for unit-column-norm A for different parameter settings.

6.1. Feature Selection

Given a set of labeled patterns si,yipi=1, where si is an

instance with n features and yi ∈ ±1 is the label. Featureselection considers the following optimization problem:

minx

λ2 ‖x‖

22 +

∑pi ` (〈si,x〉,yi) , s.t. ‖x‖0 ≤ k (17)

where the loss function `(·) is chosen to be the logistic loss:`(r, t) = log(1 + exp(r)) − rt or the hinge loss: `(r, t) =max(0, 1− rt).

In our experiments, we test on 12 well-known bench-mark datasets11 which contain high dimensional (n ≥7000) data vectors and thousands of data instances. Sincewe are solving an optimization problem, we only considermeasuring the quality of the solutions by comparing the ob-jective values.

For logistic loss feature selection, since the objective issmooth, we also include a comparisons with the state-of-the-art solver Gradient Support Pursuit (GSP)12 and greedyhard thresholding (GREEDY). Based on our experimen-tal results in Figure 2, we make the following observa-tions. (i) Convex methods generally gives good results inthis task, but they fail on ‘duke’ and ‘madelon’ datasets.

11www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/12users.ece.gatech.edu/sbahmani7/GraSP.html

(ii) The quadratic penalization techniques (NI-ADM, MD-ADM and QPM) generally demonstrate bad performance inthis task. (iii) GREEDY and GSP often achieve good per-formance, but they are not stable in some cases. (iv) Bothour MPEC-EPM and MPEC-ADM often achieve better per-formance than existing solutions. For hinge loss feature se-lection, we demonstrate our experimental results in Figure3. We make the following observations. (i) Convex methodsgenerally fail in this situation, as they usually produce muchlarger objectives. (ii) NI-ADM and MD-ADM outperformQPM in all test cases. (iii) MPEC-EPM and MPEC-ADMgenerally achieve lower objective values.

6.2. Segmented Regression

Given a design matrix A ∈ Rm×n and an observationvector b ∈ Rm, segmented regression involves solving thefollowing optimization problem:

minx‖AT (Ax− b)‖∞, s.t. ‖x‖0 ≤ k (18)

This regression model is closely related to Dantzig selector[11] in the literature, where `0 norm is replaced by `1 normin Eq (18). In our experiments, we consider design matriceswith unit column norms. Similar to [11], we first generatean A with independent Gaussian entries and then normalizeeach column to have norm 1. We then select a support set

9

www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

users.ece.gatech.edu/sbahmani7/GraSP.html

(a) by Graph Cutf(x) = 95741.1

(b) by NI-ADMf(x) = 95852.6

(c) by MD-ADMf(x) = 95825.2

(d) by QPMf(x) = 95827.3

(e) by MPEC-EPMf(x) = 95798.8

(f) by MPEC-ADMf(x) = 95748.0

Figure 5: MRF optimization on ‘cat’ image. The objective value for the unary image b is f(b) = 96439.8.

0 50 100 150 200 250 3000

0.2

0.4

NI−A

DM

0 50 100 150 200 250 3000

0.2

0.4

MD−

ADM

0 50 100 150 200 250 3000

0.2

0.4

QPM

0 50 100 150 200 250 3000

0.2

0.4

MPE

C−EP

M

0 50 100 150 200 250 3000

0.2

0.4

MPE

C−AD

M

Figure 6: Trend filtering result on ‘snp500.dat’ data [38]. The objective values achieved by the five methods are0.38, 0.38, 0.53, 0.32, 0.33, respectively.

of size 0.5 × min(m,n) uniformly at random. We finallyset b = Ax + ε with ε ∼ N(0, σ2I). We vary n from1024, 2048, 4096 and set m = n/8. We demonstrate

our experimental results in Figure 4. The proposed penal-ization techniques (MPEC-EPM and MPEC-ADM) outper-form the quadratic penalization techniques (NI-ADM, MD-

10

Table 2: Random-value impulse noise removal problems. The results separated by ‘/’ are SNR0, SNR1 and SNR2,respectively.

Img.Alg.

CVX (i.e. `1TV ) DI-ADM MD-ADM QPM MPEC-EPM MPEC-ADM

mandrill+10% 0.83/3.06/5.00 0.78/3.68/7.06 0.84/3.71/7.10 0.85/3.76/7.16 0.85/3.71/6.90 0.86/3.87/7.64mandrill+30% 0.69/2.38/4.42 0.67/2.96/5.13 0.74/2.91/4.96 0.75/2.94/5.02 0.75/2.89/4.87 0.76/3.02/5.13mandrill+50% 0.55/1.80/3.33 0.58/2.32/3.86 0.65/2.23/3.64 0.66/2.30/3.79 0.66/2.22/3.58 0.67/2.39/3.88mandrill+70% 0.41/0.91/1.64 0.47/1.49/2.42 0.52/1.39/2.17 0.52/1.40/2.22 0.53/1.39/2.12 0.55/1.56/2.43mandrill+90% 0.31/0.02/0.09 0.33/0.14/0.13 0.35/0.04/-0.08 0.35/0.04/-0.03 0.36/0.07/-0.14 0.37/0.19/0.05lenna+10% 0.97/8.08/13.48 0.97/8.05/13.51 0.97/8.13/13.59 0.97/8.12/13.87 0.97/8.10/13.72 0.98/8.31/14.55lenna+30% 0.92/7.40/13.60 0.92/6.71/10.53 0.91/6.62/10.07 0.94/7.38/12.27 0.93/7.19/11.86 0.96/7.90/13.68lenna+50% 0.84/5.50/9.46 0.84/5.32/8.13 0.83/5.18/7.53 0.88/6.28/10.10 0.86/5.75/8.69 0.91/6.65/10.89lenna+70% 0.57/2.48/4.17 0.67/3.34/5.11 0.66/3.17/4.40 0.71/3.66/5.34 0.69/3.34/4.53 0.75/4.09/5.81lenna+90% 0.35/0.53/0.89 0.41/0.76/0.91 0.40/0.54/0.48 0.41/0.66/0.80 0.43/0.54/0.38 0.48/0.89/0.80lake+10% 0.94/8.03/14.66 0.95/8.21/13.64 0.96/8.31/13.81 0.96/8.40/14.52 0.96/8.33/14.14 0.96/8.45/14.72lake+30% 0.88/7.29/12.83 0.86/6.86/10.87 0.87/6.81/10.02 0.91/7.57/12.45 0.89/7.26/11.33 0.92/7.87/13.34lake+50% 0.61/4.89/8.31 0.76/5.69/8.52 0.77/5.65/8.02 0.83/6.57/10.36 0.80/5.98/8.78 0.85/6.75/10.56lake+70% 0.40/2.07/3.61 0.51/3.68/5.49 0.56/3.72/5.14 0.60/4.01/5.60 0.61/3.94/5.28 0.62/4.08/5.61lake+90% 0.27/0.54/0.87 0.25/0.94/1.06 0.27/0.87/0.65 0.26/0.75/0.65 0.29/0.81/0.44 0.28/0.90/0.71jetplane+10% 0.42/3.33/7.98 0.37/3.16/7.10 0.39/3.16/7.15 0.39/3.20/7.43 0.39/3.16/7.28 0.39/3.22/7.52jetplane+30% 0.36/2.94/7.00 0.35/2.70/5.69 0.38/2.68/5.60 0.39/2.98/6.68 0.39/2.87/6.24 0.40/3.14/7.23jetplane+50% 0.26/1.75/4.24 0.31/2.02/4.21 0.37/2.08/4.00 0.38/2.56/5.45 0.37/2.26/4.48 0.40/2.72/5.75jetplane+70% 0.24/-0.55/-0.17 0.24/0.62/1.46 0.31/0.83/1.49 0.30/0.81/1.44 0.34/0.98/1.55 0.33/0.92/1.38jetplane+90% 0.19/-1.81/-3.38 0.17/-1.98/-3.23 0.18/-1.96/-3.47 0.18/-2.18/-3.90 0.20/-1.66/-2.13 0.19/-2.05/-3.80blonde+10% 0.26/0.92/2.83 0.25/0.87/2.69 0.25/0.86/2.68 0.22/0.90/2.80 0.27/0.88/2.72 0.25/0.92/2.87blonde+30% 0.25/0.82/2.57 0.26/0.76/2.21 0.26/0.70/2.14 0.25/0.80/2.47 0.27/0.76/2.35 0.27/0.85/2.59blonde+50% 0.25/0.60/1.89 0.25/0.61/1.80 0.24/0.58/1.77 0.25/0.67/2.13 0.25/0.64/1.96 0.26/0.75/2.26blonde+70% 0.24/0.02/0.22 0.22/0.35/1.02 0.21/0.35/1.05 0.22/0.33/1.11 0.20/0.38/1.19 0.26/0.49/1.47blonde+90% 0.20/-0.81/-1.39 0.21/-0.62/-1.15 0.20/-0.65/-1.26 0.18/-0.75/-1.39 0.20/-0.52/-1.03 0.19/-0.45/-0.88

ADM and QPM) consistently.

6.3. MRF Optimization

Markov Random Field (MRF) optimization involvessolving the following problem [39],

minx

12xTLx + bTx, s.t. x ∈ 0, 1n (19)

where b is determined by the unary term defined for thegraph and L is the Laplacian matrix, which is based onthe binary term relating pairs of graph nodes together. Thequadratic term is usually considered a smoothness prior onthe node labels. The MRF formulation is widely used inmany labeling applications, including image segmentation.

In our experiments, we include the comparison with thegraph cut method [39], which is known to achieve the globaloptimal solution for this specific class of binary problem.Figure 5 demonstrates a qualitative result for image seg-mentation. MPEC-ADM produces a solution that is veryclose to the globally optimal one. Moreover, both our meth-ods achieve lower objectives than the other compared meth-ods.

6.4. Trend Filtering

Given a time series y ∈ Rn, the goal of trend filteringis to find another nearest time series x such that x is sparse

after a gradient mapping [38]. Trend filtering involves solv-ing the following optimization, where D ∈ R(n−2)×n is a2nd difference matrix.

minx

12‖x− y‖22, s.t. ‖Dx‖0 ≤ k (20)

In our experiments, we perform trend filtering on the‘snp500.dat’ data set [38]. For better visualization, we onlyuse the first 300 time instances of the signal y, i.e. n = 300.The sparsity parameter k is set to 30. We stop all the al-gorithms when the same strict stopping criterion is met,in order to ensure that the hard constraint ‖Ax‖0 ≤ k isfully satisfied. As shown in Figure 6, the proposed meth-ods (MPEC-EPM and MPEC-ADM) achieve more naturaltrend filtering results (refer to position 120-170), since theyachieve lower objective values (0.32 and 0.33).

6.5. Impulse Noise Removal

Given a noisy image b which was contaminated by ran-dom value impulsive noise, we consider a denoised imageof x as a minimizer of the following constrained `0TVmodel [65]:

minx

TV (x), s.t. ‖x− b‖0 ≤ k (21)

where TV (x) , ‖∇x‖p,1 denotes the total variationnorm function, ‖x‖p,1 ,

∑ni=1(|xi|p + |xn+i|p)

1p ; ∇ ,

11

(a) by CVXSNR0 = 0.92

(b) by NI-ADMSNR0 = 0.92

(c) by MD-ADMSNR0 = 0.91

(d) by QPMSNR0 = 0.94

(e) by MPEC-EPMSNR0 = 0.93

(f) by MPEC-ADMSNR0 = 0.96

Figure 7: Random value impulse noise removal on ‘lenna’ image.

[∇x|∇y]T ∈ R2n×n, and k specifies the noise level.

Furthermore, ∇x ∈ Rn×n and ∇y ∈ Rn×n aretwo suitable linear transformation matrices that computestheir discrete gradients. Following the experimental set-tings in [65], we use three ways to measure SNR, i.e.SNR0, SNR1, SNR2. We show image recovery resultsin Table 2 when random-value impulse noise is added tothe clean images ‘mandrill’, ‘lenna’, ‘lake’, ‘jetplane’,‘blonde’. We also demonstrate a visualized example ofrandom value impulse noise removal on ‘lenna’ image inFigure 7. A conclusion can be drawn that both MPEC-EPM and MPEC-ADM achieve state-of-the-art perfor-mance while MPEC-ADM is generally better than MPEC-EPM in this impulse noise removal problem.

7. Conclusions and Future Work

This paper presents two methods (exact penalty and al-ternating direction) to solve the general sparsity constrainedminimization problem. Although it is non-convex, we de-sign effective penalization/regularization schemes to solveits equivalent MPEC. We also prove that both of our meth-ods are convergent to first-order KKT points. Experimen-tal results on a variety of sparse optimization applicationsdemonstrate that our methods achieve state-of-the-art per-

formance.Our future work focuses on several directions. (i) We

will investigate the optimality qualification of our multi-stage convex relaxation method for some specific objectivefunctions (e.g. squared or logistic function). (ii) It is inter-esting to extend the present results to deal with rank mini-mization problem and its applications. (iii) We are also in-terested in applying the proposed algorithms to other sparseoptimization and binary optimization applications in ma-chine learning and signal processing.

References

[1] Andreas Argyriou, Rina Foygel, and NathanSrebro. Sparse prediction with the k-supportnorm. In Neural Information Processing Sys-tems (NIPS), pages 1466–1474, 2012. 2, 3

[2] Hedy Attouch, Jerome Bolte, Patrick Redont,and Antoine Soubeyran. Proximal alternatingminimization and projection methods for non-convex problems: An approach based on thekurdyka-lojasiewicz inequality. Mathematics

12

of Operations Research (MOR), 35(2):438–457,2010. 3, 5, 6, 21

[3] Sohail Bahmani, Bhiksha Raj, and Petros T.Boufounos. Greedy sparsity-constrained opti-mization. Journal of Machine Learning Re-search (JMLR), 14(1):807–841, 2013. 2

[4] Chenglong Bao, Hui Ji, Yuhui Quan, andZuowei Shen. Dictionary learning for sparsecoding: Algorithms and convergence analysis.IEEE Transactions on Pattern Analysis and Ma-chine Intelligence (TPAMI), 38(7):1356–1369,2016. 1

[5] Amir Beck and Yonina C Eldar. Sparsityconstrained nonlinear optimization: Optimalityconditions and algorithms. SIAM Journal onOptimization (SIOPT), 23(3):1480–1509, 2013.3

[6] Dimitris Bertsimas, Angela King, and RahulMazumder. Best subset selection via a modernoptimization lens. arXiv preprint:1507.03133,2015. 3

[7] Shujun Bi, Xiaolan Liu, and Shaohua Pan. Ex-act penalty decomposition method for zero-norm minimization based on MPEC formula-tion. SIAM Journal on Scientific Computing(SISC), 36(4), 2014. 3

[8] Daniel Bienstock. Computational study ofa family of mixed-integer quadratic program-ming problems. Mathematical programming,74(2):121–140, 1996. 2, 3

[9] Thomas Blumensath and Mike E Davies. Gradi-ent pursuits. IEEE Transactions on Signal Pro-cessing, 56(6):2370–2382, 2008. 2

[10] Jerome Bolte, Shoham Sabach, and MarcTeboulle. Proximal alternating linearized min-imization for nonconvex and nonsmooth prob-lems. Mathematical Programming, 146(1-2):459–494, 2014. 3, 5, 6, 21

[11] Emmanuel Candes and Terence Tao. Thedantzig selector: statistical estimation when pis much larger than n. The Annals of Statistics,pages 2313–2351, 2007. 1, 9

[12] Emmanuel J. Candes and Terence Tao. De-coding by linear programming. IEEE Trans-actions on Information Theory, 51(12):4203–4215, 2005. 1, 2

[13] Emmanuel J Candes, Michael B Wakin, andStephen P Boyd. Enhancing sparsity byreweighted `1 minimization. Journal of FourierAnalysis and Applications, 14(5-6):877–905,2008. 1, 2, 3

[14] Antoni B. Chan, Nuno Vasconcelos, and GertR. G. Lanckriet. Direct convex relaxations ofsparse svm. In International Conference on Ma-chine Learning (ICML), pages 145–153, 2007.2, 3

[15] Scott Shaobing Chen, David L Donoho, andMichael A Saunders. Atomic decomposition bybasis pursuit. SIAM journal on Scientific Com-puting (SISC), 20(1):33–61, 1998. 2

[16] Jon Dattorro. Convex Optimization & EuclideanDistance Geometry. Meboo Publishing USA,2011. 3, 18

[17] Victor DeMiguel, Michael P Friedlander, Fran-cisco J Nogales, and Stefan Scholtes. A two-sided relaxation scheme for mathematical pro-grams with equilibrium constraints. SIAM Jour-nal on Optimization (SIOPT), 16(2):587–609,2005. 5

[18] Bin Dong and Yong Zhang. An efficient al-gorithm for l0 minimization in wavelet framebased image restoration. Journal of ScientificComputing, 54(2-3):350–368, 2013. 1, 3, 6

[19] David L Donoho. Compressed sensing.IEEE Transactions on Information Theory,52(4):1289–1306, 2006. 1

[20] John C. Duchi, Shai Shalev-Shwartz, YoramSinger, and Tushar Chandra. Efficient projec-tions onto the `1 ball for learning in high dimen-sions. In International Conference on MachineLearning (ICML), pages 272–279, 2008. 4

13

[21] Ehsan Elhamifar and Rene Vidal. Sparse sub-space clustering. In Computer Vision and Pat-tern Recognition (CVPR), pages 2790–2797,2009. 1

[22] Francisco Facchinei, Houyuan Jiang, and LiqunQi. A smoothing method for mathematical pro-grams with equilibrium constraints. Mathemat-ical programming, 85(1):107–134, 1999. 5

[23] Jianqing Fan and Runze Li. Variable selectionvia nonconcave penalized likelihood and its or-acle properties. Journal of the American Statis-tical Association (JASA), 96(456):1348–1360,2001. 3

[24] Mingbin Feng, John E Mitchell, Jong-Shi Pang,Xin Shen, and Andreas Wachter. Comple-mentarity formulations of `0-norm optimizationproblems. 2013. 3

[25] Fajwel Fogel, Rodolphe Jenatton, Francis R.Bach, and Alexandre d’Aspremont. Convexrelaxations for permutation problems. SIAMJournal on Matrix Analysis and Applications(SIMAX), 36(4):1465–1488, 2015. 1

[26] Jerome Friedman, Trevor Hastie, and RobertTibshirani. Sparse inverse covariance estima-tion with the graphical lasso. Biostatistics,9(3):432–441, 2008. 1

[27] Jerome H Friedman. Fast sparse regression andclassification. International Journal of Fore-casting, 28(3):722–738, 2012. 3

[28] Cuixia Gao, Naiyan Wang, Qi Yu, and Zhi-hua Zhang. A feasible nonconvex relaxationapproach to feature selection. In AAAI Con-ference on Artificial Intelligence (AAAI), pages356–361, 2011. 3

[29] Dongdong Ge, Xiaoye Jiang, and Yinyu Ye.A note on the complexity of `p minimization.Mathematical Programming, 129(2):285–299,2011. 2, 3

[30] Donald Geman and Chengda Yang. Nonlinearimage recovery with half-quadratic regulariza-tion. IEEE Transactions on Image Processing(TIP), 4(7):932–946, 1995. 3

[31] Pinghua Gong, Changshui Zhang, ZhaosongLu, Jianhua Huang, and Jieping Ye. A generaliterative shrinkage and thresholding algorithmfor non-convex regularized optimization prob-lems. In International Conference on MachineLearning (ICML), pages 37–45, 2013. 3

[32] Bingsheng He and Xiaoming Yuan. Onthe O(1/n) convergence rate of the douglas-rachford alternating direction method. SIAMJournal on Numerical Analysis (SINUM),50(2):700–709, 2012. 4, 17

[33] Richard V. Helgason, Jeffery L. Kennington,and H. S. Lall. A polynomially bounded al-gorithm for a singly constrained quadratic pro-gram. Mathematical Programming, 18(1):338–343, 1980. 4, 5, 22

[34] XM Hu and Daniel Ralph. Convergence ofa penalty method for mathematical program-ming with complementarity constraints. Jour-nal of Optimization Theory and Applications,123(2):365–390, 2004. 3, 5

[35] Prateek Jain, Ambuj Tewari, and PurushottamKar. On iterative hard thresholding methodsfor high-dimensional m-estimation. In NeuralInformation Processing Systems (NIPS), pages685–693, 2014. 3

[36] Bo Jiang, Ya-Feng Liu, and Zaiwen Wen. lp-norm regularization algorithms for optimizationover permutation matrices. Technical report,Optimization online, November 2015. 1

[37] Abdeslam Kadrani, Jean-Pierre Dussault, andAbdelhamid Benchakroun. A new regulariza-tion scheme for mathematical programs withcomplementarity constraints. SIAM Journal onOptimization (SIOPT), 20(1):78–103, 2009. 3

[38] Seung-Jean Kim, Kwangmoo Koh, Stephen P.Boyd, and Dimitry M. Gorinevsky. `1 trend fil-tering. SIAM Review, 51(2):339–360, 2009. 1,10, 11

[39] Vladimir Kolmogorov and Ramin Zabih. Whatenergy functions can be minimized via graph

14

cuts? IEEE Transactions on Pattern Analysisand Machine Intelligence (TPAMI), 26(2):147–159, 2004. 10, 11

[40] Honglak Lee, Alexis Battle, Rajat Raina, andAndrew Y. Ng. Efficient sparse coding algo-rithms. In Neural Information Processing Sys-tems (NIPS), pages 801–808, 2006. 1

[41] Yuanqing Li, Andrzej Cichocki, and Shun-ichiAmari. Analysis of sparse representation andblind source separation. Neural computation,16(6):1193–1234, 2004. 1

[42] Ya-Feng Liu, Yu-Hong Dai, and Shiqian Ma.Joint power and admission control: Non-convexlq approximation and an effective polynomialtime deflation approach. IEEE Transactions onSignal Processing, 63(14):3641–3656, 2015. 1

[43] Canyi Lu, Jinhui Tang, Shuicheng Yan, andZhouchen Lin. Generalized nonconvex nons-mooth low-rank minimization. In Conferenceon Computer Vision and Pattern Recognition(CVPR), pages 4130–4137, 2014. 3

[44] Zhaosong Lu and Yong Zhang. Sparse ap-proximation via penalty decomposition meth-ods. SIAM Journal on Optimization (SIOPT),23(4):2448–2478, 2013. 2, 3, 4, 6

[45] Zhi-Quan Luo, Jong-Shi Pang, and DanielRalph. Mathematical programs with equilib-rium constraints. Cambridge University Press,1996. 3

[46] Stephane G Mallat and Zhifeng Zhang. Match-ing pursuits with time-frequency dictionar-ies. IEEE Transactions on Signal Processing,41(12):3397–3415, 1993. 2

[47] Rahul Mazumder, Jerome H Friedman, andTrevor Hastie. Sparsenet: Coordinate descentwith nonconvex penalties. Journal of the Amer-ican Statistical Association (JASA), 2012. 3

[48] Andrew M McDonald, Massimiliano Pontil,and Dimitris Stamos. Spectral k-support normregularization. In Neural Information Process-ing Systems (NIPS), pages 3644–3652, 2014. 3

[49] Deanna Needell and Joel A Tropp. Cosamp: It-erative signal recovery from incomplete and in-accurate samples. Applied and ComputationalHarmonic Analysis, 26(3):301–321, 2009. 2

[50] Deanna Needell and Roman Vershynin. Uni-form uncertainty principle and signal recov-ery via regularized orthogonal matching pur-suit. Foundations of computational mathemat-ics, 9(3):317–334, 2009. 2

[51] Andrew Y Ng. Feature selection, l1 vs. l2 reg-ularization, and rotational invariance. In In-ternational Conference on Machine Learning(ICML), page 78, 2004. 1

[52] Mert Pilanci, Martin J. Wainwright, and Lau-rent El Ghaoui. Sparse learning via booleanrelaxations. Mathematical Programming,151(1):63–87, 2015. 3

[53] Nikhil S. Rao, Parikshit Shah, and Stephen J.Wright. Forward-backward greedy algorithmsfor atomic norm regularization. IEEE Transac-tions on Signal Processing, 63(21):5798–5811,2015. 2

[54] Joel Tropp, Anna C Gilbert, et al. Signal recov-ery from random measurements via orthogonalmatching pursuit. IEEE Transactions on Infor-mation Theory, 53(12):4655–4666, 2007. 2

[55] Joel A. Tropp. Greed is good: algorithmic re-sults for sparse approximation. IEEE Trans-actions on Information Theory, 50(10):2231–2242, 2004. 2

[56] Joshua Trzasko and Armando Manduca. Highlyundersampled magnetic resonance image recon-struction via homotopic-minimization. IEEETransactions on Medical Imaging, 28(1):106–121, 2009. 3

[57] Peng Wang, Chunhua Shen, and Anton van denHengel. A fast semidefinite approach to solvingbinary quadratic problems. In Computer Visionand Pattern Recognition (CVPR), pages 1312–1319, 2013. 1

15

[58] Andreas Weinmann, Martin Storath, and Lau-rent Demaret. The l1-potts functional for robustjump-sparse reconstruction. SIAM Journal onNumerical Analysis (SINUM), 53(1):644–673,2015. 1

[59] Jason Weston, Andre Elisseeff, BernhardScholkopf, and Michael E. Tipping. Use of thezero-norm with linear models and kernel meth-ods. Journal of Machine Learning Research(JMLR), 3:1439–1461, 2003. 1

[60] Bin Wu, Chao Ding, Defeng Sun, and Kim-Chuan Toh. On the moreau-yosida regu-larization of the vector k-norm related func-tions. SIAM Journal on Optimization (SIOPT),24(2):766–794, 2014. 18

[61] Zhen James Xiang, Yongxin Taylor Xi, Uri Has-son, and Peter J. Ramadge. Boosting withspatial regularization. In Neural InformationProcessing Systems (NIPS), pages 2107–2115,2009. 1

[62] Li Xu, Cewu Lu, Yi Xu, and Jiaya Jia. Imagesmoothing via `0 gradient minimization. ACMTransactions on Graphics (TOG), 30(6):174,2011. 1

[63] Penghang Yin, Yifei Lou, Qi He, and Jack Xin.Minimization of `1−2 for compressed sensing.SIAM Journal on Scientific Computing (SISC),37(1), 2015. 2, 3

[64] Jin Yu, Anders Eriksson, Tat-Jun Chin, andDavid Suter. An adversarial optimization ap-proach to efficient outlier removal. In Interna-tional Conference on Computer Vision (ICCV),pages 399–406, 2011. 2, 3

[65] Ganzhao Yuan and Bernard Ghanem. `0tv: Anew method for image restoration in the pres-ence of impulse noise. In Computer Vision andPattern Recognition (CVPR), pages 5369–5377,2015. 1, 2, 3, 4, 11

[66] Ganzhao Yuan and Bernard Ghanem. A prox-imal alternating direction method for semi-definite rank minimization. In AAAI Conferenceon Artificial Intelligence (AAAI), 2016. 4

[67] Ming Yuan. High dimensional inverse co-variance matrix estimation via linear program-ming. The Journal of Machine Learning Re-search (JMLR), 11:2261–2286, 2010. 1

[68] Xiaotong Yuan, Ping Li, and Tong Zhang. Gra-dient hard thresholding pursuit for sparsity-constrained optimization. In International Con-ference on Machine Learning (ICML), pages127–135, 2014. 3

[69] Cun-Hui Zhang. Nearly unbiased variable se-lection under minimax concave penalty. The An-nals of statistics, pages 894–942, 2010. 3

[70] Tong Zhang. Analysis of multi-stage con-vex relaxation for sparse regularization. TheJournal of Machine Learning Research (JMLR),11:1081–1107, 2010. 1, 3

[71] Tong Zhang. Adaptive forward-backwardgreedy algorithm for learning sparse represen-tations. IEEE Transactions on Information The-ory, 57(7):4689–4708, 2011. 2

AppendixThe appendix is organized as follows. Section A presents

the proof of Lemma 1. Section B presents the convergenceanalysis of the exact penalty method. Section C presents theconvergence analysis of the alternating direction method.Section D presents a Matlab code of the breakpoint searchalgorithm.

A. Proof of Lemma 1Here we provide the proof of our new non-separable

MPEC of the `0-norm function.

Proof. It is not hard to validate that the `0 norm functionof x ∈ Rm can be repressed as the following minimizationproblem over u ∈ Rm:

‖x‖0 = min−1≤u≤1

‖u‖1, s.t. |x| = u x (22)

Note that the minimization problem in Eq (22) can be de-composed into m subproblems. When xi = 0, the opti-mal solution in position i will be achieved at ui = 0 byminimization; when xi > 0 (xi < 0, respectively), theoptimal solution in position i will be achieved at ui = 1(ui = −1, respectively) by constraint. In other words,

16

u∗ = sign(x) will be achieved for Eq (22), leading to‖x‖0 = ‖sign(x)‖1.We now focus on Eq (22). Due to the box constraints−1 ≤u ≤ 1, it always holds that |x| − u x ≥ 0. Therefore,the error generated by the difference of |x| and u x issummarizable. We naturally have the following results:

|x| − u x = 0 ⇔ 〈1, |x| − u x〉 = 0

⇔ ‖x‖1 = 〈u,x〉

B. Convergence Analysis of Algorithm 1This section presents the proof of Theorem 1 and The-

orem 2. Our main novelties are establishing the inverselyproportional relationship between the weight of `1 normand the sparsity of Ax (in the proof of exactness) and qual-ifying the sparsity upper bound for 〈w, |Ax|〉 (in the proofof convergence rate). Our results make use of the specifiedstructure of the `1 norm problem.

The following lemma is useful and it establishes a lowerbound on the norm of matrix vector product.

Lemma 2. Assume that A ∈ Rm×n has right inverse. Italways holds that ‖AT c‖2 ≥ σ(A)‖c‖2 for all c ∈ Rm,where σ(A) is the smallest singular value of A.

Proof. We denote Z = AAT − (σ(A))2I ∈ Rm×m. SinceA has right inverse, we have σ(A) > 0 and Z 0. We letZ = LLT and have the following inequalities:

‖AT c‖22 = 〈AAT , ccT 〉= 〈Z + (σ(A))2I, ccT 〉= ‖LT c‖22 + 〈(σ(A))2I, ccT 〉≥ 0 + 〈(σ(A))2‖c‖2

Taking the square root of both sides, we naturally completethe proof of this lemma.

The following lemma is useful in our convergence anal-ysis.

Lemma 3. Assume that h(·) is convex and L-Lipschitz con-tinuous and w is a non-negative vector. We have the follow-ing results: (i) The optimal solution of the following opti-mization problem:

x∗ = arg minx

h(x) + 〈w, |Ax|〉 (23)

will be achieved with |Ax∗|i = 0 when wi > L/σ(A) forall i ∈ [m]. (ii) Moreover, if ‖x∗‖ ≤ δ, it always holds that〈w, |Ax∗|〉 ≤ δL.

Proof. (i) We now prove the first part of this lemma. Theconvex optimization problem in Eq (23) is equivalent to thefollowing minimax saddle point problem:

(x∗,y∗) = arg minx

arg max−1≤y≤1

h(x) + 〈y w,Ax〉

According to the optimality with respect to y, we have thefollowing result:

y∗i = 1 ⇒ (Ax∗)i > 0,

|y∗i | < 1 ⇒ (Ax∗)i = 0, (24)y∗i = −1 ⇒ (Ax∗)i < 0

According to the optimality with respect to x, we have thefollowing result:

h′(x) + AT (y w) = 0 (25)

where h′(x) denotes the subgradient of h(·) with respect tox.

We now prove that |yi| is strictly less than 1 when wi >L/σ(A), then we obtain |Ax∗|i = 0 due to the optimalitycondition in Eq (24). Our proof is as follows. We assumethat yi 6= 0, since otherwise |yi| < 0, the conclusion holds.Then we derive the following inequalities:

|yi| < |yi| ·wi

L· σ(A)

≤ |yi| ·wi

L· ‖A

T (y w)‖‖y w‖

≤ |yi| ·wi

L· L

‖y w‖

≤ |yi| ·wi

‖y w‖∞≤ 1 (26)

where the first step uses the choice that wi > L/σ(A) andthe assumption that yi 6= 0, the second step uses Lemma 2,the third step uses the fact that ‖AT (yw)‖ = ‖h′(x)‖ ≤L which can be derived from Eq (25) and the L-Lipschitzcontinuity of h(·), the fourth step uses the norm inequalitythat ‖c‖∞ ≤ ‖c‖2 ∀c, the last step uses the nonnegativityof w.

We remark that such results are natural and common-place, since it is well-known that `1 norm induces sparsity.

(ii) We now prove the second part of this lemma. Recallthat if x∗ solves Eq (23), then it must satisfy the followingvariational inequality [32]:

〈w, |Ax| − |Ax∗|〉+ 〈x− x∗, h′(x∗)〉 ≥ 0, ∀x (27)

Letting x = 0 in Eq (27), using the nonnegativity of w,Cauchy-Schwarz inequality, and the Lipschitz continuity of

17

h(·), we have:

〈w, |Ax∗|〉 ≤ −〈x∗, h′(x∗)〉≤ ‖x∗‖ · ‖h′(x∗)‖≤ δL

The following lemma shows that the biconvex minimiza-tion problem will lead to basic feasibility for the comple-mentarity constraint when the penalty parameter ρ is largerthan a threshold.

Lemma 4. When ρ > L/σ(A), any local optimal solu-tion of the minimization problem: minx,u Jρ(x,u) will beachieved with ‖Ax‖1 − 〈Ax,u〉 = 0.

Proof. We let x,u be any local optimal solution of thebiconvex minimization problem minx,u Jρ(x,u). We de-note O as the index of the largest-k value of |Ax|, G ,i|(Ax)i > 0, and S , j|(Ax)j < 0. Moreover, wedefine I , O∩G, J , O∩S and K , 1, 2, ...,m\I ∪J.

(i) First of all, we consider the minimization problem ofthe penalty function with respect to u (i.e. minu Jρ(x,u)).It reduces to the following minimization problem:

min−1≤u≤1

〈−Ax,u〉, s.t. ‖u‖1 ≤ k (28)

The objective of Eq (28) essentially computes the k-largestnorm function of Ax, see [60, 16]. It is not hard to validatethat the optimal solution of Eq (28) can be computed asfollows:

ui =

+1, i ∈ I−1, i ∈ J0, u ∈ K

(ii) We now consider the minimization problem of thepenalty function with respect to x (i.e. minx Jρ(x,u)).Clearly, we have that: 〈(Ax)I ,uI〉 = ‖(Ax)I‖1 and〈(Ax)J ,uJ〉 = ‖(Ax)J‖1. It reduces to the following op-timization problem for the x-subproblem:

minx

f(x) + ρ(‖(Ax)K‖1 − 〈(Ax)K ,uK〉).

Since uK = 0, it can be further simplified as:

minx

f(x) + 〈w, |Ax|〉, where wi =

ρ, i ∈ K0 i /∈ K

(29)

Applying Lemma 3 with h(x) = f(x), we conclude that|Ax|K = 0 will be achieved when ρ ≥ L/σ(A).

Since |Ax|K = 0, 〈(Ax)I ,uI〉 = ‖(Ax)I‖1 and〈(Ax)J ,uJ〉 = ‖(Ax)J‖1, we achieve that the comple-mentarity constraint ‖Ax‖1 − 〈Ax,u〉 = 0 is fully satis-fied. This finishes the proof for claim in the lemma.

We now show that when the penalty parameter ρ is largerthan a threshold, the biconvex objective function Jρ(x,u)is equivalent to the original constrained MPEC problem.

Theorem 1. Exactness of the Penalty Function. Thepenalty problem minx,v Jρ(x,v) admits the same localand global optimal solutions as the original MPEC prob-lem when ρ > L/σ(A).

Proof. First of all, based on the non-separable MPEC re-formulation, we have the following Lagrangian functionH : Rn × Rm × R→ R:

H(x,u, χ) = f(x) + I(u) + χ(‖Ax‖1 − 〈Ax,u〉)

Based on the Lagrangian function13, we have the followingKKT conditions for any KKT solution (x∗,u∗, χ∗):

0 ∈ ∂f(x∗) + χ∗AT∂|Ax∗| − χ∗ATu∗

0 ∈ ∂I(u∗)− χ∗Ax∗ (30)0 = ‖Ax∗‖1 − 〈Ax∗,u∗〉

The KKT solution is defined as the first-order minimizer(respectively maximizer) for the primal (respectively dual)variables of the Lagrange function. It can be simply derivedfrom setting the (sub-)gradient of the Lagrange function tozero with respect to each block of variables.

Secondly, we focus on the penalty function Jρ(x,u).When ρ > L/σ(A), by Lemma 4 we have:

0 = ‖Ax‖1 − 〈Ax,u〉 (31)

By the local optimality of the penalty function Jρ(x,u), wehave the following equalities:

0 ∈ ∂f(x) + ρAT∂|Ax| − ρATu

0 ∈ ∂I(u)− ρAx(32)

Since Eq (32) and Eq (31) coincide with Eq (30), we con-clude that the solution of Jρ(x,u) admits the same localand global optimal solutions as the original non-separableMPEC reformulation, when ρ > L/σ(A).

Theorem 2. Convergence rate of Algorithm MPEC-EPM. Assume that ‖xt‖ ≤ δ for all t. Algorithm1 will converge to the first-order KKT point in at mostd(ln(Lδ)− ln(ερ0)

)/ln 2e outer iterations with the accu-

racy at least ‖Ax‖1 − 〈Ax,v〉 ≤ ε.

13Note that the Lagrangian functionH has the same form as the penaltyfunction J and the multiplier χ also plays a role of penalty parameter.It always holds at the optimal solution that χ∗ > 0 since 〈Ax,u〉 ≤‖Ax‖1 · ‖u‖∞ ≤ ‖Ax‖1 and the equality holds at the optimal solution.

18

Proof. Assume Algorithm 1 takes s outer iterations to con-verge. We have the following inequalities:

‖Axs+1‖1 − 〈Axs+1,vs+1〉= ‖(Axs+1)K‖1≤ δL

ρs

where the first step uses the notations and the results (SeeEq (29)) in Lemma 4, the second step uses the second partof Lemma 3 with h(x) = f(x). The above inequality im-plies that when ρs ≥ δL

ε , Algorithm MPEC-EPM achievesaccuracy at least ‖Ax‖1 − 〈Ax,v〉 ≤ ε. Noticing thatρs = 2sρ0, we have that ε accuracy will be achieved when

2sρ0 ≥ δLε

⇒ 2s ≥ δLερ0

⇒ s ≥(ln(δL)− ln(ερ0)

)/ln 2

Thus, we finish the proof of this theorem.

C. Convergence Analysis of Algorithm 2This section presents the proof of Theorem 3 and Theo-

rem 4. Note that our proof is also built upon Lemma 3.

Lemma 5. When π > L/σ(A), any local optimal solu-tion of the minimization problem: minx,v L(x,v, π) willbe achieved with |Ax| v = 0.

Proof. We assume that x and v are arbitrary local optimalsolutions of the augmented Lagrangian function for a givenπ.(i) Firstly, we now focus on the v-subproblem, we have:

v = arg minv

I(v) + 〈π, |Ax| v〉+α

2‖|Ax| v‖22

Then there exists a constant θ ≥ 0 (that depends on the localoptimal solutions x and v) such that it solves the followingminimax saddle point problem:

maxθ≥0

min0≤v≤1

〈π, |Ax| v〉+α

2‖|Ax| v‖22

+θ(n− k − 〈v,1〉)

Clearly, v can be computed as:

v = max(0,min(1,θ − |Ax| πα|Ax| |Ax|

)) (33)

We now define I , i|θ − |Ax|i · πi ≤ 0, J , i|θ −|Ax|i · πi > 0. Clearly, we have

vI = 0,vJ 6= 0

(ii) Secondly, we focus on the x-subproblem:

minx

f(x) + 〈|Ax| v,π〉+α

2‖|Ax| v‖2

Noticing vJ 6= 0, we apply Lemma 3 with w = π v andh(x) = f(x) + α

2 ‖|Ax| v‖2, then |Ax|J = 0 will beachieved whenever πJ vJ > Lg/σ(A), where Lh is theLipschitz constant of h(·). Since vI = 0, (Ax)J = 0 andI ∪ J = 1, 2, ...,m, we have α

2 ‖|Ax| v‖2 = 0, thus,Lh = L. Moreover, incorporating (Ax)J=0 into Eq(33),we have vJ = 1.(iii) Finally, since π > 0 and θ ≥ 0, by the definition ofI , we have |Ax|I ≥ 0. In summery, we have the followingresults:

vI = 0, vJ = 1, (Ax)I ≥ 0, (Ax)J = 0 (34)

We conclude that the complementarity constraint will holdautomatically whenever πJ > (L/σ(A)) 1

vJ= L/σ(A),

where J is the index that |Ax|J = 0. However, we donot have any prior information of the index set J (i.e. thevalue of Ax∗) and J can be any subset of 1, 2, ...,m, thecondition πJ > L/σ(A) needs to be further restricted toπ > L/σ(A). Thus, we complete the proof of this Lemma.

Theorem 3. Boundedness of multiplier and exactness ofthe Augmented Lagrangian Function. The augmented La-grangian problem minx,v L(x,v, π) admits the same localand global optimal solutions as the original MPEC problemwhen π ≥ L/σ(A).

Proof. First of all, based on the Lagrangian function L(·),we have the following KKT conditions for any KKT solu-tion (x∗,v∗,π∗):

0 ∈ ∂f(x∗) + AT (∂|Ax∗| v∗ π∗)0 ∈ ∂I(v∗) + Ax∗ π∗ (35)0 = |Ax∗| v∗

The KKT solution is defined as the first-order minimizer(respectively maximizer) for the primal (respectively dual)variables of the Lagrange function. It can be simply derivedfrom setting the (sub-)gradient of the Lagrange function tozero with respect to each block of variables.

Secondly, we focus on the penalty function L(x,u, π).When π > L/σ(A), by Lemma 5 we have:

0 = |Ax| v (36)

By the local optimality of the penalty function L(x,u, π),we have the following equalities:

0 ∈ ∂f(x) + AT (∂|Ax| v π)

0 ∈ ∂I(v) + Ax π (37)

19

Since Eq (36) and Eq (37) coincide with Eq (35), we con-clude that the solution of L(x,u, π) admits the same localand global optimal solutions as the original non-separableMPEC reformulation, when π > L/σ(A).

In the following, we present the proof of Theorem 4. Forthe ease of discussions, we define:

s , x,v,π, w , x,v.

First of all, we prove the subgradient lower bound for theiterates gap by the following lemma.

Lemma 3. Assume that xt are bounded for all t, then thereexists a constant $ > 0 such that the following inequalityholds:

‖∂L(st+1)‖ ≤ $‖st+1 − st‖ (38)

Proof. For notation simplicity, we denote p = ∂|Axt+1|and z = |Axt+1|. By the optimal condition of the x-subproblem and v-subproblem, we have:

0 ∈ AT(p vt (αz vt + πt)

)+ f ′(xt+1)

+µ(xt+1 − xt)

0 ∈(αz vt+1 + πt

) z + ∂I(vt+1) +

µ(vt+1 − vt) (39)

By the definition of L we have that

∂Lx(st+1)

= AT (p vt+1 α(z vt+1 + πt+1/α)) + f ′(xt+1)

= ATp (vt+1 α(z vt+1 + πt+1/α)

−vt α(z vt + πt/α)) + µ(xt − xt+1)

= ATp (vt+1 α(z vt+1 + πt+1/α)

−vt α(z vt + πt/α)) + µ(xt − xt+1)

= ATp (vt+1 α(z vt+1 + πt+1/α

−vt α(z vt + (πt+1 − πt+1 + πt)/α))

+µ(xt − xt+1)

= ATp (vt+1 α(z vt+1 + πt+1/α

−vt α(z vt + (πt+1 − πt+1 + πt)/α))

+µ(xt − xt+1)

= ATp vt (πt − πt+1) + µ(xt − xt+1) +

ATp (vt+1 (αz vt+1 + πt+1)−vt (αz vt + πt+1))

= ATp vt (πt − πt+1) + µ(xt − xt+1) +

ATp πt+1 (vt+1 − vt) +

ATp αz (vt+1 vt+1 − vt vt) (40)

The first step uses the definition of Lx(st+1), the secondstep uses Eq (39), the third step uses vt+vt+1−vt+1 = vt,the fourth step uses the multiplier update rule for π.

We assume that xt+1 is bounded by δ for all t, i.e.‖xt+1‖ ≤ δ. By Theorem 3, πt+1 is also bounded. We as-sume it is bounded by a constant κ, i.e. ‖πt+1‖ ≤ κ. Usingthe fact that ‖∂‖Ax‖1‖∞ ≤ 1, ‖Ax‖2 ≤ ‖A‖‖x‖, ‖x y‖2 ≤ ‖x‖∞‖y‖ and Eq (40), we have the following in-equalities:

‖∂Lx(st+1)‖

≤ ‖A‖ · ‖p‖∞ · ‖v‖∞ · ‖πt − πt+1‖+ µ‖xt − xt+1‖+‖A‖ · ‖p‖∞ · ‖πt+1‖∞ · ‖vt+1 − vt‖+‖A‖ · ‖p‖∞ · α‖z‖ · ‖vt+1 + vt‖∞ · ‖vt+1 − vt‖

≤ ‖A‖‖πt − πt+1‖+ µ‖xt+1 − xt‖+(κ+ 2δα‖A‖) · ‖A‖ · ‖vt+1 − vt‖ (41)

Similarly, we have

∂Lv(st+1) = ∂I(vt+1) + αz (πt+1/α+ z vt+1)

= z (πt+1 − πt)− µ(vt+1 − vt)

∂Lπ(st+1) = |Axt+1 vt+1| = 1α (πt+1 − πt)

Then we derive the following inequalities:

‖∂Lv(st+1)‖ ≤ δ‖A‖‖πt − πt+1‖+ µ‖vt+1 − vt‖ (42)

‖∂Lπ(st+1)‖ ≤ 1α‖π

t+1 − πt‖ (43)

Combining Eqs (41-43), we conclude that there exists $ >0 such that the following inequality holds

‖∂L(st+1)‖ ≤ $‖st+1‖

We thus complete the proof of this lemma.

The following lemma is useful in our convergence anal-ysis.

Proposition 1. Assume that xt are bounded for all t, thenwe have the following inequality:∑∞

t=0 ‖st − st+1‖2 < +∞

In particular the sequence ‖st−st+1‖ is asymptotic regular,namely ‖st − st+1‖ → 0 as t → ∞. Moreover any clusterpoint of st is a stationary point of L.

Proof. Due to the initialization and the update rule of π,we conclude that πt is nonnegative and monotone non-decreasing. Moreover, using the result of Theorem 3, ast → ∞, we have: |Axt+1| vt+1 = 0. Therefore, weconclude that as t→ +∞ it must hold that

|Axt+1| vt+1 = 0,∑ti=1 ‖πi+1 − πi‖ < +∞,∑ti=1 ‖πi+1 − πi‖2 < +∞

20

On the other hand, we naturally derive the following in-equalities:

L(xt+1,vt+1;πt+1)

= L(xt+1,vt+1;πt) + 〈πt+1 − πt, |Axt+1| vt+1〉= L(xt+1,vt+1;πt) + 1

α‖πt+1 − πt‖2

≤ L(xt,vt+1;πt)− µ2‖xt+1 − xt‖2 + 1

α‖πt+1 − πt‖2

≤ L(xt,vt;πt)− µ2‖xt+1 − xt‖2 − µ

2‖vt+1 − vt‖2

+ 1α‖πt+1 − πt‖2 (44)

The first step uses the definition of L; the second step usesupdate rule of the Lagrangian multiplier π; the third andfourth step use the µ-strongly convexity of L with respectto x and v, respectively.

We defineC = 1α

∑ti=1 ‖πi+1−πi‖2+L(x0,v0;π0)−

L(xt+1,vt+1;πt+1). Clearly, by the boundedness ofxt,vt,πt, bothC andL(xt+1,vt+1;πt+1) are bounded.Summing Eq (44) over i = 1, 2..., t, we have:

µ2

∑ti=1 ‖wi+1 −wi‖2 ≤ C (45)

Therefore, combining Eq(44) and Eq (45), we have∑+∞t=1 ‖st+1 − st‖2 < +∞; in particular ‖st+1 − st‖ → 0.

By Eq (38), we have that:

‖∂L(st+1)‖ ≤ $‖st+1 − st‖ → 0

which implies that any cluster point of st is a stationarypoint of L. We complete the proof of this lemma.

Remarks: Lemma 1 states that any cluster point is the KKTpoint. Strictly speaking, this result does not imply the con-vergence of the algorithm. This is because the boundednessof∑∞t=0 ‖st − st+1‖2 does not imply that the sequence st

is convergent 14. In what follows, we aim to prove strongerresult in Theorem 4.

Our analysis is mainly based on a recent non-convexanalysis tool called Kurdyka-Łojasiewicz inequality [2, 10].One key condition of our proof requires that the Lagrangianfunction L(s) satisfies the so-call (KL) property in its ef-fective domain. It is so-called the semi-algebraic func-tion satisfy the Kurdyka-Łojasiewicz property. We notethat semi-algebraic functions include (i) real polynomialfunctions, (ii) finite sums and products of semi-algebraicfunctions , and (iii) indicator functions of semi-algebraicsets [10]. Using these definitions repeatedly, the graph ofL(s) : (s, z)|z = L(s) can be proved to be a semi-algebraic set. Therefore, the Lagrangian function L(s) is asemi-algebraic function. This is not surprising since semi-algebraic function is ubiquitous in applications [10]. Wenow present the following proposition established in [2].

14One typical counter-example is st =∑ti=1

1i

. Clearly,∑∞t=0 ‖st−

st+1‖2 =∑∞t=1(

1t)2 is bounded by π2

6; however, st is divergent since

st = ln(k) + Ce, where Ce is the well-known Euler’s constant.

Proposition 2. For a given semi-algebraic function L(s),for all s ∈ domL, there exists θ ∈ [0, 1), η ∈ (0,+∞] aneighborhood S of s and a concave and continuous functionϕ(ς) = ς1−θ, ς ∈ [0, η) such that for all s ∈ S and satisfiesL(s) ∈ (L(s),L(s) + η), the following inequality holds:

dist(0, ∂L(s))ϕ′(L(s)− L(s)) ≥ 1, ∀s

where dist(0, ∂L(s)) = min||w∗|| : w∗ ∈ ∂L(s).

We now ready to prove the convergence of our alternat-ing direction method.

Theorem 4. Assume that xt are bounded for all t. Then wehave the following inequality:∑+∞

t=0 ‖st − st+1‖ <∞ (46)

Moreover, as t → +∞, Algorithm MPEC-ADM convergesto a first order KKT point of the reformulated MPEC prob-lem.

Proof. For simplicity, we define rt = ϕ(L(st)− L(s∗))−ϕ(L(st+1)− L(s∗)). We naturally derive the following in-equalities:

µ2‖wt+1 −wt‖2 − 1

α‖πt+1 − πt‖2

≤ L(st)− L(st+1)

= (L(st)− L(s∗))− (L(st+1)− L(s∗))≤ rt/ϕ′(L(st)− L(s∗))≤ rt · dist(0, ∂L(st))≤ rt ·$

(‖xt − xk−1‖+ ‖vt − vk−1‖+ ‖πt − πk−1‖

)≤ rt ·$

(√2‖wt −wk−1‖+ ‖πt − πk−1‖

)= rt$

√16µ

(√µ8‖wt −wk−1‖+

√µ16‖πt − πk−1‖

)≤ (

√µ8‖wt −wk−1‖+

√µ16‖πt − πk−1‖)2 + (rt$

√4µ)2

The first step uses Eq (44); the third step uses the con-cavity of ϕ such that ϕ(a) − ϕ(b) ≥ ϕ′(a)(a − b) for alla, b ∈ R; the fourth step uses the KL property such thatdist(0, ∂L(st))ϕ′(L(st)− L(s∗)) ≥ 1 as in Proposition 2;the fifth step uses Eq (38); the sixth step uses the inequalitythat ‖x; y‖ ≤ ‖x‖ + ‖y‖ ≤

√2‖x; y‖, where ‘;’ in [·] de-

notes the row-wise partitioning indicator as in Matlab; thelast step uses that fact that 2ab ≤ a2 + b2 for all a, b ∈ R.As k →∞, we have:

µ2 ‖w

t+1 −wt‖2 ≤ 4(rtκ)2

µ +(√

µ8 ‖w

t −wk−1‖)2

Taking the squared root of both side and using the inequalitythat√a+ b ≤

√a+√b for all a, b ∈ R+, we have

2√

µ8 ‖w

t+1 −wt‖ ≤ 2rtκ√µ +

√µ8 ‖w

t −wk−1‖

21

Then we have√µ

8‖wt+1 −wt‖

≤ 2rtκ√µ

+

√µ

8(‖wt −wk−1‖ − ‖wt+1 −wt‖) (47)

Summing Eq (47) over i = 1, 2..., t, we have:∑ti=1

√µ8 ‖w

i+1 −wi‖

≤ κ√µ

t∑i=1

ri +(‖w1 −w0‖+ ‖wt+1 −wt‖

)(48)

The first term in the right-hand side of Eq(48) is boundedsince

∑ti=1 r

i = ϕ(L(s0)−L(s∗))−ϕ(L(st+1)−L(s∗))is bounded. Therefore, we conclude that as k → ∞, weobtain: ∑t

i=1 ‖wi+1 −wi‖ < +∞

By the boundedness of π in Theorem 3, we have∑ti=1 ‖πi+1 − πi‖ < +∞. Therefore, we obtain:∑t

i=1 ‖si+1 − si‖ < +∞

Finally, by Eq (38) in Lemma 3 we have ∂L(st+1) = 0. Inother words, we have the following results:

0 ∈ f ′(xt+1) + AT (∂|Axt+1| vt+1 πt+1)

0 ∈ πt+1 |Axt+1|+ ∂I(vt+1)

0 = |Axt+1| vt+1

which coincide with the KKT conditions in Eq (35). There-fore, we conclude that the solution xt+1,vt+1,πt+1 con-verges to a first-order KKT point of the reformulated MPECproblem.

D. Matlab Code of Breakpoint Search Algo-rithm [33]

f u n c t i o n [ x ] = B r e a k P o i n t S e a r c h ( d , a , s , ope r )% Th i s program s o l v e s t h e f o l l o w i n g QP :% min x 0 . 5∗ x ’∗ d i a g ( d )∗ x + a ’∗ x% s . t . sum ( x ) ope r s , 0 <= x <= 1% ope r can be ’== ’ , ’>= ’ , o r ’<=’% We assume d > 0 .

n = l e n g t h ( d ) ; a s s e r t ( s>=0 && s<=n ) ;y = s o r t ([−a−d;−a ] , ’ ascend ’ ) ;l = 1 ; r = 2∗n ; L = n ; R = 0 ;w h i l e 1

i f ( r−l ==1)S = ( s−L ) / ( R−L ) ;rho = y ( l ) + ( y ( r ) − y ( l ) ) ∗ S ;b r e a k ;

e l s em = f l o o r ( 0 . 5 ∗ ( l + r ) ) ;t = (−a−y (m ) ) . / d ;C = sum ( max ( 0 , min ( t , 1 ) ) ) ;i f (C== s ) ,

rho = y (m) ; b r e a k ;e l s e i f (C>s )

l = m; L=C ;e l s e i f (C<s )

r = m; R=C ;end

endendi f ( ope r (1)== ’> ’ )rho = min ( rho , 0 ) ;e l s e i f ( ope r (1)== ’< ’ )rho = max ( rho , 0 ) ;endt = (−a−rho ) . / d ;x = max ( 0 , min ( t , 1 ) ) ;

22

sparsity constrained minimization via mathematical ... · pdf filesparsity constrained...

Documents