research article variable selection and parameter estimation … · 2020. 1. 12. · garrotte [],...

13
Research Article Variable Selection and Parameter Estimation with the Atan Regularization Method Yanxin Wang 1 and Li Zhu 2 1 School of Science, Ningbo University of Technology, Ningbo 315211, China 2 School of Applied Mathematics, Xiamen University of Technology, Xiamen 361024, China Correspondence should be addressed to Yanxin Wang; [email protected] Received 23 September 2015; Revised 4 February 2016; Accepted 7 February 2016 Academic Editor: Steve Su Copyright © 2016 Y. Wang and L. Zhu. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Variable selection is fundamental to high-dimensional statistical modeling. Many variable selection techniques may be implemented by penalized least squares using various penalty functions. In this paper, an arctangent type penalty which very closely resembles 0 penalty is proposed; we call it Atan penalty. e Atan-penalized least squares procedure is shown to consistently select the correct model and is asymptotically normal, provided the number of variables grows slower than the number of observations. e Atan procedure is efficiently implemented using an iteratively reweighted Lasso algorithm. Simulation results and data example show that the Atan procedure with BIC-type criterion performs very well in a variety of settings. 1. Introduction High-dimensional data arise frequently in modern applica- tions in biology, economics, chemometrics, neuroscience, and other scientific fields. To facilitate the analysis, it is oſten reasonable and useful to assume that only a small number of covariates are relevant for modeling the response variable. Under this sparsity assumption, a widely used approach for analyzing high-dimensional data is regularized or penalized regression. is approach estimates the unknown regression coefficients by solving the following penalized regression problem: min R { { { 1 2 y X 2 + =1 ( ) } } } , (1) where X = ( 1 , 2 ,..., ) is × design matrix, y = ( 1 , 2 ,..., ) is an -dimensional response vector, = ( 1 ,..., ) is the vector of unknown regression coefficients, ‖⋅‖ denotes 2 norm (Euclidean norm), and (⋅) is a penalty function which depends on a tuning parameter >0. In the above regularization framework, various penalty functions are used to perform variable selection by putting relatively large penalties on small coefficients, such as the best subset selection, 1 penalized regression or Lasso [1], Bridge regression [2], SCAD [3], MCP [4], SICA [5], SELO [6], Dantzig selector [7], and Bayesian variable selection method [8]. e mainstream methods are the Lasso, Dantzig selector, and the folded concave penalization [9] such as the SCAD and MCP. e best subset selection, namely, 0 penalty, along with the traditional model selection criteria such as AIC, BIC, and RIC [10–12] is attractive for variable selection since it directly penalizes the number of nonzero coefficients. However, one drawback of 0 penalized least squares (PLS) procedure is instability of the resulting estimators [13]. is results from the fact that 0 penalty is not continuous at 0. Another perhaps more significant drawback of 0 penalty is that implementing 0 PLS procedures is NP-hard and may involve an exhaustive search over all possible models. us, implementing these procedures is computationally infeasible when the number of potential predictors is even moderately large, let along the high-dimensional data. e Lasso penalized regression is computationally attrac- tive and enjoys great performance in prediction. However, Hindawi Publishing Corporation Journal of Probability and Statistics Volume 2016, Article ID 6495417, 12 pages http://dx.doi.org/10.1155/2016/6495417

Upload: others

Post on 16-Apr-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

Research ArticleVariable Selection and Parameter Estimation withthe Atan Regularization Method

Yanxin Wang1 and Li Zhu2

1School of Science Ningbo University of Technology Ningbo 315211 China2School of Applied Mathematics Xiamen University of Technology Xiamen 361024 China

Correspondence should be addressed to Yanxin Wang wyxinbj163com

Received 23 September 2015 Revised 4 February 2016 Accepted 7 February 2016

Academic Editor Steve Su

Copyright copy 2016 Y Wang and L ZhuThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Variable selection is fundamental to high-dimensional statistical modeling Many variable selection techniques may beimplemented by penalized least squares using various penalty functions In this paper an arctangent type penalty which very closelyresembles 119897

0penalty is proposed we call it Atan penaltyThe Atan-penalized least squares procedure is shown to consistently select

the correct model and is asymptotically normal provided the number of variables grows slower than the number of observationsTheAtan procedure is efficiently implemented using an iteratively reweighted Lasso algorithm Simulation results and data exampleshow that the Atan procedure with BIC-type criterion performs very well in a variety of settings

1 Introduction

High-dimensional data arise frequently in modern applica-tions in biology economics chemometrics neuroscienceand other scientific fields To facilitate the analysis it is oftenreasonable and useful to assume that only a small numberof covariates are relevant for modeling the response variableUnder this sparsity assumption a widely used approach foranalyzing high-dimensional data is regularized or penalizedregression This approach estimates the unknown regressioncoefficients by solving the following penalized regressionproblem

min120573isinR119901

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

119901120582(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(1)

where X = (1199091 1199092 119909

119899)119879 is 119899 times 119901 design matrix y =

(1199101 1199102 119910

119899)119879 is an 119899-dimensional response vector 120573 =

(1205731 120573

119901)119879 is the vector of unknown regression coefficients

sdot denotes 1198712norm (Euclidean norm) and 119901

120582(sdot) is a penalty

function which depends on a tuning parameter 120582 gt 0

In the above regularization framework various penaltyfunctions are used to perform variable selection by puttingrelatively large penalties on small coefficients such as the bestsubset selection 119897

1penalized regression or Lasso [1] Bridge

regression [2] SCAD [3] MCP [4] SICA [5] SELO [6]Dantzig selector [7] and Bayesian variable selection method[8]Themainstreammethods are the Lasso Dantzig selectorand the folded concave penalization [9] such as the SCADandMCPThe best subset selection namely 119897

0penalty along with

the traditional model selection criteria such as AIC BIC andRIC [10ndash12] is attractive for variable selection since it directlypenalizes the number of nonzero coefficients However onedrawback of 119897

0penalized least squares (PLS) procedure is

instability of the resulting estimators [13] This results fromthe fact that 119897

0penalty is not continuous at 0 Another perhaps

more significant drawback of 1198970penalty is that implementing

1198970PLS procedures is NP-hard and may involve an exhaustive

search over all possible models Thus implementing theseprocedures is computationally infeasible when the numberof potential predictors is even moderately large let along thehigh-dimensional data

The Lasso penalized regression is computationally attrac-tive and enjoys great performance in prediction However

Hindawi Publishing CorporationJournal of Probability and StatisticsVolume 2016 Article ID 6495417 12 pageshttpdxdoiorg10115520166495417

2 Journal of Probability and Statistics

Lasso may not consistently select the correct model andis not necessarily asymptotically normal [14 15] a strongirrepresentable condition is necessary for the Lasso to beselection consistent [15 16] The folded concave penaliza-tion unlike the Lasso does not require the irrepresentablecondition to achieve the variable selection consistency andcan correct the intrinsic estimation bias of the Lasso Fanand Li [3] first systematically studied nonconvex penalizedlikelihood for fixed finite dimension 119901 They recommendedthe SCAD penalty which enjoys the oracle property (avariable selection and estimation procedure is said to havethe oracle property if it selects the true model 119860 withprobability tending to one and if the estimated coefficientsare asymptotically normal with the same asymptotic varianceas the least squares estimator based on the true model) forvariable selection Fan and Peng [17] extended these resultsby allowing 119901 to grow with 119899 at the rate 119901 = 119900(119899

15) or 119901 =

119900(11989913

) Lv and Fan [5] introduced the weak oracle propertywhich means that the estimator enjoys the same sparsityas the oracle estimator with asymptotic probability oneand has consistency and established regularity conditionsunder which the PLS estimator given by folded concavepenalties has a nonasymptotic weak oracle property whendimensionality119901 can grow nonpolynomially with sample size119899 Theoretical properties enjoyed by SELO [6] estimatorsallow the number of predictors 119901 to tend to infinity alongwith the number of observations 119899 provided 119901119899 rarr 0 Forhigh-dimensional nonconvex penalized regression with 119901 gt

119899 Kim et al [18] proved that the oracle estimator itself is alocal minimum of SCAD penalized least squares regressionunder very relaxed conditions Zhang [4] proposedMCP anddevised a novel PLUS algorithm which when used togethercan achieve the oracle property under certain regularityconditions Recently Fan et al [9] have shown that thefolded concave penalization methods enjoy the strong oracleproperty for high-dimensional sparse estimation Importantinsight has also been gained through the recent work ontheoretical analysis of the global solution [19ndash21]

The practical performance of PLS procedures dependsheavily on the choice of a tuning parameter The theoret-ically optimal tuning parameter does not have an explicitrepresentation and depends on unknown factors such as thevariance of the unobserved randomnoise Cross-validation iscommonly adopted in practice to select the tuning parameterbut is observed to often result in overfitting In the case offixed 119901 Wang et al [22] proposed that one selects tuningparameter by minimizing the generalized BIC tuning param-eter selector Wang et al [23] extended those results to thesetting of a diverging number of parameters Recently Dickeret al [6] proposed a BIC-like tuning parameter selectorWanget al [21] extended the work of [24 25] for BIC on high-dimensional least squares regression they proposed a high-dimensional BIC for a nonconvex penalized solution path

In this paper we propose an arctangent type (Atan)penalty function which very closely approximates 119897

0penalty

Because the Atan penalty is continuous the Atan estimatormay be more stable than the estimators obtained through 119897

0

penalizedmethodsTheAtan penalty is a smooth function on[0infin) and we use an iteratively reweighted Lasso algorithm

We formally establish the model selection oracle propertyenjoyed by the Atan estimator In particular the asymptoticnormality of the Atan is formally established Our asymptoticframework allows the number of predictors 119901 rarr infin alongwith the number of observations 119899 provided 119901119899 rarr 0 Fur-thermore a BIC-like tuning parameter selection procedure isimplemented for Atan

This paper is organized in the following way In Section 2we introduce PLS estimators and give a brief overview ofexisting nonconvex penalty terms and the Atan penalty isthen presented Then we discuss some of its theoreticalproperties in Section 3 In Section 4 we describe a simple andefficient algorithm for obtaining Atan estimator Simulationstudies and an application of the proposed methodology arepresented in Section 5 Conclusions are given in Section 6The proofs are relegated to the Appendix

2 The Atan-Penalized Least Squares Method

21 Linear Model and Penalized Least Squares Suppose that(119910119894 119909119894)119899

119894=1is a random sample from the linear regression

modely = X120573lowast + 120576 (2)

where X = (1199091 1199092 119909

119899)119879 is 119899 times 119901 design matrix y =

(1199101 1199102 119910

119899)119879 is an 119899-dimensional response vector and 120576

are the iid random errors with mean 0 and variance 1205902119899-

dimensional noise vectorWhen discussing variable selection it is convenient to

have concise notation Denote the columns of X by x1

x119901

isin 119877119899 and the rows of X by 119909

1 119909

119899isin 119877119901 Let 119860 =

119895 120573lowast

119895= 0 be the true model and suppose that 119901

0is the

size of the true model That is suppose that |119860| = 1199010

where |119860| denotes the cardinality of 119860 In addition for 119878 sube

1 2 119901 let 120573119878= (120573119895)119895isin119878

be the |119878|-dimensional subvectorof120573 containing entries indexed by 119878 and letX

119878be 119899times|119878|matrix

obtained from X by extracting columns corresponding to 119878Given 119901 times 119901 matrix 119862 and subsets 119878

1 1198782sube 1 2 119901 let

1198621198781 1198782

be |1198781| times |1198782| submatrix of 119862 with rows determined by

1198781and columns determined by 119878

2

Various penalty functions have been used in the variableselection literature for linear regression model Commonlyused penalty functions include 119897

119902 0 le 119902 le 2 the nonnegative

garrotte [26] elastic-net [27 28] SCAD [3] and MCP [4]In particular 119897

1penalized least squares procedure is called

the Lasso However Lasso estimates may be biased andinconsistent for model selection [3 15] This implies that theLasso does not have the oracle property The adaptive Lassois a weighted version of Lasso which has the oracle property[15] Slightly abusing notation is that the adaptive Lassopenalty is defined by 119901

120582(120573) = 120582120596

119895|120573| where 120596

119895is a data-

dependent weightFan and Li [3] proposed a continuously differentiable

penalty function called the SCADpenalty which is defined by

1199011015840

120582(10038161003816100381610038161205731003816100381610038161003816) = 120582119868 (

10038161003816100381610038161205731003816100381610038161003816 le 120582) +

(119886120582 minus10038161003816100381610038161205731003816100381610038161003816)+

(119886 minus 1) 120582119868 (10038161003816100381610038161205731003816100381610038161003816 gt 120582)

for some 119886 gt 2

(3)

Journal of Probability and Statistics 3

0

02

04

06

08

1

12

AtanLassoSCAD

MCP

3210minus1minus2minus3

120573

l0

(a)

0

05

1

15

2

25

3

3210minus1minus2minus3

120573

120574 = 0005

120574 = 01

120574 = 1

120574 = infin

(b)

Figure 1 (a) Plots of the penalties (including 1198970 Lasso SCAD MCP and Atan) (b) Plots of the Atan penalties with different 120574

Authors of [3] suggested using 119886 = 37 from a BayesianperspectiveTheminimum concave penalty [4] translates theflat part of the derivative of the SCAD into the origin and isgiven by

1199011015840

120582(10038161003816100381610038161205731003816100381610038161003816) =

(119886120582 minus10038161003816100381610038161205731003816100381610038161003816)+

119886 (4)

which minimizes the maximum of the concavity Zhang [4]proved that theMCP procedure may select the correct modelwith probability tending to 1 and that MCP estimator hasgood properties in terms of 119897

119901-loss provided 120582 and 119886 satisfy

certain conditions Zhangrsquos results in fact allow for 119901 ≫ 119899

22 Atan Penalty In this section we first propose a novelnonconvex penalty that we call Atan penalty We then studyits applications in sparse modeling

The Atan penalty is defined by

119901120582120574

(10038161003816100381610038161205731003816100381610038161003816) = 120582 (120574 +

2

120587) arctan(

10038161003816100381610038161205731003816100381610038161003816

120574) (5)

for 120582 ge 0 and 120574 gt 0 It is clear that this penalty is concavein |120573| Moreover we can establish its relationship with 119897

0

and 1198971penalties In particular we have the following proposi-

tions

Proposition 1 Let 119901120582120574(|120573|) be given in (5) then

(119886) lim120574rarrinfin

119901120582120574

(10038161003816100381610038161205731003816100381610038161003816) = 120582

10038161003816100381610038161205731003816100381610038161003816

(119887) lim120574rarr0

119901120582120574

(10038161003816100381610038161205731003816100381610038161003816) =

120582 11989411989110038161003816100381610038161205731003816100381610038161003816 = 0

0 11989411989110038161003816100381610038161205731003816100381610038161003816 = 0

(6)

The propositions show that the limits of Atan at 0 andinfin are 1198970

penalty and 1198971penalty respectivelyThe first-order derivative of

119901120582120574(|120573|) with respect to |120573| is

1199011015840

120582120574(10038161003816100381610038161205731003816100381610038161003816) = 120582

120574 (120574 + 2120587)

1205742 + 1205732 (7)

The Atan penalty function (120574 = 0005) is plotted in Figure 1(a)along with the SCAD Lasso MCP and 119897

0penalty Figure 1(b)

depicts the Atan with different 120574

3 Theoretical Properties

In this section we study the theoretical properties of the Atanestimator proposed in Section 2 in the situation where thenumber of parameters 119901 tends to infin with increasing samplesize 119899 We discuss some conditions of the penalty and lossfunctions in Section 31 Our main results are presented inSection 32

31 Regularity Conditions We need to place the followingconditions on the penalty functions

(A) 119899 rarr infin and 1199011205902119899 rarr 0

(B) 120588radic119899(1199011205902) rarr infin where 120588 = min119895isin119860

|120573lowast

119895|

(C) 120582 = 119874(1) 120582radic119899(1199011205902) rarr infin and 120574 = 119874(11990112

1205903119899minus32

)(D) There exist constants 119862

1 1198622

isin R such that 1198621

lt

120582min((1119899)X119879X) lt 120582max((1119899)X119879X) lt 1198622 where

120582min((1119899)X119879X) and 120582max((1119899)X119879X) are the smallestand largest eigenvalues of (1119899)X119879X respectively

(E) lim119899rarrinfin

119899minus1max

1le119894le119899sum119901

119895=11199092

119894119895= 0

(F) 119864(|120576119894120590|2+120575

) lt 119872 for some 120575 and119872 lt infin

4 Journal of Probability and Statistics

Since 119901 may vary with 119899 it is implicit that 120573lowast may varywith 119899 Additionally we allow model 119860 and the distributionof 120576 (in particular 1205902) to change with 119899 Condition (A) limitshow119901 and 1205902may growwith 119899This condition is substantiallyweaker than that required in [17] which requires 1199015119899 rarr 0and slightly weaker than that required in [28] which requireslog(119901) log(119899) rarr ] isin [0 1) and the same as that requiredin [6] As mentioned in Section 1 other authors have studiedPLS methods in settings where 119901 gt 119899 that is their growthcondition on 119901 is weaker than condition (A) [18] Condition(B) gives a lower bound on the size of the smallest nonzeroentry of 120573lowast Notice that the smallest nonzero entry of 120573lowastis allowed to vanish asymptotically provided it does not doso faster than radic1199011205902119899 Similar conditions are found in [17]Condition (C) restricts the rates of tuning parameters 120582 and120574 Note that condition (C) does not constrain the minimumsize of 120574 Indeed no such constraint is required for ourasymptotic results about the Atan estimator Since the Atanpenalty approaches 119897

0penalty as 120574 rarr 0 this suggests that

the Atan and 1198970penalized least squares estimator have similar

asymptotic properties On the other hand in practice wehave found that one should not take 120574 too small in order topreserve stability of the Atan estimator Condition (D) is anidentifiability condition Conditions (E) and (F) are used toprove asymptotic normality ofAtan estimators and are relatedto the Lindeberg condition of the Lindeberg-Feller centrallimit theorem Conditions (A)ndash(F) imply that Atan has theoracle property andmay correctly identifymodel119860 as we willsee inTheorem 4

32 Oracle Properties Let

119876119899(120573) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) (8)

be the objective function and 119901120582120574(|120573119895|) is the Atan penalty

function

Theorem 2 Suppose that conditions (A)ndash(D) hold then forevery 119903 isin (0 1) there exists a constant 119862

0gt 0 such that

lim inf 119875119899rarrinfin

[

[

arg min120573

119876119899(120573)

sube

120573 isin 1198771199011003817100381710038171003817120573 minus 120573

lowast1003817100381710038171003817 le 119862radic1199011205902

119899

]

]

gt 1 minus 119903

(9)

whenever119862 ge 1198620 Consequently there exists a sequence of local

minimizers of119876119899(120573) and such that minus120573lowast = 119874

119875(radic1199011205902119899)

Lemma 3 Assume that (A)ndash(D) hold and fix 119862 gt 0 then

lim119899rarrinfin

119875[[

[

argmin120573minus120573lowastle119862radic119901120590

2119899

119876119899(120573) sube 120573 isin 119877

119901 120573119860119888 = 0

]]

]

= 1

(10)

where119860119888 = 1 119901 119860 is the complement of119860 in 1 119901

Theorem 4 (oracle properties) Suppose that (A)ndash(F) holdthen there exists a sequence ofradic1198991199011205902-consistent localminimaof Atan such that

(i) lim119899rarrinfin

119875(119895 119895

= 0 = 119860) = 1

(ii) radic119899119861119899(119899minus1119883119879

1198601198831198601205902)12

(119860minus 120573lowast

119860) rarr 119873(0 119866)

in distribution where 119861119899is any arbitrary 119902 times |119860| matrix such

that 119861119899119861119879

119899rarr 119866 and 119866 is 119902 times 119902 nonegative symmetric matrix

4 Implementation

41 Iteratively Reweighted Lasso Algorithm The computationfor the Atan-penalized method is much more involvedbecause the resulting optimization problem is usually non-convex and hasmultiple local minimizers Several algorithmshave been developed for computing the folded concave penal-ized estimators such as the local quadratic approximation(LQA) algorithm and the local linear approximation (LLA)algorithm [3 29] Both LQA and LLA are related to themajorization-minorization (MM) principle [30] Recentlycoordinate descent algorithm was applied to solve the foldedconcave penalized least squares [31 32] Reference [4] deviseda PLUS algorithm for solving the PLS using the MCP Zhang[33] analyzed the capped-119897

1penalty for solving the PLS

In this section we present an iteratively reweighted Lasso(IRL) algorithm to solve the followingminimization problem

min120573isinR119901

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+ 120582

119901

sum

119895=1

(120574 +2

120587) arctan(

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816

120574)

(11)

We show that the solution of the penalty least squares prob-lem can be transformed into that of a series of convexweighted Lasso estimators to which the existing Lasso algo-rithms can be efficiently applied

Now taking a first-order Taylor-series approximation ofthe Atan penalty about the current value 120573

119895= 120573lowast

119895 we obtain

the overapproximation of (11)

min120573isinR119901

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+ 120582

119901

sum

119895=1

120574 (120574 + 2120587)

1205742 + 120573lowast2119895

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816

(12)

Note that the linear approximation in (12) is analogous to oneproposed in Zou and Li [29] and they argued strongly in favorof a one-step approximation Instead we offer the followingalgorithm for Atan-penalized least squares Assume all thecovariates have been standardized to havemean zero and unitvarianceWithout loss of generality the span of regularizationparameters is 120582min = 120582

0lt sdot sdot sdot lt 120582

119871= 120582max for 119897 = 0 119871

[34] Here we summarize the details of IRL algorithm as inAlgorithm 1 where the active set is defined as 119860 = 119895 | 120573

119895=

0 119895 = 1 2 119901In the situation with 119899 gt 119901 we set 120582min = 0 However

when 119899 lt 119901 one attains the saturated model long before theregularization parameter reaches zero In this case we suggest120582min = 10

minus4120582max [34] and we used 120591 = 10

minus4 In practice wehave found that if the columns of X are standardized so thatx1198952= 119899 for 119895 = 1 119901 then taking 120574 = 0005 works well

Journal of Probability and Statistics 5

(1) Start with 120582max = max1le119895le119901

|X119879119910|119899 Set 119897 = 119871 and 120573(119897)= 0

Outer loop(2) Set 119896 = 0 and

(0)

= 120573(119897)

(3) Increment 119896 = 119896 + 1 and (119896+1)

= (119896)

(4) Update the weights

119895= 120574(120574 + 2120587)(120574

2+ ((119896)

119895)2) 119895 = 1 119901

Inner loopSolve the Karush-Kuhn-Tucker (KKT) conditions for fixed

119895

119909119879

119895(119910 minus X120573) minus 120582

119897119895sgn(120573

119895) = 0 if 119895 isin 119860

|119909119879

119895(119910 minus X120573)| lt 120582

119897119895 if 119895 notin 119860

(5) Goto Step (3)

(6) Repeat Steps (3)ndash(5) until (119896+1)

minus (119896)

2

lt 120591The estimate 120573

(119897)is the limit point of the outer loop

infin

(7) Decrement 119897 = 119897 minus 1 and 120582

119897= 120582119897minus1 Return to (2) using 120573

(119897)as a warm start

Algorithm 1 Iteratively reweighted Lasso (IRL) algorithm

Remark 5 Algorithm 1 is very much like MM algorithmsHunter and Li [30] were the first to advocate MM algorithmsfor variable selection when 119899 gt 119901 However we should noticethat Algorithm 1 differs from the algorithm described in [30]The use of MM algorithms in [30] is to justify a quadraticoverapproximation to penalty functions with singularities atthe origin In the inner loop of Algorithm 1 we avoid suchapproximation by solving the KKT conditions precisely andefficiently [35] Our use of MM algorithms in Algorithm 1 isto justify the local linear approximation toAtan penalty in theouter loop

Remark 6 LLA to SCAD penalty were also proposed by Zouand Li [29] Algorithm 1 differs from that proposed in [29] inthat it constructs the entire coefficient path even when 119901 gt

119899 whereas the procedure in [29] computes the coefficientestimates for fixed regularization parameter 120582 starting with aroot-119899 consistent estimator of true coefficient 120573lowast at the initialstep Moreover even with the same initial estimate the limitpoint from Algorithm 1 will differ from [29] after one itera-tion

Remark 7 In the more general case where 119901 gt 119899 we usedcoordinate-wise optimization to compute the entire regu-larized coefficient path via Algorithm 1 Our experience isthat coordinate-wise optimization works well in practice andconverges very quickly

42 Regularity Parameter Selection Tuning parameter selec-tion is an important issue in most PLS procedures There arerelatively few studies on the choice of penalty parametersTraditional model selection criteria such as AIC [10] andBIC [11] suffer from a number of limitations Their majordrawback arises because parameter estimation and modelselection are two different processes which can result ininstability [13] and complicated stochastic properties Toovercome the deficiency of traditional methods Fan and Li[3] proposed the SCADmethod which estimates parameters

while simultaneously selecting important variables Theyselected tuning parameter by minimizing GCV [1 3 26]

However it is well known that GCV and AIC-basedmethods are not consistent for model selection in the sensethat as 119899 rarr infin they may select irrelevant predictors withnonvanishing probability [22 36] On the other hand BIC-based tuning parameter selection roughly corresponds tomaximizing the posterior probability of selecting the truemodel in an appropriate Bayesian formulation and has beenshown to be consistent for model selection in several settings[22 37 38] The BIC tuning parameter selector is defined by

BIC = log(10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

119899) + DF

log (119899)119899

(13)

where DF is the estimate of the degrees of freedom given by

DF = tr X (X119879 + 119899Σ120582)119879

X119879 (14)

andΣ120582= diag1199011015840

120582(|1|)|1| 119901

1015840

120582(|119901|)|119901|The diagonal

elements of Σ120582are coefficients of quadratic terms in the local

quadratic approximation to SCAD penalty function 119901120582(sdot) [3]

Dicker et al [6] proposed BIC-like procedures imple-mented by minimizing

BIC0= log(

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

119899 minus 0

) +log (119899)

1198990 (15)

To estimate the residual variance they use 2 = (119899minus0)minus1yminus

X2 This differs from other estimates of the residualvariance used in PLSmethods where the denominator 119899minus

0

is replaced by 119899 [22] here 119899minus0is used to account for degrees

of freedom lost to estimationMore works on the high-dimensional BIC for the least

squares regression to tuning parameter selection for noncon-vex penalized regression can be seen in [24 25 39] Here weused BIC statistic as (15)

6 Journal of Probability and Statistics

43 A Standard Error Formula The standard errors for theestimated parameters can be obtained directly because weare estimating parameters and selecting variables at the sametime Let = (120582 120574) be a local minimizer of Atan Following[3 17] standard errors of may be estimated by using quad-ratic approximations to Atan Indeed the approximation

119901120582119886

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) asymp 119901120582120574

(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

+1

2100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1199011015840

120582120574(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816) (1205732

119895minus 1205732

1198950)

for 120573119895asymp 1205731198950

(16)

suggests that Atan may be replaced by the quadratic mini-mization problem

min

1

119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

1199011015840

120582119886(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1205732

119895

(17)

at least for the purposes of obtaining standard errors Usingthis expression we obtain a sandwich formula for the esti-mated standard error of

where = 119895

119895= 0 Consider

cov () = 2X119879X+ 119899Δ

()minus1

sdot X119879XX119879X+ 119899Δ

()minus1

(18)

where Δ(120573) = diag1199011015840120582119886(|1205731|)|1205731| 119901

1015840

120582119886(|120573119901|)|120573119901| 2 =

119899minus1y minus X2 and

0= || is the number of elements in ||

Under the conditions of Theorem 4

119861119899119883119879

119883cov (

) 119861119879

119899

1205902997888rarr 119866 (19)

This is a consistency result for cov()

5 Numerical Studies

51 Simulation Studies We now investigate the sparsityrecovery and estimation properties of the proposed estimatorvia numerical simulations We compare the following esti-mators the Lasso estimator (implemented using R packageglmnet [40]) the adaptive Lasso estimator (denoted byAlasso [15]) the SCAD estimator from the CD algorithmwithout calibration [31] the MCP estimator from the CDalgorithm with 119886 = 15 [31] the Dantzig selector [7] andBayesian variable select method [8] For the proposed Atanestimator we take 120574 = 0005 and BIC statistic (15) is usedto select the tuning parameter 120582 In the following we reportsimulation results from four examples

Example 8 In this example simulation data are generatedfrom the linear regression model

119910 = x119879120573lowast + 120590120576 (20)

where 120573lowast = (3 15 0 0 2 0 0 0 0 0 0 0)119879 120576 sim 119873(0 1) and

x is multivariate normal distribution with zero mean and

covariance between the 119894th and 119895th elements being 120588|119894minus119895| with120588 = 05 In our simulation sample size 119899 is set to be 100 and200 120590 = 2 For each case we repeated the simulation 200times

For linear model model error for = x119879 is ME() =

( minus 120573)119879119864(xx119879)( minus 120573) Simulation results are summarized

in Table 1 in which MRME stands for median of ratios ofME of a selected model to that of the unpenalized minimumsquare estimate under the full model Both the columns ofldquoCrdquo and ldquoICrdquo are measures of model complexity Column ldquoCrdquoshows the average number of nonzero coefficients correctlyestimated to be nonzero and column ldquoICrdquo presents theaverage number of zero coefficients incorrectly estimated tobe nonzero In the column labeled ldquounderfitrdquo we presentthe proportion of excluding any nonzero coefficients in 200replications Likewise we report the probability of selectingthe exact subset model and the probability of including allthree significant variables and some noise variables in thecolumns ldquocorrect-fitrdquo and ldquooverfitrdquo respectively

As can be seen from Table 1 all variable selection proce-dures dramatically reduce model error Atan has the smallestmodel error among all competitors followed by AlassoMCPSCAD Bayesian and Dantzig In terms of sparsity Atan alsohas the highest probability of correct-fit The Atan penaltyperforms better than the other penalties Also Atan has someadvantages when dimensional 119901 is high which can be seen inExample 9

We now test the accuracy of our standard error formula(18) The median absolute deviation divided by 06745denoted by SD in Table 2 of 3 estimated coefficients in the200 simulations can be regarded as the true standard errorThe median of the 200 estimated SDs denoted by SDm andthe median absolute deviation error of the 200 estimatedstandard errors divided by 06745 denoted by SDmad gaugethe overall performance of standard error formula (18)Table 2 presents the results for nonzero coefficients whensample size 119899 = 200 The results for the other case with 119899 =

100 are similar Table 2 suggests that the sandwich formulaperforms surprisingly well

Example 9 The example is from Wang et al [23] Morespecifically we take 120573 = (114 minus236 3712 minus139 13 0 0 0)

119879isin R119901 and 119901 = [4119899

14] minus 5 and [119905] stands for the

largest integer no larger than 119905 For this example predictordimension 119901 is diverging but the dimension of the truemodel is fixed to be 5 Results from the simulation study arefound in Table 3 A similar conclusion as in Example 8 can befound

Example 10 In this simulation study presented here weexamined the performance of the various PLS methods for 119901substantially larger than in the previous studies In particularwe took 119901 = 339 119899 = 500 1205905 = 5 and 120573

lowast= (2119868

119879

37 minus3119868119879

37

119868119879

37 0119879

228) where 119868119896 isin 119877

119896 is the vector with all entries equalto 1 Thus 119901

0= 111 We simulated 200 independent datasets

(1199101 119909119879

1) (119910

119899 119909119879

119899) in this study and for each dataset we

computed estimates of 120573lowast Results from this simulation studyare found in Table 4

Journal of Probability and Statistics 7

Table 1 Simulation results for linear regression models of Example 8

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 100 120590 = 2

Lasso 06213 30000 12050 00000 03700 06700Alasso 03074 30000 03500 00000 07300 02700SCAD 02715 30000 10650 00000 04400 05600MCP 03041 30000 05650 00000 05800 04200Dantzig 04623 30000 06546 00000 05700 04300Bayesian 03548 30000 05732 00000 06300 03700Atan 02550 30000 01750 00000 08450 01550

119899 = 200 120590 = 2

Lasso 06027 30000 10700 00000 03550 06450Alasso 02781 30000 01600 00000 08650 01350SCAD 02900 30000 08550 00000 05250 04750MCP 02752 30000 03650 00000 06850 03150Dantzig 03863 30000 08576 00000 04920 05080Bayesian 02563 30000 04754 00000 07150 02850Atan 02508 30000 01000 00000 09050 00950

Table 2 Standard deviations of estimators for the linear regression model (119899 = 200)

Method 1

2

5

SD SDm (SDmad) SD SDm (SDmad) SD SDm (SDmad)Lasso 01753 01453 (00100) 01730 01688 (00094) 01591 01301 (00079)Alasso 01483 01638 (00095) 01642 01636 (00098) 01475 01439 (00073)SCAD 01797 01634 (00105) 01819 01608 (00104) 01398 01438 (00076)MCP 01602 01643 (00096) 01861 01656 (00097) 01464 01435 (00069)Dantzig 01734 01645 (00115) 01723 01665 (00094) 01581 01538 (00074)Bayesian 01568 01635 (00089) 01678 01649 (00092) 01367 01375 (00078)Atan 01510 01643 (00108) 01591 01658 (00096) 01609 01434 (00083)

Perhaps the most striking aspect of the results presentedin Table 4 is that hardly no method ever selected the correctmodel in this simulation study However given that 119901 119901

0

and 120573lowast are substantially larger in this study than in the

previous simulation studies this may not be too surprisingNotice that on average Atan selects the most parsimoniousmodels of all methods and has the smallest model errorAtanrsquos nearest competitor in terms of model error is AlassoThis implementation of Alasso has mean model error 02783but its average selected model size is 1035250 larger thanAtanrsquos Since 119901

0= 111 it is clear that Atan underfits in some

instances In fact all of the methods in this study underfitto some extent This may be due to the fact that many ofthe nonzero entries in 120573

lowast are small relative to the noise level1205902= 5

Example 11 As an extension of this method we consider theproblem of simultaneous variable selection and estimation inthe partially linear model

119884 = 1198831015840120573 + 119892 (119879) + 120576 (21)

where119884 is a scalar response variate119883 is a 119901-vector covariate119879 is a scalar covariate and takes values in a compact interval(for simplicity we assume this interval to be [0 1]) 120573 is 119901 times

1 column vector of unknown regression parameter function119892(sdot) is unknown and model error 120576 is independent of (119883 119879)withmean 0 Traditionally it has generally been assumed that120573 is finite dimension several standard approaches such as thekernel method the spline method [41] and the local linearestimation [17] have been proposed

In this study we simulate 119899 = 100 200 points 119879119894 119894 =

1 100(200) from the uniform distribution on [0 1] Foreach 119894 1198901015840

119894119895119904 are simulated to be normally distributed with

autocorrelated variance structure 119860119877(120588) such that

cov (119890119894119895 119890119894119897) = 120588|119894minus119895|

1 le 119894 119895 le 10 (22)

119883119894119895rsquos are then formed as follows

1198831198941= sin (2119879

119894) + 1198901198941

1198831198942= (05 + 119879

119894)minus2

+ 1198901198942

8 Journal of Probability and Statistics

Table 3 Simulation results for linear regression models of Example 9

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 200 119901 = 10

Lasso 09955 49900 20100 00100 01150 08750Alasso 05740 48650 01100 01350 08150 00500SCAD 05659 49800 05600 00200 05600 04200MCP 06177 49100 01200 00900 08250 00850Dantzig 06987 48900 06700 01100 04360 04540Bayesian 05656 48650 02500 01350 06340 02310Atan 05447 48900 01150 01100 08250 00650

119899 = 400 119901 = 12

Lasso 12197 50000 20650 00000 01400 08600Alasso 04458 49950 00900 00050 09250 00700SCAD 04481 50000 04850 00000 06350 03650MCP 04828 50000 01150 00000 08950 01050Dantzig 07879 50000 05670 00000 03200 06800Bayesian 04237 50000 01800 00000 07550 02450Atan 04125 49950 00250 00050 09700 00250

119899 = 800 119901 = 16

Lasso 12004 50000 25700 00000 00900 09100Alasso 03156 50000 00700 00000 09300 00700SCAD 03219 50000 06550 00000 05950 04050MCP 03220 50000 00750 00000 09300 00700Dantzig 05791 50000 05470 00000 03400 06600Bayesian 03275 50000 02800 00000 06750 03250Atan 03239 50000 00750 00000 09350 00650

Table 4 Simulation results for linear regression models of Example 10

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 500 120590 = 5

Lasso 03492 1102000 247450 05650 00000 04350Alasso 02783 1035250 61250 10000 00000 00000SCAD 03012 1060400 357150 09900 00000 00150MCP 02791 1033600 81100 10000 00000 00900Dantzig 03120 1083400 184650 07570 00000 02430Bayesian 02876 1044350 74700 09800 00000 00200Atan 02794 1014000 40600 10000 00000 00000

1198831198943= exp (119879

119894) + 1198901198943

1198831198945= (119879119894minus 07)

4+ 1198901198945

1198831198946= 119879119894(1 + 119879

2

119894)minus1

+ 1198901198946

1198831198945= radic1 + 119879

119894+ 1198901198947

1198831198948= log (3119879

119894+ 8) + 119890

1198948

(23)

We investigate the scenario 119901 = 10 119883119894119895= 119890119894119895 119895 = 4 9 10

In the scenario we have 120573119895= 119895 1 le 119895 le 4 and 120573

119895= 0 with

others 120576 sim 119873(0 1) For each case we repeated the simulation200 times We investigate 119892(sdot) functions 119892(119905) = cos (2120587119905)For comparison we apply the different penalties in the para-metric component with the B-spline in the nonparametriccomponent

The results are summarized in Table 5 Columns 2ndash5 inTable 5 are the averages of the estimates of 120573

119895 119895 = 1 4

respectively Column 6 is the number of estimates of 120573119895

Journal of Probability and Statistics 9

Table 5 Example 11 comparison of estimators

Estimator 1205731

1205732

1205733

1205734

119862 MME (SD) RASE (119892(119879))119899 = 100 119901 = 10

SCAD 09617 20128 28839 40611 47450 50000 00695 (00625) 06269Lasso 08278 19517 27984 39274 55200 60000 01727 (00790) 08079Dantzig 08767 19544 27894 39302 54590 60000 01696 (00680) 07278Atan 09710 20121 29865 40472 49820 50000 00658 (00630) 06214

119899 = 200 119901 = 10

SCAD 09836 19989 29283 40278 52000 50000 00230 (00210) 03604Lasso 09219 19529 29124 39580 51450 50000 00500 (00315) 04220Dantzig 09534 19675 29096 39345 51460 50000 00510 (00290) 04154Atan 09904 19946 29548 40189 51250 50000 00190 (00245) 03460

5 le 119895 le 119901 which are 0 averaged over 200 simulationsand their medians are given in column 7 Model errors arecomputed as ME() = ( minus 120573)

119879119864(xx119879)( minus 120573) Their medians

are listed in the 8th column followed by the model errorsstandard deviations in parentheses

The performance of (119879) is assessed by the square root ofaverage squared errors (RASE)

RASE2 = 1

119899

119899

sum

119894=1

( (119879119894) minus 119892 (119879

119894))2 (24)

where 119879119894 119894 = 1 119899 are the observed data points at which

function 119892 is estimated Column 9 summarizes the RASEvalues for the different situations

We can see the following from Table 5 (1) Comparingwith the Lasso and Dantzig estimator the SCAD and Atanestimators have a smallermodel error which is due to the factthat the SCAD and Atan estimators are unbiased estimatorswhile the Lasso and Dantzig are biased especially for thelarger coefficient Moreover the Atan and SCAD estimatorsare more stable (2) Each method is able to select importantvariables but it is obvious that the Atan estimator has slightlystronger sparsity (3) For the nonparametric component Atanand SCAD estimator have smaller RASE values

52 RealDataAnalysis In this sectionwe apply theAtan reg-ularization scheme to a prostate cancer example The datasetin this example is derived from a study of prostate cancer in[42]The dataset consists of themedical records of 97 patientswho were about to receive a radical prostatectomy Thepredictors are eight clinical measures log(cancer volume)(lcavol) log (prostate weight) (lweight) age the logarithmof the amount of benign prostatic hyperplasia (lbph) sem-inal vesicle invasion (svi) log (capsular penetration) (lcp)gleason score (gleason) and percentage gleason score 4 or5 (pgg45) The response is the logarithm of prostate-specificantigen (lpsa) One of the main aims here is to identify whichpredictors are more important in predicting the response

The Lasso Alasso SCAD MCP and Atan are all appliedto the data We also compute the OLS estimate of the prostatecancer data Results are summarized in Table 6 The OLSestimator does not perform variable selection Lasso selectsfive variables in the final model SCAD selects lcavol lweight

Table 6 Prostate cancer data comparing different methods

Method 1198772

11987721198772

OLS Variables selectedOLS 06615 10000 AllLasso 05867 08870 (1 2 4 5 8)Alasso 05991 09058 (1 2 5)SCAD 06140 09283 (1 2 4 5)MCP 05999 09069 (1 2 5)Atan 06057 09158 (1 2 5)

lbph and svi in the final model while Alasso MCP andAtan select lcavol lweight and svi Thus Atan selects asubstantially simpler model than Lasso SCAD Furthermoreas indicated by the columns labeled 119877

2 (1198772 is equal to oneminus the residual sum of squares divided by the total sum ofsquares) and11987721198772OLS in Table 6 the Atan estimator describesmore variability in the data than Alasso and MCP and nearlyas much as OLS estimator

6 Conclusion and Discussion

In this paper a new Atan penalty which very closely resem-bles 119871

0penalty is proposed First we establish the theory

of the Atan estimator under mild conditions The theoryindicates that the Atan-penalized least squares procedureenjoys oracle properties even when the number of variablesgrows slower than the number of observations Second theiteratively reweighted Lasso algorithm makes our proposedestimator implementation fast Third we suggest a BIC-liketuning parameter selector to identify the true model con-sistently Numerical studies further endorse our theoreticalresults and the advantage of the Atan estimator for modelselection

We do not address the situation where 119901 ≫ 119899 in thispaper In fact the proposed Atan method can be easilyextended for variable selection in the situation of119901 ≫ 119899 Alsoas it is shown in Example 11 the Atan method can be appliedto semiparametric model and nonparametric model [43 44]Furthermore there is a recent field of applications of variableselection which is to look for impact points in functional dataanalysis [45 46]The possible combination of Atan estimator

10 Journal of Probability and Statistics

with the questions in [45 46] would be interesting Theseproblems are beyond the scope of this paper and will beinteresting topics for future research

Appendix

Proof of Theorem 2 Let 120572119899= radic1199011205902119899 and fix 119903 isin (0 1) To

prove the theorem it suffices to show that if 119862 gt 0 is largeenough then

119876119899(120573lowast) lt inf120583=119862

119876119899(120573lowast+ 120572119899120583) (A1)

holds for all 119899 sufficiently large with probability at least 1 minus 119903Define119863

119899(120583) = 119876

119899(120573lowast+ 120572119899120583) minus 119876

119899(120573lowast) and note that

119863119899(120583) =

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

+

119901

sum

119895=1

119901120582119886

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

(A2)

The fact that 119901120582120574

is concave on [0infin) implies that

119901120582120574

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (

10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816minus10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (minus120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816)

= minus120582120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2

(A3)

when 119899 is sufficiently largeCondition (B) implies that

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2le120574 (120574 + 2120587)

1205742 + 1205882 (A4)

Thus for 119899 big enough

119863119899(120583) ge

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

minus119862119901120582120572

119899120574 (120574 + 2120587)

1205742 + 1205882

(A5)

By (D)1

21198991205722

119899

1003817100381710038171003817X12058310038171003817100381710038172ge120582min2

11986221205722

119899 (A6)

On the other hand (D) implies1

119899120572119899

10038161003816100381610038161003816120576119879X12058310038161003816100381610038161003816 le

119862120572119899

radic119899

10038171003817100381710038171003817100381710038171003817

1

radic119899X119879120576

10038171003817100381710038171003817100381710038171003817= 119874119875(1198621205722

119899) (A7)

Furthermore (C) and (B) imply

119862119901120582120572119899120574 (120574 + 2120587)

1205742 + 1205882= 119900 (119862120572

2

119899) (A8)

From (A5)ndash(A8) we conclude that if 119862 gt 0 is large enoughthen inf

120583=119862119863119899(120583) gt 0 holds for all 119899 sufficiently large with

probability at least 1 minus 119903 This proves Theorem 2

Proof of Lemma 3 Suppose that 120573 isin 119877119901 and that 120573 minus 120573

lowast le

119862radic1199011205902119899 Define isin 119877119901 by

119860119888 = 0 and

119860= 120573119860 Similar to

the proof of Theorem 2 let

119863119899(120573 ) = 119876

119899(120573) minus 119876

119899() (A9)

where 119876119899(120573) is defined in (8) Then

119863119899(120573 ) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172minus

1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899

10038171003817100381710038171003817y minus X minus X (120573 minus )

10038171003817100381710038171003817

2

minus1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899(120573 minus )

119879

X119879X (120573 minus )

minus1

2119899(120573 minus )

119879

X119879 (y minus X)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

= 119874119901(10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817radic1199011205902

119899)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(A10)

On the other hand since the Atan penalty is concave on[0infin)

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) ge 120582 (120574 +

2

120587) arctan( 119862

120574radic1198991199011205902)

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816 (A11)

for 119895 isin 119860119888 Thus

sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

ge 120582 (120574 +2

120587) arctan( 119862

120574radic1198991199011205902)

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

(A12)

By (C)

lim inf119899rarrinfin

arctan( 119862

120574radic1198991199011205902) gt 0 (A13)

It follows from (A10)ndash(A12) that there is constant119870 gt 0 suchthat

119863119899(120573 )

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

ge 119870120582 + 119874119875(radic

1199011205902

119899) (A14)

Since 120582radic1199011205902119899 rarr infin the result follows

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

2 Journal of Probability and Statistics

Lasso may not consistently select the correct model andis not necessarily asymptotically normal [14 15] a strongirrepresentable condition is necessary for the Lasso to beselection consistent [15 16] The folded concave penaliza-tion unlike the Lasso does not require the irrepresentablecondition to achieve the variable selection consistency andcan correct the intrinsic estimation bias of the Lasso Fanand Li [3] first systematically studied nonconvex penalizedlikelihood for fixed finite dimension 119901 They recommendedthe SCAD penalty which enjoys the oracle property (avariable selection and estimation procedure is said to havethe oracle property if it selects the true model 119860 withprobability tending to one and if the estimated coefficientsare asymptotically normal with the same asymptotic varianceas the least squares estimator based on the true model) forvariable selection Fan and Peng [17] extended these resultsby allowing 119901 to grow with 119899 at the rate 119901 = 119900(119899

15) or 119901 =

119900(11989913

) Lv and Fan [5] introduced the weak oracle propertywhich means that the estimator enjoys the same sparsityas the oracle estimator with asymptotic probability oneand has consistency and established regularity conditionsunder which the PLS estimator given by folded concavepenalties has a nonasymptotic weak oracle property whendimensionality119901 can grow nonpolynomially with sample size119899 Theoretical properties enjoyed by SELO [6] estimatorsallow the number of predictors 119901 to tend to infinity alongwith the number of observations 119899 provided 119901119899 rarr 0 Forhigh-dimensional nonconvex penalized regression with 119901 gt

119899 Kim et al [18] proved that the oracle estimator itself is alocal minimum of SCAD penalized least squares regressionunder very relaxed conditions Zhang [4] proposedMCP anddevised a novel PLUS algorithm which when used togethercan achieve the oracle property under certain regularityconditions Recently Fan et al [9] have shown that thefolded concave penalization methods enjoy the strong oracleproperty for high-dimensional sparse estimation Importantinsight has also been gained through the recent work ontheoretical analysis of the global solution [19ndash21]

The practical performance of PLS procedures dependsheavily on the choice of a tuning parameter The theoret-ically optimal tuning parameter does not have an explicitrepresentation and depends on unknown factors such as thevariance of the unobserved randomnoise Cross-validation iscommonly adopted in practice to select the tuning parameterbut is observed to often result in overfitting In the case offixed 119901 Wang et al [22] proposed that one selects tuningparameter by minimizing the generalized BIC tuning param-eter selector Wang et al [23] extended those results to thesetting of a diverging number of parameters Recently Dickeret al [6] proposed a BIC-like tuning parameter selectorWanget al [21] extended the work of [24 25] for BIC on high-dimensional least squares regression they proposed a high-dimensional BIC for a nonconvex penalized solution path

In this paper we propose an arctangent type (Atan)penalty function which very closely approximates 119897

0penalty

Because the Atan penalty is continuous the Atan estimatormay be more stable than the estimators obtained through 119897

0

penalizedmethodsTheAtan penalty is a smooth function on[0infin) and we use an iteratively reweighted Lasso algorithm

We formally establish the model selection oracle propertyenjoyed by the Atan estimator In particular the asymptoticnormality of the Atan is formally established Our asymptoticframework allows the number of predictors 119901 rarr infin alongwith the number of observations 119899 provided 119901119899 rarr 0 Fur-thermore a BIC-like tuning parameter selection procedure isimplemented for Atan

This paper is organized in the following way In Section 2we introduce PLS estimators and give a brief overview ofexisting nonconvex penalty terms and the Atan penalty isthen presented Then we discuss some of its theoreticalproperties in Section 3 In Section 4 we describe a simple andefficient algorithm for obtaining Atan estimator Simulationstudies and an application of the proposed methodology arepresented in Section 5 Conclusions are given in Section 6The proofs are relegated to the Appendix

2 The Atan-Penalized Least Squares Method

21 Linear Model and Penalized Least Squares Suppose that(119910119894 119909119894)119899

119894=1is a random sample from the linear regression

modely = X120573lowast + 120576 (2)

where X = (1199091 1199092 119909

119899)119879 is 119899 times 119901 design matrix y =

(1199101 1199102 119910

119899)119879 is an 119899-dimensional response vector and 120576

are the iid random errors with mean 0 and variance 1205902119899-

dimensional noise vectorWhen discussing variable selection it is convenient to

have concise notation Denote the columns of X by x1

x119901

isin 119877119899 and the rows of X by 119909

1 119909

119899isin 119877119901 Let 119860 =

119895 120573lowast

119895= 0 be the true model and suppose that 119901

0is the

size of the true model That is suppose that |119860| = 1199010

where |119860| denotes the cardinality of 119860 In addition for 119878 sube

1 2 119901 let 120573119878= (120573119895)119895isin119878

be the |119878|-dimensional subvectorof120573 containing entries indexed by 119878 and letX

119878be 119899times|119878|matrix

obtained from X by extracting columns corresponding to 119878Given 119901 times 119901 matrix 119862 and subsets 119878

1 1198782sube 1 2 119901 let

1198621198781 1198782

be |1198781| times |1198782| submatrix of 119862 with rows determined by

1198781and columns determined by 119878

2

Various penalty functions have been used in the variableselection literature for linear regression model Commonlyused penalty functions include 119897

119902 0 le 119902 le 2 the nonnegative

garrotte [26] elastic-net [27 28] SCAD [3] and MCP [4]In particular 119897

1penalized least squares procedure is called

the Lasso However Lasso estimates may be biased andinconsistent for model selection [3 15] This implies that theLasso does not have the oracle property The adaptive Lassois a weighted version of Lasso which has the oracle property[15] Slightly abusing notation is that the adaptive Lassopenalty is defined by 119901

120582(120573) = 120582120596

119895|120573| where 120596

119895is a data-

dependent weightFan and Li [3] proposed a continuously differentiable

penalty function called the SCADpenalty which is defined by

1199011015840

120582(10038161003816100381610038161205731003816100381610038161003816) = 120582119868 (

10038161003816100381610038161205731003816100381610038161003816 le 120582) +

(119886120582 minus10038161003816100381610038161205731003816100381610038161003816)+

(119886 minus 1) 120582119868 (10038161003816100381610038161205731003816100381610038161003816 gt 120582)

for some 119886 gt 2

(3)

Journal of Probability and Statistics 3

0

02

04

06

08

1

12

AtanLassoSCAD

MCP

3210minus1minus2minus3

120573

l0

(a)

0

05

1

15

2

25

3

3210minus1minus2minus3

120573

120574 = 0005

120574 = 01

120574 = 1

120574 = infin

(b)

Figure 1 (a) Plots of the penalties (including 1198970 Lasso SCAD MCP and Atan) (b) Plots of the Atan penalties with different 120574

Authors of [3] suggested using 119886 = 37 from a BayesianperspectiveTheminimum concave penalty [4] translates theflat part of the derivative of the SCAD into the origin and isgiven by

1199011015840

120582(10038161003816100381610038161205731003816100381610038161003816) =

(119886120582 minus10038161003816100381610038161205731003816100381610038161003816)+

119886 (4)

which minimizes the maximum of the concavity Zhang [4]proved that theMCP procedure may select the correct modelwith probability tending to 1 and that MCP estimator hasgood properties in terms of 119897

119901-loss provided 120582 and 119886 satisfy

certain conditions Zhangrsquos results in fact allow for 119901 ≫ 119899

22 Atan Penalty In this section we first propose a novelnonconvex penalty that we call Atan penalty We then studyits applications in sparse modeling

The Atan penalty is defined by

119901120582120574

(10038161003816100381610038161205731003816100381610038161003816) = 120582 (120574 +

2

120587) arctan(

10038161003816100381610038161205731003816100381610038161003816

120574) (5)

for 120582 ge 0 and 120574 gt 0 It is clear that this penalty is concavein |120573| Moreover we can establish its relationship with 119897

0

and 1198971penalties In particular we have the following proposi-

tions

Proposition 1 Let 119901120582120574(|120573|) be given in (5) then

(119886) lim120574rarrinfin

119901120582120574

(10038161003816100381610038161205731003816100381610038161003816) = 120582

10038161003816100381610038161205731003816100381610038161003816

(119887) lim120574rarr0

119901120582120574

(10038161003816100381610038161205731003816100381610038161003816) =

120582 11989411989110038161003816100381610038161205731003816100381610038161003816 = 0

0 11989411989110038161003816100381610038161205731003816100381610038161003816 = 0

(6)

The propositions show that the limits of Atan at 0 andinfin are 1198970

penalty and 1198971penalty respectivelyThe first-order derivative of

119901120582120574(|120573|) with respect to |120573| is

1199011015840

120582120574(10038161003816100381610038161205731003816100381610038161003816) = 120582

120574 (120574 + 2120587)

1205742 + 1205732 (7)

The Atan penalty function (120574 = 0005) is plotted in Figure 1(a)along with the SCAD Lasso MCP and 119897

0penalty Figure 1(b)

depicts the Atan with different 120574

3 Theoretical Properties

In this section we study the theoretical properties of the Atanestimator proposed in Section 2 in the situation where thenumber of parameters 119901 tends to infin with increasing samplesize 119899 We discuss some conditions of the penalty and lossfunctions in Section 31 Our main results are presented inSection 32

31 Regularity Conditions We need to place the followingconditions on the penalty functions

(A) 119899 rarr infin and 1199011205902119899 rarr 0

(B) 120588radic119899(1199011205902) rarr infin where 120588 = min119895isin119860

|120573lowast

119895|

(C) 120582 = 119874(1) 120582radic119899(1199011205902) rarr infin and 120574 = 119874(11990112

1205903119899minus32

)(D) There exist constants 119862

1 1198622

isin R such that 1198621

lt

120582min((1119899)X119879X) lt 120582max((1119899)X119879X) lt 1198622 where

120582min((1119899)X119879X) and 120582max((1119899)X119879X) are the smallestand largest eigenvalues of (1119899)X119879X respectively

(E) lim119899rarrinfin

119899minus1max

1le119894le119899sum119901

119895=11199092

119894119895= 0

(F) 119864(|120576119894120590|2+120575

) lt 119872 for some 120575 and119872 lt infin

4 Journal of Probability and Statistics

Since 119901 may vary with 119899 it is implicit that 120573lowast may varywith 119899 Additionally we allow model 119860 and the distributionof 120576 (in particular 1205902) to change with 119899 Condition (A) limitshow119901 and 1205902may growwith 119899This condition is substantiallyweaker than that required in [17] which requires 1199015119899 rarr 0and slightly weaker than that required in [28] which requireslog(119901) log(119899) rarr ] isin [0 1) and the same as that requiredin [6] As mentioned in Section 1 other authors have studiedPLS methods in settings where 119901 gt 119899 that is their growthcondition on 119901 is weaker than condition (A) [18] Condition(B) gives a lower bound on the size of the smallest nonzeroentry of 120573lowast Notice that the smallest nonzero entry of 120573lowastis allowed to vanish asymptotically provided it does not doso faster than radic1199011205902119899 Similar conditions are found in [17]Condition (C) restricts the rates of tuning parameters 120582 and120574 Note that condition (C) does not constrain the minimumsize of 120574 Indeed no such constraint is required for ourasymptotic results about the Atan estimator Since the Atanpenalty approaches 119897

0penalty as 120574 rarr 0 this suggests that

the Atan and 1198970penalized least squares estimator have similar

asymptotic properties On the other hand in practice wehave found that one should not take 120574 too small in order topreserve stability of the Atan estimator Condition (D) is anidentifiability condition Conditions (E) and (F) are used toprove asymptotic normality ofAtan estimators and are relatedto the Lindeberg condition of the Lindeberg-Feller centrallimit theorem Conditions (A)ndash(F) imply that Atan has theoracle property andmay correctly identifymodel119860 as we willsee inTheorem 4

32 Oracle Properties Let

119876119899(120573) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) (8)

be the objective function and 119901120582120574(|120573119895|) is the Atan penalty

function

Theorem 2 Suppose that conditions (A)ndash(D) hold then forevery 119903 isin (0 1) there exists a constant 119862

0gt 0 such that

lim inf 119875119899rarrinfin

[

[

arg min120573

119876119899(120573)

sube

120573 isin 1198771199011003817100381710038171003817120573 minus 120573

lowast1003817100381710038171003817 le 119862radic1199011205902

119899

]

]

gt 1 minus 119903

(9)

whenever119862 ge 1198620 Consequently there exists a sequence of local

minimizers of119876119899(120573) and such that minus120573lowast = 119874

119875(radic1199011205902119899)

Lemma 3 Assume that (A)ndash(D) hold and fix 119862 gt 0 then

lim119899rarrinfin

119875[[

[

argmin120573minus120573lowastle119862radic119901120590

2119899

119876119899(120573) sube 120573 isin 119877

119901 120573119860119888 = 0

]]

]

= 1

(10)

where119860119888 = 1 119901 119860 is the complement of119860 in 1 119901

Theorem 4 (oracle properties) Suppose that (A)ndash(F) holdthen there exists a sequence ofradic1198991199011205902-consistent localminimaof Atan such that

(i) lim119899rarrinfin

119875(119895 119895

= 0 = 119860) = 1

(ii) radic119899119861119899(119899minus1119883119879

1198601198831198601205902)12

(119860minus 120573lowast

119860) rarr 119873(0 119866)

in distribution where 119861119899is any arbitrary 119902 times |119860| matrix such

that 119861119899119861119879

119899rarr 119866 and 119866 is 119902 times 119902 nonegative symmetric matrix

4 Implementation

41 Iteratively Reweighted Lasso Algorithm The computationfor the Atan-penalized method is much more involvedbecause the resulting optimization problem is usually non-convex and hasmultiple local minimizers Several algorithmshave been developed for computing the folded concave penal-ized estimators such as the local quadratic approximation(LQA) algorithm and the local linear approximation (LLA)algorithm [3 29] Both LQA and LLA are related to themajorization-minorization (MM) principle [30] Recentlycoordinate descent algorithm was applied to solve the foldedconcave penalized least squares [31 32] Reference [4] deviseda PLUS algorithm for solving the PLS using the MCP Zhang[33] analyzed the capped-119897

1penalty for solving the PLS

In this section we present an iteratively reweighted Lasso(IRL) algorithm to solve the followingminimization problem

min120573isinR119901

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+ 120582

119901

sum

119895=1

(120574 +2

120587) arctan(

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816

120574)

(11)

We show that the solution of the penalty least squares prob-lem can be transformed into that of a series of convexweighted Lasso estimators to which the existing Lasso algo-rithms can be efficiently applied

Now taking a first-order Taylor-series approximation ofthe Atan penalty about the current value 120573

119895= 120573lowast

119895 we obtain

the overapproximation of (11)

min120573isinR119901

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+ 120582

119901

sum

119895=1

120574 (120574 + 2120587)

1205742 + 120573lowast2119895

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816

(12)

Note that the linear approximation in (12) is analogous to oneproposed in Zou and Li [29] and they argued strongly in favorof a one-step approximation Instead we offer the followingalgorithm for Atan-penalized least squares Assume all thecovariates have been standardized to havemean zero and unitvarianceWithout loss of generality the span of regularizationparameters is 120582min = 120582

0lt sdot sdot sdot lt 120582

119871= 120582max for 119897 = 0 119871

[34] Here we summarize the details of IRL algorithm as inAlgorithm 1 where the active set is defined as 119860 = 119895 | 120573

119895=

0 119895 = 1 2 119901In the situation with 119899 gt 119901 we set 120582min = 0 However

when 119899 lt 119901 one attains the saturated model long before theregularization parameter reaches zero In this case we suggest120582min = 10

minus4120582max [34] and we used 120591 = 10

minus4 In practice wehave found that if the columns of X are standardized so thatx1198952= 119899 for 119895 = 1 119901 then taking 120574 = 0005 works well

Journal of Probability and Statistics 5

(1) Start with 120582max = max1le119895le119901

|X119879119910|119899 Set 119897 = 119871 and 120573(119897)= 0

Outer loop(2) Set 119896 = 0 and

(0)

= 120573(119897)

(3) Increment 119896 = 119896 + 1 and (119896+1)

= (119896)

(4) Update the weights

119895= 120574(120574 + 2120587)(120574

2+ ((119896)

119895)2) 119895 = 1 119901

Inner loopSolve the Karush-Kuhn-Tucker (KKT) conditions for fixed

119895

119909119879

119895(119910 minus X120573) minus 120582

119897119895sgn(120573

119895) = 0 if 119895 isin 119860

|119909119879

119895(119910 minus X120573)| lt 120582

119897119895 if 119895 notin 119860

(5) Goto Step (3)

(6) Repeat Steps (3)ndash(5) until (119896+1)

minus (119896)

2

lt 120591The estimate 120573

(119897)is the limit point of the outer loop

infin

(7) Decrement 119897 = 119897 minus 1 and 120582

119897= 120582119897minus1 Return to (2) using 120573

(119897)as a warm start

Algorithm 1 Iteratively reweighted Lasso (IRL) algorithm

Remark 5 Algorithm 1 is very much like MM algorithmsHunter and Li [30] were the first to advocate MM algorithmsfor variable selection when 119899 gt 119901 However we should noticethat Algorithm 1 differs from the algorithm described in [30]The use of MM algorithms in [30] is to justify a quadraticoverapproximation to penalty functions with singularities atthe origin In the inner loop of Algorithm 1 we avoid suchapproximation by solving the KKT conditions precisely andefficiently [35] Our use of MM algorithms in Algorithm 1 isto justify the local linear approximation toAtan penalty in theouter loop

Remark 6 LLA to SCAD penalty were also proposed by Zouand Li [29] Algorithm 1 differs from that proposed in [29] inthat it constructs the entire coefficient path even when 119901 gt

119899 whereas the procedure in [29] computes the coefficientestimates for fixed regularization parameter 120582 starting with aroot-119899 consistent estimator of true coefficient 120573lowast at the initialstep Moreover even with the same initial estimate the limitpoint from Algorithm 1 will differ from [29] after one itera-tion

Remark 7 In the more general case where 119901 gt 119899 we usedcoordinate-wise optimization to compute the entire regu-larized coefficient path via Algorithm 1 Our experience isthat coordinate-wise optimization works well in practice andconverges very quickly

42 Regularity Parameter Selection Tuning parameter selec-tion is an important issue in most PLS procedures There arerelatively few studies on the choice of penalty parametersTraditional model selection criteria such as AIC [10] andBIC [11] suffer from a number of limitations Their majordrawback arises because parameter estimation and modelselection are two different processes which can result ininstability [13] and complicated stochastic properties Toovercome the deficiency of traditional methods Fan and Li[3] proposed the SCADmethod which estimates parameters

while simultaneously selecting important variables Theyselected tuning parameter by minimizing GCV [1 3 26]

However it is well known that GCV and AIC-basedmethods are not consistent for model selection in the sensethat as 119899 rarr infin they may select irrelevant predictors withnonvanishing probability [22 36] On the other hand BIC-based tuning parameter selection roughly corresponds tomaximizing the posterior probability of selecting the truemodel in an appropriate Bayesian formulation and has beenshown to be consistent for model selection in several settings[22 37 38] The BIC tuning parameter selector is defined by

BIC = log(10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

119899) + DF

log (119899)119899

(13)

where DF is the estimate of the degrees of freedom given by

DF = tr X (X119879 + 119899Σ120582)119879

X119879 (14)

andΣ120582= diag1199011015840

120582(|1|)|1| 119901

1015840

120582(|119901|)|119901|The diagonal

elements of Σ120582are coefficients of quadratic terms in the local

quadratic approximation to SCAD penalty function 119901120582(sdot) [3]

Dicker et al [6] proposed BIC-like procedures imple-mented by minimizing

BIC0= log(

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

119899 minus 0

) +log (119899)

1198990 (15)

To estimate the residual variance they use 2 = (119899minus0)minus1yminus

X2 This differs from other estimates of the residualvariance used in PLSmethods where the denominator 119899minus

0

is replaced by 119899 [22] here 119899minus0is used to account for degrees

of freedom lost to estimationMore works on the high-dimensional BIC for the least

squares regression to tuning parameter selection for noncon-vex penalized regression can be seen in [24 25 39] Here weused BIC statistic as (15)

6 Journal of Probability and Statistics

43 A Standard Error Formula The standard errors for theestimated parameters can be obtained directly because weare estimating parameters and selecting variables at the sametime Let = (120582 120574) be a local minimizer of Atan Following[3 17] standard errors of may be estimated by using quad-ratic approximations to Atan Indeed the approximation

119901120582119886

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) asymp 119901120582120574

(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

+1

2100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1199011015840

120582120574(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816) (1205732

119895minus 1205732

1198950)

for 120573119895asymp 1205731198950

(16)

suggests that Atan may be replaced by the quadratic mini-mization problem

min

1

119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

1199011015840

120582119886(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1205732

119895

(17)

at least for the purposes of obtaining standard errors Usingthis expression we obtain a sandwich formula for the esti-mated standard error of

where = 119895

119895= 0 Consider

cov () = 2X119879X+ 119899Δ

()minus1

sdot X119879XX119879X+ 119899Δ

()minus1

(18)

where Δ(120573) = diag1199011015840120582119886(|1205731|)|1205731| 119901

1015840

120582119886(|120573119901|)|120573119901| 2 =

119899minus1y minus X2 and

0= || is the number of elements in ||

Under the conditions of Theorem 4

119861119899119883119879

119883cov (

) 119861119879

119899

1205902997888rarr 119866 (19)

This is a consistency result for cov()

5 Numerical Studies

51 Simulation Studies We now investigate the sparsityrecovery and estimation properties of the proposed estimatorvia numerical simulations We compare the following esti-mators the Lasso estimator (implemented using R packageglmnet [40]) the adaptive Lasso estimator (denoted byAlasso [15]) the SCAD estimator from the CD algorithmwithout calibration [31] the MCP estimator from the CDalgorithm with 119886 = 15 [31] the Dantzig selector [7] andBayesian variable select method [8] For the proposed Atanestimator we take 120574 = 0005 and BIC statistic (15) is usedto select the tuning parameter 120582 In the following we reportsimulation results from four examples

Example 8 In this example simulation data are generatedfrom the linear regression model

119910 = x119879120573lowast + 120590120576 (20)

where 120573lowast = (3 15 0 0 2 0 0 0 0 0 0 0)119879 120576 sim 119873(0 1) and

x is multivariate normal distribution with zero mean and

covariance between the 119894th and 119895th elements being 120588|119894minus119895| with120588 = 05 In our simulation sample size 119899 is set to be 100 and200 120590 = 2 For each case we repeated the simulation 200times

For linear model model error for = x119879 is ME() =

( minus 120573)119879119864(xx119879)( minus 120573) Simulation results are summarized

in Table 1 in which MRME stands for median of ratios ofME of a selected model to that of the unpenalized minimumsquare estimate under the full model Both the columns ofldquoCrdquo and ldquoICrdquo are measures of model complexity Column ldquoCrdquoshows the average number of nonzero coefficients correctlyestimated to be nonzero and column ldquoICrdquo presents theaverage number of zero coefficients incorrectly estimated tobe nonzero In the column labeled ldquounderfitrdquo we presentthe proportion of excluding any nonzero coefficients in 200replications Likewise we report the probability of selectingthe exact subset model and the probability of including allthree significant variables and some noise variables in thecolumns ldquocorrect-fitrdquo and ldquooverfitrdquo respectively

As can be seen from Table 1 all variable selection proce-dures dramatically reduce model error Atan has the smallestmodel error among all competitors followed by AlassoMCPSCAD Bayesian and Dantzig In terms of sparsity Atan alsohas the highest probability of correct-fit The Atan penaltyperforms better than the other penalties Also Atan has someadvantages when dimensional 119901 is high which can be seen inExample 9

We now test the accuracy of our standard error formula(18) The median absolute deviation divided by 06745denoted by SD in Table 2 of 3 estimated coefficients in the200 simulations can be regarded as the true standard errorThe median of the 200 estimated SDs denoted by SDm andthe median absolute deviation error of the 200 estimatedstandard errors divided by 06745 denoted by SDmad gaugethe overall performance of standard error formula (18)Table 2 presents the results for nonzero coefficients whensample size 119899 = 200 The results for the other case with 119899 =

100 are similar Table 2 suggests that the sandwich formulaperforms surprisingly well

Example 9 The example is from Wang et al [23] Morespecifically we take 120573 = (114 minus236 3712 minus139 13 0 0 0)

119879isin R119901 and 119901 = [4119899

14] minus 5 and [119905] stands for the

largest integer no larger than 119905 For this example predictordimension 119901 is diverging but the dimension of the truemodel is fixed to be 5 Results from the simulation study arefound in Table 3 A similar conclusion as in Example 8 can befound

Example 10 In this simulation study presented here weexamined the performance of the various PLS methods for 119901substantially larger than in the previous studies In particularwe took 119901 = 339 119899 = 500 1205905 = 5 and 120573

lowast= (2119868

119879

37 minus3119868119879

37

119868119879

37 0119879

228) where 119868119896 isin 119877

119896 is the vector with all entries equalto 1 Thus 119901

0= 111 We simulated 200 independent datasets

(1199101 119909119879

1) (119910

119899 119909119879

119899) in this study and for each dataset we

computed estimates of 120573lowast Results from this simulation studyare found in Table 4

Journal of Probability and Statistics 7

Table 1 Simulation results for linear regression models of Example 8

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 100 120590 = 2

Lasso 06213 30000 12050 00000 03700 06700Alasso 03074 30000 03500 00000 07300 02700SCAD 02715 30000 10650 00000 04400 05600MCP 03041 30000 05650 00000 05800 04200Dantzig 04623 30000 06546 00000 05700 04300Bayesian 03548 30000 05732 00000 06300 03700Atan 02550 30000 01750 00000 08450 01550

119899 = 200 120590 = 2

Lasso 06027 30000 10700 00000 03550 06450Alasso 02781 30000 01600 00000 08650 01350SCAD 02900 30000 08550 00000 05250 04750MCP 02752 30000 03650 00000 06850 03150Dantzig 03863 30000 08576 00000 04920 05080Bayesian 02563 30000 04754 00000 07150 02850Atan 02508 30000 01000 00000 09050 00950

Table 2 Standard deviations of estimators for the linear regression model (119899 = 200)

Method 1

2

5

SD SDm (SDmad) SD SDm (SDmad) SD SDm (SDmad)Lasso 01753 01453 (00100) 01730 01688 (00094) 01591 01301 (00079)Alasso 01483 01638 (00095) 01642 01636 (00098) 01475 01439 (00073)SCAD 01797 01634 (00105) 01819 01608 (00104) 01398 01438 (00076)MCP 01602 01643 (00096) 01861 01656 (00097) 01464 01435 (00069)Dantzig 01734 01645 (00115) 01723 01665 (00094) 01581 01538 (00074)Bayesian 01568 01635 (00089) 01678 01649 (00092) 01367 01375 (00078)Atan 01510 01643 (00108) 01591 01658 (00096) 01609 01434 (00083)

Perhaps the most striking aspect of the results presentedin Table 4 is that hardly no method ever selected the correctmodel in this simulation study However given that 119901 119901

0

and 120573lowast are substantially larger in this study than in the

previous simulation studies this may not be too surprisingNotice that on average Atan selects the most parsimoniousmodels of all methods and has the smallest model errorAtanrsquos nearest competitor in terms of model error is AlassoThis implementation of Alasso has mean model error 02783but its average selected model size is 1035250 larger thanAtanrsquos Since 119901

0= 111 it is clear that Atan underfits in some

instances In fact all of the methods in this study underfitto some extent This may be due to the fact that many ofthe nonzero entries in 120573

lowast are small relative to the noise level1205902= 5

Example 11 As an extension of this method we consider theproblem of simultaneous variable selection and estimation inthe partially linear model

119884 = 1198831015840120573 + 119892 (119879) + 120576 (21)

where119884 is a scalar response variate119883 is a 119901-vector covariate119879 is a scalar covariate and takes values in a compact interval(for simplicity we assume this interval to be [0 1]) 120573 is 119901 times

1 column vector of unknown regression parameter function119892(sdot) is unknown and model error 120576 is independent of (119883 119879)withmean 0 Traditionally it has generally been assumed that120573 is finite dimension several standard approaches such as thekernel method the spline method [41] and the local linearestimation [17] have been proposed

In this study we simulate 119899 = 100 200 points 119879119894 119894 =

1 100(200) from the uniform distribution on [0 1] Foreach 119894 1198901015840

119894119895119904 are simulated to be normally distributed with

autocorrelated variance structure 119860119877(120588) such that

cov (119890119894119895 119890119894119897) = 120588|119894minus119895|

1 le 119894 119895 le 10 (22)

119883119894119895rsquos are then formed as follows

1198831198941= sin (2119879

119894) + 1198901198941

1198831198942= (05 + 119879

119894)minus2

+ 1198901198942

8 Journal of Probability and Statistics

Table 3 Simulation results for linear regression models of Example 9

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 200 119901 = 10

Lasso 09955 49900 20100 00100 01150 08750Alasso 05740 48650 01100 01350 08150 00500SCAD 05659 49800 05600 00200 05600 04200MCP 06177 49100 01200 00900 08250 00850Dantzig 06987 48900 06700 01100 04360 04540Bayesian 05656 48650 02500 01350 06340 02310Atan 05447 48900 01150 01100 08250 00650

119899 = 400 119901 = 12

Lasso 12197 50000 20650 00000 01400 08600Alasso 04458 49950 00900 00050 09250 00700SCAD 04481 50000 04850 00000 06350 03650MCP 04828 50000 01150 00000 08950 01050Dantzig 07879 50000 05670 00000 03200 06800Bayesian 04237 50000 01800 00000 07550 02450Atan 04125 49950 00250 00050 09700 00250

119899 = 800 119901 = 16

Lasso 12004 50000 25700 00000 00900 09100Alasso 03156 50000 00700 00000 09300 00700SCAD 03219 50000 06550 00000 05950 04050MCP 03220 50000 00750 00000 09300 00700Dantzig 05791 50000 05470 00000 03400 06600Bayesian 03275 50000 02800 00000 06750 03250Atan 03239 50000 00750 00000 09350 00650

Table 4 Simulation results for linear regression models of Example 10

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 500 120590 = 5

Lasso 03492 1102000 247450 05650 00000 04350Alasso 02783 1035250 61250 10000 00000 00000SCAD 03012 1060400 357150 09900 00000 00150MCP 02791 1033600 81100 10000 00000 00900Dantzig 03120 1083400 184650 07570 00000 02430Bayesian 02876 1044350 74700 09800 00000 00200Atan 02794 1014000 40600 10000 00000 00000

1198831198943= exp (119879

119894) + 1198901198943

1198831198945= (119879119894minus 07)

4+ 1198901198945

1198831198946= 119879119894(1 + 119879

2

119894)minus1

+ 1198901198946

1198831198945= radic1 + 119879

119894+ 1198901198947

1198831198948= log (3119879

119894+ 8) + 119890

1198948

(23)

We investigate the scenario 119901 = 10 119883119894119895= 119890119894119895 119895 = 4 9 10

In the scenario we have 120573119895= 119895 1 le 119895 le 4 and 120573

119895= 0 with

others 120576 sim 119873(0 1) For each case we repeated the simulation200 times We investigate 119892(sdot) functions 119892(119905) = cos (2120587119905)For comparison we apply the different penalties in the para-metric component with the B-spline in the nonparametriccomponent

The results are summarized in Table 5 Columns 2ndash5 inTable 5 are the averages of the estimates of 120573

119895 119895 = 1 4

respectively Column 6 is the number of estimates of 120573119895

Journal of Probability and Statistics 9

Table 5 Example 11 comparison of estimators

Estimator 1205731

1205732

1205733

1205734

119862 MME (SD) RASE (119892(119879))119899 = 100 119901 = 10

SCAD 09617 20128 28839 40611 47450 50000 00695 (00625) 06269Lasso 08278 19517 27984 39274 55200 60000 01727 (00790) 08079Dantzig 08767 19544 27894 39302 54590 60000 01696 (00680) 07278Atan 09710 20121 29865 40472 49820 50000 00658 (00630) 06214

119899 = 200 119901 = 10

SCAD 09836 19989 29283 40278 52000 50000 00230 (00210) 03604Lasso 09219 19529 29124 39580 51450 50000 00500 (00315) 04220Dantzig 09534 19675 29096 39345 51460 50000 00510 (00290) 04154Atan 09904 19946 29548 40189 51250 50000 00190 (00245) 03460

5 le 119895 le 119901 which are 0 averaged over 200 simulationsand their medians are given in column 7 Model errors arecomputed as ME() = ( minus 120573)

119879119864(xx119879)( minus 120573) Their medians

are listed in the 8th column followed by the model errorsstandard deviations in parentheses

The performance of (119879) is assessed by the square root ofaverage squared errors (RASE)

RASE2 = 1

119899

119899

sum

119894=1

( (119879119894) minus 119892 (119879

119894))2 (24)

where 119879119894 119894 = 1 119899 are the observed data points at which

function 119892 is estimated Column 9 summarizes the RASEvalues for the different situations

We can see the following from Table 5 (1) Comparingwith the Lasso and Dantzig estimator the SCAD and Atanestimators have a smallermodel error which is due to the factthat the SCAD and Atan estimators are unbiased estimatorswhile the Lasso and Dantzig are biased especially for thelarger coefficient Moreover the Atan and SCAD estimatorsare more stable (2) Each method is able to select importantvariables but it is obvious that the Atan estimator has slightlystronger sparsity (3) For the nonparametric component Atanand SCAD estimator have smaller RASE values

52 RealDataAnalysis In this sectionwe apply theAtan reg-ularization scheme to a prostate cancer example The datasetin this example is derived from a study of prostate cancer in[42]The dataset consists of themedical records of 97 patientswho were about to receive a radical prostatectomy Thepredictors are eight clinical measures log(cancer volume)(lcavol) log (prostate weight) (lweight) age the logarithmof the amount of benign prostatic hyperplasia (lbph) sem-inal vesicle invasion (svi) log (capsular penetration) (lcp)gleason score (gleason) and percentage gleason score 4 or5 (pgg45) The response is the logarithm of prostate-specificantigen (lpsa) One of the main aims here is to identify whichpredictors are more important in predicting the response

The Lasso Alasso SCAD MCP and Atan are all appliedto the data We also compute the OLS estimate of the prostatecancer data Results are summarized in Table 6 The OLSestimator does not perform variable selection Lasso selectsfive variables in the final model SCAD selects lcavol lweight

Table 6 Prostate cancer data comparing different methods

Method 1198772

11987721198772

OLS Variables selectedOLS 06615 10000 AllLasso 05867 08870 (1 2 4 5 8)Alasso 05991 09058 (1 2 5)SCAD 06140 09283 (1 2 4 5)MCP 05999 09069 (1 2 5)Atan 06057 09158 (1 2 5)

lbph and svi in the final model while Alasso MCP andAtan select lcavol lweight and svi Thus Atan selects asubstantially simpler model than Lasso SCAD Furthermoreas indicated by the columns labeled 119877

2 (1198772 is equal to oneminus the residual sum of squares divided by the total sum ofsquares) and11987721198772OLS in Table 6 the Atan estimator describesmore variability in the data than Alasso and MCP and nearlyas much as OLS estimator

6 Conclusion and Discussion

In this paper a new Atan penalty which very closely resem-bles 119871

0penalty is proposed First we establish the theory

of the Atan estimator under mild conditions The theoryindicates that the Atan-penalized least squares procedureenjoys oracle properties even when the number of variablesgrows slower than the number of observations Second theiteratively reweighted Lasso algorithm makes our proposedestimator implementation fast Third we suggest a BIC-liketuning parameter selector to identify the true model con-sistently Numerical studies further endorse our theoreticalresults and the advantage of the Atan estimator for modelselection

We do not address the situation where 119901 ≫ 119899 in thispaper In fact the proposed Atan method can be easilyextended for variable selection in the situation of119901 ≫ 119899 Alsoas it is shown in Example 11 the Atan method can be appliedto semiparametric model and nonparametric model [43 44]Furthermore there is a recent field of applications of variableselection which is to look for impact points in functional dataanalysis [45 46]The possible combination of Atan estimator

10 Journal of Probability and Statistics

with the questions in [45 46] would be interesting Theseproblems are beyond the scope of this paper and will beinteresting topics for future research

Appendix

Proof of Theorem 2 Let 120572119899= radic1199011205902119899 and fix 119903 isin (0 1) To

prove the theorem it suffices to show that if 119862 gt 0 is largeenough then

119876119899(120573lowast) lt inf120583=119862

119876119899(120573lowast+ 120572119899120583) (A1)

holds for all 119899 sufficiently large with probability at least 1 minus 119903Define119863

119899(120583) = 119876

119899(120573lowast+ 120572119899120583) minus 119876

119899(120573lowast) and note that

119863119899(120583) =

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

+

119901

sum

119895=1

119901120582119886

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

(A2)

The fact that 119901120582120574

is concave on [0infin) implies that

119901120582120574

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (

10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816minus10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (minus120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816)

= minus120582120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2

(A3)

when 119899 is sufficiently largeCondition (B) implies that

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2le120574 (120574 + 2120587)

1205742 + 1205882 (A4)

Thus for 119899 big enough

119863119899(120583) ge

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

minus119862119901120582120572

119899120574 (120574 + 2120587)

1205742 + 1205882

(A5)

By (D)1

21198991205722

119899

1003817100381710038171003817X12058310038171003817100381710038172ge120582min2

11986221205722

119899 (A6)

On the other hand (D) implies1

119899120572119899

10038161003816100381610038161003816120576119879X12058310038161003816100381610038161003816 le

119862120572119899

radic119899

10038171003817100381710038171003817100381710038171003817

1

radic119899X119879120576

10038171003817100381710038171003817100381710038171003817= 119874119875(1198621205722

119899) (A7)

Furthermore (C) and (B) imply

119862119901120582120572119899120574 (120574 + 2120587)

1205742 + 1205882= 119900 (119862120572

2

119899) (A8)

From (A5)ndash(A8) we conclude that if 119862 gt 0 is large enoughthen inf

120583=119862119863119899(120583) gt 0 holds for all 119899 sufficiently large with

probability at least 1 minus 119903 This proves Theorem 2

Proof of Lemma 3 Suppose that 120573 isin 119877119901 and that 120573 minus 120573

lowast le

119862radic1199011205902119899 Define isin 119877119901 by

119860119888 = 0 and

119860= 120573119860 Similar to

the proof of Theorem 2 let

119863119899(120573 ) = 119876

119899(120573) minus 119876

119899() (A9)

where 119876119899(120573) is defined in (8) Then

119863119899(120573 ) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172minus

1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899

10038171003817100381710038171003817y minus X minus X (120573 minus )

10038171003817100381710038171003817

2

minus1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899(120573 minus )

119879

X119879X (120573 minus )

minus1

2119899(120573 minus )

119879

X119879 (y minus X)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

= 119874119901(10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817radic1199011205902

119899)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(A10)

On the other hand since the Atan penalty is concave on[0infin)

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) ge 120582 (120574 +

2

120587) arctan( 119862

120574radic1198991199011205902)

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816 (A11)

for 119895 isin 119860119888 Thus

sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

ge 120582 (120574 +2

120587) arctan( 119862

120574radic1198991199011205902)

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

(A12)

By (C)

lim inf119899rarrinfin

arctan( 119862

120574radic1198991199011205902) gt 0 (A13)

It follows from (A10)ndash(A12) that there is constant119870 gt 0 suchthat

119863119899(120573 )

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

ge 119870120582 + 119874119875(radic

1199011205902

119899) (A14)

Since 120582radic1199011205902119899 rarr infin the result follows

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

Journal of Probability and Statistics 3

0

02

04

06

08

1

12

AtanLassoSCAD

MCP

3210minus1minus2minus3

120573

l0

(a)

0

05

1

15

2

25

3

3210minus1minus2minus3

120573

120574 = 0005

120574 = 01

120574 = 1

120574 = infin

(b)

Figure 1 (a) Plots of the penalties (including 1198970 Lasso SCAD MCP and Atan) (b) Plots of the Atan penalties with different 120574

Authors of [3] suggested using 119886 = 37 from a BayesianperspectiveTheminimum concave penalty [4] translates theflat part of the derivative of the SCAD into the origin and isgiven by

1199011015840

120582(10038161003816100381610038161205731003816100381610038161003816) =

(119886120582 minus10038161003816100381610038161205731003816100381610038161003816)+

119886 (4)

which minimizes the maximum of the concavity Zhang [4]proved that theMCP procedure may select the correct modelwith probability tending to 1 and that MCP estimator hasgood properties in terms of 119897

119901-loss provided 120582 and 119886 satisfy

certain conditions Zhangrsquos results in fact allow for 119901 ≫ 119899

22 Atan Penalty In this section we first propose a novelnonconvex penalty that we call Atan penalty We then studyits applications in sparse modeling

The Atan penalty is defined by

119901120582120574

(10038161003816100381610038161205731003816100381610038161003816) = 120582 (120574 +

2

120587) arctan(

10038161003816100381610038161205731003816100381610038161003816

120574) (5)

for 120582 ge 0 and 120574 gt 0 It is clear that this penalty is concavein |120573| Moreover we can establish its relationship with 119897

0

and 1198971penalties In particular we have the following proposi-

tions

Proposition 1 Let 119901120582120574(|120573|) be given in (5) then

(119886) lim120574rarrinfin

119901120582120574

(10038161003816100381610038161205731003816100381610038161003816) = 120582

10038161003816100381610038161205731003816100381610038161003816

(119887) lim120574rarr0

119901120582120574

(10038161003816100381610038161205731003816100381610038161003816) =

120582 11989411989110038161003816100381610038161205731003816100381610038161003816 = 0

0 11989411989110038161003816100381610038161205731003816100381610038161003816 = 0

(6)

The propositions show that the limits of Atan at 0 andinfin are 1198970

penalty and 1198971penalty respectivelyThe first-order derivative of

119901120582120574(|120573|) with respect to |120573| is

1199011015840

120582120574(10038161003816100381610038161205731003816100381610038161003816) = 120582

120574 (120574 + 2120587)

1205742 + 1205732 (7)

The Atan penalty function (120574 = 0005) is plotted in Figure 1(a)along with the SCAD Lasso MCP and 119897

0penalty Figure 1(b)

depicts the Atan with different 120574

3 Theoretical Properties

In this section we study the theoretical properties of the Atanestimator proposed in Section 2 in the situation where thenumber of parameters 119901 tends to infin with increasing samplesize 119899 We discuss some conditions of the penalty and lossfunctions in Section 31 Our main results are presented inSection 32

31 Regularity Conditions We need to place the followingconditions on the penalty functions

(A) 119899 rarr infin and 1199011205902119899 rarr 0

(B) 120588radic119899(1199011205902) rarr infin where 120588 = min119895isin119860

|120573lowast

119895|

(C) 120582 = 119874(1) 120582radic119899(1199011205902) rarr infin and 120574 = 119874(11990112

1205903119899minus32

)(D) There exist constants 119862

1 1198622

isin R such that 1198621

lt

120582min((1119899)X119879X) lt 120582max((1119899)X119879X) lt 1198622 where

120582min((1119899)X119879X) and 120582max((1119899)X119879X) are the smallestand largest eigenvalues of (1119899)X119879X respectively

(E) lim119899rarrinfin

119899minus1max

1le119894le119899sum119901

119895=11199092

119894119895= 0

(F) 119864(|120576119894120590|2+120575

) lt 119872 for some 120575 and119872 lt infin

4 Journal of Probability and Statistics

Since 119901 may vary with 119899 it is implicit that 120573lowast may varywith 119899 Additionally we allow model 119860 and the distributionof 120576 (in particular 1205902) to change with 119899 Condition (A) limitshow119901 and 1205902may growwith 119899This condition is substantiallyweaker than that required in [17] which requires 1199015119899 rarr 0and slightly weaker than that required in [28] which requireslog(119901) log(119899) rarr ] isin [0 1) and the same as that requiredin [6] As mentioned in Section 1 other authors have studiedPLS methods in settings where 119901 gt 119899 that is their growthcondition on 119901 is weaker than condition (A) [18] Condition(B) gives a lower bound on the size of the smallest nonzeroentry of 120573lowast Notice that the smallest nonzero entry of 120573lowastis allowed to vanish asymptotically provided it does not doso faster than radic1199011205902119899 Similar conditions are found in [17]Condition (C) restricts the rates of tuning parameters 120582 and120574 Note that condition (C) does not constrain the minimumsize of 120574 Indeed no such constraint is required for ourasymptotic results about the Atan estimator Since the Atanpenalty approaches 119897

0penalty as 120574 rarr 0 this suggests that

the Atan and 1198970penalized least squares estimator have similar

asymptotic properties On the other hand in practice wehave found that one should not take 120574 too small in order topreserve stability of the Atan estimator Condition (D) is anidentifiability condition Conditions (E) and (F) are used toprove asymptotic normality ofAtan estimators and are relatedto the Lindeberg condition of the Lindeberg-Feller centrallimit theorem Conditions (A)ndash(F) imply that Atan has theoracle property andmay correctly identifymodel119860 as we willsee inTheorem 4

32 Oracle Properties Let

119876119899(120573) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) (8)

be the objective function and 119901120582120574(|120573119895|) is the Atan penalty

function

Theorem 2 Suppose that conditions (A)ndash(D) hold then forevery 119903 isin (0 1) there exists a constant 119862

0gt 0 such that

lim inf 119875119899rarrinfin

[

[

arg min120573

119876119899(120573)

sube

120573 isin 1198771199011003817100381710038171003817120573 minus 120573

lowast1003817100381710038171003817 le 119862radic1199011205902

119899

]

]

gt 1 minus 119903

(9)

whenever119862 ge 1198620 Consequently there exists a sequence of local

minimizers of119876119899(120573) and such that minus120573lowast = 119874

119875(radic1199011205902119899)

Lemma 3 Assume that (A)ndash(D) hold and fix 119862 gt 0 then

lim119899rarrinfin

119875[[

[

argmin120573minus120573lowastle119862radic119901120590

2119899

119876119899(120573) sube 120573 isin 119877

119901 120573119860119888 = 0

]]

]

= 1

(10)

where119860119888 = 1 119901 119860 is the complement of119860 in 1 119901

Theorem 4 (oracle properties) Suppose that (A)ndash(F) holdthen there exists a sequence ofradic1198991199011205902-consistent localminimaof Atan such that

(i) lim119899rarrinfin

119875(119895 119895

= 0 = 119860) = 1

(ii) radic119899119861119899(119899minus1119883119879

1198601198831198601205902)12

(119860minus 120573lowast

119860) rarr 119873(0 119866)

in distribution where 119861119899is any arbitrary 119902 times |119860| matrix such

that 119861119899119861119879

119899rarr 119866 and 119866 is 119902 times 119902 nonegative symmetric matrix

4 Implementation

41 Iteratively Reweighted Lasso Algorithm The computationfor the Atan-penalized method is much more involvedbecause the resulting optimization problem is usually non-convex and hasmultiple local minimizers Several algorithmshave been developed for computing the folded concave penal-ized estimators such as the local quadratic approximation(LQA) algorithm and the local linear approximation (LLA)algorithm [3 29] Both LQA and LLA are related to themajorization-minorization (MM) principle [30] Recentlycoordinate descent algorithm was applied to solve the foldedconcave penalized least squares [31 32] Reference [4] deviseda PLUS algorithm for solving the PLS using the MCP Zhang[33] analyzed the capped-119897

1penalty for solving the PLS

In this section we present an iteratively reweighted Lasso(IRL) algorithm to solve the followingminimization problem

min120573isinR119901

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+ 120582

119901

sum

119895=1

(120574 +2

120587) arctan(

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816

120574)

(11)

We show that the solution of the penalty least squares prob-lem can be transformed into that of a series of convexweighted Lasso estimators to which the existing Lasso algo-rithms can be efficiently applied

Now taking a first-order Taylor-series approximation ofthe Atan penalty about the current value 120573

119895= 120573lowast

119895 we obtain

the overapproximation of (11)

min120573isinR119901

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+ 120582

119901

sum

119895=1

120574 (120574 + 2120587)

1205742 + 120573lowast2119895

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816

(12)

Note that the linear approximation in (12) is analogous to oneproposed in Zou and Li [29] and they argued strongly in favorof a one-step approximation Instead we offer the followingalgorithm for Atan-penalized least squares Assume all thecovariates have been standardized to havemean zero and unitvarianceWithout loss of generality the span of regularizationparameters is 120582min = 120582

0lt sdot sdot sdot lt 120582

119871= 120582max for 119897 = 0 119871

[34] Here we summarize the details of IRL algorithm as inAlgorithm 1 where the active set is defined as 119860 = 119895 | 120573

119895=

0 119895 = 1 2 119901In the situation with 119899 gt 119901 we set 120582min = 0 However

when 119899 lt 119901 one attains the saturated model long before theregularization parameter reaches zero In this case we suggest120582min = 10

minus4120582max [34] and we used 120591 = 10

minus4 In practice wehave found that if the columns of X are standardized so thatx1198952= 119899 for 119895 = 1 119901 then taking 120574 = 0005 works well

Journal of Probability and Statistics 5

(1) Start with 120582max = max1le119895le119901

|X119879119910|119899 Set 119897 = 119871 and 120573(119897)= 0

Outer loop(2) Set 119896 = 0 and

(0)

= 120573(119897)

(3) Increment 119896 = 119896 + 1 and (119896+1)

= (119896)

(4) Update the weights

119895= 120574(120574 + 2120587)(120574

2+ ((119896)

119895)2) 119895 = 1 119901

Inner loopSolve the Karush-Kuhn-Tucker (KKT) conditions for fixed

119895

119909119879

119895(119910 minus X120573) minus 120582

119897119895sgn(120573

119895) = 0 if 119895 isin 119860

|119909119879

119895(119910 minus X120573)| lt 120582

119897119895 if 119895 notin 119860

(5) Goto Step (3)

(6) Repeat Steps (3)ndash(5) until (119896+1)

minus (119896)

2

lt 120591The estimate 120573

(119897)is the limit point of the outer loop

infin

(7) Decrement 119897 = 119897 minus 1 and 120582

119897= 120582119897minus1 Return to (2) using 120573

(119897)as a warm start

Algorithm 1 Iteratively reweighted Lasso (IRL) algorithm

Remark 5 Algorithm 1 is very much like MM algorithmsHunter and Li [30] were the first to advocate MM algorithmsfor variable selection when 119899 gt 119901 However we should noticethat Algorithm 1 differs from the algorithm described in [30]The use of MM algorithms in [30] is to justify a quadraticoverapproximation to penalty functions with singularities atthe origin In the inner loop of Algorithm 1 we avoid suchapproximation by solving the KKT conditions precisely andefficiently [35] Our use of MM algorithms in Algorithm 1 isto justify the local linear approximation toAtan penalty in theouter loop

Remark 6 LLA to SCAD penalty were also proposed by Zouand Li [29] Algorithm 1 differs from that proposed in [29] inthat it constructs the entire coefficient path even when 119901 gt

119899 whereas the procedure in [29] computes the coefficientestimates for fixed regularization parameter 120582 starting with aroot-119899 consistent estimator of true coefficient 120573lowast at the initialstep Moreover even with the same initial estimate the limitpoint from Algorithm 1 will differ from [29] after one itera-tion

Remark 7 In the more general case where 119901 gt 119899 we usedcoordinate-wise optimization to compute the entire regu-larized coefficient path via Algorithm 1 Our experience isthat coordinate-wise optimization works well in practice andconverges very quickly

42 Regularity Parameter Selection Tuning parameter selec-tion is an important issue in most PLS procedures There arerelatively few studies on the choice of penalty parametersTraditional model selection criteria such as AIC [10] andBIC [11] suffer from a number of limitations Their majordrawback arises because parameter estimation and modelselection are two different processes which can result ininstability [13] and complicated stochastic properties Toovercome the deficiency of traditional methods Fan and Li[3] proposed the SCADmethod which estimates parameters

while simultaneously selecting important variables Theyselected tuning parameter by minimizing GCV [1 3 26]

However it is well known that GCV and AIC-basedmethods are not consistent for model selection in the sensethat as 119899 rarr infin they may select irrelevant predictors withnonvanishing probability [22 36] On the other hand BIC-based tuning parameter selection roughly corresponds tomaximizing the posterior probability of selecting the truemodel in an appropriate Bayesian formulation and has beenshown to be consistent for model selection in several settings[22 37 38] The BIC tuning parameter selector is defined by

BIC = log(10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

119899) + DF

log (119899)119899

(13)

where DF is the estimate of the degrees of freedom given by

DF = tr X (X119879 + 119899Σ120582)119879

X119879 (14)

andΣ120582= diag1199011015840

120582(|1|)|1| 119901

1015840

120582(|119901|)|119901|The diagonal

elements of Σ120582are coefficients of quadratic terms in the local

quadratic approximation to SCAD penalty function 119901120582(sdot) [3]

Dicker et al [6] proposed BIC-like procedures imple-mented by minimizing

BIC0= log(

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

119899 minus 0

) +log (119899)

1198990 (15)

To estimate the residual variance they use 2 = (119899minus0)minus1yminus

X2 This differs from other estimates of the residualvariance used in PLSmethods where the denominator 119899minus

0

is replaced by 119899 [22] here 119899minus0is used to account for degrees

of freedom lost to estimationMore works on the high-dimensional BIC for the least

squares regression to tuning parameter selection for noncon-vex penalized regression can be seen in [24 25 39] Here weused BIC statistic as (15)

6 Journal of Probability and Statistics

43 A Standard Error Formula The standard errors for theestimated parameters can be obtained directly because weare estimating parameters and selecting variables at the sametime Let = (120582 120574) be a local minimizer of Atan Following[3 17] standard errors of may be estimated by using quad-ratic approximations to Atan Indeed the approximation

119901120582119886

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) asymp 119901120582120574

(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

+1

2100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1199011015840

120582120574(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816) (1205732

119895minus 1205732

1198950)

for 120573119895asymp 1205731198950

(16)

suggests that Atan may be replaced by the quadratic mini-mization problem

min

1

119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

1199011015840

120582119886(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1205732

119895

(17)

at least for the purposes of obtaining standard errors Usingthis expression we obtain a sandwich formula for the esti-mated standard error of

where = 119895

119895= 0 Consider

cov () = 2X119879X+ 119899Δ

()minus1

sdot X119879XX119879X+ 119899Δ

()minus1

(18)

where Δ(120573) = diag1199011015840120582119886(|1205731|)|1205731| 119901

1015840

120582119886(|120573119901|)|120573119901| 2 =

119899minus1y minus X2 and

0= || is the number of elements in ||

Under the conditions of Theorem 4

119861119899119883119879

119883cov (

) 119861119879

119899

1205902997888rarr 119866 (19)

This is a consistency result for cov()

5 Numerical Studies

51 Simulation Studies We now investigate the sparsityrecovery and estimation properties of the proposed estimatorvia numerical simulations We compare the following esti-mators the Lasso estimator (implemented using R packageglmnet [40]) the adaptive Lasso estimator (denoted byAlasso [15]) the SCAD estimator from the CD algorithmwithout calibration [31] the MCP estimator from the CDalgorithm with 119886 = 15 [31] the Dantzig selector [7] andBayesian variable select method [8] For the proposed Atanestimator we take 120574 = 0005 and BIC statistic (15) is usedto select the tuning parameter 120582 In the following we reportsimulation results from four examples

Example 8 In this example simulation data are generatedfrom the linear regression model

119910 = x119879120573lowast + 120590120576 (20)

where 120573lowast = (3 15 0 0 2 0 0 0 0 0 0 0)119879 120576 sim 119873(0 1) and

x is multivariate normal distribution with zero mean and

covariance between the 119894th and 119895th elements being 120588|119894minus119895| with120588 = 05 In our simulation sample size 119899 is set to be 100 and200 120590 = 2 For each case we repeated the simulation 200times

For linear model model error for = x119879 is ME() =

( minus 120573)119879119864(xx119879)( minus 120573) Simulation results are summarized

in Table 1 in which MRME stands for median of ratios ofME of a selected model to that of the unpenalized minimumsquare estimate under the full model Both the columns ofldquoCrdquo and ldquoICrdquo are measures of model complexity Column ldquoCrdquoshows the average number of nonzero coefficients correctlyestimated to be nonzero and column ldquoICrdquo presents theaverage number of zero coefficients incorrectly estimated tobe nonzero In the column labeled ldquounderfitrdquo we presentthe proportion of excluding any nonzero coefficients in 200replications Likewise we report the probability of selectingthe exact subset model and the probability of including allthree significant variables and some noise variables in thecolumns ldquocorrect-fitrdquo and ldquooverfitrdquo respectively

As can be seen from Table 1 all variable selection proce-dures dramatically reduce model error Atan has the smallestmodel error among all competitors followed by AlassoMCPSCAD Bayesian and Dantzig In terms of sparsity Atan alsohas the highest probability of correct-fit The Atan penaltyperforms better than the other penalties Also Atan has someadvantages when dimensional 119901 is high which can be seen inExample 9

We now test the accuracy of our standard error formula(18) The median absolute deviation divided by 06745denoted by SD in Table 2 of 3 estimated coefficients in the200 simulations can be regarded as the true standard errorThe median of the 200 estimated SDs denoted by SDm andthe median absolute deviation error of the 200 estimatedstandard errors divided by 06745 denoted by SDmad gaugethe overall performance of standard error formula (18)Table 2 presents the results for nonzero coefficients whensample size 119899 = 200 The results for the other case with 119899 =

100 are similar Table 2 suggests that the sandwich formulaperforms surprisingly well

Example 9 The example is from Wang et al [23] Morespecifically we take 120573 = (114 minus236 3712 minus139 13 0 0 0)

119879isin R119901 and 119901 = [4119899

14] minus 5 and [119905] stands for the

largest integer no larger than 119905 For this example predictordimension 119901 is diverging but the dimension of the truemodel is fixed to be 5 Results from the simulation study arefound in Table 3 A similar conclusion as in Example 8 can befound

Example 10 In this simulation study presented here weexamined the performance of the various PLS methods for 119901substantially larger than in the previous studies In particularwe took 119901 = 339 119899 = 500 1205905 = 5 and 120573

lowast= (2119868

119879

37 minus3119868119879

37

119868119879

37 0119879

228) where 119868119896 isin 119877

119896 is the vector with all entries equalto 1 Thus 119901

0= 111 We simulated 200 independent datasets

(1199101 119909119879

1) (119910

119899 119909119879

119899) in this study and for each dataset we

computed estimates of 120573lowast Results from this simulation studyare found in Table 4

Journal of Probability and Statistics 7

Table 1 Simulation results for linear regression models of Example 8

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 100 120590 = 2

Lasso 06213 30000 12050 00000 03700 06700Alasso 03074 30000 03500 00000 07300 02700SCAD 02715 30000 10650 00000 04400 05600MCP 03041 30000 05650 00000 05800 04200Dantzig 04623 30000 06546 00000 05700 04300Bayesian 03548 30000 05732 00000 06300 03700Atan 02550 30000 01750 00000 08450 01550

119899 = 200 120590 = 2

Lasso 06027 30000 10700 00000 03550 06450Alasso 02781 30000 01600 00000 08650 01350SCAD 02900 30000 08550 00000 05250 04750MCP 02752 30000 03650 00000 06850 03150Dantzig 03863 30000 08576 00000 04920 05080Bayesian 02563 30000 04754 00000 07150 02850Atan 02508 30000 01000 00000 09050 00950

Table 2 Standard deviations of estimators for the linear regression model (119899 = 200)

Method 1

2

5

SD SDm (SDmad) SD SDm (SDmad) SD SDm (SDmad)Lasso 01753 01453 (00100) 01730 01688 (00094) 01591 01301 (00079)Alasso 01483 01638 (00095) 01642 01636 (00098) 01475 01439 (00073)SCAD 01797 01634 (00105) 01819 01608 (00104) 01398 01438 (00076)MCP 01602 01643 (00096) 01861 01656 (00097) 01464 01435 (00069)Dantzig 01734 01645 (00115) 01723 01665 (00094) 01581 01538 (00074)Bayesian 01568 01635 (00089) 01678 01649 (00092) 01367 01375 (00078)Atan 01510 01643 (00108) 01591 01658 (00096) 01609 01434 (00083)

Perhaps the most striking aspect of the results presentedin Table 4 is that hardly no method ever selected the correctmodel in this simulation study However given that 119901 119901

0

and 120573lowast are substantially larger in this study than in the

previous simulation studies this may not be too surprisingNotice that on average Atan selects the most parsimoniousmodels of all methods and has the smallest model errorAtanrsquos nearest competitor in terms of model error is AlassoThis implementation of Alasso has mean model error 02783but its average selected model size is 1035250 larger thanAtanrsquos Since 119901

0= 111 it is clear that Atan underfits in some

instances In fact all of the methods in this study underfitto some extent This may be due to the fact that many ofthe nonzero entries in 120573

lowast are small relative to the noise level1205902= 5

Example 11 As an extension of this method we consider theproblem of simultaneous variable selection and estimation inthe partially linear model

119884 = 1198831015840120573 + 119892 (119879) + 120576 (21)

where119884 is a scalar response variate119883 is a 119901-vector covariate119879 is a scalar covariate and takes values in a compact interval(for simplicity we assume this interval to be [0 1]) 120573 is 119901 times

1 column vector of unknown regression parameter function119892(sdot) is unknown and model error 120576 is independent of (119883 119879)withmean 0 Traditionally it has generally been assumed that120573 is finite dimension several standard approaches such as thekernel method the spline method [41] and the local linearestimation [17] have been proposed

In this study we simulate 119899 = 100 200 points 119879119894 119894 =

1 100(200) from the uniform distribution on [0 1] Foreach 119894 1198901015840

119894119895119904 are simulated to be normally distributed with

autocorrelated variance structure 119860119877(120588) such that

cov (119890119894119895 119890119894119897) = 120588|119894minus119895|

1 le 119894 119895 le 10 (22)

119883119894119895rsquos are then formed as follows

1198831198941= sin (2119879

119894) + 1198901198941

1198831198942= (05 + 119879

119894)minus2

+ 1198901198942

8 Journal of Probability and Statistics

Table 3 Simulation results for linear regression models of Example 9

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 200 119901 = 10

Lasso 09955 49900 20100 00100 01150 08750Alasso 05740 48650 01100 01350 08150 00500SCAD 05659 49800 05600 00200 05600 04200MCP 06177 49100 01200 00900 08250 00850Dantzig 06987 48900 06700 01100 04360 04540Bayesian 05656 48650 02500 01350 06340 02310Atan 05447 48900 01150 01100 08250 00650

119899 = 400 119901 = 12

Lasso 12197 50000 20650 00000 01400 08600Alasso 04458 49950 00900 00050 09250 00700SCAD 04481 50000 04850 00000 06350 03650MCP 04828 50000 01150 00000 08950 01050Dantzig 07879 50000 05670 00000 03200 06800Bayesian 04237 50000 01800 00000 07550 02450Atan 04125 49950 00250 00050 09700 00250

119899 = 800 119901 = 16

Lasso 12004 50000 25700 00000 00900 09100Alasso 03156 50000 00700 00000 09300 00700SCAD 03219 50000 06550 00000 05950 04050MCP 03220 50000 00750 00000 09300 00700Dantzig 05791 50000 05470 00000 03400 06600Bayesian 03275 50000 02800 00000 06750 03250Atan 03239 50000 00750 00000 09350 00650

Table 4 Simulation results for linear regression models of Example 10

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 500 120590 = 5

Lasso 03492 1102000 247450 05650 00000 04350Alasso 02783 1035250 61250 10000 00000 00000SCAD 03012 1060400 357150 09900 00000 00150MCP 02791 1033600 81100 10000 00000 00900Dantzig 03120 1083400 184650 07570 00000 02430Bayesian 02876 1044350 74700 09800 00000 00200Atan 02794 1014000 40600 10000 00000 00000

1198831198943= exp (119879

119894) + 1198901198943

1198831198945= (119879119894minus 07)

4+ 1198901198945

1198831198946= 119879119894(1 + 119879

2

119894)minus1

+ 1198901198946

1198831198945= radic1 + 119879

119894+ 1198901198947

1198831198948= log (3119879

119894+ 8) + 119890

1198948

(23)

We investigate the scenario 119901 = 10 119883119894119895= 119890119894119895 119895 = 4 9 10

In the scenario we have 120573119895= 119895 1 le 119895 le 4 and 120573

119895= 0 with

others 120576 sim 119873(0 1) For each case we repeated the simulation200 times We investigate 119892(sdot) functions 119892(119905) = cos (2120587119905)For comparison we apply the different penalties in the para-metric component with the B-spline in the nonparametriccomponent

The results are summarized in Table 5 Columns 2ndash5 inTable 5 are the averages of the estimates of 120573

119895 119895 = 1 4

respectively Column 6 is the number of estimates of 120573119895

Journal of Probability and Statistics 9

Table 5 Example 11 comparison of estimators

Estimator 1205731

1205732

1205733

1205734

119862 MME (SD) RASE (119892(119879))119899 = 100 119901 = 10

SCAD 09617 20128 28839 40611 47450 50000 00695 (00625) 06269Lasso 08278 19517 27984 39274 55200 60000 01727 (00790) 08079Dantzig 08767 19544 27894 39302 54590 60000 01696 (00680) 07278Atan 09710 20121 29865 40472 49820 50000 00658 (00630) 06214

119899 = 200 119901 = 10

SCAD 09836 19989 29283 40278 52000 50000 00230 (00210) 03604Lasso 09219 19529 29124 39580 51450 50000 00500 (00315) 04220Dantzig 09534 19675 29096 39345 51460 50000 00510 (00290) 04154Atan 09904 19946 29548 40189 51250 50000 00190 (00245) 03460

5 le 119895 le 119901 which are 0 averaged over 200 simulationsand their medians are given in column 7 Model errors arecomputed as ME() = ( minus 120573)

119879119864(xx119879)( minus 120573) Their medians

are listed in the 8th column followed by the model errorsstandard deviations in parentheses

The performance of (119879) is assessed by the square root ofaverage squared errors (RASE)

RASE2 = 1

119899

119899

sum

119894=1

( (119879119894) minus 119892 (119879

119894))2 (24)

where 119879119894 119894 = 1 119899 are the observed data points at which

function 119892 is estimated Column 9 summarizes the RASEvalues for the different situations

We can see the following from Table 5 (1) Comparingwith the Lasso and Dantzig estimator the SCAD and Atanestimators have a smallermodel error which is due to the factthat the SCAD and Atan estimators are unbiased estimatorswhile the Lasso and Dantzig are biased especially for thelarger coefficient Moreover the Atan and SCAD estimatorsare more stable (2) Each method is able to select importantvariables but it is obvious that the Atan estimator has slightlystronger sparsity (3) For the nonparametric component Atanand SCAD estimator have smaller RASE values

52 RealDataAnalysis In this sectionwe apply theAtan reg-ularization scheme to a prostate cancer example The datasetin this example is derived from a study of prostate cancer in[42]The dataset consists of themedical records of 97 patientswho were about to receive a radical prostatectomy Thepredictors are eight clinical measures log(cancer volume)(lcavol) log (prostate weight) (lweight) age the logarithmof the amount of benign prostatic hyperplasia (lbph) sem-inal vesicle invasion (svi) log (capsular penetration) (lcp)gleason score (gleason) and percentage gleason score 4 or5 (pgg45) The response is the logarithm of prostate-specificantigen (lpsa) One of the main aims here is to identify whichpredictors are more important in predicting the response

The Lasso Alasso SCAD MCP and Atan are all appliedto the data We also compute the OLS estimate of the prostatecancer data Results are summarized in Table 6 The OLSestimator does not perform variable selection Lasso selectsfive variables in the final model SCAD selects lcavol lweight

Table 6 Prostate cancer data comparing different methods

Method 1198772

11987721198772

OLS Variables selectedOLS 06615 10000 AllLasso 05867 08870 (1 2 4 5 8)Alasso 05991 09058 (1 2 5)SCAD 06140 09283 (1 2 4 5)MCP 05999 09069 (1 2 5)Atan 06057 09158 (1 2 5)

lbph and svi in the final model while Alasso MCP andAtan select lcavol lweight and svi Thus Atan selects asubstantially simpler model than Lasso SCAD Furthermoreas indicated by the columns labeled 119877

2 (1198772 is equal to oneminus the residual sum of squares divided by the total sum ofsquares) and11987721198772OLS in Table 6 the Atan estimator describesmore variability in the data than Alasso and MCP and nearlyas much as OLS estimator

6 Conclusion and Discussion

In this paper a new Atan penalty which very closely resem-bles 119871

0penalty is proposed First we establish the theory

of the Atan estimator under mild conditions The theoryindicates that the Atan-penalized least squares procedureenjoys oracle properties even when the number of variablesgrows slower than the number of observations Second theiteratively reweighted Lasso algorithm makes our proposedestimator implementation fast Third we suggest a BIC-liketuning parameter selector to identify the true model con-sistently Numerical studies further endorse our theoreticalresults and the advantage of the Atan estimator for modelselection

We do not address the situation where 119901 ≫ 119899 in thispaper In fact the proposed Atan method can be easilyextended for variable selection in the situation of119901 ≫ 119899 Alsoas it is shown in Example 11 the Atan method can be appliedto semiparametric model and nonparametric model [43 44]Furthermore there is a recent field of applications of variableselection which is to look for impact points in functional dataanalysis [45 46]The possible combination of Atan estimator

10 Journal of Probability and Statistics

with the questions in [45 46] would be interesting Theseproblems are beyond the scope of this paper and will beinteresting topics for future research

Appendix

Proof of Theorem 2 Let 120572119899= radic1199011205902119899 and fix 119903 isin (0 1) To

prove the theorem it suffices to show that if 119862 gt 0 is largeenough then

119876119899(120573lowast) lt inf120583=119862

119876119899(120573lowast+ 120572119899120583) (A1)

holds for all 119899 sufficiently large with probability at least 1 minus 119903Define119863

119899(120583) = 119876

119899(120573lowast+ 120572119899120583) minus 119876

119899(120573lowast) and note that

119863119899(120583) =

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

+

119901

sum

119895=1

119901120582119886

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

(A2)

The fact that 119901120582120574

is concave on [0infin) implies that

119901120582120574

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (

10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816minus10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (minus120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816)

= minus120582120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2

(A3)

when 119899 is sufficiently largeCondition (B) implies that

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2le120574 (120574 + 2120587)

1205742 + 1205882 (A4)

Thus for 119899 big enough

119863119899(120583) ge

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

minus119862119901120582120572

119899120574 (120574 + 2120587)

1205742 + 1205882

(A5)

By (D)1

21198991205722

119899

1003817100381710038171003817X12058310038171003817100381710038172ge120582min2

11986221205722

119899 (A6)

On the other hand (D) implies1

119899120572119899

10038161003816100381610038161003816120576119879X12058310038161003816100381610038161003816 le

119862120572119899

radic119899

10038171003817100381710038171003817100381710038171003817

1

radic119899X119879120576

10038171003817100381710038171003817100381710038171003817= 119874119875(1198621205722

119899) (A7)

Furthermore (C) and (B) imply

119862119901120582120572119899120574 (120574 + 2120587)

1205742 + 1205882= 119900 (119862120572

2

119899) (A8)

From (A5)ndash(A8) we conclude that if 119862 gt 0 is large enoughthen inf

120583=119862119863119899(120583) gt 0 holds for all 119899 sufficiently large with

probability at least 1 minus 119903 This proves Theorem 2

Proof of Lemma 3 Suppose that 120573 isin 119877119901 and that 120573 minus 120573

lowast le

119862radic1199011205902119899 Define isin 119877119901 by

119860119888 = 0 and

119860= 120573119860 Similar to

the proof of Theorem 2 let

119863119899(120573 ) = 119876

119899(120573) minus 119876

119899() (A9)

where 119876119899(120573) is defined in (8) Then

119863119899(120573 ) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172minus

1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899

10038171003817100381710038171003817y minus X minus X (120573 minus )

10038171003817100381710038171003817

2

minus1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899(120573 minus )

119879

X119879X (120573 minus )

minus1

2119899(120573 minus )

119879

X119879 (y minus X)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

= 119874119901(10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817radic1199011205902

119899)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(A10)

On the other hand since the Atan penalty is concave on[0infin)

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) ge 120582 (120574 +

2

120587) arctan( 119862

120574radic1198991199011205902)

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816 (A11)

for 119895 isin 119860119888 Thus

sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

ge 120582 (120574 +2

120587) arctan( 119862

120574radic1198991199011205902)

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

(A12)

By (C)

lim inf119899rarrinfin

arctan( 119862

120574radic1198991199011205902) gt 0 (A13)

It follows from (A10)ndash(A12) that there is constant119870 gt 0 suchthat

119863119899(120573 )

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

ge 119870120582 + 119874119875(radic

1199011205902

119899) (A14)

Since 120582radic1199011205902119899 rarr infin the result follows

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

4 Journal of Probability and Statistics

Since 119901 may vary with 119899 it is implicit that 120573lowast may varywith 119899 Additionally we allow model 119860 and the distributionof 120576 (in particular 1205902) to change with 119899 Condition (A) limitshow119901 and 1205902may growwith 119899This condition is substantiallyweaker than that required in [17] which requires 1199015119899 rarr 0and slightly weaker than that required in [28] which requireslog(119901) log(119899) rarr ] isin [0 1) and the same as that requiredin [6] As mentioned in Section 1 other authors have studiedPLS methods in settings where 119901 gt 119899 that is their growthcondition on 119901 is weaker than condition (A) [18] Condition(B) gives a lower bound on the size of the smallest nonzeroentry of 120573lowast Notice that the smallest nonzero entry of 120573lowastis allowed to vanish asymptotically provided it does not doso faster than radic1199011205902119899 Similar conditions are found in [17]Condition (C) restricts the rates of tuning parameters 120582 and120574 Note that condition (C) does not constrain the minimumsize of 120574 Indeed no such constraint is required for ourasymptotic results about the Atan estimator Since the Atanpenalty approaches 119897

0penalty as 120574 rarr 0 this suggests that

the Atan and 1198970penalized least squares estimator have similar

asymptotic properties On the other hand in practice wehave found that one should not take 120574 too small in order topreserve stability of the Atan estimator Condition (D) is anidentifiability condition Conditions (E) and (F) are used toprove asymptotic normality ofAtan estimators and are relatedto the Lindeberg condition of the Lindeberg-Feller centrallimit theorem Conditions (A)ndash(F) imply that Atan has theoracle property andmay correctly identifymodel119860 as we willsee inTheorem 4

32 Oracle Properties Let

119876119899(120573) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) (8)

be the objective function and 119901120582120574(|120573119895|) is the Atan penalty

function

Theorem 2 Suppose that conditions (A)ndash(D) hold then forevery 119903 isin (0 1) there exists a constant 119862

0gt 0 such that

lim inf 119875119899rarrinfin

[

[

arg min120573

119876119899(120573)

sube

120573 isin 1198771199011003817100381710038171003817120573 minus 120573

lowast1003817100381710038171003817 le 119862radic1199011205902

119899

]

]

gt 1 minus 119903

(9)

whenever119862 ge 1198620 Consequently there exists a sequence of local

minimizers of119876119899(120573) and such that minus120573lowast = 119874

119875(radic1199011205902119899)

Lemma 3 Assume that (A)ndash(D) hold and fix 119862 gt 0 then

lim119899rarrinfin

119875[[

[

argmin120573minus120573lowastle119862radic119901120590

2119899

119876119899(120573) sube 120573 isin 119877

119901 120573119860119888 = 0

]]

]

= 1

(10)

where119860119888 = 1 119901 119860 is the complement of119860 in 1 119901

Theorem 4 (oracle properties) Suppose that (A)ndash(F) holdthen there exists a sequence ofradic1198991199011205902-consistent localminimaof Atan such that

(i) lim119899rarrinfin

119875(119895 119895

= 0 = 119860) = 1

(ii) radic119899119861119899(119899minus1119883119879

1198601198831198601205902)12

(119860minus 120573lowast

119860) rarr 119873(0 119866)

in distribution where 119861119899is any arbitrary 119902 times |119860| matrix such

that 119861119899119861119879

119899rarr 119866 and 119866 is 119902 times 119902 nonegative symmetric matrix

4 Implementation

41 Iteratively Reweighted Lasso Algorithm The computationfor the Atan-penalized method is much more involvedbecause the resulting optimization problem is usually non-convex and hasmultiple local minimizers Several algorithmshave been developed for computing the folded concave penal-ized estimators such as the local quadratic approximation(LQA) algorithm and the local linear approximation (LLA)algorithm [3 29] Both LQA and LLA are related to themajorization-minorization (MM) principle [30] Recentlycoordinate descent algorithm was applied to solve the foldedconcave penalized least squares [31 32] Reference [4] deviseda PLUS algorithm for solving the PLS using the MCP Zhang[33] analyzed the capped-119897

1penalty for solving the PLS

In this section we present an iteratively reweighted Lasso(IRL) algorithm to solve the followingminimization problem

min120573isinR119901

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+ 120582

119901

sum

119895=1

(120574 +2

120587) arctan(

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816

120574)

(11)

We show that the solution of the penalty least squares prob-lem can be transformed into that of a series of convexweighted Lasso estimators to which the existing Lasso algo-rithms can be efficiently applied

Now taking a first-order Taylor-series approximation ofthe Atan penalty about the current value 120573

119895= 120573lowast

119895 we obtain

the overapproximation of (11)

min120573isinR119901

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+ 120582

119901

sum

119895=1

120574 (120574 + 2120587)

1205742 + 120573lowast2119895

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816

(12)

Note that the linear approximation in (12) is analogous to oneproposed in Zou and Li [29] and they argued strongly in favorof a one-step approximation Instead we offer the followingalgorithm for Atan-penalized least squares Assume all thecovariates have been standardized to havemean zero and unitvarianceWithout loss of generality the span of regularizationparameters is 120582min = 120582

0lt sdot sdot sdot lt 120582

119871= 120582max for 119897 = 0 119871

[34] Here we summarize the details of IRL algorithm as inAlgorithm 1 where the active set is defined as 119860 = 119895 | 120573

119895=

0 119895 = 1 2 119901In the situation with 119899 gt 119901 we set 120582min = 0 However

when 119899 lt 119901 one attains the saturated model long before theregularization parameter reaches zero In this case we suggest120582min = 10

minus4120582max [34] and we used 120591 = 10

minus4 In practice wehave found that if the columns of X are standardized so thatx1198952= 119899 for 119895 = 1 119901 then taking 120574 = 0005 works well

Journal of Probability and Statistics 5

(1) Start with 120582max = max1le119895le119901

|X119879119910|119899 Set 119897 = 119871 and 120573(119897)= 0

Outer loop(2) Set 119896 = 0 and

(0)

= 120573(119897)

(3) Increment 119896 = 119896 + 1 and (119896+1)

= (119896)

(4) Update the weights

119895= 120574(120574 + 2120587)(120574

2+ ((119896)

119895)2) 119895 = 1 119901

Inner loopSolve the Karush-Kuhn-Tucker (KKT) conditions for fixed

119895

119909119879

119895(119910 minus X120573) minus 120582

119897119895sgn(120573

119895) = 0 if 119895 isin 119860

|119909119879

119895(119910 minus X120573)| lt 120582

119897119895 if 119895 notin 119860

(5) Goto Step (3)

(6) Repeat Steps (3)ndash(5) until (119896+1)

minus (119896)

2

lt 120591The estimate 120573

(119897)is the limit point of the outer loop

infin

(7) Decrement 119897 = 119897 minus 1 and 120582

119897= 120582119897minus1 Return to (2) using 120573

(119897)as a warm start

Algorithm 1 Iteratively reweighted Lasso (IRL) algorithm

Remark 5 Algorithm 1 is very much like MM algorithmsHunter and Li [30] were the first to advocate MM algorithmsfor variable selection when 119899 gt 119901 However we should noticethat Algorithm 1 differs from the algorithm described in [30]The use of MM algorithms in [30] is to justify a quadraticoverapproximation to penalty functions with singularities atthe origin In the inner loop of Algorithm 1 we avoid suchapproximation by solving the KKT conditions precisely andefficiently [35] Our use of MM algorithms in Algorithm 1 isto justify the local linear approximation toAtan penalty in theouter loop

Remark 6 LLA to SCAD penalty were also proposed by Zouand Li [29] Algorithm 1 differs from that proposed in [29] inthat it constructs the entire coefficient path even when 119901 gt

119899 whereas the procedure in [29] computes the coefficientestimates for fixed regularization parameter 120582 starting with aroot-119899 consistent estimator of true coefficient 120573lowast at the initialstep Moreover even with the same initial estimate the limitpoint from Algorithm 1 will differ from [29] after one itera-tion

Remark 7 In the more general case where 119901 gt 119899 we usedcoordinate-wise optimization to compute the entire regu-larized coefficient path via Algorithm 1 Our experience isthat coordinate-wise optimization works well in practice andconverges very quickly

42 Regularity Parameter Selection Tuning parameter selec-tion is an important issue in most PLS procedures There arerelatively few studies on the choice of penalty parametersTraditional model selection criteria such as AIC [10] andBIC [11] suffer from a number of limitations Their majordrawback arises because parameter estimation and modelselection are two different processes which can result ininstability [13] and complicated stochastic properties Toovercome the deficiency of traditional methods Fan and Li[3] proposed the SCADmethod which estimates parameters

while simultaneously selecting important variables Theyselected tuning parameter by minimizing GCV [1 3 26]

However it is well known that GCV and AIC-basedmethods are not consistent for model selection in the sensethat as 119899 rarr infin they may select irrelevant predictors withnonvanishing probability [22 36] On the other hand BIC-based tuning parameter selection roughly corresponds tomaximizing the posterior probability of selecting the truemodel in an appropriate Bayesian formulation and has beenshown to be consistent for model selection in several settings[22 37 38] The BIC tuning parameter selector is defined by

BIC = log(10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

119899) + DF

log (119899)119899

(13)

where DF is the estimate of the degrees of freedom given by

DF = tr X (X119879 + 119899Σ120582)119879

X119879 (14)

andΣ120582= diag1199011015840

120582(|1|)|1| 119901

1015840

120582(|119901|)|119901|The diagonal

elements of Σ120582are coefficients of quadratic terms in the local

quadratic approximation to SCAD penalty function 119901120582(sdot) [3]

Dicker et al [6] proposed BIC-like procedures imple-mented by minimizing

BIC0= log(

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

119899 minus 0

) +log (119899)

1198990 (15)

To estimate the residual variance they use 2 = (119899minus0)minus1yminus

X2 This differs from other estimates of the residualvariance used in PLSmethods where the denominator 119899minus

0

is replaced by 119899 [22] here 119899minus0is used to account for degrees

of freedom lost to estimationMore works on the high-dimensional BIC for the least

squares regression to tuning parameter selection for noncon-vex penalized regression can be seen in [24 25 39] Here weused BIC statistic as (15)

6 Journal of Probability and Statistics

43 A Standard Error Formula The standard errors for theestimated parameters can be obtained directly because weare estimating parameters and selecting variables at the sametime Let = (120582 120574) be a local minimizer of Atan Following[3 17] standard errors of may be estimated by using quad-ratic approximations to Atan Indeed the approximation

119901120582119886

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) asymp 119901120582120574

(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

+1

2100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1199011015840

120582120574(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816) (1205732

119895minus 1205732

1198950)

for 120573119895asymp 1205731198950

(16)

suggests that Atan may be replaced by the quadratic mini-mization problem

min

1

119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

1199011015840

120582119886(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1205732

119895

(17)

at least for the purposes of obtaining standard errors Usingthis expression we obtain a sandwich formula for the esti-mated standard error of

where = 119895

119895= 0 Consider

cov () = 2X119879X+ 119899Δ

()minus1

sdot X119879XX119879X+ 119899Δ

()minus1

(18)

where Δ(120573) = diag1199011015840120582119886(|1205731|)|1205731| 119901

1015840

120582119886(|120573119901|)|120573119901| 2 =

119899minus1y minus X2 and

0= || is the number of elements in ||

Under the conditions of Theorem 4

119861119899119883119879

119883cov (

) 119861119879

119899

1205902997888rarr 119866 (19)

This is a consistency result for cov()

5 Numerical Studies

51 Simulation Studies We now investigate the sparsityrecovery and estimation properties of the proposed estimatorvia numerical simulations We compare the following esti-mators the Lasso estimator (implemented using R packageglmnet [40]) the adaptive Lasso estimator (denoted byAlasso [15]) the SCAD estimator from the CD algorithmwithout calibration [31] the MCP estimator from the CDalgorithm with 119886 = 15 [31] the Dantzig selector [7] andBayesian variable select method [8] For the proposed Atanestimator we take 120574 = 0005 and BIC statistic (15) is usedto select the tuning parameter 120582 In the following we reportsimulation results from four examples

Example 8 In this example simulation data are generatedfrom the linear regression model

119910 = x119879120573lowast + 120590120576 (20)

where 120573lowast = (3 15 0 0 2 0 0 0 0 0 0 0)119879 120576 sim 119873(0 1) and

x is multivariate normal distribution with zero mean and

covariance between the 119894th and 119895th elements being 120588|119894minus119895| with120588 = 05 In our simulation sample size 119899 is set to be 100 and200 120590 = 2 For each case we repeated the simulation 200times

For linear model model error for = x119879 is ME() =

( minus 120573)119879119864(xx119879)( minus 120573) Simulation results are summarized

in Table 1 in which MRME stands for median of ratios ofME of a selected model to that of the unpenalized minimumsquare estimate under the full model Both the columns ofldquoCrdquo and ldquoICrdquo are measures of model complexity Column ldquoCrdquoshows the average number of nonzero coefficients correctlyestimated to be nonzero and column ldquoICrdquo presents theaverage number of zero coefficients incorrectly estimated tobe nonzero In the column labeled ldquounderfitrdquo we presentthe proportion of excluding any nonzero coefficients in 200replications Likewise we report the probability of selectingthe exact subset model and the probability of including allthree significant variables and some noise variables in thecolumns ldquocorrect-fitrdquo and ldquooverfitrdquo respectively

As can be seen from Table 1 all variable selection proce-dures dramatically reduce model error Atan has the smallestmodel error among all competitors followed by AlassoMCPSCAD Bayesian and Dantzig In terms of sparsity Atan alsohas the highest probability of correct-fit The Atan penaltyperforms better than the other penalties Also Atan has someadvantages when dimensional 119901 is high which can be seen inExample 9

We now test the accuracy of our standard error formula(18) The median absolute deviation divided by 06745denoted by SD in Table 2 of 3 estimated coefficients in the200 simulations can be regarded as the true standard errorThe median of the 200 estimated SDs denoted by SDm andthe median absolute deviation error of the 200 estimatedstandard errors divided by 06745 denoted by SDmad gaugethe overall performance of standard error formula (18)Table 2 presents the results for nonzero coefficients whensample size 119899 = 200 The results for the other case with 119899 =

100 are similar Table 2 suggests that the sandwich formulaperforms surprisingly well

Example 9 The example is from Wang et al [23] Morespecifically we take 120573 = (114 minus236 3712 minus139 13 0 0 0)

119879isin R119901 and 119901 = [4119899

14] minus 5 and [119905] stands for the

largest integer no larger than 119905 For this example predictordimension 119901 is diverging but the dimension of the truemodel is fixed to be 5 Results from the simulation study arefound in Table 3 A similar conclusion as in Example 8 can befound

Example 10 In this simulation study presented here weexamined the performance of the various PLS methods for 119901substantially larger than in the previous studies In particularwe took 119901 = 339 119899 = 500 1205905 = 5 and 120573

lowast= (2119868

119879

37 minus3119868119879

37

119868119879

37 0119879

228) where 119868119896 isin 119877

119896 is the vector with all entries equalto 1 Thus 119901

0= 111 We simulated 200 independent datasets

(1199101 119909119879

1) (119910

119899 119909119879

119899) in this study and for each dataset we

computed estimates of 120573lowast Results from this simulation studyare found in Table 4

Journal of Probability and Statistics 7

Table 1 Simulation results for linear regression models of Example 8

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 100 120590 = 2

Lasso 06213 30000 12050 00000 03700 06700Alasso 03074 30000 03500 00000 07300 02700SCAD 02715 30000 10650 00000 04400 05600MCP 03041 30000 05650 00000 05800 04200Dantzig 04623 30000 06546 00000 05700 04300Bayesian 03548 30000 05732 00000 06300 03700Atan 02550 30000 01750 00000 08450 01550

119899 = 200 120590 = 2

Lasso 06027 30000 10700 00000 03550 06450Alasso 02781 30000 01600 00000 08650 01350SCAD 02900 30000 08550 00000 05250 04750MCP 02752 30000 03650 00000 06850 03150Dantzig 03863 30000 08576 00000 04920 05080Bayesian 02563 30000 04754 00000 07150 02850Atan 02508 30000 01000 00000 09050 00950

Table 2 Standard deviations of estimators for the linear regression model (119899 = 200)

Method 1

2

5

SD SDm (SDmad) SD SDm (SDmad) SD SDm (SDmad)Lasso 01753 01453 (00100) 01730 01688 (00094) 01591 01301 (00079)Alasso 01483 01638 (00095) 01642 01636 (00098) 01475 01439 (00073)SCAD 01797 01634 (00105) 01819 01608 (00104) 01398 01438 (00076)MCP 01602 01643 (00096) 01861 01656 (00097) 01464 01435 (00069)Dantzig 01734 01645 (00115) 01723 01665 (00094) 01581 01538 (00074)Bayesian 01568 01635 (00089) 01678 01649 (00092) 01367 01375 (00078)Atan 01510 01643 (00108) 01591 01658 (00096) 01609 01434 (00083)

Perhaps the most striking aspect of the results presentedin Table 4 is that hardly no method ever selected the correctmodel in this simulation study However given that 119901 119901

0

and 120573lowast are substantially larger in this study than in the

previous simulation studies this may not be too surprisingNotice that on average Atan selects the most parsimoniousmodels of all methods and has the smallest model errorAtanrsquos nearest competitor in terms of model error is AlassoThis implementation of Alasso has mean model error 02783but its average selected model size is 1035250 larger thanAtanrsquos Since 119901

0= 111 it is clear that Atan underfits in some

instances In fact all of the methods in this study underfitto some extent This may be due to the fact that many ofthe nonzero entries in 120573

lowast are small relative to the noise level1205902= 5

Example 11 As an extension of this method we consider theproblem of simultaneous variable selection and estimation inthe partially linear model

119884 = 1198831015840120573 + 119892 (119879) + 120576 (21)

where119884 is a scalar response variate119883 is a 119901-vector covariate119879 is a scalar covariate and takes values in a compact interval(for simplicity we assume this interval to be [0 1]) 120573 is 119901 times

1 column vector of unknown regression parameter function119892(sdot) is unknown and model error 120576 is independent of (119883 119879)withmean 0 Traditionally it has generally been assumed that120573 is finite dimension several standard approaches such as thekernel method the spline method [41] and the local linearestimation [17] have been proposed

In this study we simulate 119899 = 100 200 points 119879119894 119894 =

1 100(200) from the uniform distribution on [0 1] Foreach 119894 1198901015840

119894119895119904 are simulated to be normally distributed with

autocorrelated variance structure 119860119877(120588) such that

cov (119890119894119895 119890119894119897) = 120588|119894minus119895|

1 le 119894 119895 le 10 (22)

119883119894119895rsquos are then formed as follows

1198831198941= sin (2119879

119894) + 1198901198941

1198831198942= (05 + 119879

119894)minus2

+ 1198901198942

8 Journal of Probability and Statistics

Table 3 Simulation results for linear regression models of Example 9

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 200 119901 = 10

Lasso 09955 49900 20100 00100 01150 08750Alasso 05740 48650 01100 01350 08150 00500SCAD 05659 49800 05600 00200 05600 04200MCP 06177 49100 01200 00900 08250 00850Dantzig 06987 48900 06700 01100 04360 04540Bayesian 05656 48650 02500 01350 06340 02310Atan 05447 48900 01150 01100 08250 00650

119899 = 400 119901 = 12

Lasso 12197 50000 20650 00000 01400 08600Alasso 04458 49950 00900 00050 09250 00700SCAD 04481 50000 04850 00000 06350 03650MCP 04828 50000 01150 00000 08950 01050Dantzig 07879 50000 05670 00000 03200 06800Bayesian 04237 50000 01800 00000 07550 02450Atan 04125 49950 00250 00050 09700 00250

119899 = 800 119901 = 16

Lasso 12004 50000 25700 00000 00900 09100Alasso 03156 50000 00700 00000 09300 00700SCAD 03219 50000 06550 00000 05950 04050MCP 03220 50000 00750 00000 09300 00700Dantzig 05791 50000 05470 00000 03400 06600Bayesian 03275 50000 02800 00000 06750 03250Atan 03239 50000 00750 00000 09350 00650

Table 4 Simulation results for linear regression models of Example 10

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 500 120590 = 5

Lasso 03492 1102000 247450 05650 00000 04350Alasso 02783 1035250 61250 10000 00000 00000SCAD 03012 1060400 357150 09900 00000 00150MCP 02791 1033600 81100 10000 00000 00900Dantzig 03120 1083400 184650 07570 00000 02430Bayesian 02876 1044350 74700 09800 00000 00200Atan 02794 1014000 40600 10000 00000 00000

1198831198943= exp (119879

119894) + 1198901198943

1198831198945= (119879119894minus 07)

4+ 1198901198945

1198831198946= 119879119894(1 + 119879

2

119894)minus1

+ 1198901198946

1198831198945= radic1 + 119879

119894+ 1198901198947

1198831198948= log (3119879

119894+ 8) + 119890

1198948

(23)

We investigate the scenario 119901 = 10 119883119894119895= 119890119894119895 119895 = 4 9 10

In the scenario we have 120573119895= 119895 1 le 119895 le 4 and 120573

119895= 0 with

others 120576 sim 119873(0 1) For each case we repeated the simulation200 times We investigate 119892(sdot) functions 119892(119905) = cos (2120587119905)For comparison we apply the different penalties in the para-metric component with the B-spline in the nonparametriccomponent

The results are summarized in Table 5 Columns 2ndash5 inTable 5 are the averages of the estimates of 120573

119895 119895 = 1 4

respectively Column 6 is the number of estimates of 120573119895

Journal of Probability and Statistics 9

Table 5 Example 11 comparison of estimators

Estimator 1205731

1205732

1205733

1205734

119862 MME (SD) RASE (119892(119879))119899 = 100 119901 = 10

SCAD 09617 20128 28839 40611 47450 50000 00695 (00625) 06269Lasso 08278 19517 27984 39274 55200 60000 01727 (00790) 08079Dantzig 08767 19544 27894 39302 54590 60000 01696 (00680) 07278Atan 09710 20121 29865 40472 49820 50000 00658 (00630) 06214

119899 = 200 119901 = 10

SCAD 09836 19989 29283 40278 52000 50000 00230 (00210) 03604Lasso 09219 19529 29124 39580 51450 50000 00500 (00315) 04220Dantzig 09534 19675 29096 39345 51460 50000 00510 (00290) 04154Atan 09904 19946 29548 40189 51250 50000 00190 (00245) 03460

5 le 119895 le 119901 which are 0 averaged over 200 simulationsand their medians are given in column 7 Model errors arecomputed as ME() = ( minus 120573)

119879119864(xx119879)( minus 120573) Their medians

are listed in the 8th column followed by the model errorsstandard deviations in parentheses

The performance of (119879) is assessed by the square root ofaverage squared errors (RASE)

RASE2 = 1

119899

119899

sum

119894=1

( (119879119894) minus 119892 (119879

119894))2 (24)

where 119879119894 119894 = 1 119899 are the observed data points at which

function 119892 is estimated Column 9 summarizes the RASEvalues for the different situations

We can see the following from Table 5 (1) Comparingwith the Lasso and Dantzig estimator the SCAD and Atanestimators have a smallermodel error which is due to the factthat the SCAD and Atan estimators are unbiased estimatorswhile the Lasso and Dantzig are biased especially for thelarger coefficient Moreover the Atan and SCAD estimatorsare more stable (2) Each method is able to select importantvariables but it is obvious that the Atan estimator has slightlystronger sparsity (3) For the nonparametric component Atanand SCAD estimator have smaller RASE values

52 RealDataAnalysis In this sectionwe apply theAtan reg-ularization scheme to a prostate cancer example The datasetin this example is derived from a study of prostate cancer in[42]The dataset consists of themedical records of 97 patientswho were about to receive a radical prostatectomy Thepredictors are eight clinical measures log(cancer volume)(lcavol) log (prostate weight) (lweight) age the logarithmof the amount of benign prostatic hyperplasia (lbph) sem-inal vesicle invasion (svi) log (capsular penetration) (lcp)gleason score (gleason) and percentage gleason score 4 or5 (pgg45) The response is the logarithm of prostate-specificantigen (lpsa) One of the main aims here is to identify whichpredictors are more important in predicting the response

The Lasso Alasso SCAD MCP and Atan are all appliedto the data We also compute the OLS estimate of the prostatecancer data Results are summarized in Table 6 The OLSestimator does not perform variable selection Lasso selectsfive variables in the final model SCAD selects lcavol lweight

Table 6 Prostate cancer data comparing different methods

Method 1198772

11987721198772

OLS Variables selectedOLS 06615 10000 AllLasso 05867 08870 (1 2 4 5 8)Alasso 05991 09058 (1 2 5)SCAD 06140 09283 (1 2 4 5)MCP 05999 09069 (1 2 5)Atan 06057 09158 (1 2 5)

lbph and svi in the final model while Alasso MCP andAtan select lcavol lweight and svi Thus Atan selects asubstantially simpler model than Lasso SCAD Furthermoreas indicated by the columns labeled 119877

2 (1198772 is equal to oneminus the residual sum of squares divided by the total sum ofsquares) and11987721198772OLS in Table 6 the Atan estimator describesmore variability in the data than Alasso and MCP and nearlyas much as OLS estimator

6 Conclusion and Discussion

In this paper a new Atan penalty which very closely resem-bles 119871

0penalty is proposed First we establish the theory

of the Atan estimator under mild conditions The theoryindicates that the Atan-penalized least squares procedureenjoys oracle properties even when the number of variablesgrows slower than the number of observations Second theiteratively reweighted Lasso algorithm makes our proposedestimator implementation fast Third we suggest a BIC-liketuning parameter selector to identify the true model con-sistently Numerical studies further endorse our theoreticalresults and the advantage of the Atan estimator for modelselection

We do not address the situation where 119901 ≫ 119899 in thispaper In fact the proposed Atan method can be easilyextended for variable selection in the situation of119901 ≫ 119899 Alsoas it is shown in Example 11 the Atan method can be appliedto semiparametric model and nonparametric model [43 44]Furthermore there is a recent field of applications of variableselection which is to look for impact points in functional dataanalysis [45 46]The possible combination of Atan estimator

10 Journal of Probability and Statistics

with the questions in [45 46] would be interesting Theseproblems are beyond the scope of this paper and will beinteresting topics for future research

Appendix

Proof of Theorem 2 Let 120572119899= radic1199011205902119899 and fix 119903 isin (0 1) To

prove the theorem it suffices to show that if 119862 gt 0 is largeenough then

119876119899(120573lowast) lt inf120583=119862

119876119899(120573lowast+ 120572119899120583) (A1)

holds for all 119899 sufficiently large with probability at least 1 minus 119903Define119863

119899(120583) = 119876

119899(120573lowast+ 120572119899120583) minus 119876

119899(120573lowast) and note that

119863119899(120583) =

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

+

119901

sum

119895=1

119901120582119886

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

(A2)

The fact that 119901120582120574

is concave on [0infin) implies that

119901120582120574

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (

10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816minus10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (minus120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816)

= minus120582120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2

(A3)

when 119899 is sufficiently largeCondition (B) implies that

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2le120574 (120574 + 2120587)

1205742 + 1205882 (A4)

Thus for 119899 big enough

119863119899(120583) ge

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

minus119862119901120582120572

119899120574 (120574 + 2120587)

1205742 + 1205882

(A5)

By (D)1

21198991205722

119899

1003817100381710038171003817X12058310038171003817100381710038172ge120582min2

11986221205722

119899 (A6)

On the other hand (D) implies1

119899120572119899

10038161003816100381610038161003816120576119879X12058310038161003816100381610038161003816 le

119862120572119899

radic119899

10038171003817100381710038171003817100381710038171003817

1

radic119899X119879120576

10038171003817100381710038171003817100381710038171003817= 119874119875(1198621205722

119899) (A7)

Furthermore (C) and (B) imply

119862119901120582120572119899120574 (120574 + 2120587)

1205742 + 1205882= 119900 (119862120572

2

119899) (A8)

From (A5)ndash(A8) we conclude that if 119862 gt 0 is large enoughthen inf

120583=119862119863119899(120583) gt 0 holds for all 119899 sufficiently large with

probability at least 1 minus 119903 This proves Theorem 2

Proof of Lemma 3 Suppose that 120573 isin 119877119901 and that 120573 minus 120573

lowast le

119862radic1199011205902119899 Define isin 119877119901 by

119860119888 = 0 and

119860= 120573119860 Similar to

the proof of Theorem 2 let

119863119899(120573 ) = 119876

119899(120573) minus 119876

119899() (A9)

where 119876119899(120573) is defined in (8) Then

119863119899(120573 ) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172minus

1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899

10038171003817100381710038171003817y minus X minus X (120573 minus )

10038171003817100381710038171003817

2

minus1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899(120573 minus )

119879

X119879X (120573 minus )

minus1

2119899(120573 minus )

119879

X119879 (y minus X)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

= 119874119901(10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817radic1199011205902

119899)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(A10)

On the other hand since the Atan penalty is concave on[0infin)

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) ge 120582 (120574 +

2

120587) arctan( 119862

120574radic1198991199011205902)

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816 (A11)

for 119895 isin 119860119888 Thus

sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

ge 120582 (120574 +2

120587) arctan( 119862

120574radic1198991199011205902)

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

(A12)

By (C)

lim inf119899rarrinfin

arctan( 119862

120574radic1198991199011205902) gt 0 (A13)

It follows from (A10)ndash(A12) that there is constant119870 gt 0 suchthat

119863119899(120573 )

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

ge 119870120582 + 119874119875(radic

1199011205902

119899) (A14)

Since 120582radic1199011205902119899 rarr infin the result follows

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

Journal of Probability and Statistics 5

(1) Start with 120582max = max1le119895le119901

|X119879119910|119899 Set 119897 = 119871 and 120573(119897)= 0

Outer loop(2) Set 119896 = 0 and

(0)

= 120573(119897)

(3) Increment 119896 = 119896 + 1 and (119896+1)

= (119896)

(4) Update the weights

119895= 120574(120574 + 2120587)(120574

2+ ((119896)

119895)2) 119895 = 1 119901

Inner loopSolve the Karush-Kuhn-Tucker (KKT) conditions for fixed

119895

119909119879

119895(119910 minus X120573) minus 120582

119897119895sgn(120573

119895) = 0 if 119895 isin 119860

|119909119879

119895(119910 minus X120573)| lt 120582

119897119895 if 119895 notin 119860

(5) Goto Step (3)

(6) Repeat Steps (3)ndash(5) until (119896+1)

minus (119896)

2

lt 120591The estimate 120573

(119897)is the limit point of the outer loop

infin

(7) Decrement 119897 = 119897 minus 1 and 120582

119897= 120582119897minus1 Return to (2) using 120573

(119897)as a warm start

Algorithm 1 Iteratively reweighted Lasso (IRL) algorithm

Remark 5 Algorithm 1 is very much like MM algorithmsHunter and Li [30] were the first to advocate MM algorithmsfor variable selection when 119899 gt 119901 However we should noticethat Algorithm 1 differs from the algorithm described in [30]The use of MM algorithms in [30] is to justify a quadraticoverapproximation to penalty functions with singularities atthe origin In the inner loop of Algorithm 1 we avoid suchapproximation by solving the KKT conditions precisely andefficiently [35] Our use of MM algorithms in Algorithm 1 isto justify the local linear approximation toAtan penalty in theouter loop

Remark 6 LLA to SCAD penalty were also proposed by Zouand Li [29] Algorithm 1 differs from that proposed in [29] inthat it constructs the entire coefficient path even when 119901 gt

119899 whereas the procedure in [29] computes the coefficientestimates for fixed regularization parameter 120582 starting with aroot-119899 consistent estimator of true coefficient 120573lowast at the initialstep Moreover even with the same initial estimate the limitpoint from Algorithm 1 will differ from [29] after one itera-tion

Remark 7 In the more general case where 119901 gt 119899 we usedcoordinate-wise optimization to compute the entire regu-larized coefficient path via Algorithm 1 Our experience isthat coordinate-wise optimization works well in practice andconverges very quickly

42 Regularity Parameter Selection Tuning parameter selec-tion is an important issue in most PLS procedures There arerelatively few studies on the choice of penalty parametersTraditional model selection criteria such as AIC [10] andBIC [11] suffer from a number of limitations Their majordrawback arises because parameter estimation and modelselection are two different processes which can result ininstability [13] and complicated stochastic properties Toovercome the deficiency of traditional methods Fan and Li[3] proposed the SCADmethod which estimates parameters

while simultaneously selecting important variables Theyselected tuning parameter by minimizing GCV [1 3 26]

However it is well known that GCV and AIC-basedmethods are not consistent for model selection in the sensethat as 119899 rarr infin they may select irrelevant predictors withnonvanishing probability [22 36] On the other hand BIC-based tuning parameter selection roughly corresponds tomaximizing the posterior probability of selecting the truemodel in an appropriate Bayesian formulation and has beenshown to be consistent for model selection in several settings[22 37 38] The BIC tuning parameter selector is defined by

BIC = log(10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

119899) + DF

log (119899)119899

(13)

where DF is the estimate of the degrees of freedom given by

DF = tr X (X119879 + 119899Σ120582)119879

X119879 (14)

andΣ120582= diag1199011015840

120582(|1|)|1| 119901

1015840

120582(|119901|)|119901|The diagonal

elements of Σ120582are coefficients of quadratic terms in the local

quadratic approximation to SCAD penalty function 119901120582(sdot) [3]

Dicker et al [6] proposed BIC-like procedures imple-mented by minimizing

BIC0= log(

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

119899 minus 0

) +log (119899)

1198990 (15)

To estimate the residual variance they use 2 = (119899minus0)minus1yminus

X2 This differs from other estimates of the residualvariance used in PLSmethods where the denominator 119899minus

0

is replaced by 119899 [22] here 119899minus0is used to account for degrees

of freedom lost to estimationMore works on the high-dimensional BIC for the least

squares regression to tuning parameter selection for noncon-vex penalized regression can be seen in [24 25 39] Here weused BIC statistic as (15)

6 Journal of Probability and Statistics

43 A Standard Error Formula The standard errors for theestimated parameters can be obtained directly because weare estimating parameters and selecting variables at the sametime Let = (120582 120574) be a local minimizer of Atan Following[3 17] standard errors of may be estimated by using quad-ratic approximations to Atan Indeed the approximation

119901120582119886

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) asymp 119901120582120574

(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

+1

2100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1199011015840

120582120574(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816) (1205732

119895minus 1205732

1198950)

for 120573119895asymp 1205731198950

(16)

suggests that Atan may be replaced by the quadratic mini-mization problem

min

1

119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

1199011015840

120582119886(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1205732

119895

(17)

at least for the purposes of obtaining standard errors Usingthis expression we obtain a sandwich formula for the esti-mated standard error of

where = 119895

119895= 0 Consider

cov () = 2X119879X+ 119899Δ

()minus1

sdot X119879XX119879X+ 119899Δ

()minus1

(18)

where Δ(120573) = diag1199011015840120582119886(|1205731|)|1205731| 119901

1015840

120582119886(|120573119901|)|120573119901| 2 =

119899minus1y minus X2 and

0= || is the number of elements in ||

Under the conditions of Theorem 4

119861119899119883119879

119883cov (

) 119861119879

119899

1205902997888rarr 119866 (19)

This is a consistency result for cov()

5 Numerical Studies

51 Simulation Studies We now investigate the sparsityrecovery and estimation properties of the proposed estimatorvia numerical simulations We compare the following esti-mators the Lasso estimator (implemented using R packageglmnet [40]) the adaptive Lasso estimator (denoted byAlasso [15]) the SCAD estimator from the CD algorithmwithout calibration [31] the MCP estimator from the CDalgorithm with 119886 = 15 [31] the Dantzig selector [7] andBayesian variable select method [8] For the proposed Atanestimator we take 120574 = 0005 and BIC statistic (15) is usedto select the tuning parameter 120582 In the following we reportsimulation results from four examples

Example 8 In this example simulation data are generatedfrom the linear regression model

119910 = x119879120573lowast + 120590120576 (20)

where 120573lowast = (3 15 0 0 2 0 0 0 0 0 0 0)119879 120576 sim 119873(0 1) and

x is multivariate normal distribution with zero mean and

covariance between the 119894th and 119895th elements being 120588|119894minus119895| with120588 = 05 In our simulation sample size 119899 is set to be 100 and200 120590 = 2 For each case we repeated the simulation 200times

For linear model model error for = x119879 is ME() =

( minus 120573)119879119864(xx119879)( minus 120573) Simulation results are summarized

in Table 1 in which MRME stands for median of ratios ofME of a selected model to that of the unpenalized minimumsquare estimate under the full model Both the columns ofldquoCrdquo and ldquoICrdquo are measures of model complexity Column ldquoCrdquoshows the average number of nonzero coefficients correctlyestimated to be nonzero and column ldquoICrdquo presents theaverage number of zero coefficients incorrectly estimated tobe nonzero In the column labeled ldquounderfitrdquo we presentthe proportion of excluding any nonzero coefficients in 200replications Likewise we report the probability of selectingthe exact subset model and the probability of including allthree significant variables and some noise variables in thecolumns ldquocorrect-fitrdquo and ldquooverfitrdquo respectively

As can be seen from Table 1 all variable selection proce-dures dramatically reduce model error Atan has the smallestmodel error among all competitors followed by AlassoMCPSCAD Bayesian and Dantzig In terms of sparsity Atan alsohas the highest probability of correct-fit The Atan penaltyperforms better than the other penalties Also Atan has someadvantages when dimensional 119901 is high which can be seen inExample 9

We now test the accuracy of our standard error formula(18) The median absolute deviation divided by 06745denoted by SD in Table 2 of 3 estimated coefficients in the200 simulations can be regarded as the true standard errorThe median of the 200 estimated SDs denoted by SDm andthe median absolute deviation error of the 200 estimatedstandard errors divided by 06745 denoted by SDmad gaugethe overall performance of standard error formula (18)Table 2 presents the results for nonzero coefficients whensample size 119899 = 200 The results for the other case with 119899 =

100 are similar Table 2 suggests that the sandwich formulaperforms surprisingly well

Example 9 The example is from Wang et al [23] Morespecifically we take 120573 = (114 minus236 3712 minus139 13 0 0 0)

119879isin R119901 and 119901 = [4119899

14] minus 5 and [119905] stands for the

largest integer no larger than 119905 For this example predictordimension 119901 is diverging but the dimension of the truemodel is fixed to be 5 Results from the simulation study arefound in Table 3 A similar conclusion as in Example 8 can befound

Example 10 In this simulation study presented here weexamined the performance of the various PLS methods for 119901substantially larger than in the previous studies In particularwe took 119901 = 339 119899 = 500 1205905 = 5 and 120573

lowast= (2119868

119879

37 minus3119868119879

37

119868119879

37 0119879

228) where 119868119896 isin 119877

119896 is the vector with all entries equalto 1 Thus 119901

0= 111 We simulated 200 independent datasets

(1199101 119909119879

1) (119910

119899 119909119879

119899) in this study and for each dataset we

computed estimates of 120573lowast Results from this simulation studyare found in Table 4

Journal of Probability and Statistics 7

Table 1 Simulation results for linear regression models of Example 8

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 100 120590 = 2

Lasso 06213 30000 12050 00000 03700 06700Alasso 03074 30000 03500 00000 07300 02700SCAD 02715 30000 10650 00000 04400 05600MCP 03041 30000 05650 00000 05800 04200Dantzig 04623 30000 06546 00000 05700 04300Bayesian 03548 30000 05732 00000 06300 03700Atan 02550 30000 01750 00000 08450 01550

119899 = 200 120590 = 2

Lasso 06027 30000 10700 00000 03550 06450Alasso 02781 30000 01600 00000 08650 01350SCAD 02900 30000 08550 00000 05250 04750MCP 02752 30000 03650 00000 06850 03150Dantzig 03863 30000 08576 00000 04920 05080Bayesian 02563 30000 04754 00000 07150 02850Atan 02508 30000 01000 00000 09050 00950

Table 2 Standard deviations of estimators for the linear regression model (119899 = 200)

Method 1

2

5

SD SDm (SDmad) SD SDm (SDmad) SD SDm (SDmad)Lasso 01753 01453 (00100) 01730 01688 (00094) 01591 01301 (00079)Alasso 01483 01638 (00095) 01642 01636 (00098) 01475 01439 (00073)SCAD 01797 01634 (00105) 01819 01608 (00104) 01398 01438 (00076)MCP 01602 01643 (00096) 01861 01656 (00097) 01464 01435 (00069)Dantzig 01734 01645 (00115) 01723 01665 (00094) 01581 01538 (00074)Bayesian 01568 01635 (00089) 01678 01649 (00092) 01367 01375 (00078)Atan 01510 01643 (00108) 01591 01658 (00096) 01609 01434 (00083)

Perhaps the most striking aspect of the results presentedin Table 4 is that hardly no method ever selected the correctmodel in this simulation study However given that 119901 119901

0

and 120573lowast are substantially larger in this study than in the

previous simulation studies this may not be too surprisingNotice that on average Atan selects the most parsimoniousmodels of all methods and has the smallest model errorAtanrsquos nearest competitor in terms of model error is AlassoThis implementation of Alasso has mean model error 02783but its average selected model size is 1035250 larger thanAtanrsquos Since 119901

0= 111 it is clear that Atan underfits in some

instances In fact all of the methods in this study underfitto some extent This may be due to the fact that many ofthe nonzero entries in 120573

lowast are small relative to the noise level1205902= 5

Example 11 As an extension of this method we consider theproblem of simultaneous variable selection and estimation inthe partially linear model

119884 = 1198831015840120573 + 119892 (119879) + 120576 (21)

where119884 is a scalar response variate119883 is a 119901-vector covariate119879 is a scalar covariate and takes values in a compact interval(for simplicity we assume this interval to be [0 1]) 120573 is 119901 times

1 column vector of unknown regression parameter function119892(sdot) is unknown and model error 120576 is independent of (119883 119879)withmean 0 Traditionally it has generally been assumed that120573 is finite dimension several standard approaches such as thekernel method the spline method [41] and the local linearestimation [17] have been proposed

In this study we simulate 119899 = 100 200 points 119879119894 119894 =

1 100(200) from the uniform distribution on [0 1] Foreach 119894 1198901015840

119894119895119904 are simulated to be normally distributed with

autocorrelated variance structure 119860119877(120588) such that

cov (119890119894119895 119890119894119897) = 120588|119894minus119895|

1 le 119894 119895 le 10 (22)

119883119894119895rsquos are then formed as follows

1198831198941= sin (2119879

119894) + 1198901198941

1198831198942= (05 + 119879

119894)minus2

+ 1198901198942

8 Journal of Probability and Statistics

Table 3 Simulation results for linear regression models of Example 9

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 200 119901 = 10

Lasso 09955 49900 20100 00100 01150 08750Alasso 05740 48650 01100 01350 08150 00500SCAD 05659 49800 05600 00200 05600 04200MCP 06177 49100 01200 00900 08250 00850Dantzig 06987 48900 06700 01100 04360 04540Bayesian 05656 48650 02500 01350 06340 02310Atan 05447 48900 01150 01100 08250 00650

119899 = 400 119901 = 12

Lasso 12197 50000 20650 00000 01400 08600Alasso 04458 49950 00900 00050 09250 00700SCAD 04481 50000 04850 00000 06350 03650MCP 04828 50000 01150 00000 08950 01050Dantzig 07879 50000 05670 00000 03200 06800Bayesian 04237 50000 01800 00000 07550 02450Atan 04125 49950 00250 00050 09700 00250

119899 = 800 119901 = 16

Lasso 12004 50000 25700 00000 00900 09100Alasso 03156 50000 00700 00000 09300 00700SCAD 03219 50000 06550 00000 05950 04050MCP 03220 50000 00750 00000 09300 00700Dantzig 05791 50000 05470 00000 03400 06600Bayesian 03275 50000 02800 00000 06750 03250Atan 03239 50000 00750 00000 09350 00650

Table 4 Simulation results for linear regression models of Example 10

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 500 120590 = 5

Lasso 03492 1102000 247450 05650 00000 04350Alasso 02783 1035250 61250 10000 00000 00000SCAD 03012 1060400 357150 09900 00000 00150MCP 02791 1033600 81100 10000 00000 00900Dantzig 03120 1083400 184650 07570 00000 02430Bayesian 02876 1044350 74700 09800 00000 00200Atan 02794 1014000 40600 10000 00000 00000

1198831198943= exp (119879

119894) + 1198901198943

1198831198945= (119879119894minus 07)

4+ 1198901198945

1198831198946= 119879119894(1 + 119879

2

119894)minus1

+ 1198901198946

1198831198945= radic1 + 119879

119894+ 1198901198947

1198831198948= log (3119879

119894+ 8) + 119890

1198948

(23)

We investigate the scenario 119901 = 10 119883119894119895= 119890119894119895 119895 = 4 9 10

In the scenario we have 120573119895= 119895 1 le 119895 le 4 and 120573

119895= 0 with

others 120576 sim 119873(0 1) For each case we repeated the simulation200 times We investigate 119892(sdot) functions 119892(119905) = cos (2120587119905)For comparison we apply the different penalties in the para-metric component with the B-spline in the nonparametriccomponent

The results are summarized in Table 5 Columns 2ndash5 inTable 5 are the averages of the estimates of 120573

119895 119895 = 1 4

respectively Column 6 is the number of estimates of 120573119895

Journal of Probability and Statistics 9

Table 5 Example 11 comparison of estimators

Estimator 1205731

1205732

1205733

1205734

119862 MME (SD) RASE (119892(119879))119899 = 100 119901 = 10

SCAD 09617 20128 28839 40611 47450 50000 00695 (00625) 06269Lasso 08278 19517 27984 39274 55200 60000 01727 (00790) 08079Dantzig 08767 19544 27894 39302 54590 60000 01696 (00680) 07278Atan 09710 20121 29865 40472 49820 50000 00658 (00630) 06214

119899 = 200 119901 = 10

SCAD 09836 19989 29283 40278 52000 50000 00230 (00210) 03604Lasso 09219 19529 29124 39580 51450 50000 00500 (00315) 04220Dantzig 09534 19675 29096 39345 51460 50000 00510 (00290) 04154Atan 09904 19946 29548 40189 51250 50000 00190 (00245) 03460

5 le 119895 le 119901 which are 0 averaged over 200 simulationsand their medians are given in column 7 Model errors arecomputed as ME() = ( minus 120573)

119879119864(xx119879)( minus 120573) Their medians

are listed in the 8th column followed by the model errorsstandard deviations in parentheses

The performance of (119879) is assessed by the square root ofaverage squared errors (RASE)

RASE2 = 1

119899

119899

sum

119894=1

( (119879119894) minus 119892 (119879

119894))2 (24)

where 119879119894 119894 = 1 119899 are the observed data points at which

function 119892 is estimated Column 9 summarizes the RASEvalues for the different situations

We can see the following from Table 5 (1) Comparingwith the Lasso and Dantzig estimator the SCAD and Atanestimators have a smallermodel error which is due to the factthat the SCAD and Atan estimators are unbiased estimatorswhile the Lasso and Dantzig are biased especially for thelarger coefficient Moreover the Atan and SCAD estimatorsare more stable (2) Each method is able to select importantvariables but it is obvious that the Atan estimator has slightlystronger sparsity (3) For the nonparametric component Atanand SCAD estimator have smaller RASE values

52 RealDataAnalysis In this sectionwe apply theAtan reg-ularization scheme to a prostate cancer example The datasetin this example is derived from a study of prostate cancer in[42]The dataset consists of themedical records of 97 patientswho were about to receive a radical prostatectomy Thepredictors are eight clinical measures log(cancer volume)(lcavol) log (prostate weight) (lweight) age the logarithmof the amount of benign prostatic hyperplasia (lbph) sem-inal vesicle invasion (svi) log (capsular penetration) (lcp)gleason score (gleason) and percentage gleason score 4 or5 (pgg45) The response is the logarithm of prostate-specificantigen (lpsa) One of the main aims here is to identify whichpredictors are more important in predicting the response

The Lasso Alasso SCAD MCP and Atan are all appliedto the data We also compute the OLS estimate of the prostatecancer data Results are summarized in Table 6 The OLSestimator does not perform variable selection Lasso selectsfive variables in the final model SCAD selects lcavol lweight

Table 6 Prostate cancer data comparing different methods

Method 1198772

11987721198772

OLS Variables selectedOLS 06615 10000 AllLasso 05867 08870 (1 2 4 5 8)Alasso 05991 09058 (1 2 5)SCAD 06140 09283 (1 2 4 5)MCP 05999 09069 (1 2 5)Atan 06057 09158 (1 2 5)

lbph and svi in the final model while Alasso MCP andAtan select lcavol lweight and svi Thus Atan selects asubstantially simpler model than Lasso SCAD Furthermoreas indicated by the columns labeled 119877

2 (1198772 is equal to oneminus the residual sum of squares divided by the total sum ofsquares) and11987721198772OLS in Table 6 the Atan estimator describesmore variability in the data than Alasso and MCP and nearlyas much as OLS estimator

6 Conclusion and Discussion

In this paper a new Atan penalty which very closely resem-bles 119871

0penalty is proposed First we establish the theory

of the Atan estimator under mild conditions The theoryindicates that the Atan-penalized least squares procedureenjoys oracle properties even when the number of variablesgrows slower than the number of observations Second theiteratively reweighted Lasso algorithm makes our proposedestimator implementation fast Third we suggest a BIC-liketuning parameter selector to identify the true model con-sistently Numerical studies further endorse our theoreticalresults and the advantage of the Atan estimator for modelselection

We do not address the situation where 119901 ≫ 119899 in thispaper In fact the proposed Atan method can be easilyextended for variable selection in the situation of119901 ≫ 119899 Alsoas it is shown in Example 11 the Atan method can be appliedto semiparametric model and nonparametric model [43 44]Furthermore there is a recent field of applications of variableselection which is to look for impact points in functional dataanalysis [45 46]The possible combination of Atan estimator

10 Journal of Probability and Statistics

with the questions in [45 46] would be interesting Theseproblems are beyond the scope of this paper and will beinteresting topics for future research

Appendix

Proof of Theorem 2 Let 120572119899= radic1199011205902119899 and fix 119903 isin (0 1) To

prove the theorem it suffices to show that if 119862 gt 0 is largeenough then

119876119899(120573lowast) lt inf120583=119862

119876119899(120573lowast+ 120572119899120583) (A1)

holds for all 119899 sufficiently large with probability at least 1 minus 119903Define119863

119899(120583) = 119876

119899(120573lowast+ 120572119899120583) minus 119876

119899(120573lowast) and note that

119863119899(120583) =

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

+

119901

sum

119895=1

119901120582119886

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

(A2)

The fact that 119901120582120574

is concave on [0infin) implies that

119901120582120574

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (

10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816minus10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (minus120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816)

= minus120582120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2

(A3)

when 119899 is sufficiently largeCondition (B) implies that

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2le120574 (120574 + 2120587)

1205742 + 1205882 (A4)

Thus for 119899 big enough

119863119899(120583) ge

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

minus119862119901120582120572

119899120574 (120574 + 2120587)

1205742 + 1205882

(A5)

By (D)1

21198991205722

119899

1003817100381710038171003817X12058310038171003817100381710038172ge120582min2

11986221205722

119899 (A6)

On the other hand (D) implies1

119899120572119899

10038161003816100381610038161003816120576119879X12058310038161003816100381610038161003816 le

119862120572119899

radic119899

10038171003817100381710038171003817100381710038171003817

1

radic119899X119879120576

10038171003817100381710038171003817100381710038171003817= 119874119875(1198621205722

119899) (A7)

Furthermore (C) and (B) imply

119862119901120582120572119899120574 (120574 + 2120587)

1205742 + 1205882= 119900 (119862120572

2

119899) (A8)

From (A5)ndash(A8) we conclude that if 119862 gt 0 is large enoughthen inf

120583=119862119863119899(120583) gt 0 holds for all 119899 sufficiently large with

probability at least 1 minus 119903 This proves Theorem 2

Proof of Lemma 3 Suppose that 120573 isin 119877119901 and that 120573 minus 120573

lowast le

119862radic1199011205902119899 Define isin 119877119901 by

119860119888 = 0 and

119860= 120573119860 Similar to

the proof of Theorem 2 let

119863119899(120573 ) = 119876

119899(120573) minus 119876

119899() (A9)

where 119876119899(120573) is defined in (8) Then

119863119899(120573 ) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172minus

1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899

10038171003817100381710038171003817y minus X minus X (120573 minus )

10038171003817100381710038171003817

2

minus1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899(120573 minus )

119879

X119879X (120573 minus )

minus1

2119899(120573 minus )

119879

X119879 (y minus X)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

= 119874119901(10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817radic1199011205902

119899)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(A10)

On the other hand since the Atan penalty is concave on[0infin)

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) ge 120582 (120574 +

2

120587) arctan( 119862

120574radic1198991199011205902)

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816 (A11)

for 119895 isin 119860119888 Thus

sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

ge 120582 (120574 +2

120587) arctan( 119862

120574radic1198991199011205902)

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

(A12)

By (C)

lim inf119899rarrinfin

arctan( 119862

120574radic1198991199011205902) gt 0 (A13)

It follows from (A10)ndash(A12) that there is constant119870 gt 0 suchthat

119863119899(120573 )

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

ge 119870120582 + 119874119875(radic

1199011205902

119899) (A14)

Since 120582radic1199011205902119899 rarr infin the result follows

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

6 Journal of Probability and Statistics

43 A Standard Error Formula The standard errors for theestimated parameters can be obtained directly because weare estimating parameters and selecting variables at the sametime Let = (120582 120574) be a local minimizer of Atan Following[3 17] standard errors of may be estimated by using quad-ratic approximations to Atan Indeed the approximation

119901120582119886

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) asymp 119901120582120574

(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

+1

2100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1199011015840

120582120574(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816) (1205732

119895minus 1205732

1198950)

for 120573119895asymp 1205731198950

(16)

suggests that Atan may be replaced by the quadratic mini-mization problem

min

1

119899

1003817100381710038171003817y minus X12057310038171003817100381710038172+

119901

sum

119895=1

1199011015840

120582119886(100381610038161003816100381610038161205731198950

10038161003816100381610038161003816)

100381610038161003816100381610038161205731198950

10038161003816100381610038161003816

1205732

119895

(17)

at least for the purposes of obtaining standard errors Usingthis expression we obtain a sandwich formula for the esti-mated standard error of

where = 119895

119895= 0 Consider

cov () = 2X119879X+ 119899Δ

()minus1

sdot X119879XX119879X+ 119899Δ

()minus1

(18)

where Δ(120573) = diag1199011015840120582119886(|1205731|)|1205731| 119901

1015840

120582119886(|120573119901|)|120573119901| 2 =

119899minus1y minus X2 and

0= || is the number of elements in ||

Under the conditions of Theorem 4

119861119899119883119879

119883cov (

) 119861119879

119899

1205902997888rarr 119866 (19)

This is a consistency result for cov()

5 Numerical Studies

51 Simulation Studies We now investigate the sparsityrecovery and estimation properties of the proposed estimatorvia numerical simulations We compare the following esti-mators the Lasso estimator (implemented using R packageglmnet [40]) the adaptive Lasso estimator (denoted byAlasso [15]) the SCAD estimator from the CD algorithmwithout calibration [31] the MCP estimator from the CDalgorithm with 119886 = 15 [31] the Dantzig selector [7] andBayesian variable select method [8] For the proposed Atanestimator we take 120574 = 0005 and BIC statistic (15) is usedto select the tuning parameter 120582 In the following we reportsimulation results from four examples

Example 8 In this example simulation data are generatedfrom the linear regression model

119910 = x119879120573lowast + 120590120576 (20)

where 120573lowast = (3 15 0 0 2 0 0 0 0 0 0 0)119879 120576 sim 119873(0 1) and

x is multivariate normal distribution with zero mean and

covariance between the 119894th and 119895th elements being 120588|119894minus119895| with120588 = 05 In our simulation sample size 119899 is set to be 100 and200 120590 = 2 For each case we repeated the simulation 200times

For linear model model error for = x119879 is ME() =

( minus 120573)119879119864(xx119879)( minus 120573) Simulation results are summarized

in Table 1 in which MRME stands for median of ratios ofME of a selected model to that of the unpenalized minimumsquare estimate under the full model Both the columns ofldquoCrdquo and ldquoICrdquo are measures of model complexity Column ldquoCrdquoshows the average number of nonzero coefficients correctlyestimated to be nonzero and column ldquoICrdquo presents theaverage number of zero coefficients incorrectly estimated tobe nonzero In the column labeled ldquounderfitrdquo we presentthe proportion of excluding any nonzero coefficients in 200replications Likewise we report the probability of selectingthe exact subset model and the probability of including allthree significant variables and some noise variables in thecolumns ldquocorrect-fitrdquo and ldquooverfitrdquo respectively

As can be seen from Table 1 all variable selection proce-dures dramatically reduce model error Atan has the smallestmodel error among all competitors followed by AlassoMCPSCAD Bayesian and Dantzig In terms of sparsity Atan alsohas the highest probability of correct-fit The Atan penaltyperforms better than the other penalties Also Atan has someadvantages when dimensional 119901 is high which can be seen inExample 9

We now test the accuracy of our standard error formula(18) The median absolute deviation divided by 06745denoted by SD in Table 2 of 3 estimated coefficients in the200 simulations can be regarded as the true standard errorThe median of the 200 estimated SDs denoted by SDm andthe median absolute deviation error of the 200 estimatedstandard errors divided by 06745 denoted by SDmad gaugethe overall performance of standard error formula (18)Table 2 presents the results for nonzero coefficients whensample size 119899 = 200 The results for the other case with 119899 =

100 are similar Table 2 suggests that the sandwich formulaperforms surprisingly well

Example 9 The example is from Wang et al [23] Morespecifically we take 120573 = (114 minus236 3712 minus139 13 0 0 0)

119879isin R119901 and 119901 = [4119899

14] minus 5 and [119905] stands for the

largest integer no larger than 119905 For this example predictordimension 119901 is diverging but the dimension of the truemodel is fixed to be 5 Results from the simulation study arefound in Table 3 A similar conclusion as in Example 8 can befound

Example 10 In this simulation study presented here weexamined the performance of the various PLS methods for 119901substantially larger than in the previous studies In particularwe took 119901 = 339 119899 = 500 1205905 = 5 and 120573

lowast= (2119868

119879

37 minus3119868119879

37

119868119879

37 0119879

228) where 119868119896 isin 119877

119896 is the vector with all entries equalto 1 Thus 119901

0= 111 We simulated 200 independent datasets

(1199101 119909119879

1) (119910

119899 119909119879

119899) in this study and for each dataset we

computed estimates of 120573lowast Results from this simulation studyare found in Table 4

Journal of Probability and Statistics 7

Table 1 Simulation results for linear regression models of Example 8

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 100 120590 = 2

Lasso 06213 30000 12050 00000 03700 06700Alasso 03074 30000 03500 00000 07300 02700SCAD 02715 30000 10650 00000 04400 05600MCP 03041 30000 05650 00000 05800 04200Dantzig 04623 30000 06546 00000 05700 04300Bayesian 03548 30000 05732 00000 06300 03700Atan 02550 30000 01750 00000 08450 01550

119899 = 200 120590 = 2

Lasso 06027 30000 10700 00000 03550 06450Alasso 02781 30000 01600 00000 08650 01350SCAD 02900 30000 08550 00000 05250 04750MCP 02752 30000 03650 00000 06850 03150Dantzig 03863 30000 08576 00000 04920 05080Bayesian 02563 30000 04754 00000 07150 02850Atan 02508 30000 01000 00000 09050 00950

Table 2 Standard deviations of estimators for the linear regression model (119899 = 200)

Method 1

2

5

SD SDm (SDmad) SD SDm (SDmad) SD SDm (SDmad)Lasso 01753 01453 (00100) 01730 01688 (00094) 01591 01301 (00079)Alasso 01483 01638 (00095) 01642 01636 (00098) 01475 01439 (00073)SCAD 01797 01634 (00105) 01819 01608 (00104) 01398 01438 (00076)MCP 01602 01643 (00096) 01861 01656 (00097) 01464 01435 (00069)Dantzig 01734 01645 (00115) 01723 01665 (00094) 01581 01538 (00074)Bayesian 01568 01635 (00089) 01678 01649 (00092) 01367 01375 (00078)Atan 01510 01643 (00108) 01591 01658 (00096) 01609 01434 (00083)

Perhaps the most striking aspect of the results presentedin Table 4 is that hardly no method ever selected the correctmodel in this simulation study However given that 119901 119901

0

and 120573lowast are substantially larger in this study than in the

previous simulation studies this may not be too surprisingNotice that on average Atan selects the most parsimoniousmodels of all methods and has the smallest model errorAtanrsquos nearest competitor in terms of model error is AlassoThis implementation of Alasso has mean model error 02783but its average selected model size is 1035250 larger thanAtanrsquos Since 119901

0= 111 it is clear that Atan underfits in some

instances In fact all of the methods in this study underfitto some extent This may be due to the fact that many ofthe nonzero entries in 120573

lowast are small relative to the noise level1205902= 5

Example 11 As an extension of this method we consider theproblem of simultaneous variable selection and estimation inthe partially linear model

119884 = 1198831015840120573 + 119892 (119879) + 120576 (21)

where119884 is a scalar response variate119883 is a 119901-vector covariate119879 is a scalar covariate and takes values in a compact interval(for simplicity we assume this interval to be [0 1]) 120573 is 119901 times

1 column vector of unknown regression parameter function119892(sdot) is unknown and model error 120576 is independent of (119883 119879)withmean 0 Traditionally it has generally been assumed that120573 is finite dimension several standard approaches such as thekernel method the spline method [41] and the local linearestimation [17] have been proposed

In this study we simulate 119899 = 100 200 points 119879119894 119894 =

1 100(200) from the uniform distribution on [0 1] Foreach 119894 1198901015840

119894119895119904 are simulated to be normally distributed with

autocorrelated variance structure 119860119877(120588) such that

cov (119890119894119895 119890119894119897) = 120588|119894minus119895|

1 le 119894 119895 le 10 (22)

119883119894119895rsquos are then formed as follows

1198831198941= sin (2119879

119894) + 1198901198941

1198831198942= (05 + 119879

119894)minus2

+ 1198901198942

8 Journal of Probability and Statistics

Table 3 Simulation results for linear regression models of Example 9

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 200 119901 = 10

Lasso 09955 49900 20100 00100 01150 08750Alasso 05740 48650 01100 01350 08150 00500SCAD 05659 49800 05600 00200 05600 04200MCP 06177 49100 01200 00900 08250 00850Dantzig 06987 48900 06700 01100 04360 04540Bayesian 05656 48650 02500 01350 06340 02310Atan 05447 48900 01150 01100 08250 00650

119899 = 400 119901 = 12

Lasso 12197 50000 20650 00000 01400 08600Alasso 04458 49950 00900 00050 09250 00700SCAD 04481 50000 04850 00000 06350 03650MCP 04828 50000 01150 00000 08950 01050Dantzig 07879 50000 05670 00000 03200 06800Bayesian 04237 50000 01800 00000 07550 02450Atan 04125 49950 00250 00050 09700 00250

119899 = 800 119901 = 16

Lasso 12004 50000 25700 00000 00900 09100Alasso 03156 50000 00700 00000 09300 00700SCAD 03219 50000 06550 00000 05950 04050MCP 03220 50000 00750 00000 09300 00700Dantzig 05791 50000 05470 00000 03400 06600Bayesian 03275 50000 02800 00000 06750 03250Atan 03239 50000 00750 00000 09350 00650

Table 4 Simulation results for linear regression models of Example 10

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 500 120590 = 5

Lasso 03492 1102000 247450 05650 00000 04350Alasso 02783 1035250 61250 10000 00000 00000SCAD 03012 1060400 357150 09900 00000 00150MCP 02791 1033600 81100 10000 00000 00900Dantzig 03120 1083400 184650 07570 00000 02430Bayesian 02876 1044350 74700 09800 00000 00200Atan 02794 1014000 40600 10000 00000 00000

1198831198943= exp (119879

119894) + 1198901198943

1198831198945= (119879119894minus 07)

4+ 1198901198945

1198831198946= 119879119894(1 + 119879

2

119894)minus1

+ 1198901198946

1198831198945= radic1 + 119879

119894+ 1198901198947

1198831198948= log (3119879

119894+ 8) + 119890

1198948

(23)

We investigate the scenario 119901 = 10 119883119894119895= 119890119894119895 119895 = 4 9 10

In the scenario we have 120573119895= 119895 1 le 119895 le 4 and 120573

119895= 0 with

others 120576 sim 119873(0 1) For each case we repeated the simulation200 times We investigate 119892(sdot) functions 119892(119905) = cos (2120587119905)For comparison we apply the different penalties in the para-metric component with the B-spline in the nonparametriccomponent

The results are summarized in Table 5 Columns 2ndash5 inTable 5 are the averages of the estimates of 120573

119895 119895 = 1 4

respectively Column 6 is the number of estimates of 120573119895

Journal of Probability and Statistics 9

Table 5 Example 11 comparison of estimators

Estimator 1205731

1205732

1205733

1205734

119862 MME (SD) RASE (119892(119879))119899 = 100 119901 = 10

SCAD 09617 20128 28839 40611 47450 50000 00695 (00625) 06269Lasso 08278 19517 27984 39274 55200 60000 01727 (00790) 08079Dantzig 08767 19544 27894 39302 54590 60000 01696 (00680) 07278Atan 09710 20121 29865 40472 49820 50000 00658 (00630) 06214

119899 = 200 119901 = 10

SCAD 09836 19989 29283 40278 52000 50000 00230 (00210) 03604Lasso 09219 19529 29124 39580 51450 50000 00500 (00315) 04220Dantzig 09534 19675 29096 39345 51460 50000 00510 (00290) 04154Atan 09904 19946 29548 40189 51250 50000 00190 (00245) 03460

5 le 119895 le 119901 which are 0 averaged over 200 simulationsand their medians are given in column 7 Model errors arecomputed as ME() = ( minus 120573)

119879119864(xx119879)( minus 120573) Their medians

are listed in the 8th column followed by the model errorsstandard deviations in parentheses

The performance of (119879) is assessed by the square root ofaverage squared errors (RASE)

RASE2 = 1

119899

119899

sum

119894=1

( (119879119894) minus 119892 (119879

119894))2 (24)

where 119879119894 119894 = 1 119899 are the observed data points at which

function 119892 is estimated Column 9 summarizes the RASEvalues for the different situations

We can see the following from Table 5 (1) Comparingwith the Lasso and Dantzig estimator the SCAD and Atanestimators have a smallermodel error which is due to the factthat the SCAD and Atan estimators are unbiased estimatorswhile the Lasso and Dantzig are biased especially for thelarger coefficient Moreover the Atan and SCAD estimatorsare more stable (2) Each method is able to select importantvariables but it is obvious that the Atan estimator has slightlystronger sparsity (3) For the nonparametric component Atanand SCAD estimator have smaller RASE values

52 RealDataAnalysis In this sectionwe apply theAtan reg-ularization scheme to a prostate cancer example The datasetin this example is derived from a study of prostate cancer in[42]The dataset consists of themedical records of 97 patientswho were about to receive a radical prostatectomy Thepredictors are eight clinical measures log(cancer volume)(lcavol) log (prostate weight) (lweight) age the logarithmof the amount of benign prostatic hyperplasia (lbph) sem-inal vesicle invasion (svi) log (capsular penetration) (lcp)gleason score (gleason) and percentage gleason score 4 or5 (pgg45) The response is the logarithm of prostate-specificantigen (lpsa) One of the main aims here is to identify whichpredictors are more important in predicting the response

The Lasso Alasso SCAD MCP and Atan are all appliedto the data We also compute the OLS estimate of the prostatecancer data Results are summarized in Table 6 The OLSestimator does not perform variable selection Lasso selectsfive variables in the final model SCAD selects lcavol lweight

Table 6 Prostate cancer data comparing different methods

Method 1198772

11987721198772

OLS Variables selectedOLS 06615 10000 AllLasso 05867 08870 (1 2 4 5 8)Alasso 05991 09058 (1 2 5)SCAD 06140 09283 (1 2 4 5)MCP 05999 09069 (1 2 5)Atan 06057 09158 (1 2 5)

lbph and svi in the final model while Alasso MCP andAtan select lcavol lweight and svi Thus Atan selects asubstantially simpler model than Lasso SCAD Furthermoreas indicated by the columns labeled 119877

2 (1198772 is equal to oneminus the residual sum of squares divided by the total sum ofsquares) and11987721198772OLS in Table 6 the Atan estimator describesmore variability in the data than Alasso and MCP and nearlyas much as OLS estimator

6 Conclusion and Discussion

In this paper a new Atan penalty which very closely resem-bles 119871

0penalty is proposed First we establish the theory

of the Atan estimator under mild conditions The theoryindicates that the Atan-penalized least squares procedureenjoys oracle properties even when the number of variablesgrows slower than the number of observations Second theiteratively reweighted Lasso algorithm makes our proposedestimator implementation fast Third we suggest a BIC-liketuning parameter selector to identify the true model con-sistently Numerical studies further endorse our theoreticalresults and the advantage of the Atan estimator for modelselection

We do not address the situation where 119901 ≫ 119899 in thispaper In fact the proposed Atan method can be easilyextended for variable selection in the situation of119901 ≫ 119899 Alsoas it is shown in Example 11 the Atan method can be appliedto semiparametric model and nonparametric model [43 44]Furthermore there is a recent field of applications of variableselection which is to look for impact points in functional dataanalysis [45 46]The possible combination of Atan estimator

10 Journal of Probability and Statistics

with the questions in [45 46] would be interesting Theseproblems are beyond the scope of this paper and will beinteresting topics for future research

Appendix

Proof of Theorem 2 Let 120572119899= radic1199011205902119899 and fix 119903 isin (0 1) To

prove the theorem it suffices to show that if 119862 gt 0 is largeenough then

119876119899(120573lowast) lt inf120583=119862

119876119899(120573lowast+ 120572119899120583) (A1)

holds for all 119899 sufficiently large with probability at least 1 minus 119903Define119863

119899(120583) = 119876

119899(120573lowast+ 120572119899120583) minus 119876

119899(120573lowast) and note that

119863119899(120583) =

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

+

119901

sum

119895=1

119901120582119886

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

(A2)

The fact that 119901120582120574

is concave on [0infin) implies that

119901120582120574

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (

10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816minus10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (minus120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816)

= minus120582120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2

(A3)

when 119899 is sufficiently largeCondition (B) implies that

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2le120574 (120574 + 2120587)

1205742 + 1205882 (A4)

Thus for 119899 big enough

119863119899(120583) ge

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

minus119862119901120582120572

119899120574 (120574 + 2120587)

1205742 + 1205882

(A5)

By (D)1

21198991205722

119899

1003817100381710038171003817X12058310038171003817100381710038172ge120582min2

11986221205722

119899 (A6)

On the other hand (D) implies1

119899120572119899

10038161003816100381610038161003816120576119879X12058310038161003816100381610038161003816 le

119862120572119899

radic119899

10038171003817100381710038171003817100381710038171003817

1

radic119899X119879120576

10038171003817100381710038171003817100381710038171003817= 119874119875(1198621205722

119899) (A7)

Furthermore (C) and (B) imply

119862119901120582120572119899120574 (120574 + 2120587)

1205742 + 1205882= 119900 (119862120572

2

119899) (A8)

From (A5)ndash(A8) we conclude that if 119862 gt 0 is large enoughthen inf

120583=119862119863119899(120583) gt 0 holds for all 119899 sufficiently large with

probability at least 1 minus 119903 This proves Theorem 2

Proof of Lemma 3 Suppose that 120573 isin 119877119901 and that 120573 minus 120573

lowast le

119862radic1199011205902119899 Define isin 119877119901 by

119860119888 = 0 and

119860= 120573119860 Similar to

the proof of Theorem 2 let

119863119899(120573 ) = 119876

119899(120573) minus 119876

119899() (A9)

where 119876119899(120573) is defined in (8) Then

119863119899(120573 ) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172minus

1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899

10038171003817100381710038171003817y minus X minus X (120573 minus )

10038171003817100381710038171003817

2

minus1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899(120573 minus )

119879

X119879X (120573 minus )

minus1

2119899(120573 minus )

119879

X119879 (y minus X)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

= 119874119901(10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817radic1199011205902

119899)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(A10)

On the other hand since the Atan penalty is concave on[0infin)

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) ge 120582 (120574 +

2

120587) arctan( 119862

120574radic1198991199011205902)

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816 (A11)

for 119895 isin 119860119888 Thus

sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

ge 120582 (120574 +2

120587) arctan( 119862

120574radic1198991199011205902)

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

(A12)

By (C)

lim inf119899rarrinfin

arctan( 119862

120574radic1198991199011205902) gt 0 (A13)

It follows from (A10)ndash(A12) that there is constant119870 gt 0 suchthat

119863119899(120573 )

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

ge 119870120582 + 119874119875(radic

1199011205902

119899) (A14)

Since 120582radic1199011205902119899 rarr infin the result follows

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

Journal of Probability and Statistics 7

Table 1 Simulation results for linear regression models of Example 8

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 100 120590 = 2

Lasso 06213 30000 12050 00000 03700 06700Alasso 03074 30000 03500 00000 07300 02700SCAD 02715 30000 10650 00000 04400 05600MCP 03041 30000 05650 00000 05800 04200Dantzig 04623 30000 06546 00000 05700 04300Bayesian 03548 30000 05732 00000 06300 03700Atan 02550 30000 01750 00000 08450 01550

119899 = 200 120590 = 2

Lasso 06027 30000 10700 00000 03550 06450Alasso 02781 30000 01600 00000 08650 01350SCAD 02900 30000 08550 00000 05250 04750MCP 02752 30000 03650 00000 06850 03150Dantzig 03863 30000 08576 00000 04920 05080Bayesian 02563 30000 04754 00000 07150 02850Atan 02508 30000 01000 00000 09050 00950

Table 2 Standard deviations of estimators for the linear regression model (119899 = 200)

Method 1

2

5

SD SDm (SDmad) SD SDm (SDmad) SD SDm (SDmad)Lasso 01753 01453 (00100) 01730 01688 (00094) 01591 01301 (00079)Alasso 01483 01638 (00095) 01642 01636 (00098) 01475 01439 (00073)SCAD 01797 01634 (00105) 01819 01608 (00104) 01398 01438 (00076)MCP 01602 01643 (00096) 01861 01656 (00097) 01464 01435 (00069)Dantzig 01734 01645 (00115) 01723 01665 (00094) 01581 01538 (00074)Bayesian 01568 01635 (00089) 01678 01649 (00092) 01367 01375 (00078)Atan 01510 01643 (00108) 01591 01658 (00096) 01609 01434 (00083)

Perhaps the most striking aspect of the results presentedin Table 4 is that hardly no method ever selected the correctmodel in this simulation study However given that 119901 119901

0

and 120573lowast are substantially larger in this study than in the

previous simulation studies this may not be too surprisingNotice that on average Atan selects the most parsimoniousmodels of all methods and has the smallest model errorAtanrsquos nearest competitor in terms of model error is AlassoThis implementation of Alasso has mean model error 02783but its average selected model size is 1035250 larger thanAtanrsquos Since 119901

0= 111 it is clear that Atan underfits in some

instances In fact all of the methods in this study underfitto some extent This may be due to the fact that many ofthe nonzero entries in 120573

lowast are small relative to the noise level1205902= 5

Example 11 As an extension of this method we consider theproblem of simultaneous variable selection and estimation inthe partially linear model

119884 = 1198831015840120573 + 119892 (119879) + 120576 (21)

where119884 is a scalar response variate119883 is a 119901-vector covariate119879 is a scalar covariate and takes values in a compact interval(for simplicity we assume this interval to be [0 1]) 120573 is 119901 times

1 column vector of unknown regression parameter function119892(sdot) is unknown and model error 120576 is independent of (119883 119879)withmean 0 Traditionally it has generally been assumed that120573 is finite dimension several standard approaches such as thekernel method the spline method [41] and the local linearestimation [17] have been proposed

In this study we simulate 119899 = 100 200 points 119879119894 119894 =

1 100(200) from the uniform distribution on [0 1] Foreach 119894 1198901015840

119894119895119904 are simulated to be normally distributed with

autocorrelated variance structure 119860119877(120588) such that

cov (119890119894119895 119890119894119897) = 120588|119894minus119895|

1 le 119894 119895 le 10 (22)

119883119894119895rsquos are then formed as follows

1198831198941= sin (2119879

119894) + 1198901198941

1198831198942= (05 + 119879

119894)minus2

+ 1198901198942

8 Journal of Probability and Statistics

Table 3 Simulation results for linear regression models of Example 9

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 200 119901 = 10

Lasso 09955 49900 20100 00100 01150 08750Alasso 05740 48650 01100 01350 08150 00500SCAD 05659 49800 05600 00200 05600 04200MCP 06177 49100 01200 00900 08250 00850Dantzig 06987 48900 06700 01100 04360 04540Bayesian 05656 48650 02500 01350 06340 02310Atan 05447 48900 01150 01100 08250 00650

119899 = 400 119901 = 12

Lasso 12197 50000 20650 00000 01400 08600Alasso 04458 49950 00900 00050 09250 00700SCAD 04481 50000 04850 00000 06350 03650MCP 04828 50000 01150 00000 08950 01050Dantzig 07879 50000 05670 00000 03200 06800Bayesian 04237 50000 01800 00000 07550 02450Atan 04125 49950 00250 00050 09700 00250

119899 = 800 119901 = 16

Lasso 12004 50000 25700 00000 00900 09100Alasso 03156 50000 00700 00000 09300 00700SCAD 03219 50000 06550 00000 05950 04050MCP 03220 50000 00750 00000 09300 00700Dantzig 05791 50000 05470 00000 03400 06600Bayesian 03275 50000 02800 00000 06750 03250Atan 03239 50000 00750 00000 09350 00650

Table 4 Simulation results for linear regression models of Example 10

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 500 120590 = 5

Lasso 03492 1102000 247450 05650 00000 04350Alasso 02783 1035250 61250 10000 00000 00000SCAD 03012 1060400 357150 09900 00000 00150MCP 02791 1033600 81100 10000 00000 00900Dantzig 03120 1083400 184650 07570 00000 02430Bayesian 02876 1044350 74700 09800 00000 00200Atan 02794 1014000 40600 10000 00000 00000

1198831198943= exp (119879

119894) + 1198901198943

1198831198945= (119879119894minus 07)

4+ 1198901198945

1198831198946= 119879119894(1 + 119879

2

119894)minus1

+ 1198901198946

1198831198945= radic1 + 119879

119894+ 1198901198947

1198831198948= log (3119879

119894+ 8) + 119890

1198948

(23)

We investigate the scenario 119901 = 10 119883119894119895= 119890119894119895 119895 = 4 9 10

In the scenario we have 120573119895= 119895 1 le 119895 le 4 and 120573

119895= 0 with

others 120576 sim 119873(0 1) For each case we repeated the simulation200 times We investigate 119892(sdot) functions 119892(119905) = cos (2120587119905)For comparison we apply the different penalties in the para-metric component with the B-spline in the nonparametriccomponent

The results are summarized in Table 5 Columns 2ndash5 inTable 5 are the averages of the estimates of 120573

119895 119895 = 1 4

respectively Column 6 is the number of estimates of 120573119895

Journal of Probability and Statistics 9

Table 5 Example 11 comparison of estimators

Estimator 1205731

1205732

1205733

1205734

119862 MME (SD) RASE (119892(119879))119899 = 100 119901 = 10

SCAD 09617 20128 28839 40611 47450 50000 00695 (00625) 06269Lasso 08278 19517 27984 39274 55200 60000 01727 (00790) 08079Dantzig 08767 19544 27894 39302 54590 60000 01696 (00680) 07278Atan 09710 20121 29865 40472 49820 50000 00658 (00630) 06214

119899 = 200 119901 = 10

SCAD 09836 19989 29283 40278 52000 50000 00230 (00210) 03604Lasso 09219 19529 29124 39580 51450 50000 00500 (00315) 04220Dantzig 09534 19675 29096 39345 51460 50000 00510 (00290) 04154Atan 09904 19946 29548 40189 51250 50000 00190 (00245) 03460

5 le 119895 le 119901 which are 0 averaged over 200 simulationsand their medians are given in column 7 Model errors arecomputed as ME() = ( minus 120573)

119879119864(xx119879)( minus 120573) Their medians

are listed in the 8th column followed by the model errorsstandard deviations in parentheses

The performance of (119879) is assessed by the square root ofaverage squared errors (RASE)

RASE2 = 1

119899

119899

sum

119894=1

( (119879119894) minus 119892 (119879

119894))2 (24)

where 119879119894 119894 = 1 119899 are the observed data points at which

function 119892 is estimated Column 9 summarizes the RASEvalues for the different situations

We can see the following from Table 5 (1) Comparingwith the Lasso and Dantzig estimator the SCAD and Atanestimators have a smallermodel error which is due to the factthat the SCAD and Atan estimators are unbiased estimatorswhile the Lasso and Dantzig are biased especially for thelarger coefficient Moreover the Atan and SCAD estimatorsare more stable (2) Each method is able to select importantvariables but it is obvious that the Atan estimator has slightlystronger sparsity (3) For the nonparametric component Atanand SCAD estimator have smaller RASE values

52 RealDataAnalysis In this sectionwe apply theAtan reg-ularization scheme to a prostate cancer example The datasetin this example is derived from a study of prostate cancer in[42]The dataset consists of themedical records of 97 patientswho were about to receive a radical prostatectomy Thepredictors are eight clinical measures log(cancer volume)(lcavol) log (prostate weight) (lweight) age the logarithmof the amount of benign prostatic hyperplasia (lbph) sem-inal vesicle invasion (svi) log (capsular penetration) (lcp)gleason score (gleason) and percentage gleason score 4 or5 (pgg45) The response is the logarithm of prostate-specificantigen (lpsa) One of the main aims here is to identify whichpredictors are more important in predicting the response

The Lasso Alasso SCAD MCP and Atan are all appliedto the data We also compute the OLS estimate of the prostatecancer data Results are summarized in Table 6 The OLSestimator does not perform variable selection Lasso selectsfive variables in the final model SCAD selects lcavol lweight

Table 6 Prostate cancer data comparing different methods

Method 1198772

11987721198772

OLS Variables selectedOLS 06615 10000 AllLasso 05867 08870 (1 2 4 5 8)Alasso 05991 09058 (1 2 5)SCAD 06140 09283 (1 2 4 5)MCP 05999 09069 (1 2 5)Atan 06057 09158 (1 2 5)

lbph and svi in the final model while Alasso MCP andAtan select lcavol lweight and svi Thus Atan selects asubstantially simpler model than Lasso SCAD Furthermoreas indicated by the columns labeled 119877

2 (1198772 is equal to oneminus the residual sum of squares divided by the total sum ofsquares) and11987721198772OLS in Table 6 the Atan estimator describesmore variability in the data than Alasso and MCP and nearlyas much as OLS estimator

6 Conclusion and Discussion

In this paper a new Atan penalty which very closely resem-bles 119871

0penalty is proposed First we establish the theory

of the Atan estimator under mild conditions The theoryindicates that the Atan-penalized least squares procedureenjoys oracle properties even when the number of variablesgrows slower than the number of observations Second theiteratively reweighted Lasso algorithm makes our proposedestimator implementation fast Third we suggest a BIC-liketuning parameter selector to identify the true model con-sistently Numerical studies further endorse our theoreticalresults and the advantage of the Atan estimator for modelselection

We do not address the situation where 119901 ≫ 119899 in thispaper In fact the proposed Atan method can be easilyextended for variable selection in the situation of119901 ≫ 119899 Alsoas it is shown in Example 11 the Atan method can be appliedto semiparametric model and nonparametric model [43 44]Furthermore there is a recent field of applications of variableselection which is to look for impact points in functional dataanalysis [45 46]The possible combination of Atan estimator

10 Journal of Probability and Statistics

with the questions in [45 46] would be interesting Theseproblems are beyond the scope of this paper and will beinteresting topics for future research

Appendix

Proof of Theorem 2 Let 120572119899= radic1199011205902119899 and fix 119903 isin (0 1) To

prove the theorem it suffices to show that if 119862 gt 0 is largeenough then

119876119899(120573lowast) lt inf120583=119862

119876119899(120573lowast+ 120572119899120583) (A1)

holds for all 119899 sufficiently large with probability at least 1 minus 119903Define119863

119899(120583) = 119876

119899(120573lowast+ 120572119899120583) minus 119876

119899(120573lowast) and note that

119863119899(120583) =

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

+

119901

sum

119895=1

119901120582119886

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

(A2)

The fact that 119901120582120574

is concave on [0infin) implies that

119901120582120574

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (

10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816minus10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (minus120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816)

= minus120582120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2

(A3)

when 119899 is sufficiently largeCondition (B) implies that

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2le120574 (120574 + 2120587)

1205742 + 1205882 (A4)

Thus for 119899 big enough

119863119899(120583) ge

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

minus119862119901120582120572

119899120574 (120574 + 2120587)

1205742 + 1205882

(A5)

By (D)1

21198991205722

119899

1003817100381710038171003817X12058310038171003817100381710038172ge120582min2

11986221205722

119899 (A6)

On the other hand (D) implies1

119899120572119899

10038161003816100381610038161003816120576119879X12058310038161003816100381610038161003816 le

119862120572119899

radic119899

10038171003817100381710038171003817100381710038171003817

1

radic119899X119879120576

10038171003817100381710038171003817100381710038171003817= 119874119875(1198621205722

119899) (A7)

Furthermore (C) and (B) imply

119862119901120582120572119899120574 (120574 + 2120587)

1205742 + 1205882= 119900 (119862120572

2

119899) (A8)

From (A5)ndash(A8) we conclude that if 119862 gt 0 is large enoughthen inf

120583=119862119863119899(120583) gt 0 holds for all 119899 sufficiently large with

probability at least 1 minus 119903 This proves Theorem 2

Proof of Lemma 3 Suppose that 120573 isin 119877119901 and that 120573 minus 120573

lowast le

119862radic1199011205902119899 Define isin 119877119901 by

119860119888 = 0 and

119860= 120573119860 Similar to

the proof of Theorem 2 let

119863119899(120573 ) = 119876

119899(120573) minus 119876

119899() (A9)

where 119876119899(120573) is defined in (8) Then

119863119899(120573 ) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172minus

1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899

10038171003817100381710038171003817y minus X minus X (120573 minus )

10038171003817100381710038171003817

2

minus1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899(120573 minus )

119879

X119879X (120573 minus )

minus1

2119899(120573 minus )

119879

X119879 (y minus X)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

= 119874119901(10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817radic1199011205902

119899)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(A10)

On the other hand since the Atan penalty is concave on[0infin)

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) ge 120582 (120574 +

2

120587) arctan( 119862

120574radic1198991199011205902)

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816 (A11)

for 119895 isin 119860119888 Thus

sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

ge 120582 (120574 +2

120587) arctan( 119862

120574radic1198991199011205902)

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

(A12)

By (C)

lim inf119899rarrinfin

arctan( 119862

120574radic1198991199011205902) gt 0 (A13)

It follows from (A10)ndash(A12) that there is constant119870 gt 0 suchthat

119863119899(120573 )

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

ge 119870120582 + 119874119875(radic

1199011205902

119899) (A14)

Since 120582radic1199011205902119899 rarr infin the result follows

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

8 Journal of Probability and Statistics

Table 3 Simulation results for linear regression models of Example 9

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 200 119901 = 10

Lasso 09955 49900 20100 00100 01150 08750Alasso 05740 48650 01100 01350 08150 00500SCAD 05659 49800 05600 00200 05600 04200MCP 06177 49100 01200 00900 08250 00850Dantzig 06987 48900 06700 01100 04360 04540Bayesian 05656 48650 02500 01350 06340 02310Atan 05447 48900 01150 01100 08250 00650

119899 = 400 119901 = 12

Lasso 12197 50000 20650 00000 01400 08600Alasso 04458 49950 00900 00050 09250 00700SCAD 04481 50000 04850 00000 06350 03650MCP 04828 50000 01150 00000 08950 01050Dantzig 07879 50000 05670 00000 03200 06800Bayesian 04237 50000 01800 00000 07550 02450Atan 04125 49950 00250 00050 09700 00250

119899 = 800 119901 = 16

Lasso 12004 50000 25700 00000 00900 09100Alasso 03156 50000 00700 00000 09300 00700SCAD 03219 50000 06550 00000 05950 04050MCP 03220 50000 00750 00000 09300 00700Dantzig 05791 50000 05470 00000 03400 06600Bayesian 03275 50000 02800 00000 06750 03250Atan 03239 50000 00750 00000 09350 00650

Table 4 Simulation results for linear regression models of Example 10

Method MRME Number of zeros Proportion ofC IC Underfit Correct-fit Overfit

119899 = 500 120590 = 5

Lasso 03492 1102000 247450 05650 00000 04350Alasso 02783 1035250 61250 10000 00000 00000SCAD 03012 1060400 357150 09900 00000 00150MCP 02791 1033600 81100 10000 00000 00900Dantzig 03120 1083400 184650 07570 00000 02430Bayesian 02876 1044350 74700 09800 00000 00200Atan 02794 1014000 40600 10000 00000 00000

1198831198943= exp (119879

119894) + 1198901198943

1198831198945= (119879119894minus 07)

4+ 1198901198945

1198831198946= 119879119894(1 + 119879

2

119894)minus1

+ 1198901198946

1198831198945= radic1 + 119879

119894+ 1198901198947

1198831198948= log (3119879

119894+ 8) + 119890

1198948

(23)

We investigate the scenario 119901 = 10 119883119894119895= 119890119894119895 119895 = 4 9 10

In the scenario we have 120573119895= 119895 1 le 119895 le 4 and 120573

119895= 0 with

others 120576 sim 119873(0 1) For each case we repeated the simulation200 times We investigate 119892(sdot) functions 119892(119905) = cos (2120587119905)For comparison we apply the different penalties in the para-metric component with the B-spline in the nonparametriccomponent

The results are summarized in Table 5 Columns 2ndash5 inTable 5 are the averages of the estimates of 120573

119895 119895 = 1 4

respectively Column 6 is the number of estimates of 120573119895

Journal of Probability and Statistics 9

Table 5 Example 11 comparison of estimators

Estimator 1205731

1205732

1205733

1205734

119862 MME (SD) RASE (119892(119879))119899 = 100 119901 = 10

SCAD 09617 20128 28839 40611 47450 50000 00695 (00625) 06269Lasso 08278 19517 27984 39274 55200 60000 01727 (00790) 08079Dantzig 08767 19544 27894 39302 54590 60000 01696 (00680) 07278Atan 09710 20121 29865 40472 49820 50000 00658 (00630) 06214

119899 = 200 119901 = 10

SCAD 09836 19989 29283 40278 52000 50000 00230 (00210) 03604Lasso 09219 19529 29124 39580 51450 50000 00500 (00315) 04220Dantzig 09534 19675 29096 39345 51460 50000 00510 (00290) 04154Atan 09904 19946 29548 40189 51250 50000 00190 (00245) 03460

5 le 119895 le 119901 which are 0 averaged over 200 simulationsand their medians are given in column 7 Model errors arecomputed as ME() = ( minus 120573)

119879119864(xx119879)( minus 120573) Their medians

are listed in the 8th column followed by the model errorsstandard deviations in parentheses

The performance of (119879) is assessed by the square root ofaverage squared errors (RASE)

RASE2 = 1

119899

119899

sum

119894=1

( (119879119894) minus 119892 (119879

119894))2 (24)

where 119879119894 119894 = 1 119899 are the observed data points at which

function 119892 is estimated Column 9 summarizes the RASEvalues for the different situations

We can see the following from Table 5 (1) Comparingwith the Lasso and Dantzig estimator the SCAD and Atanestimators have a smallermodel error which is due to the factthat the SCAD and Atan estimators are unbiased estimatorswhile the Lasso and Dantzig are biased especially for thelarger coefficient Moreover the Atan and SCAD estimatorsare more stable (2) Each method is able to select importantvariables but it is obvious that the Atan estimator has slightlystronger sparsity (3) For the nonparametric component Atanand SCAD estimator have smaller RASE values

52 RealDataAnalysis In this sectionwe apply theAtan reg-ularization scheme to a prostate cancer example The datasetin this example is derived from a study of prostate cancer in[42]The dataset consists of themedical records of 97 patientswho were about to receive a radical prostatectomy Thepredictors are eight clinical measures log(cancer volume)(lcavol) log (prostate weight) (lweight) age the logarithmof the amount of benign prostatic hyperplasia (lbph) sem-inal vesicle invasion (svi) log (capsular penetration) (lcp)gleason score (gleason) and percentage gleason score 4 or5 (pgg45) The response is the logarithm of prostate-specificantigen (lpsa) One of the main aims here is to identify whichpredictors are more important in predicting the response

The Lasso Alasso SCAD MCP and Atan are all appliedto the data We also compute the OLS estimate of the prostatecancer data Results are summarized in Table 6 The OLSestimator does not perform variable selection Lasso selectsfive variables in the final model SCAD selects lcavol lweight

Table 6 Prostate cancer data comparing different methods

Method 1198772

11987721198772

OLS Variables selectedOLS 06615 10000 AllLasso 05867 08870 (1 2 4 5 8)Alasso 05991 09058 (1 2 5)SCAD 06140 09283 (1 2 4 5)MCP 05999 09069 (1 2 5)Atan 06057 09158 (1 2 5)

lbph and svi in the final model while Alasso MCP andAtan select lcavol lweight and svi Thus Atan selects asubstantially simpler model than Lasso SCAD Furthermoreas indicated by the columns labeled 119877

2 (1198772 is equal to oneminus the residual sum of squares divided by the total sum ofsquares) and11987721198772OLS in Table 6 the Atan estimator describesmore variability in the data than Alasso and MCP and nearlyas much as OLS estimator

6 Conclusion and Discussion

In this paper a new Atan penalty which very closely resem-bles 119871

0penalty is proposed First we establish the theory

of the Atan estimator under mild conditions The theoryindicates that the Atan-penalized least squares procedureenjoys oracle properties even when the number of variablesgrows slower than the number of observations Second theiteratively reweighted Lasso algorithm makes our proposedestimator implementation fast Third we suggest a BIC-liketuning parameter selector to identify the true model con-sistently Numerical studies further endorse our theoreticalresults and the advantage of the Atan estimator for modelselection

We do not address the situation where 119901 ≫ 119899 in thispaper In fact the proposed Atan method can be easilyextended for variable selection in the situation of119901 ≫ 119899 Alsoas it is shown in Example 11 the Atan method can be appliedto semiparametric model and nonparametric model [43 44]Furthermore there is a recent field of applications of variableselection which is to look for impact points in functional dataanalysis [45 46]The possible combination of Atan estimator

10 Journal of Probability and Statistics

with the questions in [45 46] would be interesting Theseproblems are beyond the scope of this paper and will beinteresting topics for future research

Appendix

Proof of Theorem 2 Let 120572119899= radic1199011205902119899 and fix 119903 isin (0 1) To

prove the theorem it suffices to show that if 119862 gt 0 is largeenough then

119876119899(120573lowast) lt inf120583=119862

119876119899(120573lowast+ 120572119899120583) (A1)

holds for all 119899 sufficiently large with probability at least 1 minus 119903Define119863

119899(120583) = 119876

119899(120573lowast+ 120572119899120583) minus 119876

119899(120573lowast) and note that

119863119899(120583) =

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

+

119901

sum

119895=1

119901120582119886

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

(A2)

The fact that 119901120582120574

is concave on [0infin) implies that

119901120582120574

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (

10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816minus10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (minus120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816)

= minus120582120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2

(A3)

when 119899 is sufficiently largeCondition (B) implies that

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2le120574 (120574 + 2120587)

1205742 + 1205882 (A4)

Thus for 119899 big enough

119863119899(120583) ge

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

minus119862119901120582120572

119899120574 (120574 + 2120587)

1205742 + 1205882

(A5)

By (D)1

21198991205722

119899

1003817100381710038171003817X12058310038171003817100381710038172ge120582min2

11986221205722

119899 (A6)

On the other hand (D) implies1

119899120572119899

10038161003816100381610038161003816120576119879X12058310038161003816100381610038161003816 le

119862120572119899

radic119899

10038171003817100381710038171003817100381710038171003817

1

radic119899X119879120576

10038171003817100381710038171003817100381710038171003817= 119874119875(1198621205722

119899) (A7)

Furthermore (C) and (B) imply

119862119901120582120572119899120574 (120574 + 2120587)

1205742 + 1205882= 119900 (119862120572

2

119899) (A8)

From (A5)ndash(A8) we conclude that if 119862 gt 0 is large enoughthen inf

120583=119862119863119899(120583) gt 0 holds for all 119899 sufficiently large with

probability at least 1 minus 119903 This proves Theorem 2

Proof of Lemma 3 Suppose that 120573 isin 119877119901 and that 120573 minus 120573

lowast le

119862radic1199011205902119899 Define isin 119877119901 by

119860119888 = 0 and

119860= 120573119860 Similar to

the proof of Theorem 2 let

119863119899(120573 ) = 119876

119899(120573) minus 119876

119899() (A9)

where 119876119899(120573) is defined in (8) Then

119863119899(120573 ) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172minus

1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899

10038171003817100381710038171003817y minus X minus X (120573 minus )

10038171003817100381710038171003817

2

minus1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899(120573 minus )

119879

X119879X (120573 minus )

minus1

2119899(120573 minus )

119879

X119879 (y minus X)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

= 119874119901(10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817radic1199011205902

119899)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(A10)

On the other hand since the Atan penalty is concave on[0infin)

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) ge 120582 (120574 +

2

120587) arctan( 119862

120574radic1198991199011205902)

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816 (A11)

for 119895 isin 119860119888 Thus

sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

ge 120582 (120574 +2

120587) arctan( 119862

120574radic1198991199011205902)

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

(A12)

By (C)

lim inf119899rarrinfin

arctan( 119862

120574radic1198991199011205902) gt 0 (A13)

It follows from (A10)ndash(A12) that there is constant119870 gt 0 suchthat

119863119899(120573 )

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

ge 119870120582 + 119874119875(radic

1199011205902

119899) (A14)

Since 120582radic1199011205902119899 rarr infin the result follows

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

Journal of Probability and Statistics 9

Table 5 Example 11 comparison of estimators

Estimator 1205731

1205732

1205733

1205734

119862 MME (SD) RASE (119892(119879))119899 = 100 119901 = 10

SCAD 09617 20128 28839 40611 47450 50000 00695 (00625) 06269Lasso 08278 19517 27984 39274 55200 60000 01727 (00790) 08079Dantzig 08767 19544 27894 39302 54590 60000 01696 (00680) 07278Atan 09710 20121 29865 40472 49820 50000 00658 (00630) 06214

119899 = 200 119901 = 10

SCAD 09836 19989 29283 40278 52000 50000 00230 (00210) 03604Lasso 09219 19529 29124 39580 51450 50000 00500 (00315) 04220Dantzig 09534 19675 29096 39345 51460 50000 00510 (00290) 04154Atan 09904 19946 29548 40189 51250 50000 00190 (00245) 03460

5 le 119895 le 119901 which are 0 averaged over 200 simulationsand their medians are given in column 7 Model errors arecomputed as ME() = ( minus 120573)

119879119864(xx119879)( minus 120573) Their medians

are listed in the 8th column followed by the model errorsstandard deviations in parentheses

The performance of (119879) is assessed by the square root ofaverage squared errors (RASE)

RASE2 = 1

119899

119899

sum

119894=1

( (119879119894) minus 119892 (119879

119894))2 (24)

where 119879119894 119894 = 1 119899 are the observed data points at which

function 119892 is estimated Column 9 summarizes the RASEvalues for the different situations

We can see the following from Table 5 (1) Comparingwith the Lasso and Dantzig estimator the SCAD and Atanestimators have a smallermodel error which is due to the factthat the SCAD and Atan estimators are unbiased estimatorswhile the Lasso and Dantzig are biased especially for thelarger coefficient Moreover the Atan and SCAD estimatorsare more stable (2) Each method is able to select importantvariables but it is obvious that the Atan estimator has slightlystronger sparsity (3) For the nonparametric component Atanand SCAD estimator have smaller RASE values

52 RealDataAnalysis In this sectionwe apply theAtan reg-ularization scheme to a prostate cancer example The datasetin this example is derived from a study of prostate cancer in[42]The dataset consists of themedical records of 97 patientswho were about to receive a radical prostatectomy Thepredictors are eight clinical measures log(cancer volume)(lcavol) log (prostate weight) (lweight) age the logarithmof the amount of benign prostatic hyperplasia (lbph) sem-inal vesicle invasion (svi) log (capsular penetration) (lcp)gleason score (gleason) and percentage gleason score 4 or5 (pgg45) The response is the logarithm of prostate-specificantigen (lpsa) One of the main aims here is to identify whichpredictors are more important in predicting the response

The Lasso Alasso SCAD MCP and Atan are all appliedto the data We also compute the OLS estimate of the prostatecancer data Results are summarized in Table 6 The OLSestimator does not perform variable selection Lasso selectsfive variables in the final model SCAD selects lcavol lweight

Table 6 Prostate cancer data comparing different methods

Method 1198772

11987721198772

OLS Variables selectedOLS 06615 10000 AllLasso 05867 08870 (1 2 4 5 8)Alasso 05991 09058 (1 2 5)SCAD 06140 09283 (1 2 4 5)MCP 05999 09069 (1 2 5)Atan 06057 09158 (1 2 5)

lbph and svi in the final model while Alasso MCP andAtan select lcavol lweight and svi Thus Atan selects asubstantially simpler model than Lasso SCAD Furthermoreas indicated by the columns labeled 119877

2 (1198772 is equal to oneminus the residual sum of squares divided by the total sum ofsquares) and11987721198772OLS in Table 6 the Atan estimator describesmore variability in the data than Alasso and MCP and nearlyas much as OLS estimator

6 Conclusion and Discussion

In this paper a new Atan penalty which very closely resem-bles 119871

0penalty is proposed First we establish the theory

of the Atan estimator under mild conditions The theoryindicates that the Atan-penalized least squares procedureenjoys oracle properties even when the number of variablesgrows slower than the number of observations Second theiteratively reweighted Lasso algorithm makes our proposedestimator implementation fast Third we suggest a BIC-liketuning parameter selector to identify the true model con-sistently Numerical studies further endorse our theoreticalresults and the advantage of the Atan estimator for modelselection

We do not address the situation where 119901 ≫ 119899 in thispaper In fact the proposed Atan method can be easilyextended for variable selection in the situation of119901 ≫ 119899 Alsoas it is shown in Example 11 the Atan method can be appliedto semiparametric model and nonparametric model [43 44]Furthermore there is a recent field of applications of variableselection which is to look for impact points in functional dataanalysis [45 46]The possible combination of Atan estimator

10 Journal of Probability and Statistics

with the questions in [45 46] would be interesting Theseproblems are beyond the scope of this paper and will beinteresting topics for future research

Appendix

Proof of Theorem 2 Let 120572119899= radic1199011205902119899 and fix 119903 isin (0 1) To

prove the theorem it suffices to show that if 119862 gt 0 is largeenough then

119876119899(120573lowast) lt inf120583=119862

119876119899(120573lowast+ 120572119899120583) (A1)

holds for all 119899 sufficiently large with probability at least 1 minus 119903Define119863

119899(120583) = 119876

119899(120573lowast+ 120572119899120583) minus 119876

119899(120573lowast) and note that

119863119899(120583) =

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

+

119901

sum

119895=1

119901120582119886

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

(A2)

The fact that 119901120582120574

is concave on [0infin) implies that

119901120582120574

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (

10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816minus10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (minus120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816)

= minus120582120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2

(A3)

when 119899 is sufficiently largeCondition (B) implies that

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2le120574 (120574 + 2120587)

1205742 + 1205882 (A4)

Thus for 119899 big enough

119863119899(120583) ge

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

minus119862119901120582120572

119899120574 (120574 + 2120587)

1205742 + 1205882

(A5)

By (D)1

21198991205722

119899

1003817100381710038171003817X12058310038171003817100381710038172ge120582min2

11986221205722

119899 (A6)

On the other hand (D) implies1

119899120572119899

10038161003816100381610038161003816120576119879X12058310038161003816100381610038161003816 le

119862120572119899

radic119899

10038171003817100381710038171003817100381710038171003817

1

radic119899X119879120576

10038171003817100381710038171003817100381710038171003817= 119874119875(1198621205722

119899) (A7)

Furthermore (C) and (B) imply

119862119901120582120572119899120574 (120574 + 2120587)

1205742 + 1205882= 119900 (119862120572

2

119899) (A8)

From (A5)ndash(A8) we conclude that if 119862 gt 0 is large enoughthen inf

120583=119862119863119899(120583) gt 0 holds for all 119899 sufficiently large with

probability at least 1 minus 119903 This proves Theorem 2

Proof of Lemma 3 Suppose that 120573 isin 119877119901 and that 120573 minus 120573

lowast le

119862radic1199011205902119899 Define isin 119877119901 by

119860119888 = 0 and

119860= 120573119860 Similar to

the proof of Theorem 2 let

119863119899(120573 ) = 119876

119899(120573) minus 119876

119899() (A9)

where 119876119899(120573) is defined in (8) Then

119863119899(120573 ) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172minus

1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899

10038171003817100381710038171003817y minus X minus X (120573 minus )

10038171003817100381710038171003817

2

minus1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899(120573 minus )

119879

X119879X (120573 minus )

minus1

2119899(120573 minus )

119879

X119879 (y minus X)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

= 119874119901(10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817radic1199011205902

119899)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(A10)

On the other hand since the Atan penalty is concave on[0infin)

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) ge 120582 (120574 +

2

120587) arctan( 119862

120574radic1198991199011205902)

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816 (A11)

for 119895 isin 119860119888 Thus

sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

ge 120582 (120574 +2

120587) arctan( 119862

120574radic1198991199011205902)

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

(A12)

By (C)

lim inf119899rarrinfin

arctan( 119862

120574radic1198991199011205902) gt 0 (A13)

It follows from (A10)ndash(A12) that there is constant119870 gt 0 suchthat

119863119899(120573 )

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

ge 119870120582 + 119874119875(radic

1199011205902

119899) (A14)

Since 120582radic1199011205902119899 rarr infin the result follows

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

10 Journal of Probability and Statistics

with the questions in [45 46] would be interesting Theseproblems are beyond the scope of this paper and will beinteresting topics for future research

Appendix

Proof of Theorem 2 Let 120572119899= radic1199011205902119899 and fix 119903 isin (0 1) To

prove the theorem it suffices to show that if 119862 gt 0 is largeenough then

119876119899(120573lowast) lt inf120583=119862

119876119899(120573lowast+ 120572119899120583) (A1)

holds for all 119899 sufficiently large with probability at least 1 minus 119903Define119863

119899(120583) = 119876

119899(120573lowast+ 120572119899120583) minus 119876

119899(120573lowast) and note that

119863119899(120583) =

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

+

119901

sum

119895=1

119901120582119886

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

(A2)

The fact that 119901120582120574

is concave on [0infin) implies that

119901120582120574

(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) minus 119901120582120574

(10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (

10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816minus10038161003816100381610038161003816120573lowast

119895

10038161003816100381610038161003816)

ge 1199011015840

120582120574(10038161003816100381610038161003816120573lowast

119895+ 120572119899120583119895

10038161003816100381610038161003816) (minus120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816)

= minus120582120572119899

10038161003816100381610038161003816120583119895

10038161003816100381610038161003816

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2

(A3)

when 119899 is sufficiently largeCondition (B) implies that

120574 (120574 + 2120587)

1205742 + (120573lowast119895+ 120572119899120583119895)2le120574 (120574 + 2120587)

1205742 + 1205882 (A4)

Thus for 119899 big enough

119863119899(120583) ge

1

2119899(1205722

119899

1003817100381710038171003817X12058310038171003817100381710038172minus 2120572119899120576119879X120583)

minus119862119901120582120572

119899120574 (120574 + 2120587)

1205742 + 1205882

(A5)

By (D)1

21198991205722

119899

1003817100381710038171003817X12058310038171003817100381710038172ge120582min2

11986221205722

119899 (A6)

On the other hand (D) implies1

119899120572119899

10038161003816100381610038161003816120576119879X12058310038161003816100381610038161003816 le

119862120572119899

radic119899

10038171003817100381710038171003817100381710038171003817

1

radic119899X119879120576

10038171003817100381710038171003817100381710038171003817= 119874119875(1198621205722

119899) (A7)

Furthermore (C) and (B) imply

119862119901120582120572119899120574 (120574 + 2120587)

1205742 + 1205882= 119900 (119862120572

2

119899) (A8)

From (A5)ndash(A8) we conclude that if 119862 gt 0 is large enoughthen inf

120583=119862119863119899(120583) gt 0 holds for all 119899 sufficiently large with

probability at least 1 minus 119903 This proves Theorem 2

Proof of Lemma 3 Suppose that 120573 isin 119877119901 and that 120573 minus 120573

lowast le

119862radic1199011205902119899 Define isin 119877119901 by

119860119888 = 0 and

119860= 120573119860 Similar to

the proof of Theorem 2 let

119863119899(120573 ) = 119876

119899(120573) minus 119876

119899() (A9)

where 119876119899(120573) is defined in (8) Then

119863119899(120573 ) =

1

2119899

1003817100381710038171003817y minus X12057310038171003817100381710038172minus

1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899

10038171003817100381710038171003817y minus X minus X (120573 minus )

10038171003817100381710038171003817

2

minus1

2119899

10038171003817100381710038171003817y minus X10038171003817100381710038171003817

2

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

=1

2119899(120573 minus )

119879

X119879X (120573 minus )

minus1

2119899(120573 minus )

119879

X119879 (y minus X)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

= 119874119901(10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817radic1199011205902

119899)

+ sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

(A10)

On the other hand since the Atan penalty is concave on[0infin)

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816) ge 120582 (120574 +

2

120587) arctan( 119862

120574radic1198991199011205902)

10038161003816100381610038161003816120573119895

10038161003816100381610038161003816 (A11)

for 119895 isin 119860119888 Thus

sum

119895isin119860119888

119901120582120574

(10038161003816100381610038161003816120573119895

10038161003816100381610038161003816)

ge 120582 (120574 +2

120587) arctan( 119862

120574radic1198991199011205902)

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

(A12)

By (C)

lim inf119899rarrinfin

arctan( 119862

120574radic1198991199011205902) gt 0 (A13)

It follows from (A10)ndash(A12) that there is constant119870 gt 0 suchthat

119863119899(120573 )

10038171003817100381710038171003817120573 minus

10038171003817100381710038171003817

ge 119870120582 + 119874119875(radic

1199011205902

119899) (A14)

Since 120582radic1199011205902119899 rarr infin the result follows

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

Journal of Probability and Statistics 11

Proof ofTheorem 4 Taken togetherTheorem 2 and Lemma 3imply that there exists a sequence of local minima of (8)such that minus 120573lowast = 119874

119875(radic1199011205902119899) and

119860119888 = 0 Part (i) of the

theorem follows immediatelyTo prove part (ii) observe that on the event 119895

119895= 0 =

119860 we must have

119860= 120573lowast

119860+ (X119879119860X119860)minus1

X119879119860120576 minus (119899

minus1X119879119860X119860)minus1

1199011015840

119860 (A15)

where 1199011015840119860= (1199011015840

120582120574(119895))119895isin119860

It follows that

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576

minus 119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

(A16)

whenever 119895 119895

= 0 = 119860 Now note that conditions (B)ndash(D)imply

1003817100381710038171003817100381710038171003817119899119861119899(1205902X119879119860X119860)minus12

1199011015840

119860

1003817100381710038171003817100381710038171003817

= 119874119875(radic

119899119901

1205902

120582120574 (120574 + 2120587)

1205742 + 1205882) = 119900

119875(1)

(A17)

and thus

radic119899119861119899(119899minus1X119879119860X119860

1205902)

12

(119860minus 120573lowast

119860)

= 119861119899(1205902X119879119860X119860)minus12

X119879119860120576 + 119900119875 (1)

(A18)

To complete the proof of (ii) we use the Lindeberg-Fellercentral limit theorem to show that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 997888rarr 119873(0 119866) (A19)

in distribution Observe that

119861119899(1205902X119879119860X119860)minus12

X119879119860120576 =

119899

sum

119894=1

120596119894119899 (A20)

where 120596119894119899

= 119861119899(1205902X119879119860X119860)minus12

119909119894119860120576119894

Fix 1205750

gt 0 and let 120578119894119899

=

119909119879

119894119860(X119879119860X119860)minus12

119861119879

119899119861119899(X119879119860X119860)minus12

119909119894119860 Then

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750] = 120578119894119899119864[

1205762

119894

12059021205781198941198991205762

119894

1205902gt 1205750]

le 120578119894119899119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

119875(1205781198941198991205762

119894

1205902gt 1205750)

120575(2+120575)

le 1205781+1205752+120575

119894119899120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

(A21)

Sincesum119899119894=1

120578119894119899

= tr(119861119879119899119861119899) rarr tr(119866) lt infin and since (E) implies

max1le119894le119899

120578119894119899120582min (119899

minus1X119879X) 120582max (119861119879

119899119861119899)max1le119894le119899

1

119899

119901

sum

119895=1

1199092

119894119895

997888rarr 0

(A22)

we must have119899

sum

119894=1

119864 [1003817100381710038171003817120596119894119899

100381710038171003817100381721003817100381710038171003817120596119894119899

10038171003817100381710038172gt 1205750]

= 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575) 119899

sum

119894=1

1205781+120575(2+120575)

119894119899

le 120575minus1

0119864(

1003816100381610038161003816100381610038161003816

120576119894

120590

1003816100381610038161003816100381610038161003816

2+120575

)

2(2+120575)

tr (119861119879119899119861119899)max1le119894le119899

120578120575(2+120575)

119894119899

997888rarr 0

(A23)

Thus the Lindeberg condition is satisfied and (A19)holds

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the Tian Yuan Special Funds ofthe National Natural Science Foundation of China (Grant no11426192) and the Project of Education of Zhejiang Province(no Y201533324)

References

[1] R Tibshirani ldquoRegression shrinkage and selection via the lassordquoJournal of the Royal Statistical Society Series B Methodologicalvol 58 no 1 pp 267ndash288 1996

[2] I E Frank and J H Friedman ldquoA statistical view of somechemometrics regression toolsrdquoTechnometrics vol 35 no 2 pp109ndash148 1993

[3] J Fan and R Li ldquoVariable selection via nonconcave penalizedlikelihood and its oracle propertiesrdquo Journal of the AmericanStatistical Association vol 96 no 456 pp 1348ndash1360 2001

[4] C-H Zhang ldquoNearly unbiased variable selection under mini-max concave penaltyrdquoThe Annals of Statistics vol 38 no 2 pp894ndash942 2010

[5] J Lv and Y Fan ldquoA unified approach to model selection andsparse recovery using regularized least squaresrdquo The Annals ofStatistics vol 37 no 6 pp 3498ndash3528 2009

[6] L Dicker B Huang and X Lin ldquoVariable selection and estima-tion with the seamless-119871

0penaltyrdquo Statistica Sinica vol 23 no

2 pp 929ndash962 2013[7] E Candes and T Tao ldquoThe Dantzig selector statistical estima-

tion when p is much larger than nrdquoThe Annals of Statistics vol35 no 6 pp 2313ndash2351 2007

[8] J Ghosh and A E Ghattas ldquoBayesian variable selection undercollinearityrdquo The American Statistician vol 69 no 3 pp 165ndash173 2015

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

12 Journal of Probability and Statistics

[9] J Fan L Z Xue andH Zou ldquoStrong oracle optimality of foldedconcave penalized estimationrdquo Annals of Statistics vol 42 no3 pp 819ndash849 2014

[10] H Akaike ldquoInformation theory and an extension of the max-imum likelihood principlerdquo in Proceedings of the 2nd Interna-tional Symposium on Information Theory B N Petrov and FCsaki Eds pp 267ndash281 Akademiai Kiado Budapest Hungary1973

[11] G Schwarz ldquoEstimating the dimension of a modelrdquo Annals ofStatistics vol 6 no 2 pp 461ndash464 1978

[12] D P Foster and E I George ldquoThe risk inflation criterion formultiple regressionrdquo The Annals of Statistics vol 22 no 4 pp1947ndash1975 1994

[13] L Breiman ldquoHeuristics of instability and stabilization in modelselectionrdquoThe Annals of Statistics vol 24 no 6 pp 2350ndash23831996

[14] K Knight and W Fu ldquoAsymptotics for lasso-type estimatorsrdquoAnnals of Statistics vol 28 no 5 pp 1356ndash1378 2000

[15] H Zou ldquoThe adaptive lasso and its oracle propertiesrdquo Journal ofthe American Statistical Association vol 101 no 476 pp 1418ndash1429 2006

[16] P Zhao and B Yu ldquoOn model selection consistency of lassordquoJournal of Machine Learning Research vol 7 pp 2541ndash25632006

[17] J Fan and H Peng ldquoNonconcave penalized likehood with adiverging number parametersrdquo Annals of Statistics vol 32 pp928ndash961 2004

[18] YKimHChoi andH-SOh ldquoSmoothly clipped absolute devi-ation on high dimensionsrdquo Journal of the American StatisticalAssociation vol 103 no 484 pp 1665ndash1673 2008

[19] Y Kim and S Kwon ldquoGlobal optimality of nonconvex penalizedestimatorsrdquo Biometrika vol 99 no 2 pp 315ndash325 2012

[20] C-H Zhang and T Zhang ldquoA general theory of concave reg-ularization for high-dimensional sparse estimation problemsrdquoStatistical Science vol 27 no 4 pp 576ndash593 2012

[21] L Wang Y Kim and R Li ldquoCalibrating nonconvex penalizedregression in ultra-high dimensionrdquoTheAnnals of Statistics vol41 no 5 pp 2505ndash2536 2013

[22] H Wang R Li and C-L Tsai ldquoTuning parameter selectors forthe smoothly clipped absolute deviation methodrdquo Biometrikavol 94 no 3 pp 553ndash568 2007

[23] H Wang B Li and C Leng ldquoShrinkage tuning parameterselection with a diverging number of parametersrdquo Journal of theRoyal Statistical Society Series B Statistical Methodology vol 71no 3 pp 671ndash683 2009

[24] J Chen and Z Chen ldquoExtended Bayesian information criteriafor model selection with large model spacesrdquo Biometrika vol95 no 3 pp 759ndash771 2008

[25] Y Kim S Kwon and H Choi ldquoConsistent model selectioncriteria on high dimensionsrdquo Journal of Machine LearningResearch vol 13 pp 1037ndash1057 2012

[26] L Breiman ldquoBetter subset regression using the non-negativegarroterdquo Technometrics vol 37 no 4 pp 373ndash384 1995

[27] H Zou and T Hastie ldquoRegularization and variable selection viathe elastic netrdquo Journal of the Royal Statistical Society Series BStatistical Methodology vol 67 no 2 pp 301ndash320 2005

[28] H Zou and H H Zhang ldquoOn the adaptive elastic-net with adiverging number of parametersrdquo The Annals of Statistics vol37 no 4 pp 1733ndash1751 2009

[29] H Zou and R Li ldquoOne-step sparse estimates in nonconcavepenalized likelihood modelsrdquo The Annals of Statistics vol 36no 4 pp 1509ndash1533 2008

[30] D R Hunter and R Li ldquoVariable selection using MM algo-rithmsrdquo The Annals of Statistics vol 33 no 4 pp 1617ndash16422005

[31] P Breheny and J Huang ldquoCoordinate descent algorithms fornonconvex penalized regression with applications to biologicalfeature selectionrdquo The Annals of Applied Statistics vol 5 no 1pp 232ndash253 2011

[32] J Fan and J Lv ldquoNonconcave penalized likelihood with NP-dimensionalityrdquo IEEE Transactions on Information Theory vol57 no 8 pp 5467ndash5484 2011

[33] T Zhang ldquoMulti-stage convex relaxation for feature selectionrdquoBernoulli vol 19 no 5 pp 2277ndash2293 2013

[34] J H Friedman T Hastie H Hofling and R Tibshirani ldquoPath-wise coordinate optimizationrdquo The Annals of Applied Statisticsvol 1 no 2 pp 302ndash332 2007

[35] B Efron T Hastie I Johnstone and R Tibshirani ldquoLeast angleregressionrdquo The Annals of Statistics vol 32 no 2 pp 407ndash4992004

[36] J Shao ldquoLinear model selection by cross-validationrdquo Journal ofthe American Statistical Association vol 88 no 422 pp 486ndash494 1993

[37] H Wang and C Leng ldquoUnified LASSO estimation by leastsquares approximationrdquo Journal of the American StatisticalAssociation vol 102 no 479 pp 1039ndash1048 2007

[38] H Zou and T Hastie ldquoOn the lsquodegrees of freedomrsquo of lassordquoAnnals of Statistics vol 35 pp 2173ndash2192 2007

[39] E R Lee H Noh and B U Park ldquoModel selection via Bayesianinformation criterion for quantile regressionmodelsrdquo Journal ofthe American Statistical Association vol 109 no 505 pp 216ndash229 2014

[40] J Friedman T Hastie and R Tibshirani ldquoRegularization pathsfor generalized linear models via coordinate descentrdquo Journal ofStatistical Software vol 33 no 1 pp 1ndash22 2010

[41] N E Heckman ldquoSpline smoothing in a partly linear modelrdquoJournal of the Royal Statistical Society Series B Methodologicalvol 48 no 2 pp 244ndash248 1986

[42] T A Stamey J N Kabalin J E McNeal et al ldquoProstate specificantigen in the diagnosis and treatment of adenocarcinoma ofthe prostate II Radical prostatectomy treated patientsrdquo TheJournal of Urology vol 141 no 5 pp 1076ndash1083 1989

[43] J Huang J L Horowitz and F Wei ldquoVariable selection innonparametric additivemodelsrdquoTheAnnals of Statistics vol 38no 4 pp 2282ndash2313 2010

[44] P Radchenko ldquoHigh dimensional single index modelsrdquo Journalof Multivariate Analysis vol 139 pp 266ndash282 2015

[45] G Aneiros and P Vieu ldquoVariable selection in infinite-dimen-sional problemsrdquo Statistics amp Probability Letters vol 94 pp 12ndash20 2014

[46] G Aneiros and P Vieu ldquoPartial linear modelling with multi-functional covariatesrdquo Computational Statistics vol 30 no 3pp 647ndash671 2015

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: Research Article Variable Selection and Parameter Estimation … · 2020. 1. 12. · garrotte [], elastic-net [ , ], SCAD [], and MCP []. In particular, 1 penalized least squares

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of