bayesian adaptive optimal estimation using a sieve prior
DESCRIPTION
TRANSCRIPT
Bayesian optimal adaptive estimation using asieve prior
YES IV Workshop
Julyan Arbel, [email protected]
ENSAE-CREST-Université Paris Dauphine
November 9, 2010
1 / 21
Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
2 / 21
Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
3 / 21
Introduction
• Posterior concentration rate and risk convergence rate in aBayesian nonparametric setting.
• Results in the same spirit as the ones by Ghosal, Ghosh and Vander Vaart (2000) and Ghosal and Van der Vaart (2007), in thespecific case of models which are suitable for the use of sievepriors.
• Use of a family of sieve priors (introduced by Zhao (2000) in thewhite noise model).
• Infinite dimensional parameter from a Sobolev smoothness class.
4 / 21
Notations• Let a model (X (n), A(n), P(n)
! : ! ! !) with observationsX (n) = (X n
i )1!i!n, and
! ="!
k=1
Rk .
• Denote !0 the parameter associated to the true model. Densitiesare denoted p(n)
! (p(n)0 for !0). The first k coordinates of !0 are
denoted !0k .
• A sieve prior " on ! is defined as follows
"(!) ="
k
"k"k (!),"
k
"k = 1,
and
!i
#i" g, where #i > 0.
5 / 21
Notations• Let a model (X (n), A(n), P(n)
! : ! ! !) with observationsX (n) = (X n
i )1!i!n, and
! ="!
k=1
Rk .
• Denote !0 the parameter associated to the true model. Densitiesare denoted p(n)
! (p(n)0 for !0). The first k coordinates of !0 are
denoted !0k .
• A sieve prior " on ! is defined as follows
"(!) ="
k
"k"k (!),"
k
"k = 1,
and
!i
#i" g, where #i > 0.
5 / 21
Notations• Let a model (X (n), A(n), P(n)
! : ! ! !) with observationsX (n) = (X n
i )1!i!n, and
! ="!
k=1
Rk .
• Denote !0 the parameter associated to the true model. Densitiesare denoted p(n)
! (p(n)0 for !0). The first k coordinates of !0 are
denoted !0k .
• A sieve prior " on ! is defined as follows
"(!) ="
k
"k"k (!),"
k
"k = 1,
and
!i
#i" g, where #i > 0.
5 / 21
We define four different divergencies
K (f , g) =
ˆ
f log(f/g)dµ,
Vp,0(f , g) =
ˆ
f |log(f/g)# K (f , g)|p dµ,
#K (f , g) =
ˆ
p(n)0 |log(f , g)|dµ,
#Vp,0(f , g) =
ˆ
p(n)0 |log(f , g)# K (f , g)|p dµ.
6 / 21
Define a Kullblack-Leibler neighborhood
Bn =$
! : K%
p(n)0 , p(n)
!
&$ n$2
n, Vp,0
%p(n)
0 , p(n)!
&$
'n$2
n(p/2)
.
We use a semimetric dn on !, and define !n =*! ! Rkn , %!% $ %n
+
with kn = k0n$2n/ log n and %n some power of n.
The posterior distribution is defined by
"(B|X (n)) =
´
B p(n)!
'X (n)
(d"(!)
´
! p(n)!
'X (n)
(d"(!)
.
7 / 21
Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
8 / 21
Assumptions
Assumption 1 On the priorAssume there exist a, b, c, d > 0 such that "k and gn satisfy
e#ak log k $ "k $ e#bk log k ,
Ae#A1|t|d$ g(t) $ Be#B1|t|d
,
&T , #0 > 0, s.t. mini!kn
#i ' n#T and maxi>0
#i $ #0 < (,
kn"
i=1
|!0i |d /#di $ Ckn log n.
Assumption 2 On the rate of convergenceThe rate of convergence $n is bounded below by the two inequalities
K%
p(n)0 , p(n)
0kn
&$ n$2
n, and Vp,0
%p(n)
0 , p(n)0kn
&$
'n$2
n(p/2
.
9 / 21
Assumptions
Assumption 1 On the priorAssume there exist a, b, c, d > 0 such that "k and gn satisfy
e#ak log k $ "k $ e#bk log k ,
Ae#A1|t|d$ g(t) $ Be#B1|t|d
,
&T , #0 > 0, s.t. mini!kn
#i ' n#T and maxi>0
#i $ #0 < (,
kn"
i=1
|!0i |d /#di $ Ckn log n.
Assumption 2 On the rate of convergenceThe rate of convergence $n is bounded below by the two inequalities
K%
p(n)0 , p(n)
0kn
&$ n$2
n, and Vp,0
%p(n)
0 , p(n)0kn
&$
'n$2
n(p/2
.
9 / 21
Assumption 3 On divergencies#K and #Vp,0 satisfy
#K%
p(n)0kn
, p(n)!
&$ C
n2%!0kn # !%2 , #Vp,0
%p(n)
0kn, p(n)
!
&$ Cnp/2 %!0kn # !%p ,
Assumption 4 On semimetric dn
There exist G0, G > 0 such that, for any two !, !$,
dn(!, !$) $ CkG0
n %! # !$%G
10 / 21
Assumption 3 On divergencies#K and #Vp,0 satisfy
#K%
p(n)0kn
, p(n)!
&$ C
n2%!0kn # !%2 , #Vp,0
%p(n)
0kn, p(n)
!
&$ Cnp/2 %!0kn # !%p ,
Assumption 4 On semimetric dn
There exist G0, G > 0 such that, for any two !, !$,
dn(!, !$) $ CkG0
n %! # !$%G
10 / 21
Assumption 5 Test conditionThere exist constants c1, & > 0 such that for every $ > 0 and for each!1 such that dn(!1, !0) > $, one can construct a test statistic 'n ! [0, 1]which satisfies
E(n)0 'n $ e#c1n"2
, supdn(!,!1)<#"
E(n)! (1# 'n) $ e#c1n"2
.
11 / 21
Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
12 / 21
Results
Theorem Posterior concentration rateThe rate of convergence of the posterior distribution relative to dn is$n,
E(n)0 "
%d2
n (!, !0) ' M$2n|X (n)
&) 0.
Corollary Risk convergence rateIf assumptions are satisfied with p > 2, and if dn is bounded, then theintegrated posterior risk given !0 and " converges at least at thesame rate $n
Rdnn (!0,") = E(n)
0 E",d2
n (!, !0)|X (n)-
= O'$2
n(.
13 / 21
Results
Theorem Posterior concentration rateThe rate of convergence of the posterior distribution relative to dn is$n,
E(n)0 "
%d2
n (!, !0) ' M$2n|X (n)
&) 0.
Corollary Risk convergence rateIf assumptions are satisfied with p > 2, and if dn is bounded, then theintegrated posterior risk given !0 and " converges at least at thesame rate $n
Rdnn (!0,") = E(n)
0 E",d2
n (!, !0)|X (n)-
= O'$2
n(.
13 / 21
Suppose the true parameter !0 has the Sobolev regularity (( > 1/2)
!$(Q0) =
.! :
""
i=1
!2i i2$ $ Q0 < (
/.
Then the assumption of the following Corollary holds in the Gaussianwhite noise model and in the regression. For these models, the rategiven in the following Corollary coincides with the minimax rate (up toa log n term) in these models: it is in this sense adaptive optimal.
14 / 21
CorollaryIf ! ! !$(Q0) and
K%
p(n)0 , p(n)
0kn
&$ Cn %!0 # !0kn%
2 , Vp,0
%p(n)
0 , p(n)0kn
&$ Cnp/2 %!0 # !0kn%
p ,
then the rate $n is
$n = $0
0log n
n
1 !2!+1
.
15 / 21
CorollaryIf ! ! !$(Q0) and
K%
p(n)0 , p(n)
0kn
&$ Cn %!0 # !0kn%
2 , Vp,0
%p(n)
0 , p(n)0kn
&$ Cnp/2 %!0 # !0kn%
p ,
then the rate $n is
$n = $0
0log n
n
1 !2!+1
.
15 / 21
Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
16 / 21
White noise model
dX n(t) = f0(t)dt +1*n
dW (t), 0 $ t $ 1,
By Fourier transform on a basis ('i), equivalent normal mean model
X ni = !0i +
1*n
)i , i = 1, 2, . . .
Global L2 loss
RL2
n = E(n)0
222f̂n # f0222
2= E(n)
0
""
i=1
%#!ni # !0i
&2.
Pointwise l2 loss at point t (with ai = 'i(t))
Rl2n = E(n)
0
%f̂n(t)# f0(t)
&2= E(n)
0
3 ""
i=1
ai
%#!ni # !0i
&42
.
17 / 21
White noise model
dX n(t) = f0(t)dt +1*n
dW (t), 0 $ t $ 1,
By Fourier transform on a basis ('i), equivalent normal mean model
X ni = !0i +
1*n
)i , i = 1, 2, . . .
Global L2 loss
RL2
n = E(n)0
222f̂n # f0222
2= E(n)
0
""
i=1
%#!ni # !0i
&2.
Pointwise l2 loss at point t (with ai = 'i(t))
Rl2n = E(n)
0
%f̂n(t)# f0(t)
&2= E(n)
0
3 ""
i=1
ai
%#!ni # !0i
&42
.
17 / 21
Results in the white noise model
We show that the model satisfies Assumptions 1 to 5.
PropositionUnder global loss, concentration and risk rates are adaptive optimal
E(n)0 "
%%! # !0%2 ' M$2
n|X (n)&) 0,
RL2
n (!0,") = E(n)0 E"
,%! # !0%2 |X (n)
-= O
'$2
n(.
18 / 21
Pointwise loss
Pointwise l2 loss does not satisfy Assumption 4. We can show thefollowing lower bound on the rate of the associated risk.
PropositionUnder pointwise loss, a lower bound on the frequentist risk rate isgiven by
sup!0%!!(Q0)
Rl2n (!0,") ! n#
2!!12!+1
log2 n.
A global optimal estimator can not be pointwise optimal (result statedby Cai, Low and Zhao, 2007).There is a penalty here from global to pointwise loss of (up to a log nterm)
n1
2!(2!+1) .
19 / 21
Pointwise loss
Pointwise l2 loss does not satisfy Assumption 4. We can show thefollowing lower bound on the rate of the associated risk.
PropositionUnder pointwise loss, a lower bound on the frequentist risk rate isgiven by
sup!0%!!(Q0)
Rl2n (!0,") ! n#
2!!12!+1
log2 n.
A global optimal estimator can not be pointwise optimal (result statedby Cai, Low and Zhao, 2007).There is a penalty here from global to pointwise loss of (up to a log nterm)
n1
2!(2!+1) .
19 / 21
Outline
1 Motivations
2 Assumptions
3 Results
4 White noise model
5 Conclusion
20 / 21
Conclusion
• We have first derived posterior concentration and riskconvergence rates for a variety of models that accomodate asieve prior.
• In a second result we have obtained a lower bound for thefrequentist risk under pointwise loss, that is to say that the sieveprior does not achieve the optimal rate under pointwise loss.
• Further work should focus on posterior concentration rate underpointwise loss.
21 / 21
Conclusion
• We have first derived posterior concentration and riskconvergence rates for a variety of models that accomodate asieve prior.
• In a second result we have obtained a lower bound for thefrequentist risk under pointwise loss, that is to say that the sieveprior does not achieve the optimal rate under pointwise loss.
• Further work should focus on posterior concentration rate underpointwise loss.
21 / 21
Conclusion
• We have first derived posterior concentration and riskconvergence rates for a variety of models that accomodate asieve prior.
• In a second result we have obtained a lower bound for thefrequentist risk under pointwise loss, that is to say that the sieveprior does not achieve the optimal rate under pointwise loss.
• Further work should focus on posterior concentration rate underpointwise loss.
21 / 21