properties of multivariate q-gaussian distributions and its … · 2016. 6. 2. · distributions,...
Post on 04-Sep-2020
4 Views
Preview:
TRANSCRIPT
Properties of Multivariate q-Gaussian Distributions
and its application to Smoothed Functional
Algorithms for Stochastic Optimization
A Project Report
Submitted in partial fulfilment of the
requirements for the Degree of
Master of Engineering
in
System Science and Automation
by
Debarghya Ghoshdastidar
ELECTRICAL ENGINEERING
and
COMPUTER SCIENCE & AUTOMATION
INDIAN INSTITUTE OF SCIENCE
BANGALORE – 560 012, INDIA
JUNE 2012
TO
People who work,
People who dream,
and
People who work
even in their dreams
Acknowledgements
I thank Dr. Ambedkar Dukkipati and Prof. Shalabh Bhatnagar for giving me the oppor-
tunity to work on this problem. Their invaluable guidance made this work possible. I also
thank other faculty members of CSA, EE, ECE and Mathematics departments for blessing
me with immense knowledge during coursework. I sincerely thank Prof. Christophe Vig-
nat, Ecole Polytechnique Federale de Lausanne, for his advices, which helped me improve
my work.
I would like to thank all the members of Algorithmic Algebra Lab and Stochastic Sys-
tems Lab for their support. The names of Abhranil, Saswata, Prabu Chandran, Gaurav
and Maria needs special mention.
Last, but not the least, I thank my family and close friends for providing constant
encouragement even in the toughest times.
i
Publications based on this Thesis
1. D. Ghoshdastidar, A. Dukkipati, and S. Bhatnagar. q-Gaussian based smoothed
functional algorithms for stochastic optimization. In International Symposium on
Information Theory. IEEE, 2012.
ii
Abstract
The importance of the q-Gaussian distribution is due to their power-law nature, and
close association with the popular Gaussian distribution. This distribution arises from
nonextensive information theory, and it has an interesting property that uncorrelated
q-Gaussian random variates show a special kind of inter-dependence.
In this work, we study some key properties related to higher order moments and co-
moments of the multivariate q-Gaussian distribution. We use the important features of
this distribution to improve upon the smoothing properties of a Gaussian kernel. Based
on this, we propose a Smoothed Functional scheme for gradient and Hessian estimation
using q-Gaussian distribution.
Using the above estimation technique, we propose four two-timescale algorithms for
optimization of a stochastic objective function using gradient descent and Newton based
search methods. We prove that the proposed algorithms converge to a local optimum.
Performance of the algorithms is shown by simulation results on a queuing model. The
results show that the q-Gaussian based schemes provide a better tuning of the algorithms,
which improve their performance compared to their Gaussian counterparts.
iii
Contents
Acknowledgements i
Publications based on this Thesis ii
Abstract iii
Notation and Abbreviations vi
1 Introduction 11.1 Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Nonextensive Information Theory . . . . . . . . . . . . . . . . . . . . . . . 31.3 Motivation and Summary of current work . . . . . . . . . . . . . . . . . . 41.4 Organization of this report . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background and Preliminaries 62.1 Problem Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Smoothed Functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 q-Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Some properties of multivariate q-Gaussian 123.1 Generalized co-moments of joint q-Gaussian distribution . . . . . . . . . . 123.2 q-Gaussian as a Smoothing Kernel . . . . . . . . . . . . . . . . . . . . . . . 153.3 Optimization using q-Gaussian SF . . . . . . . . . . . . . . . . . . . . . . . 16
4 q-Gaussian based Gradient Descent Algorithms 184.1 Gradient Estimation with q-Gaussian SF . . . . . . . . . . . . . . . . . . . 18
4.1.1 One-simulation q-Gaussian SF Gradient Estimate . . . . . . . . . . 184.1.2 Two-simulation q-Gaussian SF Gradient Estimate . . . . . . . . . . 19
4.2 Proposed Gradient Descent Algorithms . . . . . . . . . . . . . . . . . . . . 204.3 Convergence of Gradient SF Algorithms . . . . . . . . . . . . . . . . . . . 22
4.3.1 Convergence of Gq-SF1 Algorithm . . . . . . . . . . . . . . . . . . . 224.3.2 Convergence of Gq-SF2 Algorithm . . . . . . . . . . . . . . . . . . . 31
iv
CONTENTS v
5 q-Gaussian based Newton Search Algorithms 335.1 Hessian Estimation using q-Gaussian SF . . . . . . . . . . . . . . . . . . . 33
5.1.1 One-simulation q-Gaussian SF Hessian Estimate . . . . . . . . . . . 335.1.2 Two-simulation q-Gaussian SF Hessian Estimate . . . . . . . . . . . 35
5.2 Proposed Newton-based Algorithms . . . . . . . . . . . . . . . . . . . . . . 355.3 Convergence of Newton SF Algorithms . . . . . . . . . . . . . . . . . . . . 38
5.3.1 Convergence of Nq-SF1 Algorithm . . . . . . . . . . . . . . . . . . . 385.3.2 Convergence of Nq-SF2 Algorithm . . . . . . . . . . . . . . . . . . . 45
6 Simulations using Proposed Algorithms 476.1 Numerical Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7 Conclusions 57
Bibliography 58
Notation and Abbreviations
N Set of all natural numbers, 0, 1, 2, . . .
R Set of all real numbers
R+ Set of all positive real numbers
P Probability of an event
Ep Expectation of a random variable with probability distribution p
PS Projection on to a set S
‖.‖ Euclidian norm if argument is a vector, and induced 2-norm for matrix
∇nx nth derivative with respect to vector x
x(i) ith element of vector x
xn or x(n) nth element of sequence
Ai,j Element in ith row and jth column of matrix A
IN×N N ×N identity matrix
Yn The state of a stochastic process at the nth time instant
vi
Chapter 1
Introduction
1.1 Stochastic Optimization
Optimization lies at the heart of engineering sciences. Modern systems, which cannot
be modeled or analyzed using a deterministic approach, require optimization techniques
based on a stochastic framework. “Methods for stochastic optimization provide a means of
coping with inherent system noise and coping with models or systems that are highly non-
linear, high dimensional, or otherwise inappropriate for classical deterministic methods
of optimization” Gentle et. al. [17]. Stochastic techniques play a key role in optimiza-
tion problems, where the objective function does not have an analytic expression. Such
problems are often encountered in discrete event systems, which are quite common in
engineering and financial world. Most often, the data, obtained via statistical survey or
simulation, contain only noisy estimates of the objective function to be optimized.
One of the most commonly used solution methodology involves stochastic approxima-
tion algorithms, particularly the Robbins-Monro Algorithm [34], which is used to find the
zeros of a given function. Based on this approach, gradient descent algorithms have been
developed, in which the parameters controlling the system proceed towards the zeros of
the gradient of average cost. However, these algorithms require an estimate of the cost
gradient. Kiefer and Wolfowitz [25] provide such a gradient estimate using several parallel
simulations of the system. More efficient techniques for gradient estimation, using one or
1
CHAPTER 1. INTRODUCTION 2
two simulations, have been developed based on smoothed functional approach [24, 35],
perturbation analysis [22, 39] and likelihood ratio [29]. A stochastic variation of Newton-
based optimization methods, also known as adaptive Newton-based schemes, has also
been studied in literature. These algorithms require estimation of the Hessian of aver-
age cost, along with the gradient estimate. Such estimates may be obtained by finite
differences [16, 36], simultaneous perturbation [40] or smoothed functional scheme [5].
We provide a special mention about the smoothed functional (SF) approach proposed
by Katkovnik and Kulchitsky [24]. In this method, the gradient of expected cost is ap-
proximated by its convolution with a multivariate normal distribution. Such a technique
requires just one simulation and hence, proves to be efficient. A two-simulation SF al-
gorithm based on finite difference gradient estimate has been developed in [41], which
performs better than one-simulation algorithms. Rubinstein [35] showed that the gradi-
ent of the cost function can be approximated by its convolution with any function, which
satisfies certain conditions. Such functions are often referred to as smoothing kernels.
Bhatnagar [5] extended the method for gradient estimation to the Hessian case.
When the above estimation schemes are employed in gradient or Newton based op-
timization methods, the time complexity of the algorithms increase as each update iter-
ation requires the estimation procedure. A more efficient approach is to simultaneously
perform gradient estimation and parameter updation using different step-size schedules.
These class of algorithms constitute the multi-timescale stochastic approximation algo-
rithms [6]. Two-timescale stochastic optimization algorithms have been developed using
simultaneous perturbations [8, 9] and smoothed functional [7, 7].
The main issue with such algorithms is that, although convergence of the algorithm
to a local optimum is guaranteed, the global optimum cannot be achieved in practice.
Bhatnagar[5] provides a detailed comparison of the performance of various multi-timescale
algorithms for stochastic optimization in a queuing system. The results presented there
indicate that the two-simulation SF schemes, using multivariate Gaussian distribution,
outperform other algorithms. The results also show that the performance of the SF
algorithms depend considerably on several tuning parameters, such as the variance of the
CHAPTER 1. INTRODUCTION 3
normal distribution, and also the step-sizes.
We look into smoothing kernels arising out of a generalization of the classical infor-
mation theory, known as nonextensive information theory, which has gained popularity
in several domains in recent years.
1.2 Nonextensive Information Theory
Shannon [38] provided the concept of entropy as a measure of uncertainty given by prob-
ability distributions. A continuous form of the Shannon entropy functional has been
extensively studied in statistical mechanics, probability and statistics. It is defined as
H(p) =
∫X
p(x)lnp(x) dx, (1.1)
where p(.) is a pdf defined on the sample space X . Kullback’s minimum discrimination
theorem [27] establishes important connections between statistics and information theory.
A special case of this theorem is Jaynes maximum entropy principle [23], by which the
exponential family can be obtained by maximizing Shannon entropy functional, subject
to certain moment constraints.
Several generalizations of Shannon’s entropy have been studied in literature [26, 33, 31,
13]. These generalized entropy functions find extensive use in physics, communications,
computer science, statistics etc. One of the most recently studied generalized information
measure is due to [44]. Although this generalization had been introduced earlier by Havrda
and Charvat[20], Tsallis provided interpretations in the context of statistical mechanics.
Dukkipati et. al.[15, 14] provides a measure theoretic formulation of the continuous form
Tsallis entropy functional, defined as
Hq(p) =
1−∫X
[p(x)]q dx
q − 1, q ∈ R. (1.2)
This function results when the natural logarithm in (1.1) is replaced by q-logarithm defined
CHAPTER 1. INTRODUCTION 4
as lnq(x) = x1−q−11−q , q ∈ R, q 6= 1. The Shannon entropy can be retrieved from (1.2) as
q → 1. It is called nonextensive because of its pseudo-additive nature [44]. Borges[10]
presents a detailed discussion on the mathematical structure induced by this nonextensive
entropy functional. Suyari[42] generalized the Shannon-Khinchin axioms to this case.
The popularity of Tsallis entropy in computer science and statistics is due to the
power-law nature of the distributions obtained by maximizing Tsallis entropy. It has
been observed that most of real-world data exhibit a power-law behavior [30, 3]. Re-
cently Tsallis entropy has been used to study this behavior in different cases like finance,
earthquakes and network traffic [1, 2]. Compared to the exponential family, the Tsallis
distributions, i.e. the family of distributions resulting from maximization of Tsallis en-
tropy, have an additional shape parameter q, similar to that in (1.2), which controls the
nature of the power-law tails.
1.3 Motivation and Summary of current work
Though stochastic optimization algorithms are guaranteed to converge to a local optimum
of the objective function, the challenge is to achieve the global optimum. The smoothed
functional schemes have become popular due to their smoothing effects on local fluctua-
tions. However, the Gaussian smoothing kernel, which is used in practice, does not provide
“ideal performance”. Although with proper tuning of parameters, we can reduce the cost,
we cannot achieve globally optimal performance. Hence, new methods are sought.
We propose a new SF method where the smoothing kernel is a q-Gaussian distribution,
which is a power-law generalization of the Gaussian distribution, resulting from nonex-
tensive information theory. First, we prove an important result related to the co-moment
of multivariate q-Gaussian distribution. We show that the multivariate q-Gaussian dis-
tribution satisfies all the conditions for smoothing kernels discussed in [35]. The “shape
parameter” q, which controls the power-law behavior of q-Gaussian, also controls the
smoothness of the convolution, thereby providing additional tuning.
We illustrate methods for gradient and Hessian estimation using q-Gaussian smooth-
ing kernel. We also present two-timescale algorithms for stochastic optimization using
CHAPTER 1. INTRODUCTION 5
q-Gaussian based SF, and prove the convergence of the proposed algorithms to the neigh-
bourhood of a local optimum. Further, we perform simulations on a queuing network to
illustrate the benefits of the q-Gaussian based SF algorithms compared to their Gaussian
counterparts.
1.4 Organization of this report
The rest of the report is organized as follows. The framework for the optimization problem
and some of the preliminaries are presented in Chapter 2. Some general results regarding
multivariate q-Gaussian distribution has been proved in Chapter 3. We also provide
some insights regarding optimization using q-Gaussian SF in this chapter. Chapter 4
deals with gradient descent algorithms using q-Gaussian SF. It presents the approach of
gradient estimation, corresponding algorithms and their convergence. Similar results and
proposed algorithms for adaptive Newton-based search has been presented in Chapter 5.
Chapter 6 deals with implementation of the proposed algorithms, and simulations based
on a numerical setting. Finally, Chapter 7 provides the concluding remarks.
Chapter 2
Background and Preliminaries
In this chapter, we first describe the framework of the optimization problem. We also
discuss about the Smoothed Functional scheme, that is commonly to estimate derivatives
of a stochastic function. Finally, we provide a brief idea about the generalization of
Gaussian distribution studied in nonextensive information theory.
2.1 Problem Framework
We assume there exists a stochastic process, and a cost function associated with it. A few
assumptions are made regarding the process and its associated cost, which are reasonable
even in a real-world scenario.
Let Yn : n ∈ N ⊂ Rd be a parameterized Markov process, depending on a tunable
parameter θ ∈ C, where C is a compact and convex subset of RN . Let Pθ(x, dy) denote the
transition kernel of Yn when the operative parameter is θ ∈ C. Let h : Rd 7→ R+⋃0
be a Lipschitz continuous cost function associated with the process.
Assumption I. The process Yn is ergodic for any given θ as the operative parameter,
i.e., as L→∞,
1
L
L−1∑m=0
h(Ym)→ Eνθ [h(Y )],
where νθ is the stationary distribution of Yn.
6
CHAPTER 2. BACKGROUND AND PRELIMINARIES 7
Our objective is to minimize the long-run average cost
J(θ) = limL→∞
1
L
L−1∑m=0
h(Ym) =
∫Rd
h(x)νθ( dx), (2.1)
by choosing an appropriate θ ∈ C. The existence of the above limit is assured by As-
sumption I and the fact that h is continuous, hence measurable. In addition, we assume
that the average cost J(θ) satisfies the following condition:
Assumption II. The function J(.) is twice continuously differentiable with respect to any
θ ∈ C, with a bounded third derivative.
Definition 2.1 (Non-anticipative sequence). A random sequence of parameter vectors,
(θ(n))n>0 ⊂ C, controlling a process Yn ⊂ Rd, is said to be non-anticipative if the
conditional probability P (Yn+1 ∈ dy|Fn) = Pθ(Yn, dy) almost surely for all n > 0 and all
Borel sets dy ⊂ Rd, where Fn = σ(θ(m), Yn,m 6 n) is the associated σ-field.
It can be verified that under a non-anticipative parameter sequence (θ(n)), the given
process along with the updates, (Yn, θ(n))n>0 is Markov. We assume the existence of a
stochastic Lyapunov function.
Assumption III. Let (θ(n)) be a sequence of random parameters, obtained using an
iterative scheme, controlling the process Yn, and Fn = σ(θ(m), Ym,m 6 n), n > 0
be a sequence of associated σ-fields. There exists ε0 > 0, a compact set K ⊂ Rd, and
a continuous function V : Rd 7→ R+⋃0, with lim
‖x‖→∞V (x) = ∞, such that under any
non-anticipative sequence (θ(n))n>0,
(i) supn E[V (Yn)2] <∞, and
(ii) E[V (Yn+1)|Fn] 6 V (Yn)− ε0, whenever Yn /∈ K, n > 0.
While Assumption II is a technical requirement, Assumption III ensures that the
process under a tunable parameter remains stable. Assumption III will not be required,
for instance, if the single-stage cost function h is bounded in addition.
CHAPTER 2. BACKGROUND AND PRELIMINARIES 8
2.2 Smoothed Functionals
Here, we present an idea about the smoothed functional approach proposed by Katkovnik
and Kulchitsky[24]. We consider a real-valued function f : C 7→ R, defined over a compact
set C. Its smoothed functional is defined as
Sβ[f(θ)] =
∞∫−∞
Gβ(η)f(θ − η) dη =
∞∫−∞
Gβ(θ − η)f(η) dη, (2.2)
where Gβ : RN 7→ R is a kernel function, with a parameter β taking values from R. The
idea behind using smoothed functionals is that if f(θ) is not well-behaved, i.e., it has a
fluctuating character, then Sβ[f(θ)] is “better-behaved”. This ensures that any optimiza-
tion algorithm with objective function f(θ) does not get stuck at any local minimum, but
converges to the global minimum. The parameter β controls the degree of smoothness.
Rubinstein[35] established that the SF algorithm achieves these properties if the kernel
function satisfies the following sufficient conditions:
(P1) Gβ(η) = 1βNG(ηβ
), where G(x) means Gβ(x) corresponds to the case β = 1, i.e.,
G
(η
β
)= G1
(η(1)
β,η(2)
β, . . . ,
η(N)
β
),
(P2) Gβ(η) is piecewise differentiable in η,
(P3) Gβ(η) is a probability distribution function, i.e., Sβ[f(θ)] = EGβ(η)[f(θ − η)],
(P4) limβ→0Gβ(η) = δ(η), where δ(η) is the Dirac delta function, and
(P5) limβ→0 Sβ[f(θ)] = f(θ).
A two-sided form of SF is defined as
S ′β[f(θ)] =1
2
∞∫−∞
Gβ(η)(f(θ − η) + f(θ + η)
)dη
CHAPTER 2. BACKGROUND AND PRELIMINARIES 9
This can be rewritten as
S ′β[f(θ)] =1
2
∞∫−∞
Gβ(θ − η)f(η) dη +1
2
∞∫−∞
Gβ(η − θ)f(η) dη . (2.3)
The normal distribution satisfies the above conditions, and has been used as a kernel
in [24, 41]. The SF approach provides a method [7] for estimating the gradient or Hessian
of any function which satisfies Assumptions I–III as shown in [5], where the Gaussian
smoothing kernel is used. The gradient estimator obtained using (2.2) is given by
∇θJ(θ) ≈ 1
βML
M−1∑n=0
L−1∑m=0
η(n)h(Ym) (2.4)
for large M , L and small β. The stochastic process Ym is governed by parameter
(θ(n)+βη(n)), where θ(n) ∈ C ⊂ RN is obtained through an iterative scheme, and η(n) =(η(1)(n), . . . , η(N)(n)
)Tis a N -dimensional vector composed of i.i.d. standard normal
distributed random variables η(1)(n), . . . , η(N)(n). Similarly, a two-simulation gradient
estimator has been suggested using (2.3), which is of the following form
∇θJ(θ) ≈ 1
2βML
M−1∑n=0
L−1∑m=0
η(n)(h(Ym)− h(Y ′m)
)(2.5)
for large M , L and small β, where Ym and Y ′m are two processes governed by pa-
rameters (θ(n) + βη(n)) and (θ(n)− βη(n)), respectively, θ(n) and η(n) being defined as
earlier. The respective one and two simulation estimates for Hessian case are given by
∇2θJ(θ) ≈ 1
β2ML
M−1∑n=0
L−1∑m=0
H(η(n))h(Ym) (2.6)
and
∇2θJ(θ) ≈ 1
2β2ML
M−1∑n=0
L−1∑m=0
H(η(n))[h(Ym) + h(Y ′m)], (2.7)
where H(η) is a matrix given by H(η)i,j =(η(i))2 − 1 for i = j, and η(i)η(j) for i 6= j.
CHAPTER 2. BACKGROUND AND PRELIMINARIES 10
2.3 q-Gaussian distribution
The q-Gaussian distribution was initially developed to describe the process of Levy super-
diffusion [32], but has been later studied in other fields, such as finance [37]. Its importance
lies in its power-law nature, due to which the tails of the q-Gaussian decay at a slower
rate than Gaussian distribution, depending on the choice of q.
It results from maximizing Tsallis entropy under certain ‘deformed’ moment con-
straints, known as normalized q-expectation, defined as
〈f〉q =
∫Rf(x)p(x)q dx∫Rp(x)q dx
. (2.8)
This form of an expectation considers an escort distribution
pq(x) =p(x)q∫
Rp(x)q dx
,
and has been shown to be compatible with the foundations of nonextensive statistics [46].
Prato and Tsallis [32] maximized Tsallis entropy under the constraints, 〈x〉q = µq and
〈(x− µ)2〉q = β2q , which are known as q-mean and q-variance, respectively. These are
generalizations of standard first and second moments, and tend to the usual mean and
variance, respectively, as q → 1. This results in q-Gaussian distribution of the form
Gq,β(x) =1
βqKq
(1− (1− q)
(3− q)β2q
(x− µq)2
) 11−q
+
for all x ∈ R, (2.9)
where, y+ = max(y, 0) is called the Tsallis cut-off condition [45], which ensures that the
above expression is defined, and Kq is the normalizing constant, which is given by
Kq =
√π√
3−q√1−q
Γ( 2−q1−q )
Γ( 5−3q2(1−q))
for −∞ < q < 1,
√π√
3−q√1−q
Γ( 3−q2(q−1))
Γ( 1q−1)
for 1 < q < 3,
CHAPTER 2. BACKGROUND AND PRELIMINARIES 11
with Γ being the Gamma function1, which exists because its arguments are positive over
the specified intervals.
The function defined in (2.9) is not integrable for q > 3, and hence, q-Gaussian is a
probability density function only when q < 3. Further, it has been shown by Prato and
Tsallis [32] that the variance of the above distribution is finite only for q < 53, and is
given by β =√
3−q5−3q
βq. In this report, we refer to the expression which involves variance,
instead of q-variance as it is more compatible with the analysis in the stochastic scenario.
A multivariate form of the q-Gaussian distribution has been proposed in [47]. Vignat
and Plastino [48] provided an explicit form of this distribution, which will be used in this
report. Considering the usual covariance matrix of the N -variate distribution to be of the
form E[XXT ] = β2IN×N , it is defined as
Gq,β(X) =1
βNKq,N
(1− (1− q)(
(N + 4)− (N + 2)q) ‖X‖2
β2
) 11−q
+
for all X ∈ RN , (2.10)
where Kq,N is the normalizing constant given by
Kq,N =
((N+4)−(N+2)q
1−q
)N2 πN/2Γ( 2−q
1−q )Γ( 2−q
1−q+N2 )
for q < 1,
((N+4)−(N+2)q
q−1
)N2 πN/2Γ( 1
q−1−N
2 )Γ( 1
q−1)for 1 < q <
(1 + 2
N+2
).
(2.11)
The multivariate normal distribution can be obtained as a special case when q → 1. A
similar distribution can also be obtained by maximizing Renyi entropy [12]. In this pa-
per, we study the multivariate q-Gaussian distribution, and develop smoothed functional
algorithms based on it.
1Gamma function is defined as Γ(z) =∞∫0
tz−1e−t dt for z ∈ C, Real(z) > 0.
Chapter 3
Some properties of multivariate
q-Gaussian
Before going into further analysis of q-Gaussians as smoothing kernels, we look at the
support set of the multivariate q-Gaussian distribution with covariance β2IN×N . We
denote the support set as
Ωq =
x ∈ RN : ‖x‖2 < ((N+4)−(N+2)q)β2
(1−q)
for q < 1,
RN for 1 < q <(1 + 2
N+2
).
(3.1)
A standard q-Gaussian distribution has mean zero and unit variance. So, the support set
can be expressed as above by substituting β = 1.
3.1 Generalized co-moments of joint q-Gaussian dis-
tribution
We first state the following lemma, which provides an expression for the moments of
N -variate q-Gaussian distributed random vector. This is a consequence of the results
presented in [19]. The result is considered only for q <(1 + 2
N+2
)as above this interval,
12
CHAPTER 3. SOME PROPERTIES OF MULTIVARIATE q-GAUSSIAN 13
variance of q-Gaussian is not finite [32].
Proposition 3.1. Suppose X =(X(1), X(2), . . . , X(N)
)∈ RN is a random vector, where
the components are uncorrelated and identically distributed, each being distributed accord-
ing to a q-Gaussian distribution with zero mean and unit variance, where the parameter
q ∈(−∞, 1
)⋃ (1, 1 + 2
N+2
). Also, let ρ(X) =
(1− (1−q)
((N+4)−(N+2)q)‖X‖2
). Then, for any
b, b1, b2, . . . , bN ∈ Z+⋃0 we have
EGq
[(X(1)
)b1 (X(2)
)b2. . .(X(N)
)bN(ρ(X))b
]=
K
((N + 4)− (N + 2)q
1− q
) N∑i=1
bi2
(N∏i=1
bi!
2bi(bi2
)!
),
if bi is even for all i = 1, 2, . . . , N,
0 otherwise,
(3.2)
where
K =
Γ( 11−q−b+1)Γ( 1
1−q+1+N2 )
Γ( 11−q+1)Γ
(1
1−q−b+1+N2
+N∑i=1
bi2
) if q ∈ (−∞, 1),
Γ( 1q−1)Γ
(1q−1
+b−N2−N∑i=1
bi2
)Γ( 1
q−1+b)Γ( 1
q−1−N
2 )if q ∈
(1, 1 + 2
N+2
),
(3.3)
exists only if the above Gamma functions exist. Further, the existence on the Gamma func-
tions occur under the condition b <(
1 + 11−q
)if q < 1, and
(1q−1
+ b− N2−∑N
i=1bi2
)> 0
for 1 < q <(1 + 2
N+2
).
Proof. Since ρ(X) is non-negative over Ωq, we have
EGq(X)
[(X(1)
)b1 (X(2)
)b2. . .(X(N)
)bN(ρ(X))b
]
=1
Kq,N
∫Ωq
(x(1))b1 (
x(2))b2
. . .(x(N)
)bN (1− (1− q)(
(N + 4)− (N + 2)q)‖x‖2
) 11−q−b
dx.
The second equality in (3.2) can be easily proved. If for some i = 1, . . . , N , bi is odd,
then the above function is odd, and its integration is zero over Ωq, which is symmetric
CHAPTER 3. SOME PROPERTIES OF MULTIVARIATE q-GAUSSIAN 14
with respect to any axis by definition. For the other cases, since the function is even,
the integral is same over every orthant. Hence, we may consider the integration over the
first orthant, i.e., where each component is positive. For q < 1, we can reduce the above
integral, using [19, Eq. 4.635], to obtain
EGq(X)
[(X(1)
)b1 (X(2)
)b2. . .(X(N)
)bN(ρ(X))b
]
=
N∏i=1
Γ(bi+1
2
)Kq,NΓ(b)
((N + 4)− (N + 2)q
1− q
)b 1∫0
(1− y)(1
1−q−b)y(b−1)dx
where we set b =
(N2
+N∑i=1
bi2
). We can observe that the integral is in the form of a Beta
function1. Since bi’s are even, we can expand Γ(bi+1
2
)using the expansion of Gamma
function of half-integers to get Γ(bi+1
2
)= bi!
2bi( bi2 )!
√π. The claim can be obtained by
substituting Kq,N from (2.11) and using the relation B(m,n) = Γ(m)Γ(n)Γ(m+n)
. It is easy to
verify that all the Gamma functions in the equality exist provided b <(
1 + 11−q
). The
result for the interval 1 < q <(1 + 2
N+2
)can be proved in a similar way (see eq. 4.635
and eq. 4.636 of [19]).
Corollary 3.2. As q → 1, in the limiting case
limq→1
EGq(X)
[(X(1)
)b1 (X(2)
)b2. . .(X(N)
)bN(ρ(X))b
]=
N∏i=1
EG(X)
[(X(i)
)bi].
Proof. The term ρ(X) → 0 as q → 1. Also, in the limit case, we retrieve the normal
distribution, for which uncorrelated implies independence. So, the above expression turns
out to be a product of individual moments.
1Beta function is defined as B(m,n) =1∫0
tm−1(1− t)n−1 dt for m,n ∈ C, Real(m),Real(n) > 0.
CHAPTER 3. SOME PROPERTIES OF MULTIVARIATE q-GAUSSIAN 15
3.2 q-Gaussian as a Smoothing Kernel
The first step in applying q-Gaussians for SF algorithms is to ensure that the q-Gaussian
satisfies Rubinstein conditions [35].
Proposition 3.3. The N-dimensional q-Gaussian distribution satisfies the kernel prop-
erties (P1)–(P5) for all q <(1 + 2
N+2
)and q 6= 1.
Proof. (P1) From (2.10), it is evident that Gq,β(x) =1
βNGq
(x
β
).
(P2) For 1 < q <(1 + 2
N+2
), Gq,β(x) > 0 for all x ∈ RN . Thus,
∇xGq,β(x) = − 2x((N + 4)− (N + 2)q
)β2
Gq,β(x)(1− (1−q)
((N+4)−(N+2)q)β2‖x‖2) . (3.4)
For q < 1, (3.4) holds when x ∈ Ωq. On the other hand, when x /∈ Ωq, we have
Gq,β(x) = 0 and hence, ∇xGq,β(x) = 0. Thus, Gq,β(x) is differentiable for q > 1,
and piecewise differentiable for q < 1.
(P3) Gq,β(x) is a distribution for q <(1 + 2
N+2
)and hence, the corresponding SF Sq,β(.),
parameterized by both q and β, can be written as Sq,β[f(θ)] = EGq,β(x)[f(θ − x)].
(P4) Gq,β is a probability distribution satisfying limβ→0
Gq,β(0) =∞. So, limβ→0
Gq,β(x) = δ(x).
(P5) This property trivially holds due to convergence in mean
limβ→0
Sq,β[f(θ)] =
∞∫−∞
limβ→0
Gq,β(x)f(θ − x)dx =
∞∫−∞
δ(x)f(θ − x) dx = f(θ).
Hence the claim.
From the above result, it follows that q-Gaussian can be used as a kernel function,
and hence, given a particular value q ∈(− ∞, 1
)⋃ (1, 1 + 2
N+2
)and some β > 0, the
CHAPTER 3. SOME PROPERTIES OF MULTIVARIATE q-GAUSSIAN 16
one-sided and two-sided SFs of any function f : RN 7→ R are respectively given by
Sq,β[f(θ)] =
∫Ωq
Gq,β(θ − x)f(x) dx, (3.5)
S ′q,β[f(θ)] =1
2
∫Ωq
Gq,β(θ − x)f(x) dx+1
2
∫Ωq
Gq,β(x− θ)f(x) dx, (3.6)
where the nature of the SFs are controlled by both q and β.
3.3 Optimization using q-Gaussian SF
We present an example to illustrate the smoothing properties of Gaussian and q-Gaussian
kernels for different values of q to motivate the proposed SF method using q-Gaussian.
The results are for one-dimensional case, where we use the above mentioned kernels to
find the SF of the function, f(x) = x2 − 14e−x
2cos(8πx). The corresponding one-sided
SFs (3.5) are shown in Figure 3.1.
Figure 3.1: (Top figure) Unsmoothed function,(Middle row) SF for Gaussian andq-Gaussian kernels using β = 0.05 for values of q = 0.5, 1(Gaussian case) and 1.5, respec-tively, and (Bottom row) SFs using β = 0.09 for same q’s.
The figures illustrate that for a particular value of β, different degrees of smoothness
can be obtained by varying the values of q. It is observed that for a low value of β, the
SFs obtained using q-Gaussians with q > 1 are more smooth than Gaussian-SF, but as β
CHAPTER 3. SOME PROPERTIES OF MULTIVARIATE q-GAUSSIAN 17
increases, q-Gaussians with q < 1 become more smooth.
We now illustrate the effect of function smoothing in optimization. It can be observed
that the global minimum of the function is at x = 0, with several local extrema. We try
to search for the global minimum of f(x) in the interval [−1, 1] using standard Gradient-
descent and Newton-based methods. In each iteration, we project the update to [−1, 1] to
satisfy space constraint. We denote this by a projection function P[−1,1]. We incorporate a
‘suitable’ decreasing step-size so that the Newton-based algorithms converge. The update
rules are given in Table 3.1, along with the distance of the update from the the global
optimum after 100 iterations.
Algorithm Update rule Distance from optimum
Gradient-descent xn+1 = xn − 1n∇xf(xn) 0.00007538
Newton’s method xn+1 = xn − 1n
(∇2xf(xn))
−1∇2xf(xn) 0.96199169
Table 3.1: Optimization of unsmoothed function.
The next two tables show the result of optimization when the one-sided smoothed
version of f is used as the objective function. The cases in which the final distance from
the optimum are relatively small (< 0.01) are highlighted. It can be observed that the
gradient based approach gives better performance than Newton’s method. Moreover,
for higher values of β, the error decreases, which is expected as the function becomes
more smooth. However, the scenario changes for the stochastic case as here, there is an
additional error, which increases for higher values of β.
q \ β 0.050 0.100 0.150 q \ β 0.050 0.100 0.150
0.00 0.00000003 0.10686676 0.00000019 0.00 0.24342130 0.11581702 0.01461908
0.25 0.00000012 0.10177132 0.00000014 0.25 0.24127418 0.09847975 0.00193110
0.50 0.24282084 0.08734088 0.00000019 0.50 0.48939828 0.08670668 0.00016341
0.75 0.24294724 0.00000433 0.00000017 0.75 0.24615714 0.00164180 0.00063249
Gaussian 0.00000014 0.00000011 0.00332414 Gaussian 0.73622977 0.00206699 0.00018992
1.25 0.00067325 0.22314089 0.00000254 1.25 0.49121175 0.23889724 0.00013602
1.50 0.72345646 0.23916780 0.00888054 1.50 0.48663598 0.00239447 0.22118680
Table 3.2: Performance of Gradient-descent (left) and Newton’s method (right).
Chapter 4
q-Gaussian based Gradient Descent
Algorithms
4.1 Gradient Estimation with q-Gaussian SF
The objective is to estimate the gradient of the average cost∇θJ(θ) using the SF approach,
where existence of ∇θJ(θ) follows from Assumption II.
4.1.1 One-simulation q-Gaussian SF Gradient Estimate
Rubinstein [35] defined the gradient of smoothed functional (smoothed gradient) as
∇θSq,β[J(θ)] =
∫Ωq
∇θGq,β(θ − η)J(η) dη ,
recall that Ωq is the support set defined as in (3.1). As there is no functional relationship
between θ and η over Ωq, i.e.,dη(j)
dθ(i)= 0 for all i, j,
∇(i)θ Gq,β(θ − η) =
1
βNKq,N
2(η(i) − θ(i)
)β2((N + 4)− (N + 2)q
) (1−(1− q)
∑Nk=1
(θ(k) − η(k)
)2((N + 4)− (N + 2)q
)β2
) q1−q
=2
β2((N + 4)− (N + 2)q)
(η(i) − θ(i)
)ρ( θ−η
β)
Gq,β(θ − η) , (4.1)
18
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 19
where ρ(η) =
(1− (1−q)(
(N+4)−(N+2)q)‖η‖2
). Hence, substituting η′ = η−θ
β, and using the
symmetry of Gq,β(.) and ρ(.), we can write
∇θSq,β[J(θ)] =
(2
β((N + 4)− (N + 2)q
))∫Ωq
η′
ρ(η′)Gq(η
′)J(θ + βη′) dη′
=
(2
β((N + 4)− (N + 2)q
))EGq(η′)
[η′
ρ(η′)J(θ + βη′)
∣∣∣∣ θ] . (4.2)
In sequel (Proposition 4.10), we show that ‖∇θSq,β[J(θ)]−∇θJ(θ)‖ → 0 as β → 0.
Hence, for large M and small β, the form of gradient estimate suggested by (4.2) is
∇θJ(θ) ≈
(2
β((N + 4)− (N + 2)q
)M
)M−1∑n=0
(η(n)J(θ + βη(n))
ρ(η(n))
), (4.3)
where η(1), η(2), . . . , η(n) are uncorrelated identically distributed standard q-Gaussian
distributed random vectors. Considering that in two-timescale algorithms (discussed in
next section), the value of θ is updated concurrently with the gradient estimation pro-
cedure, we estimate ∇θJ(θ(n)) at each stage. Using an approximation of (2.1), we can
write (4.3) as
∇θJ(θ(n)) ≈
(2
βML((N + 4)− (N + 2)q
))M−1∑n=0
L−1∑m=0
η(n)h(Ym)(1− (1−q)
((N+4)−(N+2)q)‖η(n)‖2
)(4.4)
for large L, where the process Ym has the same transition kernel as defined in Assump-
tion I, except that it is governed by parameter (θ(n) + βη(n)).
4.1.2 Two-simulation q-Gaussian SF Gradient Estimate
In a similar manner, based on (2.3), the gradient of the two-sided SF can be written as
∇θS′q,β[J(θ)] =
1
2
∫Ωq
∇θGβ(θ − η)J(η) dη +1
2
∫Ωq
∇θGβ(η − θ)J(η) dη. (4.5)
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 20
The first integral can be obtained as in (4.2). The second integral is evaluated as
∫Ωq
∇θGβ(η − θ)J(η) dη =2
β2((N + 4)− (N + 2)q
) ∫Ωq
(η − θ)ρ(η−θ
β)Gq,β(η − θ)J(η) dη
=2
β((N + 4)− (N + 2)q
) ∫Ωq
η′
ρ(η′)Gq(η
′)J(θ − βη′) dη′ ,
where η′ = θ−ηβ
. Thus, we obtain the gradient as a conditional expectation
∇θS′q,β[J(θ)] =
(1
β((N + 4)− (N + 2)q
))EGq(η)
[η
ρ(η)
(J(θ + βη)− J(θ − βη)
)∣∣∣∣ θ] .(4.6)
In sequel (Proposition 4.12) we show that∥∥∇θS
′q,β[J(θ)]−∇θJ(θ)
∥∥ → 0 as β → 0,
which can be used to approximate (4.6) as
∇θJ(θ(n)) ≈ 1
βML((N + 4)− (N + 2)q
) M−1∑n=0
L−1∑m=0
η(n)(h(Ym)− h(Y ′m)
)(1− (1−q)
((N+4)−(N+2)q)‖η(n)‖2
) (4.7)
for large M , L and small β, where Ym and Y ′m are governed by (θ(n) + βη(n)) and
(θ(n)− βη(n)) respectively.
4.2 Proposed Gradient Descent Algorithms
In this section, we propose two-timescale algorithms corresponding to the estimates ob-
tained in (4.4) and (4.7). Let (a(n)) and (b(n)) be two step-size sequences satisfying the
following assumption.
Assumption IV. (a(n))n>0 and (b(n))n>0 are positive step-size sequences, which satisfy∞∑n=0
a(n)2 < ∞,∞∑n=0
b(n)2 < ∞,∞∑n=0
a(n) =∞∑n=0
b(n) = ∞ and a(n) = o(b(n)), i.e.,
a(n)b(n)→ 0 as n→∞.
For θ = (θ(1), . . . , θ(N))T ∈ RN , let PC(θ) = (PC(θ(1)), . . . , PC(θ(N)))T represent the
projection of θ onto the set C. (Z(i)(n), i = 1, . . . , N)n>0 are quantities used to estimate
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 21
∇θJ(θ) via the recursions below.
The algorithms require a generation of N -dimensional random vectors, consisting of
uncorrelated q-Gaussian distributed random variates. However, there is no standard al-
gorithm in literature for this multivariate case. This issue is addressed later in the Sec-
tion 6.2.
Algorithm 1 : The Gq-SF1 Algorithm
1: Fix M , L, q and β.2: Set Z(i)(0) = 0, i = 1, . . . , N .3: Fix the parameter vector θ(0) = (θ(1)(0), θ(2)(0), . . . , θ(N)(0))T .4: for n = 0 to M − 1 do5: Generate a random vector η(n) = (η(1)(n), η(2)(n), . . . , η(N)(n))T from a standard
N -dimensional q-Gaussian distribution.6: for m = 0 to L− 1 do7: Generate the simulation YnL+m governed with parameter (θ(n) + βη(n)).8: for i = 1 to N do9: Z(i)(nL+m+ 1) = (1− b(n))Z(i)(nL+m)
+b(n)
[2η(i)(n)h(YnL+m)
β((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)
‖η(n)‖2)
].
10: end for11: end for12: for i = 1 to N do13: θ(i)(n+ 1) = PC
(θ(i)(n)− a(n)Z(i)(nL)
).
14: end for15: Set θ(n+ 1) = (θ(1)(n+ 1), θ(2)(n+ 1), . . . , θ(N)(n+ 1))T .16: end for17: Output θ(M) = (θ(1)(M), . . . , θ(N)(M))T as the final parameter vector.
The Gq-SF2 algorithm is similar to the Gq-SF1 algorithm, except that we use two
parallel simulations YnL+m and Y ′nL+m governed with parameters (θ(n) + βη(n)) and
(θ(n)−βη(n)) respectively, and update the gradient estimate, in Step 9, using the single-
stage cost function of both simulations as in (4.7).
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 22
Algorithm 2 : The Gq-SF2 Algorithm
1: Fix M , L, q and β.2: Set Z(i)(0) = 0, i = 1, . . . , N .3: Fix the parameter vector θ(0) = (θ(1)(0), θ(2)(0), . . . , θ(N)(0))T .4: for n = 0 to M − 1 do5: Generate a random vector η(n) = (η(1)(n), η(2)(n), . . . , η(N)(n))T from a standard
N -dimensional q-Gaussian distribution.6: for m = 0 to L− 1 do7: Generate two simulations YnL+m and Y ′nL+m governed with control parameters
(θ(n) + βη(n)) and (θ(n)− βη(n)) respectively.8: for i = 1 to N do9: Z(i)(nL+m+ 1) = (1− b(n))Z(i)(nL+m)
+b(n)
[η(i)(n)(h(YnL+m)−h(Y ′nL+m))
β((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)
‖η(n)‖2)
].
10: end for11: end for12: for i = 1 to N do13: θ(i)(n+ 1) = PC
(θ(i)(n)− a(n)Z(i)(nL)
).
14: end for15: Set θ(n+ 1) = (θ(1)(n+ 1), θ(2)(n+ 1), . . . , θ(N)(n+ 1))T .16: end for17: Output θ(M) = (θ(1)(M), . . . , θ(N)(M))T as the final parameter vector.
4.3 Convergence of Gradient SF Algorithms
We now look into the convergence of the algorithms proposed in Section 4.2. The analysis
presented in the shorter version of this report [18] was in the lines of [5]. In this report,
we deviate from that approach to provide a more straightforward technique to prove the
convergence of the algorithms to a local optimum.
4.3.1 Convergence of Gq-SF1 Algorithm
First, let us consider the update along the faster timescale, i.e., Step 9 of the Gq-SF1
algorithm. Defining b(p) = b(n) for nL 6 p < (n + 1)L, it follows from Assumption IV
that b(p) = o(b(p)
),∑
p b(p) = ∞ and∑
p b(p)2 < ∞. We can rewrite Step 9 as the
following iteration, which runs for L steps:
Z(p+ 1) = Z(p) + b(p)[g(Yp)− Z(p)
], (4.8)
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 23
where for all nL 6 p < (n + 1)L, g(Yp) = 2η(n)h(Yp)
β((N+4)−(N+2)q)ρ(η(n)). Here, ρ(.) is defined as
in Proposition 3.1 and Yp : p ∈ N is a Markov process with the stationary distribution
ν(θ(n)+βη(n)), as the parameter updates θ(n) and η(n) are held fixed. We state the following
two results from [11], adapted to our scenario. These results lead to the stability and
convergence of iteration (4.8).
Lemma 4.1. Consider the iteration, xp+1 = xp + γ(p)f(xp, Yp). Let the following condi-
tions hold:
1. Yp : p ∈ N is a Markov process satisfying Assumptions I and III,
2. for each x ∈ RN and xp ≡ x for all p ∈ N, Yp has a unique invariant probability
measure νxp,
3. f(., .) is Lipschitz continuous in its first argument uniformly w.r.t the second, and
4. (γ(p))p>0 are step-sizes satisfying∞∑p=0
γ(p) =∞ and∞∑p=0
γ2(p) <∞.
Then the update xp converges to the stable fixed points of the ordinary differential equation
(ODE)
x(t) = f(x(t), νx(t)
),
where f(x, νx
)= Eνx
[f(x, Y )
].
Lemma 4.2. Suppose the limit f(x(t)
)= lim
a↑∞
f(ax(t), νax(t)
)a
exists uniformly on com-
pacts, and furthermore, the ODE x(t) = f(x(t)
)is well-posed and has the origin as the
unique globally asymptotically stable equilibrium. Then supp ‖xp‖ < ∞, almost surely,
where (xp)p∈N are obtained as per the recursion in Lemma 4.1.
It can be observed that iteration (4.8) is a special case of Lemma 4.1, where the
invariant measure ν(θ(n)+βη(n)) is independent of the updates (Z(p))nL<p<(n+1)L. So we
consider the following ODEs:
θ(t) = 0, (4.9)
Z(t) =2η(t)J
(θ(t) + βη(t)
)β((N + 4)− (N + 2)q
)ρ(η(t))
− Z(t). (4.10)
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 24
In lieu of (4.9) and the fact that the iterations are in a faster timescale, we let θ(t) ≡ θ
and η(t) ≡ η (constants) in (4.10). Hence, we consider
Z(t) =2ηJ
(θ + βη
)β((N + 4)− (N + 2)q
)ρ(η)
− Z(t). (4.11)
Lemma 4.3. The sequence of updates (Z(p)) is uniformly bounded with probability 1.
Proof. It can be easily verified that iteration (4.8) satisfies the first four conditions of
Lemma 4.1, the fifth not being applicable. Thus, by Lemma 4.1, (Z(p)) converges to
ODE (4.11) as
Eν(θ+βη)
[2η h(Yp)
β((N + 4)− (N + 2)q
)ρ(η)
− Z(t)
]=
2ηJ(θ + βη)
β((N + 4)− (N + 2)q
)ρ(η)
− Z(t).
We can also see that
lima↑∞
1
a
(2ηJ(θ + βη)
β((N + 4)− (N + 2)q
)ρ(η)
− aZ(t)
)= −Z(t).
Hence, Lemma 4.2 can be used to arrive at the claim.
We now define time steps tn such that t0 = 0 and tn =∑n−1
i=0 b(i) for n > 1. We also
express an interpolated trajectory Z(t) such that Z(tn) = Z(nL) for all n, and over the
interval [tn, tn+1], it is a linear interpolation between Z(nL) and Z(nL+L). Based on the
definition given below, we can consider the interpolated trajectory Z(t) as a perturbation
of the trajectory Z(t) in (4.11).
Definition 4.4 ((τ -µ) perturbation). For the specified constants τ > 0, µ > 0 and a
given ODE
x(t) = f(x(t)
), (4.12)
a (τ -µ) perturbed trajectory of (4.12) is a map y : [0,∞) 7→ RN that satisfies the following
condition. If there exists an increasing sequence (τk)k>0 ⊂ [0,∞) such that τk+1 − τk > τ
for all k, then on each interval [τk, τk+1], there exists a solution xk(t) of (4.12) such that
|xk(t)− y(t)| < µ for all t ∈ [τk, τk+1].
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 25
We state the following lemma due to [21], which is based on the above definition.
Lemma 4.5. If K = x : f(x) = 0 be the asymptotically stable attracting set of (4.12),
then given τ, ε > 0, there exists µ0 > 0 such for all µ ∈ [0, µ0], any (τ -µ) perturbation
of (4.12) converges to ε-neighbourhood of K, defined as Kε = x : ‖x− x0‖ < ε, x0 ∈ K.
Lemma 4.6. Given τ, µ > 0,(θ(tn + ·), Z(tn + ·)
)is eventually a bounded (τ -µ) pertur-
bation of (4.9)-(4.11).
Proof. Since, a(n) = o(b(n)) and the parameter is fixed at θ(n) over [tn, tn+1], we can
write the parameter update (Step 13 of Gq-SF1) as
θ(i)(n+ 1) = PC(θ(i)(n)− b(n)ζ(n)
),
where ζ(n) = o(1). Thus, the parameter update recursion seems to be quasistatic when
viewed from the timescale of (b(n)). Now, we define (τn) such that τ0 = 0 and for n > 1,
τn = min tk : tk > τn−1 + τ. We also define the functions θn(t), Zn(t), t ∈ [τn, τn+1]
which are the solutions of the ODEs
θn(t) = 0
Zn(t) =2η(n)J
(θ(n) + βη(n)
)β((N + 4)− (N + 2)q
)ρ(η(n))
− Zn(t)
over the interval (τn, τn+1), with θn(τn) = θ(n) and Zn(τn) = Z(n), respectively, i.e., the
values are given by the algorithm. Using arguments based on Gronwall’s lemma1[8], it
can now be shown that
limn→∞
supt∈[τn,τn+1]
‖Zn(t)− Z(t)‖ = 0 with probability 1. (4.13)
The claims follows as a consequence of (4.13).
1Gronwall’s lemma states that for continuous functions u(.), v(.) > 0 and scalars C,K, T > 0,
u(t) 6 C + K
∫ t
0
u(s)v(s) ds for all t ∈ [0, T ]
implies u(t) 6 C exp
(K
∫ t
0
v(s) ds
)for all t ∈ [0, T ].
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 26
Corollary 4.7.
∥∥∥∥∥Z(nL)−
(2η(n)J
(θ(n) + βη(n)
)β((N + 4)− (N + 2)q
)ρ(η(n))
)∥∥∥∥∥→ 0 almost surely as n→
∞.
Proof. The claim follows by applying Lemma 4.5 on (4.11) for every µ > 0.
Thus we can claim that the updates given by Z(nL) eventually tracks the function(2η(n)J(θ(n)+βη(n))
β((N+4)−(N+2)q)ρ(η(n))
). So, Steps 13 and 15 of Gq-SF1 algorithm can be written as
θ(n+ 1) = PC
(θ(n)− a(n)
[2η(n)J
(θ(n) + βη(n)
)β((N + 4)− (N + 2)q
)ρ(η(n))
])
= PC(θ(n) + a(n)
[−∇θ(n)J
(θ(n)
)+ ∆
(θ(n)
)+ ξn
]), (4.14)
where the error in the gradient estimate is given by
∆(θ(n)
)= ∇θ(n)J
(θ(n)
)−∇θ(n)Sq,β
[J(θ(n)
)](4.15)
and the noise term is
ξn = ∇θ(n)Sq,β[J(θ(n)
)]−
2η(n)J(θ(n) + βη(n)
)β((N + 4)− (N + 2)q
)ρ(η(n))
=2
β((N + 4)− (N + 2)q
)(EGq(η)
[η(n)
ρ(η(n))J(θ(n) + βη(n)
)∣∣∣∣ θ(n)
]− η(n)
ρ(η(n))J(θ(n) + βη(n)
)), (4.16)
which is a martingale difference term. Let Fn = σ(θ(0), . . . , θ(n), η(0), . . . , η(n − 1)
)denote the σ-field generated by the mentioned quantities. We can observe that (Fn)n>0
is a filtration, where ξ0, . . . , ξn−1 are Fn-measurable for each n > 0.
We state the following result from [28], adapted to our scenario, which leads to the
convergence of the updates in (4.14).
Lemma 4.8. Given the iteration, xn+1 = PC(xn + γ(n)(f(xn) + ξn)
), where
1. PC is the projection onto a constraint set C, which is closed and bounded,
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 27
2. f(.) is a continuous function,
3. (γ(n))n>0 is positive sequence satisfying γ(n) ↓ 0,∑∞
n=0 γ(n) =∞, and
4.∑m
i=0 γ(n)ξn converges a.s.
Under the above conditions, the update (xn) converges to the ODE
x(t) = PC(f(x(t))
), (4.17)
where PC(f(x)
)= lim
ε↓0
(PC(x+εf(x)
)−x
ε
). Further, if f(.) = −∇xg(.), where g is a contin-
uous differentiable function, then the invariant set of (4.17) is given by
K =x ∈ C
∣∣∣PC(f(x))
= 0
=x ∈ C
∣∣∣∇xg(x)T PC(−∇xg(x)
)= 0.
Following result shows that the noise term ξn satisfies the last condition in Lemma 4.8.
Lemma 4.9. Let Mn =∑n−1
i=0 a(k)ξk. Then, for all values of q ∈(0, 1)⋃ (
1, 1 + 2N+2
),
(Mn,Fn)n∈N is an almost surely convergent martingale sequence.
Proof. We can easily observe that for all k > 0,
E[ξk|Fk] =2
β((N + 4)− (N + 2)q
)(E
[η(k)
βρ(η(k))J(θ(k) + βη(k)
)∣∣∣∣ θ(k)
]− E
[η(k)
βρ(η(k))J(θ(k) + βη(k)
)∣∣∣∣Fk]).So E[ξk|Fk] = 0, since θ(k) is Fk-measurable, whereas η(k) is independent of Fk. It
follows that (ξn,Fn)n∈N is a martingale difference sequence, and hence (Mn,Fn)n∈N is a
martingale sequences.
By Lipschitz continuity of h, there exists α1 > 0 such that for all p, |h(Yp)| 6 α1(1 +
‖Yp‖), and hence, by Assumption III, we can claim
E[h(Yp)
2]6 2α2
1
(1 + E
[‖Yp‖2
])<∞ a.s.
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 28
Thus, applying Jensen’s inequality, we have J(θ(k) + βη
)26 E [h(Yp)
2] < ∞ for all η,
which implies supη
[J(θ + βη
)2]<∞ for all θ ∈ C. Now,
E[‖ξk‖2
∣∣Fk] =N∑j=1
E
[(ξ
(j)k
)2∣∣∣∣Fk]
68
β2((N + 4)− (N + 2)q
)2
N∑j=1
E
[(E
[η(k)(j)
ρ(η(k))J(θ(k) + βη(k)
)∣∣∣∣ θ(k)
])2
+
(η(k)(j)
ρ(η(k))J(θ(k) + βη(k)
))2 ∣∣∣∣Fk]
By Jensen’s inequality, we have
E[‖ξk‖2
∣∣Fk] 6 16
β2((N + 4)− (N + 2)q
)2
N∑j=1
E
[ (η(k)(j)
)2
ρ(η(k))2 J(θ(k) + βη(k)
)2
∣∣∣∣∣ θ(k)
]
=16
β2((N + 4)− (N + 2)q
)2 supη
(J(θ(k) + βη
)2) N∑j=1
E
[(η(j))2
ρ(η)2
].
We apply Proposition 3.1 to study the nature of E
[(η(j))
2
ρ(η)2
]. We can observe that in
this case, b = 2 and bi = 2 if i = j, otherwise bi = 0. SoN∑i=1
bi2
= 1 andN∏i=1
(bi!
2bi( bi2 )!
)= 1
2.
For q < 1, we have the condition b <(
1 + 11−q
), which can be satisfied only if q > 0.
Using the relation Γ(n+ 1) = nΓ(n), we can write
E
[(η(j))2
ρ(η)2
]=
((N + 4)− (N + 2)q
1− q
)1
2
Γ(
11−q − 1
)Γ(
11−q + 1 + N
2
)Γ(
11−q + 1
)Γ(
11−q + N
2
)=
((N + 4)− (N + 2)q
)((N + 2)−Nq
)4q
. (4.18)
We can verify that for 1 < q <(1 + 2
N+2
), the condition mentioned in Proposition 3.1 is
always satisfied. Hence,
E
[(η(j))2
ρ(η)2
]=
((N + 4)− (N + 2)q
)((N + 2)−Nq
)4q
. (4.19)
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 29
Thus, it can be seen that E
[(η(j))
2
ρ(η)2
]<∞ for all q ∈
(0, 1)⋃ (
1, 1+ 2N+2
), j = 1, 2, . . . , N .
and so, E [‖ξk‖2| Fk−1] <∞ for all k, and hence, if∑
n a(n)2 <∞,
∞∑n=0
E[‖Mn+1 −Mn‖2
]=∞∑n=0
a(n)2E[‖ξn‖2
]<∞.
The claim follows from the martingale convergence theorem [49].
Although the previous result shows that the noise is bounded, (4.18) shows that the
noise term can become quite large for very low values of q (close to 0). Now, we deal with
the error term ∆(θ(n)
)in (4.14).
Proposition 4.10. For a given q <(1 + 2
N+2
), q 6= 1, and for all θ ∈ C, the error term
∥∥∥∇θSq,β[J(θ)]−∇θJ(θ)∥∥∥ = o(β).
Proof. For small β > 0, using Taylor series expansion of J(θ + βη) around θ ∈ C,
J(θ + βη) = J(θ) + βηT∇θJ(θ) +β2
2ηT∇2
θJ(θ)η + o(β2).
So we can write (4.2) as
∇θSq,β[J(θ)] =2(
(N + 4)− (N + 2)q)(J(θ)
βEGq(η)
[η
ρ(η)
]+ EGq(η)
[ηηT
ρ(η)
]∇θJ(θ)
+β
2EGq(η)
[ηηT∇2
θJ(θ)η
ρ(η)
∣∣∣∣ θ]+ o(β)
). (4.20)
We consider each term in (4.20). The ith component in the first term is EGq(η)
[η(i)
ρ(η)
]= 0
by Proposition 3.1 for all i = 1, . . . , N . Similarly, the ith component in the third term can
be written as
β
2EGq(η)
[ηηT∇2
θJ(θ)η
ρ(η)
](i)
=β
2
N∑j=1
N∑k=1
[∇2θJ(θ)
]j,k
EGq(η)
[η(i)η(j)η(k)
ρ(η)
].
It can be observed that in all cases, each term in the summation is an odd function, and
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 30
so from Proposition 3.1, we can show that the third term in (4.20) is zero. Using a similar
argument, we claim that the off-diagonal terms in EGq(η)
[ηηT
ρ(η)
]are zero, while the diagonal
terms are of the form EGq(η)
[(η(i))
2
ρ(η)
], which exists for all q ∈
(− ∞, 1
)⋃ (1, 1 + 2
N+2
)as the conditions in Proposition 3.1 are always satisfied on this interval. Further, we can
compute that for all q ∈(−∞, 1 + 2
N+2
), q 6= 1,
EGq(η)
[(η(i))2
ρ(η)
]=
(N + 4)− (N + 2)q
2. (4.21)
The claim follows by substituting the above expression in (4.20).
Now, we consider the following ODE for the slowest timescale recursion:
θ(t) = PC(−∇θJ(θ(t))
), (4.22)
where PC(f(x)
)= limε↓0
(PC(x+εf(x).)−x
ε
). In accordance with Lemma 4.8, it can be
observed that the stable points of (4.22) lie in the set
K =θ ∈ C
∣∣∣PC(−∇θJ(θ))
= 0
=θ ∈ C
∣∣∣∇θJ(θ)T PC(−∇θJ(θ)
)= 0
. (4.23)
We have the following key result which shows that iteration (4.14) tracks ODE (4.22),
and hence, the convergence of our algorithm is proved.
Theorem 4.11. Under Assumptions I – IV, given ε > 0 and q ∈(0, 1)⋃ (
1, 1 + 2N+2
),
there exists β0 > 0 such that for all β ∈ (0, β0], the sequence (θ(n)) obtained using Gq-
SF1 converges to a point in the ε-neighbourhood of the stable attractor of (4.22), with
probability 1 as n→∞.
Proof. It immediately follows from Lemmas 4.8 and 4.9 that the update in (4.14) converges
to the ODE
θ(t) = PC(−∇θJ(θ(t)) + ∆
(θ(t)
)), (4.24)
But, from Proposition 4.10, we have for all n,∥∥∆(θ(n)
)∥∥ = o(β). Then for any given
ε, τ > 0, we can invoke Lemma 4.5 by considering the sequence (τk)k>0 as τ0 = 0 and
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 31
τk =∑nk
p=1 a(p), with the indices nk = minn :∑n
p=1 a(p) > τk−1 + τ
. This gives us
a β0 such that for all β ∈ (0, β0], any (τ -µ) perturbation of (4.22) would converge to
ε-neighbourhood of (4.23). Hence, the claim.
4.3.2 Convergence of Gq-SF2 Algorithm
Since, the proof of convergence here is along the lines of Gq-SF1, we do not describe it
explicitly. We just briefly describe the modifications that are required in this case. In the
faster timescale, as n→∞, the updates given by Z(nL) track the function
(2η(n)
β((N + 4)− (N + 2)q)ρ(η(n))
)(J(θ(n) + βη(n)
)− J
(θ(n)− βη(n)
)).
So we can rewrite the slower timescale update for Gq-SF2 algorithm, in a similar manner
as (4.14), where the noise term is
ξn =1
β((N + 4)− (N + 2)q
) (EGq(η(n))
[η(n)
ρ(η(n))J(θ(n) + βη(n)
)∣∣∣∣ θ(n)
]− η(n)
ρ(η(n))J(θ(n) + βη(n)
))− 1
β((N + 4)− (N + 2)q
) (EGq(η(n))
[η(n)
ρ(η(n))J(θ(n)− βη(n)
)∣∣∣∣ θ(n)
]− η(n)
ρ(η(n))J(θ(n)− βη(n)
)),
which can be divided into two parts, each being bounded (as in Lemma 4.9). We discuss
about the error term in the following proposition.
Proposition 4.12. For a given q <(1 + 2
N+2
), q 6= 1,
∥∥∥∇θS′q,β[J(θ)]−∇θJ(θ)
∥∥∥ = o(β)
for all θ ∈ C.
Proof. Using Taylor’s expansion, we have for small β
J(θ + βη)− J(θ − βη) = 2βηT∇θJ(θ) + o(β2).
CHAPTER 4. q-GAUSSIAN BASED GRADIENT DESCENT ALGORITHMS 32
One can use similar arguments as in Proposition 4.10 to rewrite (4.6) as
∇θS′q,β[J(θ)] =
1((N + 4)− (N + 2)q
)EGq(η)
[2
ρ(η)ηηT
]∇θJ(θ) + o(β),
which leads to the claim.
This leads to a result similar to Theorem 4.11, which proves convergence of the Gq-SF2
algorithm.
Theorem 4.13. Under Assumptions I – IV, given ε > 0 and q ∈(0, 1)⋃ (
1, 1 + 2N+2
),
there exists β0 > 0 such that for all β ∈ (0, β0], the sequence (θ(n)) obtained using Gq-
SF2 converges to a point in the ε-neighbourhood of the stable attractor of (4.22), with
probability 1 as n→∞.
Theorems 4.11 and 4.13 give the existence of some β0 > 0 for a given ε > 0 such
that the proposed gradient-descent algorithms converge to the ε-neighbourhood of a local
minimum. However, these resulte do not give the precise value of β0. Further, they do
not guarantee that this neighbourhood lies within a close proximity of a global minimum.
Chapter 5
q-Gaussian based Newton Search
Algorithms
5.1 Hessian Estimation using q-Gaussian SF
In this chapter, we extend the proposed Gradient based algorithms by incorporating an
additional Hessian estimate. This leads to algorithms similar to Newton based search. The
existence of ∇2θJ(θ) is assumed as per Assumption II, We estimate it using SF approach.
5.1.1 One-simulation q-Gaussian SF Hessian Estimate
By following Rubinstein [35], we define the smoothed Hessian, or Hessian of the SF, as
∇2θSq,β[J(θ)] =
∫Ωq
∇2θGq,β(θ − η)J(η) dη , (5.1)
where Ωq is the support set of the q-Gaussian distribution as defined earlier (3.1). Now,
the (i, j)th element of Gq,β(θ − η) is
[∇2θGq,β(θ − η)
]i,j
=4q
β4((N + 4)− (N + 2)q
)2
(θ(i) − η(i)
) (θ(j) − η(j)
)ρ( θ−η
β)2 Gq,β(θ − η)
33
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 34
for i 6= j, where ρ(X) =(
1− (1−q)((N+4)−(N+2)q)
‖X‖2)
. For i = j, we have
[∇2θGq,β(θ − η)
]i,i
=− 2
β2((N + 4)− (N + 2)q
) 1
ρ( θ−ηβ
)Gq,β(θ − η)
+4q
β4((N + 4)− (N + 2)q
)2
(θ(i) − η(i)
)2
ρ( θ−ηβ
)2 Gq,β(θ − η).
Thus, we can write
∇2θGq,β(θ − η) =
2
β2((N + 4)− (N + 2)q
)H (θ − ηβ
)Gq,β(θ − η), (5.2)
where H(η) =
(2q(
(N + 4)− (N + 2)q) η(i)η(j)
ρ(η)2
)for i 6= j, and
(2q(
(N + 4)− (N + 2)q) (η(i)
)2
ρ(η)2 −1
ρ(η)
)for i = j.
(5.3)
The function H(.) is a generalization of a similar function given in [5], which can be
obtained as q → 1. Hence, from (5.1), we have
∇2θSq,β[J(θ)] =
(2
β2((N + 4)− (N + 2)q
))∫Ωq
Gq,β(θ − η)H
(θ − ηβ
)J(η) dη .
Substituting η′ = η−θβ
, we can write
∇2θSq,β[J(θ)] =
(2
β2((N + 4)− (N + 2)q
))EGq(η′)
[H(η′)J(θ + βη′)
∣∣∣θ] . (5.4)
As in the case of gradient, in sequel (Proposition 5.4) we show that as β → 0,
‖∇2θSq,β[J(θ)]−∇2
θJ(θ)‖ → 0. Hence, we obtain an estimate of ∇2θJ(θ(n)) of the form
∇2θJ(θ(n)) ≈ 2
β2ML((N + 4)− (N + 2)q
) M−1∑n=0
L−1∑m=0
H(η(n)
)h(Ym), (5.5)
for large M , L and small β, where process Ym is governed by parameter (θ(n) +βη(n)).
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 35
5.1.2 Two-simulation q-Gaussian SF Hessian Estimate
Similarly, the Hessian of two-sided SF can be defined as
∇θS′q,β[J(θ)] =
1
2
∫Ωq
∇2θGβ(θ − η)J(η) dη +
1
2
∫Ωq
∇2θGβ(η − θ)J(η) dη
= EGq(η)
[1
β2((N + 4)− (N + 2)q
)H(η)(J(θ + βη) + J(θ − βη)
)]. (5.6)
By using Proposition 5.8 (discussed later in Proposition 5.8), we obtain the estimate as
∇2θJ(θ(n)) ≈ 1
β2ML((N + 4)− (N + 2)q
) M−1∑n=0
L−1∑m=0
H(η(n)
)[h(Ym) + h(Y ′m)
], (5.7)
for large M , L and small β, where Ym and Y ′m are governed by (θ(n) + βη(n)) and
(θ(n)− βη(n)), respectively.
5.2 Proposed Newton-based Algorithms
In this section, we propose two-timescale algorithms that perform a Newton based search,
and require one and two simulations, respectively. In particular, the one-simulation
(resp. two-simulation) algorithm uses gradient and Hessian estimates obtained from (4.2)
and (5.4) (resp. (4.6) and (5.6)). One of the problems with Newton-based algorithms is
that the Hessian has to be positive definite for the algorithm to progress in the descent
direction. This is satisfied in the neighbourhood of a local minimum, but it may not
hold always. Hence, the estimate obtained during recursion has to be projected onto the
space of positive definite symmetric matrices. Let Ppd : RN×N 7→ symmetric matrices
with eigenvalues> ε be the function that projects any N × N matrix onto the set of
symmetric positive definite matrices, whose minimum eigenvalue > ε for some ε > 0. We
assume that the projection Ppd satisfies the following:
Assumption V. If (An)n∈N, (Bn)n∈N ⊂ RN×N are sequences of matrices such that
limn→∞ ‖An −Bn‖ = 0, then limn→∞ ‖Ppd(An)− Ppd(Bn)‖ = 0 as well.
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 36
Such a projection always exists since, the set of positive definite matrices is dense in
RN×N . Methods for performing projection and inverse computation are discussed later.
The basic approach of the algorithms is similar to the proposed gradient descent algo-
rithms, and we use two step-size sequences, (a(n))n>0 and (b(n))n>0 satisfying Assump-
tion IV. In the recursions below, the estimate of ∇θJ(θ) is obtained from the sequence
(Z(i)(n), i = 1, . . . , N)n>0, while (Wi,j(n), i, j = 1, . . . , N)n>0 estimates ∇2θJ(θ).
Algorithm 3 : The Nq-SF1 Algorithm
1: Fix M , L, q, β and ε.2: Set Z(i)(0) = 0,W (i)(0) = 0, i = 1, . . . , N .3: Fix parameter vector θ(0) = (θ(1)(0), θ(2)(0), . . . , θ(N)(0))T .4: for n = 0 to M − 1 do5: Generate a standard N -dimensional q-Gaussian distributed random vector η(n).6: for m = 0 to L− 1 do7: Generate the simulation YnL+m governed with parameter (θ(n) + βη(n)).8: for i = 1 to N do9: Z(i)(nL+m+ 1) = (1− b(n))Z(i)(nL+m)
+b(n)
[2η(i)(n)h(YnL+m)
β((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)
‖η(n)‖2)
].
10: Wi,i(nL+m+ 1) = (1− b(n))Wi,i(nL+m)
+b(n)
[2h(YnL+m)
β2((N+4)−(N+2)q)
(2q(η(i)(n))
2
((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)
‖η(n)‖2)2
− 1
(1− (1−q)((N+4)−(N+2)q)
‖η(n)‖2)
)].
11: for j = i+ 1 to N do12: Wi,j(nL+m+ 1) = (1− b(n))Wi,j(nL+m)
+b(n)
[4qη(i)(n)η(j)(n)h(YnL+m)
β2((N+4)−(N+2)q)2(1− (1−q)((N+4)−(N+2)q)
‖η(n)‖2)2
].
13: Wj,i(nL+m+ 1) = Wi,j(nL+m+ 1).14: end for15: end for16: end for17: Project W (nL) using Ppd, and compute its inverse. Let M(nL) = Ppd(W (nL))−1.18: for i = 1 to N do
19: θ(i)(n+ 1) = PC
(θ(i)(n)− a(n)
N∑j=1
Mi,j(nL)Z(j)(nL)
).
20: end for21: Set θ(n+ 1) = (θ(1)(n+ 1), θ(2)(n+ 1), . . . , θ(N)(n+ 1))T .22: end for23: Output θ(M) = (θ(1)(M), θ(2)(M), . . . , θ(N)(M))T as the final parameter vector.
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 37
In the Nq-SF2 algorithm, we generate two simulations YnL+m and Y ′nL+m governed with
parameters (θ(n)+βη(n)) and (θ(n)−βη(n)), respectively, instead of a single simulation.
The single-stage cost function of both simulations are used to update the gradient and
Hessian estimates, which are used for the optimization update rule (Step 19).
Algorithm 4 : The Nq-SF2 Algorithm1: Fix M , L, q, β and ε.2: Set Z(i)(0) = 0,W (i)(0) = 0, i = 1, . . . , N .3: Fix parameter vector θ(0) = (θ(1)(0), θ(2)(0), . . . , θ(N)(0))T .4: for n = 0 to M − 1 do5: Generate a standard N -dimensional q-Gaussian distributed random vector η(n).6: for m = 0 to L− 1 do7: Generate two independent simulations YnL+m and Y ′nL+m governed with param-
eters (θ(n) + βη(n)) and (θ(n)− βη(n)) respectively.8: for i = 1 to N do9: Z(i)(nL+m+ 1) = (1− b(n))Z(i)(nL+m)
+b(n)
[η(i)(n)(h(YnL+m)−h(Y ′nL+m))
β((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)
‖η(n)‖2)
].
10: Wi,i(nL+m+ 1) = (1− b(n))Wi,i(nL+m)
+b(n)
[h(YnL+m)+h(Y ′nL+m)
β2((N+4)−(N+2)q)
(2q(η(i)(n))
2
((N+4)−(N+2)q)(1− (1−q)((N+4)−(N+2)q)
‖η(n)‖2)2
− 1
(1− (1−q)((N+4)−(N+2)q)
‖η(n)‖2)
)].
11: for j = i+ 1 to N do12: Wi,j(nL+m+ 1) = (1− b(n))Wi,j(nL+m)
+b(n)
[2qη(i)(n)η(j)(n)(h(YnL+m)+h(Y ′nL+m))
β2((N+4)−(N+2)q)2(1− (1−q)((N+4)−(N+2)q)
‖η(n)‖2)2
].
13: Wj,i(nL+m+ 1) = Wi,j(nL+m+ 1).14: end for15: end for16: end for17: Project W (nL) using Ppd, and compute its inverse. Let M(nL) = Ppd(W (nL))−1.18: for i = 1 to N do
19: θ(i)(n+ 1) = PC
(θ(i)(n)− a(n)
N∑j=1
Mi,j(nL)Z(j)(nL)
).
20: end for21: Set θ(n+ 1) = (θ(1)(n+ 1), θ(2)(n+ 1), . . . , θ(N)(n+ 1))T .22: end for23: Output θ(M) = (θ(1)(M), θ(2)(M), . . . , θ(N)(M))T as the final parameter vector.
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 38
5.3 Convergence of Newton SF Algorithms
5.3.1 Convergence of Nq-SF1 Algorithm
The convergence analysis is quite similar to that of Gq-SF1, along with an additional
Hessian update. Let us consider the updates in the faster timescale, i.e., Steps 9 and 10
of the Nq-SF1 algorithm. We use b(p) defined in Section 4.3, i.e., b(p) = b(n) for all
nL 6 p < (n+ 1)L. Observing that θ(n) and η(n) are fixed during these updates, we can
rewrite Step 9 as the following iteration, which runs for L steps.
Z(p+ 1) = Z(p) + b(p)(g1(Yp)− Z(p)
), (5.8)
where g1(Yp) =
(2η(n)h(Yp)
β((N + 4)− (N + 2)q)ρ(η(n))
)for nL 6 p < (n+1)L. Here, ρ(.) is de-
fined in (5.4) and Ypp∈N is a Markov process with the stationary distribution ν(θ(n)+βη(n)).
Similarly, the update of the Hessian matrix can be expressed as
W (p+ 1) = W (p) + b(p)(g2(Yp)−W (p)
), (5.9)
where g2(Yp) =
(2H(η(n))h(Yp)
β2((N + 4)− (N + 2)q)
)for nL 6 p < (n+ 1)L,
The iterations (5.8) and (5.9) are independent of each other for a fixed n, and hence,
they can be dealt with separately. It follows from Lemma 4.1 that the updates lead to
the following ODEs:
θ(t) = 0, (5.10)
Z(t) =2η(t)J
(θ(t) + βη(t)
)β((N + 4)− (N + 2)q
)ρ(η(t))
− Z(t), (5.11)
and W (t) =2H(η(t))J
(θ(t) + βη(t)
)β2((N + 4)− (N + 2)q
) −W (t), (5.12)
where, as before, θ(t) ≡ θ and η(t) ≡ η in (5.11)–(5.12) in lieu of (5.10). The results
related to the gradient updates stated in Section 4.3 (Lemmas 4.3, 4.6 and Corollary 4.7)
still hold. We prove similar results for the Hessian update.
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 39
Lemma 5.1. The sequence (W (p)) is uniformly bounded with probability 1.
Proof. Observing the fact that
Eν(θ+βη)
[2H(η)h(Yp)
β2((N + 4)− (N + 2)q
) −W (t)
]=
2H(η)J(θ + βη
)β2((N + 4)− (N + 2)q
) −W (t),
the proof turns out to be exactly similar to that of Lemma 4.3.
We again refer to the time steps tn with t0 = 0 and tn =∑n−1
i=0 b(i) for n > 1. Defining
W (t) as the interpolation between W (nL) and W (nL+L) over the interval [tn, tn+1], we
have the following.
Lemma 5.2. Given τ, µ > 0,(θ(tn + ·), W (tn + ·)
)is eventually a bounded (τ -µ) pertur-
bation of (4.9)-(4.11).
Corollary 5.3.
∥∥∥∥W (nL)−(
2H(η(n))J(θ(n) + βη(n))
β2((N + 4)− (N + 2)q)
)∥∥∥∥→ 0 almost surely as n→∞.
Thus, both Z(.) and W (.) updates eventually track the gradient and Hessian of
Sq,β[J(θ)]. So, after incorporating the projection considered in Step 17, we can write
Steps 17 and 19 of Nq-SF1 algorithm as follows:
θ(n+ 1) = PC
(θ(n)− a(n)
[Ppd
(2H(η(n))J
(θ(n) + βη(n)
)β2((N + 4)− (N + 2)q
) )−1
2η(n)J(θ(n) + βη(n)
)βρ(η(n))
((N + 4)− (N + 2)q
)])
= PC(θ(n) + a(n)
[−Ppd
(∇2θ(n)J
(θ(n)
))−1∇θ(n)J(θ(n)
)+ ∆
(θ(n)
)+ ξn
]),
(5.13)
where, using (4.2) and (5.4), we have
∆(θ(n)
)= Ppd
(∇2θ(n)J
(θ(n)
))−1∇θ(n)J(θ(n)
)− Ppd
(∇2θ(n)Sq,β
[J(θ(n)
)])−1
∇θ(n)Sq,β
[J(θ(n)
)](5.14)
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 40
and
ξn = E
Ppd(2H(η(n))J(θ(n) + βη(n)
)β2((N + 4)− (N + 2)q
) )−12η(n)J
(θ(n) + βη(n)
)βρ(η(n))
((N + 4)− (N + 2)q
)∣∣∣∣∣∣ θ(n)
− Ppd
(2H(η(n))J
(θ(n) + βη(n)
)β2((N + 4)− (N + 2)q
) )−12η(n)J
(θ(n) + βη(n)
)βρ(η(n))
((N + 4)− (N + 2)q
)(5.15)
The objective is to show that the error term ∆(θ(n)
)and the noise term ξn satisfy
conditions similar to those mentioned in Section 4.3. Then, we can use similar arguments
to prove the convergence of Nq-SF1. Considering the error term, we have the following
proposition regarding convergence of the Hessian of SF to the Hessian of the objective
function J as β → 0.
Proposition 5.4. For a given q ∈(0, 1)⋃ (
1, 1 + 2N+2
), we have
∥∥∥∇2θSq,β[J(θ)]−∇2
θJ(θ)∥∥∥ = o(β).
for all θ ∈ C and β > 0.
Proof. We use Taylor’s expansion of J(θ + βη) around θ ∈ C to rewrite (5.4) as
∇2θSq,β[J(θ)] =
2
β2((N + 4)− (N + 2)q
)(E [H(η)J(θ) |θ] + βE[H(η)ηT∇θJ(θ) |θ
]+β2
2E[H(η)ηT∇2
θJ(θ)η |θ]
+β3
6E[H(η)ηT (∇3
θJ(θ)η)η |θ]
+ o(β3)
).
(5.16)
Let us consider each of the terms in (5.16). It is evident for all i, j = 1, . . . , N , i 6= j,
E [H(η)i,j] = 0. Even for the diagonal elements, we have for all i = 1, 2, . . . , N ,
E [H(η)i,i] =2q(
(N + 4)− (N + 2)q)E
[(η(i))2
ρ(η)2
]− E
[1
ρ(η)
]. (5.17)
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 41
For q < 1, a simple application of Proposition 3.1 show that
E
[1
ρ(η)
]=
Γ(
11−q
)Γ(
11−q + 1 + N
2
)Γ(
11−q + 1
)Γ(
11−q + N
2
) =
11−q + N
2
11−q
=N + 2−Nq
2. (5.18)
Similarly, E[
1ρ(η)
]= N+2−Nq
2for q ∈
(1, 1 + 2
N+2
). Substituting this expression in (5.17),
we get E [H(η)i,i] = 0. Thus, the first term in (5.16) is zero. Expanding the inner product,
(i, j)th element of the second term can be written as
E[H(η)ηT∇θJ(θ) |θ
]=
N∑k=1
[∇θJ(θ)](k) E[η(k)H(η)i,j
]
=
2q
((N+4)−(N+2)q)
N∑k=1
[∇θJ(θ)](k) E
[η(i)η(j)η(k)
ρ(η)2
]if i 6= j
2q((N+4)−(N+2)q)
N∑k=1
[∇θJ(θ)](k) E
[(η(i))2η(k)
ρ(η)2
]−
N∑k=1
[∇θJ(θ)](k) E
[η(k)
ρ(η)
]if i = j.
In all the expectations above, since the total number of exponents in the numerator is
odd, hence, for any combination of i, j, k, the functions are odd and so, by Proposition 3.1,
the expectations are zero in all cases. Thus, the second term in (5.16) is zero. Due to
similar reasons, the fourth term in (5.16) is also zero. Now, we consider the second term.
For i 6= j, we have
E[H(η)i,j
(ηT∇2
θJ(θ)η)|θ]
=2q(
(N + 4)− (N + 2)q) N∑k,l=1
[∇2θJ(θ)
]k,l
E
[η(i)η(j)η(k)η(l)
ρ(η)2
],
which is zero unless i = k, j = l or i = l, j = k. So using the fact that ∇2θJ(θ) is
symmetric, i.e., [∇2θJ(θ)]k,l = [∇2
θJ(θ)]l,k, we can write
E[H(η)i,j
(ηT∇2
θJ(θ)η)|θ]
=4q [∇2
θJ(θ)]i,j((N + 4)− (N + 2)q
)E
[(η(i))2 (
η(j))2
ρ(η)2
]. (5.19)
Referring to Proposition 3.1, we have in this case, b = bi = bj = 2 and bk = 0 for all other
k. So∑N
k=1bk2
= 2 and bi!
2bi( bi2 )!=
bj !
2bj(bj2
)!
= 12. For any q ∈
(0, 1)⋃ (
1, 1 + 2N+2
), the
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 42
arguments of the Gamma functions are positive, and hence, the Gamma functions exist.
For q < 1, we have
E
[(η(i))2 (
η(j))2
ρ(η)2
]=
((N + 4)− (N + 2)q
1− q
)2(1
4
) Γ(
11−q − 1
)Γ(
11−q + 1 + N
2
)Γ(
11−q + 1
)Γ(
11−q + 1 + N
2
)=
((N + 4)− (N + 2)q
1− q
)21
4
1
11−q
(1
1−q − 1)
=
((N + 4)− (N + 2)q
)2
4q,
while for q > 1, we again have E
[(η(i))2 (
η(j))2
ρ(η)2
]=
(((N + 4)− (N + 2)q)2
4q
).
Hence, we obtain
E[H(η)i,j
(ηT∇2
θJ(θ)η)|θ]
=((N + 4)− (N + 2)q
) [∇2θJ(θ)
]i,j
for i 6= j by substituting in (5.19). Now for i = j, using (5.3)
E[H(η)i,i
(ηT∇2
θJ(θ)η)|θ]
=2q(
(N + 4)− (N + 2)q) N∑k,l=1
[∇2θJ(θ)
]k,l
E
[(η(i))2η(k)η(l)
ρ(η)2
]
−N∑
k,l=1
[∇2θJ(θ)
]k,l
E
[η(k)η(l)
ρ(η)2
].
Observing that the above expectations are zero for k 6= l, we have
E[H(η)i,i
(ηT∇2
θJ(θ)η)|θ]
=2q [∇2
θJ(θ)]i,i((N + 4)− (N + 2)q
)E
[(η(i))4
ρ(η)2
]
+2q(
(N + 4)− (N + 2)q)∑k 6=i
[∇2θJ(θ)
]k,k
E
[(η(i))2 (
η(k))2
ρ(η)2
]
−N∑k=1
[∇2θJ(θ)
]k,k
E
[(η(k))2
ρ(η)
]. (5.20)
We again refer to Proposition 3.1 to compute each term in (5.20). For the first term,
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 43
b = 2 and bi = 4. So we can verify the conditions in Proposition 3.1 hold for all values of
q ∈(0, 1)⋃ (
1, 1 + 2N+2
). We have
E
[(η(i))4
ρ(η)2
]=
((N + 4)− (N + 2)q
1− q
)2(4!
24 2!
) Γ(
11−q − 1
)Γ(
11−q + 1
) =3((N + 4)− (N + 2)q
)2
4q
for q ∈ (0, 1). The same result also holds for 1 < q <(1+ 2
N+2
). The second term in (5.20)
is similar to the one in (5.19), and can be computed in the same way. From (4.21), we
have EGq(η)
[(η(k))
2
ρ(η)
]=(
(N+4)−(N+2)q2
). Substituting all these terms in (5.20) results in
the following.
E[H(η)i,i
(ηT∇2
θJ(θ)η)|θ]
=((N + 4)− (N + 2)q
)(3
2
[∇2θJ(θ)
]i,i
+1
2
∑k 6=i
[∇2θJ(θ)
]k,k− 1
2
N∑k=1
[∇2θJ(θ)
]k,k
)
=((N + 4)− (N + 2)q
) [∇2θJ(θ)
]i,i.
The claim follows by substituting all the above expressions in (5.16).
The following result is a direct consequence of Propositions 4.10 and 5.4.
Corollary 5.5. Under Assumption V, ‖∆(θ)‖ = o(β) for all q ∈(0, 1 + 2
N+2
), q 6= 1.
Proof. We write ∆(.) as
∆(θ) =(Ppd
(∇2θJ(θ)
)−1 − Ppd(∇2θSq,β[J(θ)]
)−1)∇θJ(θ)
+ Ppd(∇2θSq,β[J(θ)]
)−1(∇θJ(θ)−∇θSq,β[J(θ)]) ,
which implies that
∥∥∥∆(θ)∥∥∥ 6
∥∥∥Ppd (∇2θJ(θ)
)−1 − Ppd(∇2θSq,β[J(θ)]
)−1∥∥∥∥∥∥∇θJ(θ)
∥∥∥+∥∥∥Ppd (∇2
θSq,β[J(θ)])−1∥∥∥∥∥∥∇θJ(θ)−∇θSq,β[J(θ)]
∥∥∥ .(5.21)
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 44
Since, ∇θJ(θ) is continuously differentiable on compact set C, hence supθ∈C‖∇θJ(θ)‖ <∞.
Also, since Ppd(.) is a positive definite matrix, its inverse always exists, i.e., for any given
matrix A, ‖(Ppd(A))−1‖ < ∞ considering any matrix norm. Thus, in order to justify
the claim, we need to show that other terms are o(β). From Proposition 4.10, it follows∥∥∥∇θJ(θ)−∇θSq,β[J(θ)]∥∥∥ = o(β), and we can write
∥∥∥Ppd (∇2θJ(θ)
)−1 − Ppd(∇2θSq,β[J(θ)]
)−1∥∥∥
=∥∥∥Ppd (∇2
θJ(θ))−1Ppd
(∇2θSq,β[J(θ)]
)−1(Ppd
(∇2θSq,β[J(θ)]
)− Ppd
(∇2θJ(θ)
) )∥∥∥6∥∥∥Ppd (∇2
θJ(θ))−1∥∥∥∥∥∥Ppd (∇2
θSq,β[J(θ)])−1∥∥∥∥∥∥Ppd (∇2
θSq,β[J(θ)])− Ppd
(∇2θJ(θ)
) ∥∥∥The first two terms are finite due to the positive definiteness of Ppd(.), while from Propo-
sition 5.4, the third term is o(β). Hence, the claim.
The following result deals with the noise term ξn in a way similar to the Gq-SF1
algorithm. For this we consider the filtration (Fn)n>0 as defined earlier in Chapter 4, i.e.,
Fn = σ(θ(0), . . . , θ(n), η(0), . . . , η(n− 1)
).
Lemma 5.6. Defining Mn =∑n−1
i=0 a(k)ξk, (Mn,Fn)n>0 is an almost surely convergent
martingale sequence for all q ∈(0, 1)⋃ (
1, 1 + 2N+2
).
Proof. As θ(k) is Fk-measurable, while η(k) is independent of Fk for all k > 0, we can
conclude that E[ξk|Fk] = 0. Thus (ξk,Fk)k>0 is a martingale difference sequence and
(Mk,Fk)k>0 is a martingale sequence. As shown in Lemma 4.9,
E[‖ξk‖2
∣∣Fk]6 4E
∥∥∥∥∥∥Ppd(
2H(η)J(θ + βη)
β2((N + 4)− (N + 2)q
))−12ηJ(θ + βη)
βρ(η)((N + 4)− (N + 2)q
)∥∥∥∥∥∥
26
16
β2((N + 4)− (N + 2)q
)2 E
∥∥∥∥∥∥Ppd(
2H(η)J(θ + βη)
β2((N + 4)− (N + 2)q
))−1∥∥∥∥∥∥
2 ∥∥∥∥ η
ρ(η)
∥∥∥∥2
J(θ + βη)2
6
16
β2((N + 4)− (N + 2)q
)2 supηJ(θ + βη)2 sup
η
(1
Λ2max
) N∑j=1
E
[(η(j))2
ρ(η)2
],
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 45
where Λmax is the maximum eigenvalue of the projected Hessian matrix, which is always
greater than ε by definition of Ppd. The claim follows using similar arguments as in
Proposition 4.9.
Thus, we have the main theorem which affirms the convergence of the Nq-SF1 algo-
rithm. The proof of this theorem is exactly the same as that of Theorem 4.11.
Theorem 5.7. Under Assumptions I – V, given ε > 0 and q ∈(0, 1)⋃ (
1, 1 + 2N+2
),
there exists β0 > 0 such that for all β ∈ (0, β0], the sequence (θ(n)) obtained using Nq-
SF1 converges almost surely as n → ∞ to a point in the ε-neighbourhood of the set of
stable attractor of the ODE
θ(t) = PC(Ppd
(∇2θ(t)J
(θ(t)
))−1∇θ(t)J(θ(t)
)), (5.22)
where the domain of attraction is given by
K =θ ∈ C
∣∣∣∇θJ(θ)T PC(−Ppd
(∇2θJ(θ)
)−1∇θJ(θ))
= 0. (5.23)
5.3.2 Convergence of Nq-SF2 Algorithm
The convergence of Nq-SF2 algorithm to a local minimum can be showed by extending
the results in the previous section in the same way as done for the Gq-SF2 algorithm.
We just show the result regarding convergence of smoothed Hessian to the Hessian of the
objective function as β → 0, which has been used in (5.7).
Proposition 5.8. For a given q ∈(0, 1)⋃ (
1, 1 + 2N+2
), for all θ ∈ C and β > 0,
∥∥∇2θS′q,β[J(θ)]−∇2
θJ(θ)∥∥ = o(β).
Proof. For small β > 0, using Taylor’s expansion of J(θ + βη) and J(θ − βη) around
θ ∈ C,
J(θ + βη) + J(θ − βη) = 2J(θ) + β2∇θJ(θ) + o(β3).
CHAPTER 5. q-GAUSSIAN BASED NEWTON SEARCH ALGORITHMS 46
Thus the Hessian of the two-sided SF (5.6) is
∇2θS′q,β[J(θ)] =
1
β2((N + 4)− (N + 2)q
)(E [2H(η)J(θ) |θ]
+ β2E[H(η)ηT∇2
θJ(θ)η |θ]
+ +o(β3)
),
which can be computed as in Proposition 5.4 to arrive at the claim.
The key result related to convergence of Nq-SF2 follows.
Theorem 5.9. Under Assumptions I – V, given ε > 0 and q ∈(0, 1)⋃ (
1, 1 + 2N+2
),
there exists β0 > 0 such that for all β ∈ (0, β0], the sequence (θ(n)) obtained using Nq-
SF2 converges almost surely as n → ∞ to a point in the ε-neighbourhood of the set of
stable attractor of the ODE (5.22).
Thus, the convergence analysis of all the proposed algorithms show that by choosing a
small β > 0 and any q ∈(0, 1 + 2
N+2
), all the SF algorithms converge to a local optimum.
In the special case of q → 1, we retrieve the algorithms presented in [5], and so the
corresponding convergence analysis holds.
Chapter 6
Simulations using Proposed
Algorithms
6.1 Numerical Setting
We consider a multi-node network of M/G/1 queues with feedback as shown in the figure
below. The setting here is a generalized version of that considered in [5].
Figure 6.1: Queuing Network.
There are K nodes, which are fed with independent Poisson external arrival processes
with rates λ1, λ2, . . . , λK , respectively. After departing from the ith node, a customer either
leaves the system with probability pi or enters the (i+ 1)th node with probability (1−pi).
Once the service at the Kth node is completed, the customer may rejoin the 1st node with
probability (1−pK). The service time processes of each node, Sin(θi)n>1, i = 1, 2, . . . , K
47
CHAPTER 6. SIMULATIONS USING PROPOSED ALGORITHMS 48
are defined as
Sin(θi) = Ui(n)
(1
Ri
+ ‖θi(n)− θi‖2
), (6.1)
where for all i = 1, 2, . . . , K, Ri are constants and Ui(n) are independent samples drawn
from the uniform distribution on (0, 1). The service time of each node depends on the
Ni-dimensional tunable parameter vector θi, whose individual components lie in a certain
interval[(θ
(j)i
)min
,(θ
(j)i
)max
], j = 1, 2, . . . , Ni, i = 1, 2, . . . , K. θi(n) represents the nth
update of the parameter vector at the ith node, and θi represents the target parameter
vector corresponding to ith node.
The cost function is chosen to be the sum of the total waiting times of all the customers
in the system. For the cost to be minimum, Sin(θi) should be minimum, and hence, we
should have θi(n) = θi, i = 1, . . . , K. Let us denote
θ =
θ1
θ2
...
θK
and θ =
θ1
θ2
...
θK
.
It is evident that θ, θ ∈ RN , where N =∑K
i=1Ni. In order to compare the performance
of various algorithms, we consider the performance measure to be the euclidian distance
between θ(n) and θ, given by
‖θ(n)− θ‖ =
[K∑i=1
Ni∑j=1
(θ
(j)i (n)− θ(j)
i
)2]1/2
.
6.2 Implementation Issues
Before discussing about the results of simulation, we address some of the issues regarding
implementation of the proposed algorithms.
All the algorithms require generation of a multivariate q-Gaussian distributed random
vector, whose individual components are uncorrelated and identically distributed. This
implies that the random variables are q-independent [47]. For limiting case of q → 1,
CHAPTER 6. SIMULATIONS USING PROPOSED ALGORITHMS 49
q-independence is equivalent to independence of the random variables. Hence, we can use
standard algorithms to generate i.i.d. samples. This is ideally not possible q-Gaussians
with q 6= 1. Thistleton et.al. [43] proposed an algorithm for generating one-dimensional q-
Gaussian distributed random variables using generalized Box-Muller transformation. This
method can be easily extended to the two variable case. However, there exists no standard
algorithm for generating N -variate q-Gaussian random vectors. Hence, we perform the
simulations with random vectors consisting of N i.i.d. samples of univariate q-Gaussian
random variables.
The projection of Hessian in the Newton based algorithms requires some discussion.
We require the projected matrix to have eigenvalues bounded below by ε > 0. This
can be done by performing the eigen decomposition of the Hessian update W (nL), and
computing the projected matrix by making all the eigenvalues less than ε to be equal to ε.
However, this method requires large amount of time and memory resources for the eigen
decomposition. To reduce the computational efforts, various methods have been studied
in literature [4, 40, 50] for obtaining projected Hessian estimates. We consider a variation
of Newton’s method, where the off-diagonal elements of the matrix are set to zero. This
is similar to the Jacobi variant algorithms discussed in [5], and it simplifies the update as
Steps 11–15 in the algorithms are no longer required. The projection of Hessian can be
easily obtained by projecting each of the diagonal element to [ε,∞), and inverse can be
directly computed. The simulations are shown using this method.
The analysis in faster timescale for the Newton based algorithms shows that the gradi-
ent and Hessian updates are independent, and hence, their convergence to the smoothed
gradient and Hessian, respectively, can be independently analyzed. This also provides
a scope to update the gradient and Hessian in different timescales without affecting the
convergence of the algorithms. Bhatnagar [5] used a faster timescale to update the Hes-
sian, due to which inverse of the Hessian converges much faster compared to the gradient
update. This improves the performance of the N-SF algorithms. We also study the effect
of the timescale on the proposed algorithms.
CHAPTER 6. SIMULATIONS USING PROPOSED ALGORITHMS 50
6.3 Experimental Results
For the simulations, we first consider a two queue network with the arrival rates at the
nodes being λ1 = 0.2 and λ2 − 0.1 respectively. We consider that all customers leaving
node-1 enter node-2, i.e., p1 = 0, while customers serviced at node-2 may leave the
system with probability p2 = 0.4. We also fix the constants in the service time at R1 = 10
and R2 = 20. The service parameters for either nodes are two-dimensional vectors,
N1 = N2 = 2, with components lying in the interval[(θ
(j)i
)min
,(θ
(j)i
)max
]= [0.1, 0.6]
for i = 1, 2, j = 1, 2. Thus the constrained space C is given by C = [0.1, 0.6]4 ⊂ R4. We
fix the target parameter vector at θ = (0.3, 0.3, 0.3, 0.3)T .
The simulations were performed on an Intel Core i5 machine with 3.7GiB memory
space. We run the algorithms varying the values of q and β, while all the other parameters
are held fixed at M = 10000, L = 100 and ε = 0.1. For all the cases, the initial parameter
is assumed to be θ(0) = (0.1, 0.1, 0.6, 0.6)T . The step-sizes (a(n))n>0, (b(n))n>0 were
taken to be a(n) = 1n, b(n) = 1
n3/4 , respectively. For each q, β pair, 20 independent runs
were performed, which took about 10 seconds. We compare the performance of all the
proposed algorithms with the four SF algorithms proposed in [5], which use Gaussian
smoothing. Figure 6.2 shows the convergence behavior of the proposed q-Gaussian based
algorithms with q = 0.9, and that of the corresponding Gaussian based algorithms, where
the smoothness parameter is β = 0.1 for all cases.
Figure 6.2: Convergence behavior of various algorithms for β = 0.1.
CHAPTER 6. SIMULATIONS USING PROPOSED ALGORITHMS 51
The following tables present the performance of the gradient based algorithms and
Jacobi variant of the Newton based algorithms. The values of β are chosen over the range
where the Gaussian SF algorithms perform well [5], whereas the values of q are chosen
such that they are spread uniformly over the range(0, 1 + 2
N+4
)=(0, 5
3
). It may noted
that q = 1 corresponds to the Gaussian case. We show the distance of the final updates
θ(M) from the optimum θ, averaged over 20 trials. The variance of the updates is also
shown to indicate the stability of the algorithms.
We observe that the two-simulation algorithms always perform better than their one-
simulation counterparts. The results show that, in general, Gq-SF2 algorithm gives a
better performance than Nq-SF2. This is due to the approximation considered in the
Hessian update, as it has been observed in literature that the Newton based approach
usually performs better than the gradient case. We now focus on the effect of the value
of q used in the algorithms. For each value of β, the instances where the distance for the
q-Gaussian is less than Gaussian-SF are highlighted. Also, the least distance obtained for
each β is marked.
In case of the gradient descent algorithms, it can be seen that better results are usually
obtained for q > 1. Also, over the interval (0, 1), distance is less when the value of q is
close to 1 (Gaussian case). In fact, q = 0.9 seems to give better performance than other
values of q when all values of β are considered. This observation is similar to the results
in [18]. The poor performance for low values of q is due to the noise term addressed in
Lemma 4.9, for which the variance may become quite high as q gets closer to zero. The
Newton based algorithms, however, perform well over a wider range of q. This may be
due to the fact that, in these algorithms, the parameter updates are such that they lead to
some amount of cancellation of the noise effects in gradient and Hessian updates. But, in
all the algorithms, it is observed that for higher values of β (for example, β = 0.5), when
the Gaussian algorithms tend to become unstable, q-Gaussian algorithms with q < 1 are
relatively more stable, whereas q > 1 show unstable results (high variance). Thus, it can
be claimed that for higher values of q, the region of stability, (0, β0), mentioned in the
convergence theorems becomes smaller.
CHAPTER
6.SIM
ULATIO
NSUSIN
GPROPOSED
ALGORIT
HMS
52
HHHHqβ
0.01 0.025 0.05 0.075 0.1 0.25 0.5
0.1 0.2353±0.0946 0.1763±0.0751 0.1024±0.0620 0.0975±0.0993 0.0668±0.0553 0.0772±0.0772 0.0954±0.0986
0.2 0.2378±0.1056 0.1809±0.1040 0.0974±0.0594 0.0741±0.0424 0.1052±0.0992 0.0670±0.0392 0.1380±0.1088
0.3 0.2452±0.1142 0.1624±0.0975 0.0802±0.0389 0.0943±0.0737 0.0545±0.0485 0.0607±0.0365 0.1212±0.0908
0.4 0.2603±0.1031 0.1659±0.0820 0.1074±0.0733 0.0862±0.0587 0.0715±0.0361 0.0891±0.0796 0.1566±0.1217
0.5 0.2524±0.1157 0.1675±0.1085 0.1267±0.0802 0.0810±0.0508 0.0841±0.0841 0.0713±0.0786 0.1346±0.1039
0.6 0.2174±0.0724 0.1444±0.0558 0.0751±0.0509 0.0888±0.0981 0.0610±0.0359 0.0892±0.0730 0.1467±0.1082
0.7 0.2011±0.0786 0.1151±0.0515 0.0616±0.0244 0.0794±0.0853 0.0647±0.0668 0.0603±0.0396 0.0950±0.03500.8 0.1754±0.1072 0.1012±0.0691 0.0481±0.0379 0.0333±0.0145 0.0298±0.0130 0.0563±0.0966 0.1037±0.03930.9 0.1798±0.0614 0.0768±0.0286 0.0371±0.0160 0.0262±0.0078 0.0244±0.0083 0.0243±0.0075 0.0936±0.0302
Gaussian 0.1664±0.0929 0.0653±0.0286 0.0387±0.0152 0.0274±0.0096 0.0243±0.0075 0.0258±0.0115 0.1173±0.0530
1.1 0.1523±0.0652 0.0767±0.0203 0.0398±0.0133 0.0296±0.0114 0.0204±0.0081 0.0246±0.0104 0.1326±0.0550
1.2 0.2018±0.0765 0.0746±0.0245 0.0353±0.0148 0.0274±0.0125 0.0264±0.0072 0.0312±0.0114 0.2208±0.0727
1.3 0.2012±0.0727 0.0961±0.0215 0.0413±0.0205 0.0306±0.0079 0.0275±0.0085 0.0336±0.0107 0.3238±0.0776
Table 6.1: Performance of Gq-SF1 algorithm for different values of q and β.
HHHHqβ
0.01 0.025 0.05 0.075 0.1 0.25 0.5
0.1 0.1587±0.1472 0.0623±0.0301 0.0734±0.1287 0.0184±0.0045 0.0168±0.0059 0.0243±0.0182 0.0290±0.01710.2 0.0702±0.0569 0.0455±0.0344 0.0204±0.0098 0.0759±0.1204 0.0185±0.0088 0.0393±0.0437 0.0670±0.06070.3 0.1423±0.0775 0.0659±0.0748 0.1402±0.1787 0.0210±0.0110 0.0228±0.0125 0.0409±0.0648 0.0549±0.04740.4 0.1101±0.0599 0.1124±0.1545 0.0274±0.0170 0.0190±0.0095 0.0169±0.0095 0.0146±0.0036 0.0538±0.03010.5 0.1243±0.0740 0.0460±0.0344 0.0268±0.0255 0.0256±0.0170 0.0163±0.0121 0.0480±0.0595 0.0801±0.0557
0.6 0.1090±0.0708 0.0387±0.0198 0.0285±0.0164 0.0536±0.0554 0.0363±0.0380 0.0277±0.0186 0.0754±0.0384
0.7 0.0588±0.0400 0.0307±0.0076 0.0262±0.0203 0.0173±0.0097 0.0754±0.1131 0.0335±0.0314 0.0804±0.1089
0.8 0.0544±0.0176 0.0238±0.0034 0.0196±0.0093 0.0093±0.0043 0.0075±0.0050 0.0150±0.0056 0.0473±0.02320.9 0.0490±0.0203 0.0213±0.0070 0.0114±0.0037 0.0090±0.0017 0.0061±0.0021 0.0088±0.0034 0.0466±0.0166
Gaussian 0.0501±0.0118 0.0215±0.0067 0.0112±0.0037 0.0062±0.0017 0.0066±0.0020 0.0097±0.0032 0.0742±0.0351
1.1 0.0517±0.0125 0.0236±0.0042 0.0113±0.0032 0.0077±0.0026 0.0062±0.0015 0.0104±0.0034 0.0773±0.0325
1.2 0.0531±0.0195 0.0204±0.0075 0.0107±0.0050 0.0071±0.0025 0.0056±0.0023 0.0113±0.0036 0.1471±0.0685
1.3 0.0529±0.0112 0.0208±0.0064 0.0101±0.0029 0.0079±0.0021 0.0064±0.0007 0.0159±0.0053 0.2196±0.0543
Table 6.2: Performance of Gq-SF2 algorithm for different values of q and β.
CHAPTER
6.SIM
ULATIO
NSUSIN
GPROPOSED
ALGORIT
HMS
53
HHHHqβ
0.01 0.025 0.05 0.075 0.1 0.25 0.5
0.1 0.2913±0.0714 0.1971±0.0675 0.0855±0.0229 0.0575±0.0165 0.0518±0.0205 0.0480±0.0187 0.1145±0.04580.2 0.2640±0.0774 0.1322±0.0484 0.0882±0.0376 0.0587±0.0210 0.0515±0.0168 0.0448±0.0148 0.0813±0.03290.3 0.3055±0.0966 0.1952±0.0596 0.0894±0.0352 0.0641±0.0206 0.0469±0.0187 0.0460±0.0153 0.1221±0.04200.4 0.2592±0.0823 0.1891±0.0865 0.0940±0.0375 0.0588±0.0173 0.0505±0.0187 0.0480±0.0253 0.1057±0.03190.5 0.2892±0.0837 0.1549±0.0612 0.0805±0.0326 0.0589±0.0193 0.0517±0.0172 0.0493±0.0194 0.1010±0.03150.6 0.2344±0.0858 0.1356±0.0614 0.0737±0.0276 0.0580±0.0149 0.0470±0.0203 0.0450±0.0191 0.0973±0.03950.7 0.2660±0.0923 0.1621±0.0682 0.0826±0.0305 0.0674±0.0207 0.0457±0.0144 0.0440±0.0162 0.1308±0.04120.8 0.2737±0.0805 0.1776±0.0851 0.0842±0.0237 0.0556±0.0209 0.0462±0.0178 0.0445±0.0117 0.1055±0.03640.9 0.2597±0.0826 0.1554±0.0647 0.0739±0.0300 0.0524±0.0175 0.0391±0.0102 0.0431±0.0177 0.1340±0.0686
Gaussian 0.2356±0.0844 0.1226±0.0593 0.0761±0.0321 0.0539±0.0199 0.0440±0.0131 0.0427±0.0208 0.1804±0.0676
1.1 0.2299±0.1069 0.1420±0.0423 0.0681±0.0194 0.0548±0.0177 0.0455±0.0179 0.0452±0.0142 0.1919±0.0773
1.2 0.2126±0.0650 0.1267±0.0609 0.0754±0.0333 0.0720±0.0189 0.0450±0.0231 0.0396±0.0142 0.2898±0.0786
1.3 0.2621±0.0887 0.1920±0.0835 0.0970±0.0366 0.0664±0.0265 0.0608±0.0234 0.0702±0.0352 0.3631±0.0776
Table 6.3: Performance of Jacobi variant of Nq-SF1 algorithm for different values of q and β.
HHHHqβ
0.01 0.025 0.05 0.075 0.1 0.25 0.5
0.1 0.0788±0.0372 0.0281±0.0092 0.0165±0.0061 0.0120±0.0026 0.0088±0.0035 0.0121±0.0052 0.0421±0.01970.2 0.0883±0.0322 0.0305±0.0082 0.0182±0.0073 0.0129±0.0030 0.0087±0.0026 0.0097±0.0045 0.0287±0.01230.3 0.0951±0.0345 0.0372±0.0107 0.0167±0.0040 0.0116±0.0042 0.0075±0.0034 0.0098±0.0034 0.0348±0.01120.4 0.0726±0.0241 0.0251±0.0089 0.0162±0.0055 0.0109±0.0023 0.0113±0.0043 0.0103±0.0036 0.0365±0.00920.5 0.0954±0.0329 0.0277±0.0097 0.0145±0.0041 0.0112±0.0037 0.0108±0.0036 0.0101±0.0029 0.0279±0.01400.6 0.0808±0.0297 0.0318±0.0132 0.0147±0.0042 0.0115±0.0039 0.0102±0.0026 0.0098±0.0029 0.0346±0.01280.7 0.0693±0.0328 0.0247±0.0108 0.0130±0.0052 0.0099±0.0043 0.0091±0.0028 0.0105±0.0035 0.0458±0.01570.8 0.0796±0.0308 0.0320±0.0100 0.0164±0.0051 0.0102±0.0043 0.0113±0.0034 0.0100±0.0035 0.0339±0.00900.9 0.0612±0.0220 0.0292±0.0055 0.0146±0.0027 0.0108±0.0035 0.0104±0.0037 0.0105±0.0044 0.0484±0.0221
Gaussian 0.0718±0.0222 0.0248±0.0116 0.0117±0.0046 0.0115±0.0031 0.0087±0.0023 0.0100±0.0028 0.0690±0.0287
1.1 0.0680±0.0263 0.0240±0.0111 0.0147±0.0055 0.0127±0.0037 0.0074±0.0025 0.0090±0.0053 0.0969±0.0432
1.2 0.0629±0.0219 0.0257±0.0110 0.0141±0.0040 0.0103±0.0041 0.0095±0.0036 0.0114±0.0053 0.1024±0.0498
1.3 0.0827±0.0372 0.0523±0.0212 0.0209±0.0097 0.0134±0.0044 0.0104±0.0033 0.0257±0.0146 0.2108±0.0849
Table 6.4: Performance of Jacobi variant of Nq-SF2 algorithm for different values of q and β.
CHAPTER 6. SIMULATIONS USING PROPOSED ALGORITHMS 54
We study the effect of the timescales on the proposed algorithms. Since, both the
one-simulation and two-simulation algorithms involve the step-sizes in a similar way, we
consider only Gq-SF2 and Nq-SF2 algorithms. As mention in the previous section, we can
update the Hessian on a different timescale, independent of the gradient estimation. We
use a step-size (c(n)) for Hessian estimation. We fix the step-size sequence corresponding
to the slower timescale at a(n) = 1n, n > 1. and let the other sequences to be of the form
b(n) = 1nγ
, c(n) = 1nδ
.
HHHHqδ
0.55 0.65 0.75 Gq-SF2
0.1 0.0133±0.0059 0.0095±0.0031 0.0087±0.0030 0.0229±0.0064
0.2 0.0089±0.0030 0.0095±0.0024 0.0097±0.0038 0.0243±0.0253
0.3 0.0094±0.0038 0.0110±0.0041 0.0100±0.0036 0.0473±0.0778
0.4 0.0112±0.0039 0.0104±0.0040 0.0097±0.0031 0.0250±0.0216
0.5 0.0111±0.0053 0.0108±0.0030 0.0101±0.0045 0.0403±0.0330
0.6 0.0114±0.0047 0.0100±0.0051 0.0080±0.0014 0.0505±0.0944
0.7 0.0121±0.0033 0.0101±0.0040 0.0087±0.0031 0.0098±0.0040
0.8 0.0109±0.0028 0.0098±0.0029 0.0096±0.0032 0.0101±0.0053
0.9 0.0084±0.0031 0.0089±0.0040 0.0095±0.0044 0.0085±0.0018
Gaussian 0.0093±0.0028 0.0076±0.0033 0.0088±0.0032 0.0067±0.0021
1.1 0.0075±0.0042 0.0070±0.0023 0.0080±0.0028 0.0068±0.0028
1.2 0.0086±0.0030 0.0070±0.0019 0.0078±0.0036 0.0079±0.0012
1.3 0.0072±0.0031 0.0091±0.0024 0.0102±0.0048 0.0086±0.0028
Table 6.5: Performance of Gq-SF2 and Nq-SF2 (Jacobi variant) algorithms for differentvalues of q and δ, where β = 0.1 and the step-sizes are a(n) = 1
n, b(n) = 1
n0.75 , c(n) = 1nδ
.
HHHHqδ
0.55 0.65 0.75 0.85 Gq-SF2
0.1 0.0123±0.0076 0.0343±0.0554 0.0115±0.0025 0.0115±0.0031 0.0302±0.0197
0.2 0.0416±0.0618 0.0123±0.0038 0.0091±0.0031 0.0105±0.0029 0.0232±0.0162
0.3 0.0173±0.0098 0.0147±0.0124 0.0092±0.0044 0.0119±0.0032 0.0452±0.0373
0.4 0.0255±0.0222 0.0187±0.0088 0.0121±0.0037 0.0096±0.0036 0.0187±0.0114
0.5 0.0203±0.0096 0.0108±0.0041 0.0103±0.0028 0.0105±0.0027 0.0424±0.0702
0.6 0.0160±0.0140 0.0126±0.0038 0.0096±0.0036 0.0114±0.0027 0.0223±0.0202
0.7 0.0180±0.0224 0.0115±0.0036 0.0078±0.0033 0.0098±0.0035 0.0118±0.0095
0.8 0.0148±0.0060 0.0083±0.0031 0.0095±0.0031 0.0104±0.0027 0.0055±0.0028
0.9 0.0096±0.0042 0.0101±0.0032 0.0075±0.0025 0.0098±0.0034 0.0076±0.0018
Gaussian 0.0096±0.0024 0.0067±0.0042 0.0094±0.0027 0.0077±0.0025 0.0068±0.0015
1.1 0.0096±0.0027 0.0095±0.0039 0.0069±0.0032 0.0074±0.0032 0.0060±0.0013
1.2 0.0086±0.0029 0.0071±0.0023 0.0082±0.0020 0.0087±0.0019 0.0075±0.0010
1.3 0.0087±0.0020 0.0080±0.0037 0.0064±0.0023 0.0139±0.0057 0.0073±0.0023
Table 6.6: Performance of Gq-SF2 and Nq-SF2 (Jacobi variant) algorithms for differentvalues of q and δ, where β = 0.1 and the step-sizes are a(n) = 1
n, b(n) = 1
n0.85 , c(n) = 1nδ
.
CHAPTER 6. SIMULATIONS USING PROPOSED ALGORITHMS 55
In order to satisfy Assumption IV, we need to consider γ, δ ∈ (0.5, 1). We study
the relative performance of Gq-SF2 and Nq-SF2 (Jacobi variant) algorithms, when the
Hessian is updated on various timescales. We vary δ such that b(n) = o(c(n)
)due to
reasons discussed in [5]. Tables 6.5 and 6.6 show the effect of δ on the Nq-SF2 algorithm
for various values of q. The value of β is held fixed at β = 0.1, and γ is fixed at 0.75
and 0.85, respectively. The value of δ is varied from 0.55 to γ. We mark the cases where
Nq-SF2 performs better than Gq-SF2.
We perform similar experiments in a higher dimensional case. For this, we consider
a five node network with λi = 0.2, i = 1, . . . , 5 external arrival rate at each node. The
probability of leaving the system after service at each node is pi = 0.2 for all nodes.
The service process of each node is controlled by a 10-dimensional parameter vector, and
a constant set at Ri = 10, i = 1, . . . , 5. Thus, we have a 50-dimensional constrained
optimization problem, where each component can vary over the interval [0.1, 0.6] and the
target is 0.3. The parameters of the algorithms are held fixed at M = 10000, L = 100 and
ε = 0.1. Each component in the initial parameter vector is assumed to be θ(i)(0) = 0.6 for
all i = 1, 2, . . . , 50. The step-sizes were taken to be a(n) = 1n, b(n) = 1
n0.75 and c(n) = 1n0.65 .
For each q, β pair, 20 independent runs were performed, which took about 1 minute on
an average.
Figure 6.3: Convergence behavior of various algorithms for β = 0.1.
CHAPTER 6. SIMULATIONS USING PROPOSED ALGORITHMS 56
It can be observed that the trend changes in higher dimensional case. The smaller
values of q become more significant as the noise term do not scale up to be very high,
as compared to the 4-dimensional case. In fact, for Nq-SF2 algorithm, low values of q
(close to 0) make the algorithm more stable, and also improves performance significantly.
However, we cannot conclude a trend for the best case in Gq-SF2 algorithm, although
small q’s still give better performance.
HHHHqβ
0.025 0.05 0.075 0.1 0.25
0.1 0.7237±0.0514 0.6323±0.0477 0.5721±0.0481 0.5325±0.0329 0.3128±0.03050.2 0.6858±0.0834 0.5921±0.0398 0.5278±0.0431 0.4899±0.0279 0.2769±0.03870.3 0.6695±0.0591 0.5385±0.0386 0.4913±0.0347 0.4422±0.0328 0.3113±0.12330.4 0.6827±0.1260 0.5426±0.1293 0.4687±0.1018 0.4683±0.1974 0.3005±0.13870.5 0.7196±0.1721 0.5039±0.0935 0.4798±0.1551 0.4118±0.1226 0.5363±0.35350.6 0.8145±0.1709 0.7290±0.3343 0.5164±0.1147 0.5883±0.2394 0.7116±0.29430.7 1.1038±0.1390 0.9438±0.2537 0.7993±0.2745 0.7315±0.2545 0.9571±0.32650.8 1.1176±0.2065 0.9393±0.2182 0.8644±0.2965 0.7553±0.2653 0.9573±0.21900.9 0.9266±0.2015 0.6436±0.3200 0.5634±0.2493 0.5634±0.2893 0.8520±0.2093
Gaussian 0.9548±0.0546 0.8618±0.0208 0.6182±0.0089 0.7225±0.0115 0.9768±0.0720
Table 6.7: Performance Gq-SF2 algorithm for different values of q and β.
HHHHqβ
0.025 0.05 0.075 0.1 0.25
0.1 1.0572±0.1257 0.3125±0.0316 0.2306±0.0279 0.2013±0.0233 0.4786±0.07450.2 1.0994±0.1117 0.3604±0.0638 0.2318±0.0202 0.2205±0.0279 0.6431±0.11350.3 1.2004±0.0975 0.5092±0.1137 0.2917±0.0242 0.2664±0.0307 0.8322±0.12380.4 1.2188±0.0887 0.7111±0.1847 0.3529±0.0470 0.3280±0.0463 1.0707±0.18470.5 1.3280±0.1547 0.9714±0.2154 0.5552±0.1214 0.4624±0.1020 1.2536±0.14820.6 1.4026±0.1281 1.1468±0.1503 0.8221±0.1575 0.7008±0.1915 1.3794±0.11960.7 1.4614±0.1136 1.2622±0.1383 1.0808±0.1629 0.9673±0.1896 1.4844±0.11960.8 1.4725±0.0939 1.3228±0.1452 1.1361±0.1636 1.0394±0.1783 1.5204±0.12830.9 1.5142±0.0971 1.3295±0.1019 1.1470±0.1248 1.1067±0.1631 1.5617±0.1039
Gaussian 1.4739±0.0820 1.2346±0.0934 1.0459±0.1641 1.0301±0.1617 1.6134±0.0823
Table 6.8: Performance Nq-SF2 (Jacobi variant) algorithm for different values of q and β.
Chapter 7
Conclusions
The power-law behavior of q-Gaussians provide a better control over smoothing of func-
tions as compared to Gaussian distribution. This property provides better tuning of
algorithms, which involve such distributions as smoothing kernels. This gives a better
trade-off between local fluctuations and overall error incurred while smoothing.
We have extended the Gaussian smoothed functional approach for gradient and Hes-
sian estimation approach to the q-Gaussian case, and developed optimization algorithms
based on this. We propose four two timescale algorithms using gradient and Newton
based search. These algorithms generalize the one proposed in [5]. We use a queuing
network example to show that for some values of q, the results provided by the proposed
algorithms are significantly better than Gaussian SF algorithms.
We also present proof of convergence of the proposed algorithms, providing necessary
conditions under which the algorithms converge to a local minimum of the objective
function. In course of the convergence analysis, we come across some interesting properties
of multivariate q-Gaussian distribution. We show the q-Gaussian satisfies the Rubinstein
conditions [35], and also provide an expression for the higher order generalized co-moments
of the distribution.
57
Bibliography
[1] S. Abe and N. Suzuki. Itineration of the internet over nonequilibrium stationary
states in Tsallis statistics. Physical Review E, 67(016106), 2003.
[2] S. Abe and N. Suzuki. Scale-free statistics of time interval between successive earth-
quakes. Physica A: Statistical Mechanics and its Applications, 350:588–596, 2005.
[3] A. L. Barabasi and R. Albert. Emergence of scaling in random networks. Science,
286:509–512, 1999.
[4] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, 1999.
[5] S. Bhatnagar. Adaptive Newton-based multivariate smoothed functional algorithms
for simulation optimization. ACM Transactions on Modeling and Computer Simula-
tion, 18(1):27–62, 2007.
[6] S. Bhatnagar and V. S. Borkar. Two timescale stochastic approximation scheme
for simulation-based parametric optimization. Probability in the Engineering and
Informational Sciences, 12:519–531, 1998.
[7] S. Bhatnagar and V. S. Borkar. Multiscale chaotic SPSA and smoothed functional
algorithms for simulation optimization. Simulation, 79(9):568–580, 2003.
[8] S. Bhatnagar, M. C. Fu, S. I. Marcus, and S. Bhatnagar. Two timescale algorithms for
simulation optimization of hidden makov models. IIE Transactions, 33(3):245–258,
2001.
58
BIBLIOGRAPHY 59
[9] S. Bhatnagar, M. C. Fu, S. I. Marcus, and I. J. Wang. Two-timescale simultaneous
perturbation stochastic approximation using deterministic perturbation sequences.
ACM Transactions on Modeling and Computer Simulation, 13(2):180–209, 2003.
[10] E. P. Borges. A possible deformed algebra and calculus inspired in nonextensive
thermostatistics. Physica A: Statistical Mechanics and its Applications, 340:95–101,
2004.
[11] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cam-
bridge University Press, 2008.
[12] J. Costa, A. Hero, and C. Vignat. On solutions to multivariate maximum α-entropy
problems. Energy Minimization Methods in Computer Vision and Pattern Recogni-
tion, Lecture Notes in Computer Science, 2683:211–226, 2003.
[13] Z. Daroczy. Generalized information functions. Information and Control, 16(1):36–
51, 1970.
[14] A. Dukkipati, S. Bhatnagar, and M. N. Murty. Gelfand-Yaglom-Perez theorem for
generalized relative entropy functionals. Information Sciences, 177(24):5707–5714,
2007.
[15] A. Dukkipati, S. Bhatnagar, and M. N. Murty. On measure-theoretic aspects of
nonextensive entropy functionals and corresponding maximum entropy prescriptions.
Physica A: Statistical Mechanics and its Applications, 384(2):758–774, 2007.
[16] V. Fabian. Stochastic approximation. In J. J. Rustagi, editor, Optimizing methods
in Statistics, pages 439–470, New York, 1971. Academic Press.
[17] J. E. Gentle, W. Hardle, and Y. Mori. Handbook of Computational Statistics: Con-
cepts and Methods. Springer, 2004.
[18] D. Ghoshdastidar, A. Dukkipati, and S. Bhatnagar. q-Gaussian based smoothed
functional algorithms for stochastic optimization. In International Symposium on
Information Theory. IEEE, 2012.
BIBLIOGRAPHY 60
[19] I. S. Gradshteyn and I. M. Ryzhik. Table of Integrals, Series and Products (5th ed.).
Elsevier, 1994.
[20] J. Havrda and F. Charvat. Quantification method of classification processes: Concept
of structural a-entropy. Kybernetika, 3(1):30–35, 1967.
[21] M. W. Hirsch. Convergent activation dynamics is in continuous time networks. Neural
Networks, 2:331–349, 1989.
[22] Y. C. Ho and X. R. Cao. Perturbation Analysis of Discrete Event Dynamical Systems.
Kluwer Academic Publishers, 1991.
[23] E. T. Jaynes. Information theory and statistical mechanics. The Physical Review,
106(4):620–630, 1957.
[24] V. Y. A. Katkovnik and Y. U. Kulchitsky. Convergence of a class of random search
algorithms. Automation Remote Control, 8:1321–1326, 1972.
[25] E. Kiefer and J. Wolfowitz. Stochastic estimation of a maximum regression function.
Annals of Mathematical Statistics, 23:462–466, 1952.
[26] A. N. Kolmogorov. New metric invariant of transitive dynamical systems and endo-
morphisms of Lebesgue spaces. Doklady of Russian Academy of Sciences, 119(5):861–
864, 1958.
[27] S. Kullback. Information theory and statistics. John Wiley and Sons, N.Y., 1959.
[28] H. J. Kushner and D. S. Clark. Stochastic Approximation Methods for Constrained
and Unconstrained Systems. Springer-Verlag, New York, 1978.
[29] P. L’Ecuyer and P. W. Glynn. Stochastic optimization by simulation: Convergence
proofs for the GI/G/1 queue in steady state. Management Science, 40(11):1562–1578,
1994.
[30] V. Pareto. Manuale di economica politica. Societa Editrice Libraria, 1906.
BIBLIOGRAPHY 61
[31] A. Perez. Risk estimates in terms of generalized f -entropies. In Proceedings of the
Colloquium on Information Theory, Debrecen 1967, pages 299–315, Budapest, 1968.
Journal Bolyai Mathematical Society.
[32] D. Prato and C. Tsallis. Nonextensive foundation of Levy distributions. Physical
Review E., 60(2):2398–2401, 1999.
[33] A. Renyi. On measures of entropy and information. In Fourth Berkeley Symposium
on Mathematical Statistics and Probability, 1960, volume 1, pages 547–561, Berkeley,
California, 1961. University of California Press.
[34] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathe-
matical Statistics, 22(3):400–407, 1951.
[35] R. Y. Rubinstein. Simulation and Monte-Carlo Method. John Wiley, New York,
1981.
[36] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro proce-
dure. Annals of Statistics, 13:236–245, 1985.
[37] A. H. Sato. q-Gaussian distributions and multiplicative stochastic processes for
analysis of multiple financial time series. Journal of Physics: Conference Series,
201(012008), 2010.
[38] C. E. Shannon. A mathematical theory of communications. The Bell System Tech-
nical Journal, 27, 1948.
[39] J. C. Spall. Multivariate stochastic approximation using a simultaneous perturbation
gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–334,
1992.
[40] J. C. Spall. Adaptive stochastic approximation by the simultaneous perturbation
method. IEEE Transactions on Automatic Control, 45:1839–1853, 2000.
BIBLIOGRAPHY 62
[41] M. A. Styblinski and T. S. Tang. Experiments in nonconvex optimization: Stochastic
approximation with function smoothing and simulated annealing. Neural Networks,
3(4):467–483, 1990.
[42] H. Suyari. Generalization of Shannon-Khinchin axioms to nonextensive systems and
the uniqueness theorem for the nonextensive entropy. IEEE Transactions on Infor-
mation Theory, 50:1783–1787, 2004.
[43] W. J. Thistleton, J. A. Marsh, K. Nelson, and C. Tsallis. Generalized Box-Muller
method for generating q-Gaussian random deviates. IEEE Transactions on Informa-
tion Theory, 53(12):4805–4810, 2007.
[44] C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of Statiscal
Physics, 52(1-2):479–487, 1988.
[45] C. Tsallis. Some comments on Boltzmann-Gibbs statistical mechanics. Chaos, Soli-
tons & Fractals, 6:539–559, 1995.
[46] C. Tsallis, R. S. Mendes, and A. R. Plastino. The role of constraints within gener-
alized nonextensive statistics. Physica A: Statistical Mechanics and its Applications,
261(3–4):534–554, 1998.
[47] S. Umarov and C. Tsallis. Multivariate generalizations of the q-central limit theorem.
arXiv:cond-mat/0703533, 2007.
[48] C. Vignat and A. Plastino. Central limit theorem and deformed exponentials. Journal
of Physics A: Mathematical and Theoretical, 20(45), 2007.
[49] D. Williams. Probability with Martingales. Cambridge University Press, 1991.
[50] X. Zhu and J. C. Spall. A modified second-order SPSA optimization algorithm for
finite samples. International Journal of Adaptive Control and Signal Process, 16:397–
409, 2002.
top related