resampling and portfolio analysispeng/6783spring10/chap10.pdfchapter 10 35 • the bootstrap was...
TRANSCRIPT
Chapter 10 1
RESAMPLING AND PORTFOLIO ANALYSIS
Introduction
• Computer simulation is widely used in OR.
• Applications of simulation to statistics are
widespread.
• Topic of this chapter: simulation technique called the
“bootstrap” or “resampling”
• Will study the effects of estimation error on portfolio
selection.
• “bootstrap” from the phrase “pulling oneself up by
one’s bootstraps”
– “bootstrap” = “resampling”
Chapter 10 2
• Statistics from a random sample are random
variables.
– The sample is only one of many possible samples.
– Each possible sample gives a possible value of X.
– We only see one value of X
∗ But it was selected at random from the many
possible values.
– Thus, X is a random variable.
Chapter 10 3
• Confidence intervals and hypothesis tests are based
on the randomness of statistics.
– Example: confidence coefficient tells us the
probability that an interval constructed from a
random sample will contain the parameter.
– Confidence intervals are sometimes derived using
probability theory.
– Often the necessary probability calculations are
intractable.
– In that case we can replace theoretical calculations
by Monte Carlo simulation.
Chapter 10 4
• How do we simulate sampling from an unknown
population?
• We cannot do this exactly.
• However, a sample is a good representative of the
population.
• We can simulate sampling from the population by
sampling from the sample.
– this is usually called resampling
Chapter 10 5
X-
X-
X-
X-
X-
µ X-
X-
X-
X-
X-
Population Sample
Resample
Resample
ResampleResample
Resample
.
.
.
.
Samples not taken
.
.
.
Chapter 10 6
• Each resample has the same sample size, n, as the
original sample.
• We are trying to simulate the original sampling,
• We want the resampling to be as similar as possible
to the original sampling.
Chapter 10 7
• The resamples are drawn with replacement.
• Only sampling with replacement give independent
observations.
• We want the resamples to be i.i.d. just like the
original sample.
• If the resamples were drawn without replacement
then every resample would be exactly the same as the
original sample.
– So the resamples would show no random variation.
– This wouldn’t be very satisfactory, of course.
Chapter 10 8
• The number of resamples taken should be large.
• Just how large depends on context and will be
discussed more fully later.
• Sometimes, tens of thousands of resamples are taken.
• We will let B denote the number of resamples.
Chapter 10 9
• There is some good news and some bad news about
the bootstrap.
– The good news is that computer simulation
replaces difficult mathematics.
Chapter 10 10
• The bad news is that resampling is a new and
unfamiliar concept.
– It that takes some time to become comfortable
with resampling.
– The problem is not that resampling is all that
conceptually complex.
– Rather, the problem is that students don’t have
much experience with even a single random sample
from a population.
– Resampling is even more complex that that
∗ two layers of sampling and multiple resamples.
• However, studying resampling gives us a better
understanding of sampling.
Chapter 10 11
Confidence intervals for the mean
• Before resampling of the efficient frontier, we will look
at a simpler problem.
• Suppose we wish to construct a confidence interval for
the population mean.
• Start with the so-called t-statistic:
t =µ−X
s√n
. (1)
• The denominator of t, s/√
n, is just the standard
error of the mean.
Chapter 10 12
• Sampling from a normal population
– the probability distribution of t is the
t-distribution with n− 1 degrees of freedom.
• Denote by tα/2 the α/2 upper t-value,
– this is the 1− α/2 quantile of this distribution.
• Thus t in (1) has probability α/2 of exceeding tα/2.
• Because of the symmetry of the t-distribution, the
probability is also α/2 than t is less that −tα/2.
Chapter 10 13
• Therefore, for normally distributed data, the
probability is 1− α that
−tα/2 ≤ t ≤ tα/2. (2)
• Substituting (1) into (2), after a bit of algebra we find
that the probability is 1− α that
X − tα/2s√n≤ µ ≤ X + tα/2
s√n
, (3)
– which shows that
X ± s√n
tα/2
is a 1− α confidence interval for µ, assuming
normally distributed data.
Chapter 10 14
• What if we are not sampling from a normal
distribution?
• In that case, the distribution of t (1) will not be the t
distribution, but rather some other distribution that
is not known to us.
Chapter 10 15
• There are two problems.
– First, we don’t know the population distribution.
– Second, a difficult probability calculation is
necessary to get the distribution of the t-statistic
from the population distribution.
∗ This calculation has only be done for normal
populations.
Chapter 10 16
• Considering these difficulties, can we get a confidence
interval?
– “yes, by resampling.”
• Take a large number, say B, resamples from the
original sample.
Chapter 10 17
• Let Xboot,b and sboot,b be the sample mean and
standard deviation of the bth resample.
• Define
tboot,b =X −Xboot,b
sboot,b√n
. (4)
Chapter 10 18
• Notice that tboot,b is defined in the same way as t
except for two changes.
– First, X and s in t are replaced by Xboot,b and
sboot,b in tboot,b.
– Second, µ in t is replaced by X in tboot,b.
– The last point is a bit subtle, and you should stop
to think about it.
∗ A resample is taken using the original sample as
the population.
∗ Thus, for the resample, the population mean is
X!
Chapter 10 19
•• Resamples are independent
– Therefore, tboot,1, tboot,2, . . . is a random sample
from the t-statistic distribution.
• After B values of tboot,b have been calculated:
– we find the 2.5% and 97.5% percentiles of this
collection of tboot,b values.
– Call these percentiles tL and tU .
Chapter 10 20
• More specifically, we find tL and tU as follows:
– The B values of tboot,b are sorted from smallest to
largest.
– Then we calculate Bα/2 and round to the nearest
integer.
∗ example: if α = .05 and B = 1000, the
KL = (1000)(.05)/2 = 25
– Suppose the result is KL. Then the KLth sorted
value of tboot,b is tL.
– Similarly, let KU be B(1− α/2) rounded to the
nearest integer and then tU is the KUth sorted
value of tboot,b is tU .
Chapter 10 21
• If the original population is skewed, then we do not
necessarily expect that tL = −tU .
• However, this fact causes us no problem since the
bootstrap allows us to estimate tL and tU without
assuming any relationship between them.
• Now we replace −tα/2 and tα/2 in the confidence
interval (3) by tL and tU , respectively.
• Finally, the bootstrap confidence interval for µ is(X + tL
s√n
, X + tUs√n
).
Chapter 10 22
• The bootstrap has solved both problems mentioned
above.
– We do not need to know the population
distribution since we can estimate it by the sample.
– Moreover, we don’t need to calculate the
distribution of the t-statistic using probability
theory.
∗ Instead we can simulate from this distribution.
Chapter 10 23
We will use the notation
SE =s√n
and
SEboot =sboot√
n.
Chapter 10 24
Example: We start with a very small sample of size six
to illustrate how the bootstrap works.
• The sample is 82, 93, 99, 103, 104, 110.
– X = 98.50
– SE is 4.03.
• The first bootstrap sample is 82, 82, 93, 93, 103, 110.
– In this bootstrap sample,
∗ 82 and 93 were sampled twice,
∗ 103 and 110 were sampled once,
∗ the other elements of the original sample were
not sampled.
Chapter 10 25
• For this bootstrap sample
– Xboot = 93.83,
– SEboot = 4.57,
– tboot = (98.5− 93.83)/4.57 = 1.02.
Chapter 10 26
• The second bootstrap sample is 82, 103, 110, 110,
110, 110.
– In this bootstrap sample,
∗ 82 and 103 were sampled twice,
∗ 110 was sampled four times,
∗ the other elements of the original sample were
not sampled.
Chapter 10 27
It may seem strange at first that 110 was resample four
times.
• The number of times 110 appears in a resample is
binomial with parameters p = 1/6 and n = 6.
• The probability 110 occurs exactly four times in the
sample is
6!
4! 2!
(1
6
)4 (5
6
)2
= 0.00804.
• The probability that one of the six elements of the
original sample will occur exactly four times in a
resample is (6)(.00804) = .0482.
Chapter 10 28
• For this bootstrap sample
– Xboot = 104.17,
– SEboot = 4.58,
– tboot = (98.5− 104.17)/4.58 = −1.24.
Chapter 10 29
• The third bootstrap sample is 82, 82, 93, 99, 104, 110.
– For this bootstrap sample
∗ Xboot = 95.00,
∗ SEboot = 4.70,
∗ tboot = (98.5− 95.00)/4.570 = 1.02.
Chapter 10 30
• If this example were to continue:
– more bootstrap samples would be drawn
– all bootstrap t values would be saved in order
compute quantiles of the bootstrap t values.
• Since the sample size is so small, this example is not
very realistic and we will not continue it.
Chapter 10 31
Example: Suppose that we have a random sample with
a more realistic size of 40 from some population and
X = 107 and s = 12.6.
• Let’s find the “normal theory” 95% confidence
interval for the population mean µ.
• With 39 degrees of freedom, t.025 = 2.02.
• Therefore, the confidence interval for µ is
107± 2.0212.6√
40= (102.97, 111.03).
Chapter 10 32
• Suppose that we use resampling instead of normal
theory and that we use 1,000 resamples.
• This gives us 1,000 values of tboot,b. We rank them
from smallest to largest.
• The 25% percentile is the one with rank 25 =
(1000)(.025).
• Suppose the 25th smallest value of tboot,b is −1.98.
• The 97.5% percentile is the value of tboot,b with rank
975.
• Suppose that its value is 2.25.
Chapter 10 33
• Then
– tL = −1.98,
– tU = 2.25,
– the 95% confidence interval for µ is(107− 1.98
12.6√40
, 107 + 2.2512.6√
40
)= (103.06, 111.48).
Chapter 10 34
Example: Log-returns for MSCI-Argentina
• MSCI-Argentina is the Morgan Stanley Capital Index
for Argentina
– roughly comparable to the S&P 500 for the US.
• The log-returns for this index from January 1988 to
January 2002, inclusive, are used in this example.
• A normal plot is found later.
• There is evidence of non-normality, in particular, that
the log-returns are heavy-tailed, especially the left
tail.
Chapter 10 35
• The bootstrap was implemented with B = 10, 000.
• The t-values were
– tL = −1.93
– tU = 1.98.
• To assess the Monte Carlo variability, the bootstrap
was repeated two more times with results:
– tL = −1.98 and tU = 1.96
– tL = −1.94 and tU = 1.94.
• We see that B = 10, 000 gives reasonable accuracy
but that the third significant digit is still uncertain.
Chapter 10 36
• Also, using normal theory, tL = −t.025 = −1.974 and
tU = t.025 = 1.974, which are similar to the bootstrap
values that do not assume normality.
• Therefore, the use of the bootstrap in this example
confirms that the normal theory confidence interval is
satisfactory.
• In other example, particularly with strongly skewed
data and small sample sizes, the normal theory
confidence interval will be less satisfactory.
Chapter 10 37
Here is some MATLAB code that does resampling:
for b=1:2000 ;
select = ceil((n*rand(n,1)) ;
resample = sample(select,:) ;
% Put code in here to calculate
% the statistics that are needed in your application.
end ;
Chapter 10 38
• “select” which equals “ceil((n*rand(n,1))” will be n
random integers between 1 and n
• “resample” will be a resample from “sample” using
the indices in “select”
• For example, if n equals 6, the sample is
s1, s2, . . . , s6, and select equals [3 2 6 5 2 5], then
the resample is s3, s2, s6, s5, s2, s5.
• 2000 resamples are taken, that is, B = 2000.
Chapter 10 39
Here is some MATLAB code that does a 95% bootstrap-t
confidence interval by resampling:
xbar = mean(sample) ;
t = zeros(2000,1) ;
B = 2000 ;
for b=1:B ;
select = ceil((n*rand(n,1)) ;
resample = sample(select,:) ;
t(b) = (xbar - mean(resample)) / (std(resample)/sqrt(n)) ;
end ;
t=sort(t) ;
t_L = t( round(.05*B/2) ) ;
t_U = t( round((1-.05/2)*B) ) ;
Chapter 10 40
Resampling the efficient frontier
• One application of optimal portfolio selection is to
allocation of capital to different market segments.
• Michaud (1998) discusses a global asset allocation
problem where capital must be allocated to
– U.S. stocks and government/corporate bonds,
– Euros,
– the Canadian, French, German, Japanese, and
U.K. equity markets.
Chapter 10 41
Here we look at a similar example where we allocatecapital to the equity markets of ten different countries.Monthly log-returns for these markets were calculatedfrom:
• 1 = MSCI Hong Kong
• 2 = MSCI Singapore
• 3 = MSCI Brazil
• 4 = MSCI Argentina
• 5 = MSCI UK
• 6 = MSCI Germany
• 7 = MSCI Canada
• 8 = MSCI France
• 9 = MSCI Japan
• 10 = S&P 500
Chapter 10 42
• The data are in the file “countries.txt” on the course
web site and came from Datastream.
• “MSCI” means “Morgan-Stanley Capital Index.”
• The data are from January 1988 to January 2002,
inclusive, so there are 169 months (14 years and one
month) of data.
Chapter 10 43
0 50 100 150!1
!0.5
0
0.5
1Hong Kong
retu
rn
0 50 100 150!1
!0.5
0
0.5
1Singapore
retu
rn
0 50 100 150!1
!0.5
0
0.5
1Brazil
retu
rn
0 50 100 150!1
!0.5
0
0.5
1Argentina
retu
rn
0 50 100 150!1
!0.5
0
0.5
1UK
retu
rn
0 50 100 150!1
!0.5
0
0.5
1Germany
retu
rn
0 50 100 150!1
!0.5
0
0.5
1Canada
retu
rn
0 50 100 150!1
!0.5
0
0.5
1France
retu
rn
0 50 100 150!1
!0.5
0
0.5
1Japan
retu
rn
0 50 100 150!1
!0.5
0
0.5
1US
retu
rn
Chapter 10 44
0 5 10 15 20!0.2
!0.1
0
0.1
0.2Hong Kong
corr
0 5 10 15 20!0.2
!0.1
0
0.1
0.2Singapore
corr
0 5 10 15 20!0.2
!0.1
0
0.1
0.2Brazil
corr
0 5 10 15 20!0.2
!0.1
0
0.1
0.2Argentina
corr
0 5 10 15 20!0.2
!0.1
0
0.1
0.2UK
corr
0 5 10 15 20!0.2
!0.1
0
0.1
0.2Germany
corr
0 5 10 15 20!0.2
!0.1
0
0.1
0.2Canada
corr
0 5 10 15 20!0.2
!0.1
0
0.1
0.2France
corr
0 5 10 15 20!0.2
!0.1
0
0.1
0.2Japan
corr
0 5 10 15 20!0.2
!0.1
0
0.1
0.2US
corr
Chapter 10 45
!0.2 0 0.20.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997
Prob
abilit
y
Hong Kong
!0.2 !0.1 0 0.1 0.20.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997
Prob
abilit
y
Singapore
!0.8 !0.6 !0.4 !0.2 0 0.2 0.40.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997
Prob
abilit
y
Brazil
!1 !0.5 0 0.50.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997
Prob
abilit
y
Argentina
!0.1 !0.05 0 0.05 0.1 0.150.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997
Prob
abilit
y
UK
!0.2 !0.1 0 0.10.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997
Prob
abilit
y
Germany
!0.2 !0.1 0 0.10.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997
Prob
abilit
y
Canada
!0.1 0 0.1 0.20.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997
Prob
abilit
y
France
!0.2 !0.1 0 0.10.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997
Prob
abilit
y
Japan
!0.1 !0.05 0 0.05 0.10.0030.01 0.02 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.98 0.99 0.997
Prob
abilit
y
US
Chapter 10 46
• If we are planning to invest in a number of
international capital markets, then we might like to
know the efficient frontier,
– of course, we can never know it exactly.
• At best, we can estimate the efficient frontier using
estimated expected returns and the estimated
covariance matrix of returns.
• How close is the estimated efficient frontier to the
unknown true efficient frontier?
Chapter 10 47
• Each resample consists of 168 returns drawn with
replacement from the 168 returns of the original
sample.
• From the resampling perspective, the original sample
is treated as the population and the efficient frontier
calculated from the original sample is viewed as the
true efficient frontier.
• We can recalculate the efficient frontier using each of
the resamples and compare these re-estimated
efficient frontiers to the “true efficient frontier.”
Chapter 10 48
0 0.05 0.10
0.005
0.01
0.015
0.02
0.025
0.03
standard deviation of return (!P)ex
pect
ed re
turn
(µP)
achievedoptimal
0 0.05 0.10
0.005
0.01
0.015
0.02
0.025
0.03
standard deviation of return (!P)
expe
cted
retu
rn (µ
P)
achievedoptimal
0 0.05 0.10
0.005
0.01
0.015
0.02
0.025
0.03
standard deviation of return (!P)
expe
cted
retu
rn (µ
P)
achievedoptimal
0 0.05 0.10
0.005
0.01
0.015
0.02
0.025
0.03
standard deviation of return (!P)
expe
cted
retu
rn (µ
P)
achievedoptimal
0 0.05 0.10
0.005
0.01
0.015
0.02
0.025
0.03
standard deviation of return (!P)
expe
cted
retu
rn (µ
P)
achievedoptimal
0 0.05 0.10
0.005
0.01
0.015
0.02
0.025
0.03
standard deviation of return (!P)
expe
cted
retu
rn (µ
P)
achievedoptimal
Chapter 10 49
• To be more precise, let µ and Ω be the mean vector
and covariance matrix estimated from the original
sample.
• For a given target for the expected portfolio return,
µP , let ωµPbe the efficient portfolio weight vector
with g and h estimated from the original sample.
• Let ωµP ,b be the efficient portfolio weight vector with
g and h estimated from the bth resample.
Chapter 10 50
• Then the solid blue curves in the figure are
ωTµP
µ = µP plotted against√ωT
µpΩωµP
for a grid of µP values. The dashed red curves areωT
µP ,bµ plotted against√ωT
µP ,bΩωµP ,b.
• Unfortunately, the red resampled efficient frontier
curve lies below the blue true efficient frontier curve.
Chapter 10 51
• Because of estimation error, our estimated efficient
frontiers are suboptimal.
• Also, ωT
µP ,bµ does not, in general, equal µP .
• We do not achieve the expected return µP that we
have targeted because of estimation error when
estimating µ.
Chapter 10 52
0.03 0.035 0.04 0.045 0.05 0.055 0.06
0.01
0.015
0.02
0.025
0.03
standard deviation of return (!P)ex
pect
ed re
turn
(µP) achieved
optimaleff. frontier
!0.01 !0.005 0 0.005 0.01 0.015 0.020
50
100
150
!P ! !P,opt
frequ
ency
!6 !4 !2 0 2 4x 10!3
0
20
40
60
80
µP ! .012
frequ
ency
Chapter 10 53
• We concentrate on estimating only a single point on
the efficient frontier, the point where the expected
portfolio return is 0.012.
• This point is shown in the upper subplot as a large
black asterisk and is the point
(
√ωT
.012 Ω ω.012, .012).
• Each small blue asterisk in the upper subplot is the
estimate of this point from a resample and is
(
√ωT
.012,b Ωω.012,b, ωT
.012,bµ).
Chapter 10 54
• The middle subplot is a histogram of the values of√ωT
.012,b Ωω.012,b −
√ωT
.012Ωω.012.
• The lower subplot is a histogram of the values of
ωT
.012,bµ− .012.
Chapter 10 55
• Question: “what is the main problem here,
– mis-estimation of the expected returns,
– mis-estimation of the covariance matrix of the
returns,
– or both?”
One of the fun things we can do with resampling is to
play the game of “what if?”
• In particular, we can ask, “what would happen if we
knew the true expected returns and only had to
estimate the covariance matrix?”
• We can also ask the opposite question, “what would
happen if we knew the true covariance matrix and
only had to estimate the expected returns?”
Chapter 10 56
• By playing these “what if” games, we can address our
question about the relative effects of mis-estimation
of µ and mis-estimation of the covariance matrix.
• In next figure, when we estimate the efficient frontier
for each resample, we use the mean returns from the
original sample, which from the resampling
perspective are the population values.
• Only the covariance matrix is estimated from the
resample.
• The lower subplot is a histogram of the values of
ωT
.012,bµ− .012.
Chapter 10 57
0.03 0.035 0.04 0.045 0.05 0.055
0.01
0.015
0.02
0.025
0.03
standard deviation of return (!P)
expe
cted
retu
rn (µ
P)
achievedoptimaleff. frontier
0 0.5 1 1.5 2 2.5 3 3.5x 10!3
0
20
40
60
80
!P ! !P,opt
frequ
ency
Chapter 10 58
• In following figure, when we estimate the efficient
frontier for each resample, we use the covariance
matrix from the original sample.
• Only the expected returns are estimated from the
resample.
Chapter 10 59
0.03 0.035 0.04 0.045 0.05 0.055
0.01
0.015
0.02
0.025
0.03
standard deviation of return (!P)ex
pect
ed re
turn
(µP) achieved
optimaleff. frontier
!0.01 !0.005 0 0.005 0.01 0.015 0.020
50
100
150
!P ! !P,opt
frequ
ency
!6 !4 !2 0 2 4x 10!3
0
20
40
60
80
µP ! .012
frequ
ency
Chapter 10 60
• “What would happen if we had more data?”
Chapter 10 61
0.03 0.035 0.04 0.045 0.05 0.055
0.01
0.015
0.02
0.025
0.03
standard deviation of return (!P)ex
pect
ed re
turn
(µP) achieved
optimaleff. frontier
!0.01 !0.005 0 0.005 0.01 0.015 0.020
20
40
60
80
100
!P ! !P,opt
frequ
ency
!6 !4 !2 0 2 4x 10!3
0
20
40
60
80
µP ! .012
frequ
ency