solved exercises and problems of statistical...

Solved Exercises and Problems of

StatisticalInference

David Casado

Complutense University of Madrid ∟ Faculty of Economic and Business Sciences

∟ Department of Statistics and Operational Research II∟ David Casado de Lucas

You can decide not to print this file and consult it in digital format – paper and ink will be saved. Otherwise, printit on recycled paper, double-sided and with less ink. Be ecological. Thank you very much.

5 June 2015

http://www.ucm.es/

http://www.casado-d.org/

https://www.ucm.es/estadisticaeio-2

http://economicasyempresariales.ucm.es/

http://www.casado-d.org/edu/teaching.html

http://www.casado-d.org/edu/index.html

ContentsLinks, Keywords and Descriptions

Inference Theory (IT) Framework and Scope of the Methods Some Remarks Sampling Probability Distribution

Point Estimations (PE) Methods for Estimating Properties of Estimators Methods and Properties

Confidence Intervals (CI) Methods for Estimating Minimum Sample Size Methods and Sample Size

Hypothesis Tests (HT) Parametric Based on T Based on Λ Analysis of Variance (ANOVA) Nonparametric Parametric and Nonparametric

PE – CI – HT

Additional Exercises

Appendixes Probability Theory Some Reminders Markov's Inequality. Chebyshev's Inequality Probability and Moments Generating Functions. Characteristic Function.

Mathematics Some Reminders

1 – 6

7 – 127

7 – 99 – 12

13 – 7313 – 2727 – 6464 – 73

74 – 9374 – 8181 – 8383 – 93

94 – 14294 – 125

94 – 117117 – 122122 – 125

126 – 137138 – 142

143 – 153

154 – 169

170 – 191170 – 175

170170 – 171171 – 172

191 – 209191 – 192

Limits

References

Tables of Statistics

Probability Tables

Index

192 – 194

210

211 – 217

218 – 222

223 – 225

PrologueThese exercises and problems are a necessary complement to the theory included in Notes of StatisticalInference, available at http://www.casado-d.org/edu/NotesStatisticalInference-Slides.pdf. Nevertheless, someimportant theoretical details are also included in the remarks at the beginning of each chapter. Those Notes arethought for teaching purposes, and they do not include the advanced mathematical justifications andcalculations included in this document.

Although we can study only linearly and step by step, it is worth noticing that methods are usuallyrelated—as tasks are in the real-world—in Statistical Inference. Thus, in most exercises and problems we havemade it clear which are the suppositions and how they should be proved properly. In same cases, severalstatistical methods have been “naturally” combined in the statement. Many steps and even sentences arerepeated in most exercises of the same type, both to insist on them and to facilitate the reading of exercisesindividually. The advanced exercises have been marked with the symbol (*).

Written in Courier New font style is the code with which we have done some calculation by usingthe programming language R—you can copy and paste this code from the file. I include some notes to help,up to my knowledge, students with a mother language different to the English.

AcknowledgementsThis document has been created with Linux, LibreOffice, OpenOffice.Org, GIMP and R. I thank those whomake this software available for free. I donate funds to these kinds of project from time to time.

http://www.casado-d.org/edu/NotesStatisticalInference-Slides.pdf

Links, Keywords and ExplanationsInference Theory (IT)

Framework and Scope of the Methods> [Keywords] infinite populations, independent populations, normality, asymptoticness, descriptive statistics.> [Description] The conditions under which the Statistics considered here can be applied are listed.

Some Remarks> [Keywords] partial knowledge, randomness, certainty, dimensional analysis, validity, use of the samples, calculations.> [Description] The partial knowledge justifies both the random character of the mathematical variables used to explain the variables ofthe real-world problems and the impossibility of reaching the maximum certainty in using samples instead of the whole population. Thevalidity of the results must be understood within the scenario made of the assumptions, the methods, the certainty and the data.

Sampling Probability DistributionExercise 1it-spd

> [Keywords] inference theory, joint distribution, sampling distribution, sample mean, probability function.> [Description] From a simple probability distribution for X, the joint distribution of a sample (X1,X2) and the sampling distribution ofthe sample mean X are determined.

Point Estimations (PE)Methods for Estimating

Exercise 1pe-m> [Keywords] point estimations, binomial distribution, Bernoulli distribution, method of the moments, maximum likelihood method,plug-in principle.> [Description] For the binomial distribution, the two methods are applied to estimate the second parameter (probability), when thefirst (number of trials) is known. In the second method, the maximum can be found by looking at the derivatives. Both methodsprovide the same estimator. The plug-in principle allows using the previous estimator to obtain others for the mean and the variance.

Exercise 2pe-m> [Keywords] point estimations, geometric distribution, method of the moments, maximum likelihood method, plug-in principle.> [Description] For the geometric distribution, the two methods are applied to estimate the parameter. In the second method, themaximum can be found by looking at the derivatives. Both methods provide the same estimator. The plug-in principle is applied to usethe previous estimator to obtain others for the mean and the variance.

Exercise 3pe-m> [Keywords] point estimations, Poisson distribution, method of the moments, maximum likelihood method, plug-in principle.> [Description] For the Poisson distribution, the two methods are applied to estimate the parameter. In the second method, themaximum can be found by looking at the derivatives. The two methods provide the same estimator. The plug-in principle is applied touse the previous estimator to obtain others for the mean and the variance.

Exercise 4pe-m> [Keywords] point estimations, normal distribution, method of the moments, maximum likelihood method.> [Description] For the normal distribution, the two methods are applied to estimate at the same time the two parameters of thisdistribution. In the second method, the maximum can be found by looking at the derivatives. The two methods provide the sameestimator.

Exercise 5pe-m> [Keywords] point estimations, (continuous) uniform distribution, method of the moments, maximum likelihood method, plug-inprinciple, integrals.> [Description] For the continuous uniform distribution, the two methods are applied to estimate the parameter. In the second method,the maximum cannot be found by looking at the derivatives and this task is done by applying simple qualitative reasoning. The twomethods provide different estimators. The plug-in principle allows using the previous estimator to obtain others for the mean and thevariance. As a mathematical exercise, the theoretical expression of the mean and the variance are calculated.

Exercise 6pe-m> [Keywords] point estimations, (translated) exponential distribution, method of the moments, maximum likelihood method, plug-inprinciple, integrals.> [Description] For a translation of the exponential distribution, the two methods are applied to estimate the parameter. In the secondmethod, the maximum can be found by looking at the derivatives. The two methods provide the same estimator. The plug-in principleis applied to use the previous estimator to obtain others for the mean. As a mathematical exercise, the theoretical expression of themean and the variance of the distribution are calculated.

Exercise 7pe-m> [Keywords] point estimations, method of the moments, maximum likelihood method, plug-in principle, integrals.> [Description] For a distribution given through its density function, the two methods are applied to estimate the parameter. In thesecond method, the maximum cannot be found by looking at the derivatives and this task is done by applying simple qualitative

1 Solved Exercises and Problems of Statistical Inference

reasoning. The two methods provide different estimators. The plug-in principle is applied to obtain other estimators for the mean andthe variance. Additionally, the theoretical expression of the mean and the variance of this distribution are calculated.

Properties of EstimatorsExercise 1pe-p

> [Keywords] point estimations, probability, normal distribution, sample mean, completion (standardization).> [Description] For a normal distribution with known parameters, the probability that the sample mean is larger than a given value iscalculated.

Exercise 2pe-p> [Keywords] point estimations, probability, normal distribution, sample quasivariance, completion.> [Description] For a normal distribution with known standard deviation, the probability that the sample quasivariance is larger than agiven value is calculated.

Exercise 3pe-p> [Keywords] point estimations, probability, Bernoulli distribution, sample proportion, completion (standardization), asymptoticness.> [Description] For a Bernoulli distribution with known parameter, the probability that the sample proportion is between two givenvalues is calculated.

Exercise 4pe-p> [Keywords] point estimations, probability and quantile, normal distribution, sample mean, sample quasivariance, completion.> [Description] For two (independent) normal distributions with known parameters, probabilities and quantiles of several eventsinvolving the sample mean or the sample quasivariance are calculated or found out, respectively.

Exercise 5pe-p> [Keywords] point estimations, probability, normal distribution, total sum, completion, bound.> [Description] For two (independent) normal distributions with known parameters, the probabilities of several events involving thetotal sum are calculated.

Exercise 6pe-p> [Keywords] point estimations, trimmed sample mean, mean square error, consistency, sample mean, rate of convergence.> [Description] To study the population mean, the mean square error and the consistency are studied for the trimmed sample mean.The speed in converging is analysed through a comparison with that of the (ordinary) sample mean.

Exercise 7pe-p> [Keywords] point estimations, chi-square distribution, mean square error, consistency.> [Description] To study twice the mean of a chi-square population, the mean square error and the consistency are studied for a givenestimator.

Exercise 8pe-p> [Keywords] point estimations, mean square error, relative efficiency.> [Description] For a sample of size two, the mean square errors of two given estimators are calculated and compared by using therelative efficiency.

Exercise 9pe-p> [Keywords] point estimations, sample mean, mean square error, consistency, efficiency (under normality), Cramér-Rao's lowerbound.> [Description] That the sample mean is always a consistent estimator of the population mean is proved. When the population isnormally distributed, this estimator is also efficient.

Exercise 10pe-p> [Keywords] point estimations, (continuous) uniform distribution, probability function, sample mean, consistency, efficiency,unbiasedness.> [Description] For a population variable following the continuous uniform distribution, the density function is plotted. Theconsistency and the efficiency of the sample mean, as an estimator of the population mean, are studied. Looking at the bias obtained, anew unbiased estimator of the population mean is built, and its consistency is proved.

Exercise 11pe-p> [Keywords] point estimations, geometric distribution, sufficiency, likelihood function, factorization theorem.> [Description] When a population variable follows the geometric distribution, a (minimum-dimension) sufficient statistic for studyingthe parameter is found by applying the factorization theorem.

Exercise 12pe-p (*)> [Keywords] point estimations, basic estimators, population mean, Bernoulli distribution, population proportion, normality,population variance, mean square error, consistency, rate of convergence.> [Description] The mean square error is calculated for all basic estimators of the mean, the proportion (for Bernoulli populations) andthe variance (for normal populations). Then, their consistencies in mean of order two and in probability are studied. For twopopulations, the two-variable limits that appear are studied by splitting them into two one-variable limits or by binding them.

Exercise 13pe-p (*)> [Keywords] point estimations, basic estimators, normality, population variance, mean square error, consistency, rate of convergence.> [Description] For the basic estimators of the variance of normal populations, the mean square errors are compared for one and twopopulations. The computer is used to compare graphically the coefficients that appear in the expression of the mean square errors.Besides, the consistency is also graphically studied.

Exercise 14pe-p (*)> [Keywords] point estimations, Bernoulli distribution, normal distribution, mean square error, consistency, pooled sample proportion,pooled sample variance, rate of convergence.


> [Description] The mean square error is calculated for some pooled estimators of the proportion (for Bernoulli populations) and thevariance (for normal populations). Then, their consistencies in mean of order two and in probability are studied. For pooled estimators,one sample size tending to infinite suffices, that is, one sample can “do the whole work”. Each pooled estimator—for the proportion ofa Bernoulli population and for the variance of a normal population—is compared with the “natural” estimator consisting in thesemisum of the estimators of the two populations. The computer is also used to compare graphically the coefficients that appear in theexpression of the mean square errors. The consistency can be studied graphically.

Methods and PropertiesExercise 1pe

> [Keywords] point estimations, method of the moments, mean square error, consistency, maximum likelihood method.> [Description] Given the density function of a population variable, the method of the moments is applied to find an estimator of theparameter; the mean square error of this estimator is calculated; finally, its consistency is studied. On the other hand, the maximumlikelihood method is applied too; the maximum cannot be found by using the derivatives and some qualitative reasoning is necessary.A simple analytical calculation suffices to see how the likelihood function depends upon the parameter. The two methods providedifferent estimators.

Exercise 2pe> [Keywords] point estimations, Rayleigh distribution, method of the moments, mean square error, consistency, maximum likelihoodmethod.> [Description] Supposed a population variable following the Rayleigh distribution, the method of the moments is applied to build anestimator of the parameter; the mean square error of this estimator is calculated and its consistency is studied. The maximum likelihoodmethod is also applied to build an estimator of the parameter. For this population distribution, both methods provide differentestimators. As a mathematical exercise, the expressions of the mean and the variance are calculated.

Exercise 3pe> [Keywords] point estimations, exponential distribution, method of the moments, maximum likelihood method, sufficiency,likelihood function, factorization theorem, sample mean, efficiency, consistency, plug-in principle.> [Description] A deep statistical study of the exponential distribution is carried out. To estimate the parameter, two estimators areobtained by applying both the method of the moments and the maximum likelihood method. For this population distribution, bothmethods provide the same estimator. A sufficient statistic is found. The sample mean is studied as an estimator of the parameter and theinverse of the parameter. In this exercise, it is highlighted how important the mathematical notation may be in doing calculations.

Confidence Intervals (CI)Methods for Estimating

Exercise 1ci-m> [Keywords] confidence intervals, method of the pivot, asymptoticness, normal distribution, margin of error.> [Description] The method of the pivot is applied twice to construct asymptotic confidence intervals for the mean and the standarddeviation of a normally distributed population variable with unknown mean and variance. For the first interval, the expression of themargin of error is used to obtain the confidence when the length of the interval is one unit.

Exercise 2ci-m> [Keywords] confidence intervals, method of the pivot, asymptoticness, normal distribution, margin of error.> [Description] The method of the pivot is applied to construct an asymptotic confidence interval for the mean of a population variablewith unknown variance. There was a previous estimate of the mean that is inside the interval obtained. The value of the margin of erroris explicitly given.

Exercise 3ci-m> [Keywords] confidence intervals, method of the pivot, Bernoulli distribution, asymptoticness.> [Description] The method of the pivot is applied to construct an asymptotic confidence interval for the proportion of a populationvariable following the Bernoulli distribution.

Exercise 4ci-m> [Keywords] confidence intervals, asymptoticness, method of the pivot, Bernoulli distribution, pooled sample proportion.> [Description] A confidence interval for the difference between two proportions is constructed by applying the method of the pivot.The interval allows us to make a decision about the equality of the proportions, which is equivalent to applying a two-tailed hypothesistest. As an advanced task, the exercise is repeated with the pooled sample proportion in the denominator of the statistic (estimation ofthe variances of the populations), not in the numerator (estimation of the difference between the means).

Minimum Sample SizeExercise 1ci-s

> [Keywords] confidence intervals, minimum sample size, normal distribution, method of the pivot, margin of error, Chebyshev'sinequality.> [Description] To find the minimum number of data necessary to guarantee theoretically the precision desired, two methods areapplied: one based on the expression of the margin of error and the other based on the Chebyshev's inequality.

Methods and Sample SizeExercise 1ci

> [Keywords] confidence intervals, minimum sample size, normal distribution, method of the pivot, margin of error, Chebyshev'sinequality.> [Description] A confidence interval for the mean of a normal population is built by applying the method of the pivotal quantity. Thedependence of the length of the interval with the confidence is analysed qualitatively. Given all the other quantities, the minimum


sample size is calculated in two different ways: with the method based on the expression of the margin of error and the method basedon the Chebyshev's inequality.

Exercise 2ci> [Keywords] confidence intervals, minimum sample size, asymptoticness, normal distribution, method of the pivot, margin of error,Chebyshev's inequality.> [Description] An asymptotic confidence interval for the mean of a population random variable is constructed by applying the methodof the pivotal quantity. The equivalent exact confidence interval can be obtained under the supposition that the variable is normallydistributed. Given all the other quantities, the minimum sample size is calculated in two different ways: with the method based on theexpression of the margin of error and the method based on the Chebyshev's inequality.

Exercise 3ci> [Keywords] confidence intervals, minimum sample size, normal distribution, method of the pivot, margin of error, Chebyshev'sinequality.> [Description] A confidence interval for the mean of a normal population is built by applying the method of the pivotal quantity.Given all the other quantities, the minimum sample size is calculated in two different ways: with the method based on the expression ofthe margin of error and the method based on the Chebyshev's inequality. The dependence of the length of the interval upon theconfidence is analysed qualitatively.

Exercise 4ci> [Keywords] confidence intervals, minimum sample size, normal distribution, method of the pivot, margin of error, Chebyshev'sinequality.> [Description] The method of the pivot allows us to construct a confidence interval for the difference between the means of two(independent) normal populations. Given the other quantities and supposing equal sample sizes, the minimum value is calculated byapplying two different methods: one based on the expression of the margin of error and the other based on the Chebyshev's inequality.

Hypothesis Tests (HT)Parametric

Based on TExercise 1ht-T

> [Keywords] hypothesis tests, normal distribution, two-tailed test, population mean, critical region, p-value, type I error, type IIerror, power function.> [Description] A decision on the equality of the population mean (of a variable) to a given number is made by applying a two--sided test and looking at both the critical values and the p-value. The two types of error are determined. With the help of acomputer, the power function is plotted.

Exercise 2ht-T> [Keywords] hypothesis tests, normal population, one-tailed test, population standard deviation, critical region, p-value, type Ierror, type II error, power function.> [Description] A decision on whether the population standard deviation (of a variable) is smaller than a given number is made byapplying a one-tailed test and looking at both the critical values and the p-value. The expression of the type II error is found. Withthe help of a computer, the power function is plotted. Qualitative analysis on the form of the alternative hypothesis is done. Theassumption that the population variable follows the normal distribution is necessary to apply the results for studying the variance.

Exercise 3ht-T> [Keywords] hypothesis tests, normal population, one- and two-tailed tests, population variance, critical region, p-value, type Ierror, type II error, power function.> [Description] The equality of the population variance (of a variable) to a given number is tested by considering both one- andtwo-tailed alternative hypotheses. Decisions are made after looking at both the critical values and the p-value. In the two cases, theexpression of the type II error is found and the power function is plotted with the help of a computer. The power functions aregraphically compared, and the figure shows that the one-sided test is uniformly more powerful than the two-sided test.

Exercise 4ht-T> [Keywords] hypothesis tests, normal population, one- and two-tailed tests, population variance, critical region, p-value, type Ierror, type II error, power function, statistical cook.> [Description] From the hypotheses of a one-sided test on the population variance (of a variable), different ways are qualitativelyand quantitatively considered for the opposite decision to be made.

Exercise 5ht-T> [Keywords] hypothesis tests, normal populations, one- and two-tailed tests, population standard deviation, critical region,p-value, type I error, type II error, power function.> [Description] A decision on whether the population standard deviation (of a variable) is equal to a given value is made byapplying three possible alternative hypotheses and looking at both the critical values and the p-value. The type II error is calculatedand the power function is plotted. The power functions are graphically compared: the figure shows that the one-sided tests areuniformly more powerful than the two-sided test.

Exercise 6ht-T> [Keywords] hypothesis tests, Bernoulli populations, one-tailed tests, population proportion, critical region, p-value, type I error,type II error, power function.> [Description] A decision on whether the population proportion is higher in one population is made after allocating this inequalityin the null hypothesis, firstly, and the alternative hypothesis, secondly. Two methodologies are considered, one based on the criticalvalues and the other based on the p-value. In both tests, the type II error is calculated and the power function is plotted. The


symmetry of the power functions of the two cases is highlighted. As an advanced section, the pooled sample proportion is used toestimate the variance of the populations (in the denominator of the statistic), but not to estimate the difference between thepopulation proportions (in the numerator of the statistic).

Based on ΛExercise 1ht-Λ

> [Keywords] hypothesis tests, Neyman-Pearson's lemma, likelihood ratio test, critical region, Poisson distribution, exponentialdistribution, Bernoulli distribution, normal distribution.> [Description] The critical region is theoretically studied for the null hypothesis that a parameter of the distribution equals a givenvalue against four different alternative hypothesis. The form of the region is related to the maximum likelihood of the estimator.

Analysis of Variance (ANOVA)Exercise 1ht-av

> [Keywords] hypothesis tests, normal populations, analysis of variance, critical region, p-value, type I error, type II error.> [Description] The analysis of variance is applied to test whether the means of three independent normal populations—whosevariances are supposed to be equal—are the same. Calculations are repeated three times with different levels of “manual work”.

NonparametricExercise 1ht-np

> [Keywords] hypothesis tests, chi-square tests, independence tests, critical region, p-value, type I error, table of frequencies.> [Description] The independence between two qualitative variables or factors is tested by applying the chi-square statistic.

Exercise 2ht-np> [Keywords] hypothesis tests, chi-square tests, goodness-of-fit tests, critical region, p-value, type I error, table of frequencies.> [Description] The goodness-of-fit to the whole Poisson family, firsly, and to a member of the Poisson distribution family, secondly,is tested by applying the chi-square statistic. The importance of using the sample information, instead of poorly justified assumptions,is highlighted when the results of both sections are compared.

Exercise 3ht-np> [Keywords] hypothesis tests, chi-square tests, goodness-of-fit tests, independence tests, homogeneity tests, critical region, p-value,type I error, table of frequencies.> [Description] Just the same table of frequencies is looked at as coming from three different scenarios. Chi-square goodness-of-fit,independence and homogeneity tests are respectively applied.

Parametric and NonparametricExercise 1ht

> [Keywords] hypothesis tests, Bernoulli distribution, goodness-of-fit chi-square test, position signs test, critical region, p-value, type Ierror, type II error, power function, table of frequencies.> [Description] Just the same problem is dealt with by considering three different approaches: one parametric test and two kinds ofnonparametric test. In this case, the same decision is made.

PE – CI – HTExercise 1pe-ci-ht

> [Keywords] point estimations, confidence intervals, method of the pivot, normal distribution, t distribution, pooled sample variance.> [Description] The probability of an event involving the difference between the means of two independent normal populations iscalculated with and without the supposition that the variances of the populations are the same. The method of the pivot is applied toconstruct a confidence interval for the quotient of the standard deviations.

Exercise 2pe-ci-ht> [Keywords] confidence intervals, point estimations, normal distribution, method of the pivot, probability, pooled sample variance.> [Description] For the difference of the means of two (independent) normally distributed variables, a confidence interval isconstructed by applying the method of the pivotal quantity. Since the equality of the means is included in a high-confidence interval,the pooled sample variance is considered in calculating a probability involving the difference of the sample means.

Exercise 3pe-ci-ht> [Keywords] hypothesis tests, confidence intervals, Bernoulli populations, one-tailed tests, population proportion, critical region,p-value, type I error, type II error, power function, method of the pivot.> [Description] A decision on whether the population proportion is smaller or equal in one population than in the other is made lookingat both the critical values and the p-value. The type II error is calculated and the power function is plotted. By applying the method ofthe pivot, a confidence interval for the difference of the population proportions is built. This interval can be seen as the acceptanceregion of the equivalent two-sided hypothesis test. In this case, the same decision is made with the test and with the interval.

Exercise 4pe-ci-ht> [Keywords] point estimations, hypothesis tests, standard power function density, method of the moments, maximum likelihoodmethod, plug-in principle, Neyman-Pearson's lemma, likelihood ratio tests, critical region.> [Description] Given the probability function of a population random variable, estimators are built by applying both the method ofthe moments and the maximum likelihood method. Then, the plug-in principle allows us to obtain estimators for the mean and thevariance of the distribution of the variable. In testing the equality of the parameter to a given value, the form of the critical region istheoretically studied when four different types of alternative hypothesis are considered.

Additional Exercises (Solved but not ordered by difficulty, described nor referred to in the final index.)


AppendixesProbability Theory (PT)

Some RemindersMarkov's Inequality. Chebyshev's InequalityProbability and Moments Generating Functions. Characteristic Function.

Exercise 1pt> [Keywords] probability, quantile, probability tables, probability function, binomial distribution, Poisson distribution, uniformdistribution, normal distribution, chi-square distribution, t distribution, F distribution.> [Description] For each of these distributions, the probability of a simple event is calculated both by using probability tables and byusing the mass function, or, on the contrary, a quantile is found by using the probability tables or a statistical software program.

Exercise 2pt> [Keywords] probability, normal distribution, total sum, sample mean, completion (standardization).> [Description] For a quantity that follows the normal distribution with known parameters, the probability of an event involving thequantity is calculated after properly completing the two sides of the inequality, that is, after properly rewriting the event.

Exercise 3pt (*)> [Keywords] probability, Bernoulli distribution, binomial distribution, geometric distribution, Poisson distribution, exponentialdistribution, normal distribution, raw or crude population moments, series, integral, probability generating function, moment generatingfunction, characteristic function, differential equation, integral equation, complex analysis.> [Description] For the distributions mentioned, the first two raw or crude population moments are calculated by using as many waysas possible. Their level of difficulty is different, but the aim is to practice. Some calculations require strong mathematical justifications.Several interested analytical techniques are used: changing the order of summation in series, using Taylor series, characterizing afunction through a differential or integral equation, et cetera.

Mathematics (M)Some RemindersLimitsExercise 1m (*)

> [Keywords] real analysis, integral, exponential function, bind, Fubini's theorem, integration by substitution, multiple integrals, polarcoordinates.> [Description] It is well-known that the function exp(–x2) has no antiderivative. The definite integral is calculated in three cases thatappear frequently, e.g. when working with the density function of the normal or the Rayleigh distributions. By applying the Fubini'stheorem for improper integrals, calculations are translated to the two-dimensional real space, where polar coordinates are used to solvethe multiple integral easily.

Exercise 2m> [Keywords] real analysis, limits, sequence, indeterminate forms> [Description] Several limits of one-variable sequences, similar to those necessary for other exercises, are calculated.

Exercise 3m (*)> [Keywords] real analysis, limits, sequence, indeterminate forms, polar coordinates.> [Description] Several limits of two-variable sequences, similar to those necessary for other exercises, are calculated.

Exercise 4m (*)> [Keywords] algebra, geometry, real analysis, linear transformation, rotation, movement, frontier, rectangular coordinates.> [Description] Several approaches are used to find the frontier and the regions determined by a discrete relation in the plain.

References

Tables of Statistics (T)> [Keywords] estimators, statistics T, parametric tests, likelihood ratio, analysis of Variance (ANOVA), nonparametric tests, chi-square tests,Kolmogorov Smirnov tests, runs test (of randomness), signs test (of position), Wilcoxon signed-rank test (of position).> [Description] The statistics applied in the exercises are tabulated in this appendix. Some theoretical remarks are included.

Probability Tables (P)> [Keywords] normal distribution, t distribution, chi-square distribution, F distribution.> [Description] A probability table with the most frequently used values is included for each of the distributions abovementioned.

Index


Inference Theory

[IT] Framework and Scope of the Methods

Populations[Ap1] When the entire populations can be studied, no inference is needed. Thus, here we suppose thatwe have not such total knowledge.

[Ap2] Populations will be supposed to be independent—matched or paired data must be treated in aslightly different way.

Samples[As1] Sample sizes are supposed to be quite smaller than population sizes—a correction factor is notnecessary for these (closely) infinite populations.

[As2] At the same time, we consider either any amount of normally distributed data or many data(large samples) from any distribution.

[As3] Data will be supposed to have been selected randomly, with the same probability andindependently; that is, by applying simple random sampling.

Methods[Am1] Before applying inferential methods, data should be analysed to guarantee that nothing strangewill spoil the inference—we suppose that such descriptive analysis and data treatment have been done.

[Am2] We are able to learn only linearly, but in practice methods need not be applied in the order inwhich they are presented here—e.g. nonparametric hypothesis tests to check assumptions beforeapplying parametric methods.

[IT] Some Remarks

Partial Knowledge and RandomnessThe partial knowledge mentioned in the previous section has crucial consequences. The use of only someelements of the population implies that—we can only hypothesized about the other elements—variables mustbe assigned a random character, on the one hand, and results will have no total certainty in the sense thatstatements will be set with some probability, on the other hand. For example: a 95% confidence in applying amethod must be interpreted as any other probability: the results are true with probability 0.95 and false withprobability 1–0.95 (frequently, we will never know if the method has failed or not). See remark 1pt, in theappendix of Probability Theory, on the interpretation of the concept of probability.

In Probability Theory, random variables are dimensionless quantities; in real-life problems, variablesalmost always are not. Since usually this fact does not cause troubles in Statistics, we do not pay muchattention to the units of measurement, and we can understand that the magnitude of the real-life variable, withno unit of measurement, is the part that is being modeled by using the proper probability distribution with theproper parameter values (of course, units of measurement are not random). To get used to pay attention to theunits of measurement and to manage them, they have been written in most numerical expressions.


Regarding the interpretation of the whole statistical processes that we will apply either to practice theiruse or to solve particular real-world problems, we highlight the main points on which results are usuallybased:

(i) Assumptions.(ii) The method applied, including particular details of its steps, mathematical theorems, statistic T, etc.(iii) Certainty with which the method is applied: probability, confidence or significance.(iv) The data available.

In Statistics, results may change severely when assumptions are really false, other method is applied, differentcertainty is considered, or data has no proper information (quantity, quality, representativity, etc.). Alongsidethis document, we do insist on the cautions that statisticians and reader of statistical works must take ininterpreting results. Even if you are not interested in “statistically cooking” data, you had better know therecipes... (Some of them have been included in the notes mentioned in the prologue.)

Use of the Samples

Let X = (X1,...,Xn) be the data of a population. The information they contain is extracted and usedthrough appropriate mathematical functions: estimators and statistics. When applying the methods,since we usually need to calculate a probability or to find a quantile, expressions must be written interms of those appropriate quantities whose sampling distribution is known.

In trying to make estimators or statistics appear, some Mathematics are needed. We do notrepeat them whenever they are applied in this document. For example, the standardization is a strictly positivetransformation that does not change inequalities when it is applied to both sides, or the positive branch of thesquare root must be considered to work with population or sample variances and standard deviations (thisconcepts are nonnegative by definition, while the square root is a general mathematical tool applied to thisparticular situation). As an example of those mathematical explanations not repeated again and again, weinclude the following:

Remark: Since variances are nonnegative by definition and the positive branch of the square root function is strictly increasing, it holds thatσx

2 = σy2 ↔ σx = σy (similarly for inequalities). For general numbers a and b, it holds only that a2 = b2 ↔ |a| = |b|. From a strict

mathematical point of view, for the standard deviation we should write σ = +√σ2 = |√σ2|.

Finally, at the end of the possible theoretical part of exercises, we do not insist that a sample (X1,...,Xn)would in practice be used by entering its values in the theoretical expressions obtained as a solution.Estimators and statistics are random quantities until specific data are used.

Useful QuestionsTo make the answer, users can find it useful to ask themselves:

On the Populations

● How many populations are there?● Are their probability distributions known?

On the Samples

● If populations are not normally distributed, are the sample sizes large enough to apply asymptoticresults?● Do we know the data themselves, or only some quantities calculated from them?


On the Assumptions

● What is supposed to be true? Does it seem reasonable? Do we need to prove it?● Should it be checked for the populations: the random character, the independence of the populations,the goodness-of-fit to the supposed models, the homogeneity between the populations, et cetera?● Should it be checked for the samples: the within-sample randomness and independence, the between--samples independence, et cetera?● Are there other assumptions (neither mathematical nor statistical)?

On the Statistical Problem

● What are the quantities to be studied statistically?● Concretely, what is the statistical problem: point estimation, confidence interval, hypothesis test, etc?

On the Statistical Tools

● Which are the estimators, the statistics and the methods that will be applied?

On the Quantities

● Which are the units of measurement? Are all the units equal?● How large are the magnitudes? Do they seem reasonable? Are all of them coherent (variability ispositive, probabilities and relative frequencies are between 0 and 1, etc)?

On the Interpretation

● What is the statistical interpretation of the solution?● How is the statistical solution interpreted in the framework of the problem we are working on?● Do the qualitative results seem reasonable (as expected)?● Do the quantities seem reasonable (signs, order of magnitude, etc)?

They may want to consult some other pieces of advice that we have written in Guide for Students of Statistics,available at http://www.casado-d.org/edu/GuideForStudentsOfStatistics-Slides.pdf.

[IT] Sampling Probability DistributionRemark 1it: The notation and the expression of the most basic estimators, for one population, are

X =1n∑i=1

nX i η=

∑i=1

nX i

n

V2=

1n∑ j=1

n(X j−μ)

2 s

2=

1n∑ j=1

n(X j− X )

2 S

2=

1n−1∑ j=1

n(X j− X )

2

For two populations, other basic estimators are made with these:

X −Y V X

2

V Y2

s X2

sY2

S X2

SY2 ηX−ηY

Finally, all these estimators are used to make statistics whose sampling distribution is known.

Exercise 1it-spd

Given a population (variable) X following the probability distribution determined by the following values and


http://www.casado-d.org/edu/GuideForStudentsOfStatistics-Slides.pdf



probabilitiesValue x 1 2 3

Probability p39

19

59

Determine:

(a) The joint probability distribution of the sample X = (X1,X2)(b) The sampling probability distribution of the sample mean X

(Based on an exercise of the materials in Spanish prepared by my workmates.)

Discussion: The distribution of X is totally determined, since we know all the information necessary tocalculate any quantity—e.g. the mean:

μ X = E(X )=∑Ωx j⋅P X ( x j)=∑{1,2,3 }

x j⋅p j =1⋅39+2⋅

19+3⋅

59=

189=2.222222

Instead of a table, a function is sometimes used to provide the values and the probabilities—the mass ordensity function. We can represent this function with the computer:

values = c(1, 2, 3) probabilities = c(3/9, 1/9, 5/9) plot(values, probabilities, type='h', xlab='Value', ylab='Probability', ylim=c(0,1), main= 'Mass Function', lwd=7)

The sampling probability distribution of X is determined once we give the possible values and theprobabilities with which they can be taken. Before doing that, we describe the probability distribution of therandom vector X = (X1,X2).

(A) Joint probability distribution of the sample

Since Xj are independent in any simple random sample, the probability that X = (X1,X2) takes the value x1 =(1,1), for example, is calculated as follows (note the intersection):

f X (1,1)=P X(1,1)= P X ({X 1=1 }∩{X 2=1 })=P X 1(X 1=1)⋅P X 2

(X 2=1)=39⋅

39=

19

To fill in the following table, the other probabilities are calculate in the same way.

Joint Probability Distribution of (X1,X2)

Value (x1,x2) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)

Probabilityof (x1,x2)

39⋅

39

39⋅

19

39⋅

59

19⋅

39

19⋅

19

19⋅

59

59⋅

39

59⋅

19

59⋅

59

19

127

527

127

181

581

527

581

2581

Notice that (1,3) and (3,1), for example, contain the same information. The values and their probabilities can


be given by extension (table or figure) or by comprehension (function).## Install this package if you don't have it (run the following line without #)# install.packages('scatterplot3d')

valuesX1 = c(1, 1, 1, 2, 2, 2, 3, 3, 3) valuesX2 = c(1, 2, 3, 1, 2, 3, 1, 2, 3) probabilities = c(1/9, 1/27, 5/27, 1/27, 1/81, 5/81, 5/27, 5/81, 25/81) library('scatterplot3d') # To load the package scatterplot3d(valuesX1, valuesX2, probabilities, type='h', xlab='Value X1', ylab='Value X2', zlab='Probability', xlim=c(0, 4), ylim=c(0, 4), zlim=c(0,1), main= 'Mass Function', lwd=7)

That the total sum of probabilities is equal to one can be checked:

∑Ωf X( x j)=∑Ω

p j =19+

127

+527

+127

+181

+581

+527

+581

+2581

=9+3+15+3+1+5+15+5+25

81=

8181

=1

From the information in the table it is possible to calculate any quantity—e.g. the first-order joint moment:

μX1,1= E (X 1⋅X 2)=∑Ω

x j⋅f X (x j)= 1⋅1⋅19+1⋅2⋅

127

+⋯+3⋅2⋅5

81+3⋅3⋅

2581

=4.938272

(B) Sampling probability distribution of the sample mean

The sample mean X(X) = X(X1,X2) is a random quantity, since so are X1 and X2. Each pair of values (x1,x2) of(X1,X2) gives a value x for X; on the contrary, a value x of X can correspond to different pairs of values (x1,x2).Then, we will fill in a table with all values and merge those that are equal. For example:

X (1,1)=1+1

2=

22=1

The other values x of X are calculate in the same way to fill in the following table:

Value (x1,x2) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)

Probabilityof (x1,x2)

19

127

527

127

181

581

527

581

2581

Value x of X1+1

21+2

21+3

22+1

22+2

22+3

23+1

23+2

23+3

2

132

232

252

252

3

The sample mean X can take five different values while (X1,X2) could take nine different possible values(x1,x2). Thus, the probability for X to take the value 2, for example, is calculated as follows (note the union):

P X (2)= P X ({(1,3)}∪{(2,2)}∪{(3,1)})=P X ((1,3))+PX ((2,2))P X ((3,1))=527

+181

+527

=3181

In the same way,

P X (1)= P X ({(1,1)})=19


P X ( 32 )= P X ({(1,2)}∪{(2,1)})=P X ({(1,2)})+P X ({(2,1)})=

127

+127

=227

P X ( 52 )= P X ({(2,3)}∪{(3,2)})=P X ({(2,3)})+P X ({(3,2)})=

581

+581

=1081

P X (3)= PX ({(3,3)})=2581

Then, the sampling probability distribution of the sample mean X is determined, in this case, by

Probability Distribution of X

Value x 132

252

3

Probability of x19

227

3181

1081

2581

We can check that the total sum of probabilities is equal to one:

∑ΩP X ( x j)=∑Ω

p j =19+

227

+3181

+1081

+2581

=9+6+31+10+25

81=

8181

=1

From the information in the table above it is possible to calculate any quantity—e.g. the mean:

μ X = E( X )=∑Ωx j⋅P X ( x j)=∑Ω

x j⋅p j =1⋅19+

32⋅

227

+2⋅3181

+52⋅

1081

+3⋅2581

=9+9+62+25+75

81=2.222222

It is worth noticing that this value is equal to the value that we obtained at the beginning, which agrees withthe well-known theoretical property:

μ X = E( X )= E (X )=μ X

Values and probabilities can also be provided by using a function—the mass or density function, which can berepresented with the help of a computer:

values = c(1, 3/2, 2, 5/2, 3) probabilities = c(1/9, 2/27, 31/81, 10/81, 25/81) plot(values, probabilities, type='h', xlab='Value', ylab='Probability', ylim=c(0,1), main= 'Mass Function', lwd=7)

Conclusion: For a simple distribution for X and a small sample size X = (X1,X2), we have written both thejoint probability distribution of the sample X and the sampling distribution of X. This helps us to understandthe concept of sampling distribution of any random quantity (not only the sample mean), whether we are ableto write it or even to know it (e.g. due to a theorem).


My notes:

Point Estimations

[PE] Methods for EstimatingRemark 1pe: When necessary, the expectations E(X) and E(X2) are usually given in the statement; once E(X) is given, either Var(X)or E(X2) can equivalently be given, since Var(X) = E(X2)–E(X)2. If not given, these expectations can be calculated from theirdefinitions by adding up to or integrating, for discrete and continuous variables, respectively (this is sometimes an advancedmathematical exercise).

Remark 2pe: If the method of the moments is used to estimate m parameters (frequently 1 or 2), the first m equations of the systemusually suffice; nevertheless, if not all the parameters appear in the first-order moments of X, the smallest m moments—andequations—for which the parameters appear must be considered. For example, if μ1 = 0 or if the interest relies directly on σ2 becauseμ is known, the first-order equation μ1 = μ = E(X) = m1 does not involve σ and hence the second-order equation μ2 = E(X2) = Var(X)+ E(X)2 = σ2+μ2 = m2 must be considered instead.

Remark 3pe: When looking for local maxima or minima of differentiable functions, the first-order derivatives are equalized to zero.After that, to discriminate between maxima and minima, the second-order derivatives are studied. For most of the functions we willwork with, this second step can be solved by applying some qualitative reasoning on the sign of the quantities involved and thepossible values of the data xi. When this does not suffice, the values found in the first step, say θ0, must be substituted in theexpression of the second step. On the other hand, global maxima and minima cannot in general be found using the derivatives, andsome qualitative reasoning must be applied. It is important to highlight that, in applying the maximum likelihood method, thepurpose is to find the maximum, whichever the mathematical way.

Exercise 1pe-m

If X is a population variable that follows a binomial distribution of parameters κ and η, and X = (X1,...,Xn) isa simple random sample:

(a) Apply the method of the moments to obtain an estimator of the parameter η.

(b) Apply the maximum likelihood method to obtain an estimator of the parameter η.

(c) When κ = 10 and x = (x1,...,x5) = (4, 4, 3, 5, 6), use the estimators obtained in the two previoussections to construct final estimates of the parameter η and the measures μ and σ2.

Hint: (i) In the two first sections treat the parameter κ as if it were known. (ii) In the likelihood function, join the combinatorialterms into a product; this product does not depend on the parameter η and hence its derivative will be zero.

Discussion: This statement is mathematical, although in the last section we are given some data to besubstituted. In practice, that the binomial can be used to explain X should be supported. The variable X isdimensionless. For the binomial distribution,

(See the appendixes to see how the mean and the variance of this distribution can be calculated.) Particularly,the results obtained here can be applied to the Bernoulli distribution with κ = 1.

(a) Method of the moments

(a1) Population and sample moments: The probability distribution has two parameters originally, but wehave to study only one. The first-order moments are

μ1(η)=E (X )=κ⋅η and m1(x1 , x2 ,... , xn)=1n∑ j=1

nx j= x

(a2) System of equations: Since the parameter of interest η appears in the first-order population moment of


X, the first equation is enough to apply the method:

μ1(η)=m1( x1 , x2 , ... , xn) → κ⋅η=1n∑ j=1

nx j= x → η=

1κ x

(a3) The estimator:

ηM=1κ X

(b) Maximum likelihood method

(b1) Likelihood function: For the binomial distribution the mass function is f (x ; κ ,η)=(κx )η

x(1−η)

κ− x .

We are interested only in η, so

L( x1 , x2 , ... , x n ;η)=∏ j=1

nf (x j ;η)=∏ j=1

n

(κx j)η

x j(1−η)κ− x j=(

κx1)η

x1(1−η)κ− x1 ⋯ (

κxn)η

xn(1−η)κ− xn

= [∏ j=1

n

(κx j)]η

∑ j=1

nx j(1−η)

∑ j=1

n(κ− x j)=[∏ j=1

n

(κx j)]⋅η

∑ j=1

nx j(1−η)

nκ−∑ j=1

nx j .

(b2) Optimization problem: The logarithm function is applied to facilitate the calculations,

log [L( x1 , x2 , ... , xn ;η)]= log [∏ j=1

n

(κx j)]+log [η∑ j=1

nx j ]+log [(1−η)

nκ−∑ j=1

nx j ]

=log [∏ j=1

n

(κx j) ]+(∑ j=1

nx j) log(η)+(nκ−∑ j=1

nx j) log(1−η).

To discover the local or relative extreme values, the necessary condition is

0=d

d ηlog [L( x1 , x2 , ... , x n ;η)]=0+(∑ j=1

nx j)

1

η+(nκ−∑ j=1

nx j)

−11−η

→ nκ−∑ j=1

nx j

1−η=∑ j=1

nx j

η

→ ηnκ−η∑ j=1

nx j=∑ j=1

nx j−η∑ j=1

nx j → ηnκ=∑ j=1

nx j → η0=

∑ j=1

nx j

nκ=

1κ

1n∑ j=1

nx j=

1κ x

To verify that the only candidate is a local or relative maximum, the sufficient condition is

d2

d η2 log [L (x1 , x2 ,... , xn ;η)]=(∑ j=1

nx j)

−1η

2 −(n κ−∑ j=1

nx j)

−1(1−η)

2 (−1)=−∑ j=1

nx j

η2 −

nκ−∑ j=1

nx j

(1−η)2 < 0

since κ ≥ xj and therefore n κ≥∑ j=1

nx j ↔ n κ−∑ j=1

nx j≥0 . This holds for any value, including η0 .

(b3) The estimator:

ηML=1κ X

(c) Estimation of η, μ and σ2

For κ = 10 and x = (x1,...,x5) = (4, 4, 3, 5, 6)

From the method of the moments: ηM=1κ x=

110

15(4+4+3+5+6)=0.44 .

From the maximum likelihood method, as the same estimator was obtained: ηML=0.44 .

Since μ=E (X )=κ⋅η , an estimator of η induces an estimator of μ by applying the plug-in principle:


From the method of the moments: μM=κ⋅ηM=4.4 .

From the maximum likelihood method: μML=4.4 .

Finally, σ2=Var (X )=κ⋅η(1−η) , an estimator of η induces an estimator of μ too:

From the method of the moments: σM2 =κ⋅ηM (1−ηM )=10⋅0.44 (1−0.44)=2.464 .

From the maximum likelihood method: σML2=κ⋅ηML(1−ηML)=2.464 .

Conclusion: We can see that for the binomial population the two methods provide the same estimator for η.The value of κ must be known to use the expression obtained. In this particular case, the value 0.44 indicatesthat, for each underlying trials (Bernoulli variables), one value seems more probable than the other. On theother hand, the quality of the estimator obtained should be studied, especially if the two methods had provideddifferent estimators. As a particular case, κ = 1 for the Bernoulli distribution.

Exercise 2pe-m

A random quantity X is supposed to follow a geometric distribution. Let X be a simple random sample.

A) Apply the method of the moments to find an estimator of the parameter η.

B) Apply the maximum likelihood method to find an estimator of the parameter η.

C) Given a sample such that ∑ j=1

27x j= 134 , apply the formulas obtained in the two previous sections

to give final estimates of η. Finally, give estimates of the mean and the variance of X.

Discussion: This statement is mathematical, although we are given some data in the last section. Therandom variable X is dimensionless. For the geometric distribution,

(See the appendixes to see how the mean and the variance of this distribution can be calculated.)

A) Method of the moments

a1) Population and sample moments: The population distribution has only one parameter, so one equationsuffices. The first-order moments of the model X and the sample x are, respectively,

μ1(η)=E (X )=1η and m1(x1 , x2 ,... , xn)=

1n∑ j=1

nx j= x

a2) System of equations: Since the parameter of interest η appears in the first-order moment of X, the firstequation suffices:

μ1(η)=m1( x1 , x2 , ... , xn) → 1

η=

1n∑ j=1

nx j= x → η=( 1

n∑ j=1

nx j)

−1

=1x

a3) The estimator:

ηM=( 1n∑ j=1

nX j)

−1

=1X


My notes:

B) Maximum likelihood method

b1) Likelihood function: For the geometric distribution, the mass function is f (x ;η)=η⋅(1−η)x−1 so

L( x1 , x2 , ... , x n ;η)=∏ j=1

nf (x j ;η)=η⋅(1−η)

x1−1⋯η⋅(1−η)

xn−1=η

n⋅(1−η)

(∑ j=1

n

x j)−n

b2) Optimization problem: The logarithm function is applied to make calculations easier

log [L( x1 , x2 , ... , xn ;η)]= log [ηn]+ log [(1−η)

(∑ j=1

n

x j)−n]=n⋅log (η)+[(∑ j=1

nx j)−n ]⋅log(1−η)

The population distribution has only one parameter, so a onedimensional function must be maximized. To findthe local or relative extreme values, the necessary condition is:

0=d

d ηlog [L( x1 , x2 , ... , x n ;η)]=

nη+[(∑ j=1

nx j)−n]

−11−η

→ nη=

[(∑ j=1

nx j)−n ]

1−η

→ n−n η=η∑ j=1

nx j−ηn → n=η∑ j=1

nx j → η0=

n

∑ j=1

nx j

=1x

To verify that the only candidate is a (local) maximum, the sufficient condition is:

d2

d η2 log [L (x1 , x2 ,... , xn ;η)]=−

nη

2−[(∑ j=1

nx j)−n ]

−(−1)

(1−η)2 < 0

as (∑ j=1

nx j)−n > 0 (note that xj ≥1). This holds for any value, including η0=

1x

.

b3) The estimator:

ηML=( 1n∑ j=1

nX j)

−1

=1X

C) Estimation of η, μ, and σ2

Since n = 27 and ∑ j=1

27x j= 134 ,

From the method of the moments: ηM=1x=

1127⋅∑ j=1

27x j

=27

134=0.201 .

From the maximum likelihood method, as the same estimator was obtained: ηML=0.201 .

Since μ=E (X )=1η , an estimator of η induces an estimator of μ:

From the method of the moments: μM=1

ηM

=13427

=4.96 .

From the maximum likelihood method, since the same estimator was obtained: μML=4.96 .

Note: From the numerical point of view, calculating 134/27 is expected to have smaller error than calculating 1/0.201.

Finally, since σ2=Var (X )=

1−η

η2 ,


From the method of the moments: σM2=

1−ηM

ηM2 =

1−27134

( 27134 )

2 =(134−27)134

272 =19.67 .

From the maximum likelihood method: σML2 =19.67 .


Conclusion: For the geometric model, the two methods provide the same estimator for η. We have used theestimator of η to obtain an estimator of μ. On the other hand, the quality of the estimator obtained should bestudied, especially if the two methods had provided different estimators.

Exercise 3pe-m

A real-world variable is modeled by using a random variable X that follows a Poisson distribution. Given asimple random sample of size n,

A) Apply the method of the moments to obtain an estimator of the parameter λ.

B) Apply the maximum likelihood method to obtain an estimator of the parameter λ.

C) Use these estimators to build estimators of the mean μ and the variance σ2 of the distribution.

Discussion: Although a real-world population is mentioned, this statement is mathematical. It is implicitlyassumed that the Poisson model is appropriate to study that variable (it can be supposed to be dimensionless).In a statistical study, this supposition should be evaluated, e.g. by applying a hypothesis test, before lookingfor an estimator of the population parameter. For a Poisson random variable,

(See the appendixes to see how the mean and the variance of this distribution can be calculated.)


a1) Population and sample moments: The population distribution has only one parameter, so one equationsuffices. The first-order moments of the model X and the sample x are, respectively,

μ1(λ)=E(X )=λ and m1(x1 , x2 ,... , xn)=1n∑ j=1

nx j= x

a2) System of equations: Since the parameter of interest λ appears in the first-order moment of X, the firstequation suffices. The system has only one trivial equation:

μ1(λ)=m1( x1 , x2 , ... , x n) → λ=1n∑ j=1

nx j= x

a3) The estimator:

λM=1n∑ j=1

nX j= X


My notes:


b1) Likelihood function: We write the product and reorder the terms that are similar:

L( x1 , x2 , ... , x n ;λ)=∏ j=1

nf ( x j ;λ)=∏ j=1

n λx j

x j !e−λ= λ

x1

x1!e−λ⋅λ

x2

x2 !e−λ⋯ λ

xn

xn !e−λ= λ

∑ j=1

n

x j

∏ j=1

nx j !

e−nλ

b2) Optimization problem: The logarithm function is applied to make calculations easier:

log [L( x1 , x2 , ... , xn ;λ)]=log [λ∑ j=1

n

x j

]+log [e−nλ]−log [∏ j=1

nx j ! ]=(∑ j=1

nx j)log [λ]−nλ− log [∏ j=1

nx j ! ]

The population distribution has only one parameter, so a onedimensional function must be maximized. To findthe local extreme values the necessary condition is:

0=d

d λlog [L (x1 , x2 ,... , xn ;λ)]=(∑ j=1

nx j)

1λ−n → λ0=

1n∑ j=1

nx j= x


d2

d λ2 log [L(x1 , x2 ,... , xn ;λ )]=(∑ j=1

nx j)

−1λ

2 < 0

since x∈{0,1,2...} → ∑ j=1

nx j≥0 . Then, the second derivative is always negative, also for λ0.

b3) The estimator: For λ, it is obtained after substituting the lower-case letters xj (numbers representing THEsample we have) by upper-case letters Xj (random variables representing ANY possible sample we may have):

λML=1n∑ j=1

nX j= X

C) Estimation of μ and σ2

To obtain estimators of the mean and the variance, we take into account that for this model μ=E (X )=λ

and σ2=Var (X )=λ , so by applying the plug-in principle:

μ = λ = X , σ2= λ = X

Conclusion: For the Poisson model, the two methods provide the same estimator for λ, and therefore for μand σ2 (when the plug-in principle is applied). On the other hand, the quality of the estimator obtained shouldbe studied (though the sample mean is a well-known estimator).

Exercise 4pe-m

A random variable X follows the normal distribution. Let X = (X1,...,Xn) be a simple random sample of X (seenas the population). To obtain an estimator of the parameters θ = (μ,σ), apply:

(A) The method of the moments (B) The maximum likelihood method

Discussion: This statement is mathematical. For the normal distribution,


My notes:

(For this distribution, the mean and the variance are directly μ and σ2; this is proved in the appendixes.)

(A) Method of the moments

(a1) Population and sample moments

The population distribution has two parameters, so two equations are considered. The first-order moments are

μ1(μ ,σ)=E (X )=μ and m1(x1 , x2 ,... , xn)=1n∑ j=1

nx j= x

while the second-order moments are

μ2(μ ,σ )=E (X 2)=Var (X )+E (X )

2=σ

2+μ

2 and m2( x1 , x2 , ... , xn)=1n∑ j=1

nx j

2

(a2) System of equations

{μ1(μ ,σ)=m1( x1 , x2 , ... , xn)

μ2(μ ,σ)=m2( x1 , x2 , ... , xn) → { μ=

1n∑ j=1

nx j= x

σ2+μ

2=

1n∑ j=1

nx j

2 → {

μ= x

σ2=( 1

n∑ j=1

nx j

2)− x2=sx

2

where Var (X )=E (X 2)−E(X )

2 and sx2=( 1

n∑ j=1

nx j

2)−( 1n∑ j=1

nx j)

2

= x2− x2 have been used.

(a3) The estimator

θM={μM= XσM=s X

(B) Maximum likelihood method

(b1) Likelihood function

The density function of the Gaussian distribution is f (x ;μ ,σ)=1

√2πσ2e−( x−μ)2

2σ2

. Then,

L( x1 , x2 , ... , x n ;μ ,σ)=∏ j=1

nf ( x j ;μ ,σ )=∏ j=1

n ( 1

√2πσ2e−( x j−μ )

2

2σ 2 )=( 1

√2πσ2 )n

e−

12σ2∑ j=1

n( x j−μ)

2

(b2) Optimization problem

Logarithm: The logarithm function is applied to make calculations easier

log [L( x1 , x2 , ... , xn ;μ ,σ)]=−n2

log [2 πσ2]−

1

2σ2∑ j=1

n( x j−μ)

2

Maximum: The population distribution has two parameters, and then it is necessary to maximize atwodimensional function. To discover the local extreme values, the necessary conditions are:

{∂∂μ

log [L( x1 , x2 , ... , x n ;μ ,σ)]=0

∂∂σ

log [L(x1 , x2 , ... , xn ;μ ,σ)]=0 → { −

1

2σ2∑ j=1

n[2 (x j−μ)(−1)]=0

−n

σ√2π√2π−

12[∑ j=1

n(x j−μ)

2](−2σ

σ4 )=0


{1

σ2∑ j=1

n( x j−μ)=0

−nσ +

1σ3∑ j=1

n(x j−μ)

2=0

→ { ∑ j=1

n(x j−μ)=0

−n+1σ

2∑ j=1

n(x j−μ)

2=0

→ { ∑ j=1

nx j=nμ

∑ j=1

n( x j−μ)

2=nσ2

{ μ=1n∑ j=1

nx j

σ2=1n∑ j=1

n(x j−μ)

2 → {

μ= x

σ2=

1n∑ j=1

n(x j− x)

2=sx

2 → { μ= xσ=sx

To verify that the only candidate is a (local) maximum, the sufficient conditions on the partial derivatives ofsecond order are:

A= ∂2

∂μ2log [L(x1 , ... , x n ;μ ,σ)]= ∂

∂μ [ 1

σ2∑ j=1

n(x j−μ)]= 1

σ2∑ j=1

n(−1)=−

n

σ2

B= ∂2

∂μ ∂σlog [L( x1 , ... , xn ;μ ,σ)]= ∂

∂σ [ 1

σ2∑ j=1

n(x j−μ)]=−2σ

σ4 ∑ j=1

n(x j−μ)=−

2

σ3∑ j=1

n(x j−μ)

C= ∂2

∂σ2

log [L(x1 , ... , xn ;μ ,σ)]= ∂∂σ [− n

σ +1

σ3∑ j=1

n( x j−μ)

2]= n

σ2−

3

σ4∑ j=1

n( x j−μ)

2

To calculate D = B2–AC, substituting the pair (μ ,σ)=( x , sx) in A, B and C simplifies the work

A∣( x , sx)=−

n

sx2< 0

B∣(μ , s x)=−

2

s x3∑ j=1

n( x j− x )=0

C∣( x , sx)=

n

s x2−

3

sx4∑ j=1

n(x j−μ)

2=−

2n

sx2

→ D∣( x , sx

2)=−(− n

s x2 )(−2n

sx2 )=−2n2

s x4 < 0

as ∑ j=1

n(x j− x)=(∑ j=1

nx j)−n x=0 and ∑ j=1

n(x j− x)

2=

nn∑ j=1

n(x j− x )

2=n s x

2. Then,

log [L( x ;μ ,σ)] has a maximum at (μ ,σ )=( x , sx) since it is a local extreme value and D < 0, A < 0.

(b3) The estimator

θML={μML= XσML=sX

Conclusion: Since in this case there are two parameters, both the parameter and its estimator can be thoughtas twodimensional quantities: θ=(μ ,σ) and θ=(μ ,σ) . On the other hand, the quality of the estimatorobtained should be studied, especially if the two methods had provided different estimators.


My notes:

Exercise 5pe-m

The uniform distribution U[0,θ] has

f (x ;θ)= {1θ

if x∈[0,θ]

0 otherwise

as a density function. Let X = (X1,...,Xn) be a simple random sample of a population X following thisprobability distribution.

A) Apply the method of the moments to find an estimator of the parameter θ.

B) Apply the maximum likelihood method to find an estimator of the parameter θ.

Use this estimator to build others for the mean and the variance of X.

Discussion: This statement is mathematical, and there is no supposition that would require justification. Therandom variable X is dimensionless. We are given the density function of the distribution of X, though for thisdistribution it could be deduced from the fact that all values have the same probability. For the generalcontinuous uniform distribution,

Note: If we had not remembered the first population moments, with the notation of this exercise we could do

E (X )=∫−∞

+∞

x f ( x ;θ)dx=∫0

θ

x1θ

dx=1θ [ x2

2 ]0

θ

=1θ (θ

2

2−0)=θ

2

E (X 2)=∫

−∞

+∞

x2 f (x ;θ)dx=∫0

θ

x2 1θ

dx=1θ [ x3

3 ]0

θ

=1θ (θ

3

3−0)=θ

2

3so

μ=E (X )=θ2

and σ2=Var (X )=E (X

2)−E (X )

2=θ

2

3−(θ2 )

2

=θ2( 1

3−

14 )= θ

2

12


a1) Population and sample moments: For uniform distributions, discrete or continuous, the mean is themiddle value. Then, the first-order moment of the distribution and of the sample are

μ1(θ)=0+θ

2=θ

2 and m1(x1 , ... , xn)=

1n∑ j=1

nx j= x

a2) System of equations:

μ1(θ)=m1(x1 , x2 ,... , xn) → θ2=

1n∑ j=1

nx j= x → θ0=

2n∑ j=1

nx j=2 x

a3) The estimator:

θM=2n∑ j=1

nX j=2 X



b1) Likelihood function: The density function is f (x ;θ)=1θ

for 0≤x≤θ , so

L( x1 , x2 , ... , x n ;θ)=∏ j=1

nf (x j ;θ)=∏ j=1

n 1θ=

1

θn

b2) Optimization problem: First, we try to discover the maximum by applying the technique based on thederivatives. The logarithm function is applied,

log [L( x1 , x2 , ... , xn ;θ)]=log [θ−n]=−n log(θ) ,

and the first condition leads to a useless equation:

0=d

d θlog [L(x1 , x2 , ... , xn ;θ)]=−n

1θ

→ ?

Then, we realize that global minima and maxima cannot always be found through the derivatives (only if theyare also local extremes). In fact, it is easy to see that the function L monotonically decreases with θ andtherefore monotonically increases when θ decreases (this pattern or just the opposite tend to happen when theprobability function changes monotonically with the parameter, e.g. when the parameter appears only once inthe expression). As a consequence, it has no local extreme values. Since, on the other hand, 0≤x j≤θ , ∀ j ,

{ L when θBut x j≤θ , ∀ j

→ θ0=max j{x j }

b3) The estimator: It is obtained after substituting the lower-case letters xj (numbers representing THEsample we have) by upper-case letters Xj (random variables representing ANY possible sample we may have):

θML=max j {X j }

C) Estimation of μ and σ2

To obtain estimators of the mean, we take into account that μ=E (X )=θ2

and apply the plug-in principle:

μM =θM

2=

2 X2

= X μML=θML

2=

max j {X j }

2

To obtain estimators of the variance, since σ2=Var (X )= θ

2

12

σM2=

θM2

12=

(2 X )2

12=( X )

2

3 σML

2=

θML2

12=

(max j{X j })2

12

Conclusion: For the uniform distribution, both methods provide different estimators of the parameter andhence of the mean. The quality of the estimators obtained should be studied.


My notes:

Exercise 6pe-m

A random quantity X is supposed to follow a distribution whose probability function is, for θ > 0,

f (x ;θ)= {0 if x<3

1θ

e−

x−3θ if x≥3



C) Use the estimators obtained to build estimators of the mean μ and the variance σ2.

Hint: Use that E(X) = θ + 3 and E(X2) = 2θ2 + 6θ + 9.

Discussion: This statement is mathematical. The random variable X is supposed to be dimensionless. Theprobability function and the first two moments are given, which is enough to apply the two methods. In thelast step, the plug-in principle will be applied.

Note: If E(X) had not been given in the statement, it could have been calculated by applying integration by parts (since polynomials andexponentials are functions “of different type”):

E (X )=∫−∞

+∞

x f ( x ;θ)dx=∫3

∞

x1θ

e−

x−3θ dx=[−x e

−x−3θ −∫1⋅(−e

−x−3θ )dx ]3

∞

=[−xe−

x−3θ −θe

−x−3θ ]3

∞

=[( x+θ)e−

x−3θ ]∞

3

=3+θ .

That ∫u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫u ' (x)⋅v (x )dx has been used with

• u=x → u '=1

• v '=1θ

e−

x−3θ → v=∫ 1

θe−

x−3θ dx=−e

−x−3θ

On the other hand, ex changes faster than xk for any k. To calculate E(X2):

E (X 2)=∫−∞

+∞

x2 f (x ;θ)dx=∫3

∞

x2 1θ

e−

x−3θ dx=[−x2 e

−x−3θ +2∫ x e

−x−3θ dx ]3

∞

=[ x2 e−

x−3θ ]

∞

3

+2θ∫3

∞

x1θ

e−

x−3θ dx=(32

−0)+2θμ=9+2θ(3+θ)=2θ2+6θ+9 .

Integration by parts has been applied: ∫u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫u ' (x)⋅v (x )dx with

• u=x2 → u '=2 x

• v '=1θ

e−

x−3θ → v=∫ 1

θe−

x−3θ dx=−e

−x−3θ

Again, ex changes faster than xk for any k.


a1) Population and sample moments: There is only one parameter, so one equation suffices. The first-ordermoments of the model X and the sample x are, respectively,

μ1(θ)=E (X )=θ+3 and m1(x1 , x2 ,... , xn)=1n∑ j=1

nx j= x



μ1(θ)=m1(x1 , x2 ,... , xn) → θ+3=1n∑ j=1

nx j= x → θ= x−3

a3) The estimator:θM= X −3


b1) Likelihood function: For this probability distribution, the density function is f (x ;θ)=1θ

e−

x−3θ so

L( x1 , x2 , ... , x n ;θ)=∏ j=1

nf (x j ;θ)=∏ j=1

n 1θ

e−

x j−3θ =

1θ

n e−

1θ∑ j=1

n

( x j−3)


log [L( x1 , x2 , ... , xn ;θ)]=log(θ−n)−

1

θ∑ j=1

n(x j−3)=−n log(θ)−

1

θ∑ j=1

n( x j−3)

The population distribution has only one parameter, so a onedimensional function must be maximized. To findthe local or relative extreme values, the necessary condition is:

0=d

d θlog [L(x1 , x2 , ... , xn ;θ)]=−n

1θ+

1

θ2∑ j=1

n( x j−3) →

nθ=

1

θ2∑ j=1

n(x j−3)

→ θ=1n∑ j=1

n(x j−3)=

1n∑ j=1

nx j−

1n∑ j=1

n3= x−3 → θ0= x−3


d 2

d θ2 log [L( x1 , x2 , ... , x n ;θ)]=d

d θ[−n

1θ+

1θ

2∑ j=1

n( x j−3)]=

nθ

2−2θθ

4 ∑ j=1

n(x j−3) <

?0

The first term is always positive but the second is always negative, so we had better substitute the candidate

d 2

d θ2 log [L( x1 , x2 , ... , x n ;θ)]=nθ

2−2θθ

4 n( x−3)=nθ0

2−2θ0

θ04 nθ0=−

nθ0

2 < 0

b3) The estimator:θML= X −3

C) Estimation of η and σ2

c1) For the mean: By using the hint and the plug-in principle,

From the method of the moments: μM=θM+3= X −3+3= X . From the maximum likelihood method, as the same estimator was obtained: μML= X .

c2) For the variance: We must write it in terms of the first two moments of X,

σ2=Var (X )=E (X 2

)−E (X )2=2θ2

+6θ+9−(θ+3)2=2θ2+6θ+9−θ

2−6θ−9=θ

2

Then, From the method of the moments: σM

2 =θM2 =( X −3)2=( X )2−6 X +9 .

From the maximum likelihood method: σML2=θML

2=( X −3)2=( X )

2−6 X +9 .

Conclusion: For this model, the two methods provide the same estimator. We have used the estimator of θto obtain estimators of μ and σ2. The quality of the estimator obtained should be studied, especially if the two


methods had provided different estimators. Regarding the original probability distribution: (i) the expressionreminds us the exponential distribution; (ii) the term x–3 suggests a translation; and (iii) the variance θ2 is thesame as the variance of the exponential distribution. After translating all possible values x, the mean is alsotranslated but the variance is not. Thus, the distribution of the statement is a translation of the exponentialdistribution, which has this equivalent notation

In fact, the distribution with probability function f (x ;θ)=1θ

e−

x−δθ , x>δ (and zero elsewhere) is termed

two-parameter exponential distribution. It is a translation of size δ of the usual exponential distribution. Aparticular, simple case is obtained for θ = 1 and δ =0, since f (x )= e− x , x>0 .

Exercise 7pe-m

A random quantity X is supposed to follow a distribution whose probability function is, for θ>0,

f (x ;θ)= {3 x2

θ3 if 0≤ x≤θ

0 otherwise



C) Use the estimators obtained to build estimators of the mean μ and the variance σ2.

Hint: Use that E(X) = 3θ/4 and Var(X) = (3θ2)/80.

Discussion: This statement is mathematical. The random variable X is supposed to be dimensionless. Theprobability function and the first two moments are given, which is enough to apply the two methods. In thelast step, the plug-in principle will be applied.

Note: If E(X) had not been given in the statement, it could have been calculated by integrating:

E (X )=∫−∞

+∞

x f ( x ;θ)dx=∫0

θ

x3x 2

θ3 dx=

3θ

3 [θ4

4 ]0

θ

=34θ

On the other hand, if Var(X) had not been given in the statement, it could have been calculated by using a property and integrating:

E (X 2)=∫

−∞

+∞

x2 f (x ;θ)dx=∫0

θ

x2 3 x2

θ3 dx=

3θ

3 [ x5

5 ]0

θ

=35θ

2 .

Now,

μ=E (X )=34θ and σ

2=Var (X )=E (X 2

)−E (X )2=

35θ

2−( 3

4θ)

2

=( 35−

32

42 )θ2=

380

θ2 .


a1) Population and sample moments: There is only one parameter, so one equation suffices. The first-ordermoments of the model X and the sample x are, respectively,


My notes:

μ1(θ)=E (X )=34θ and m1(x1 , x2 ,... , xn)=

1n∑ j=1

nx j= x


μ1(θ)=m1(x1 , x2 ,... , xn) → 34θ=

1n∑ j=1

nx j= x → θ0=

43x

a3) The estimator:

θM=43

X


b1) Likelihood function: For this probability distribution, the density function is f (x ;θ)=3 x2

θ3 so

L( x1 , x2 , ... , x n ;θ)=∏ j=1

nf (x j ;θ)=∏ j=1

n 3 x j2

θ3 =

3n

θ3n∏ j=1

nx j

2


log [L( x1 , x2 , ... , xn ;θ)]=log(3n)−3n log(θ)+ log(∏ j=1

nx j

2)

Now, if we try to find the maximum by looking at the first-order derivatives, a useless equation is obtained:

0=d

d θlog [L(x1 , x2 , ... , xn ;θ)]=−3 n

1θ

→ ?

Then, we realize that global minima and maxima cannot in general be found through the derivatives (only ifthey are also local). It is easy to see that the function L monotonically increases when θ decreases (this patternor just the opposite tend to happen when the probability function changes monotonically with the parameter,e.g. when the parameter appears only once in the expression). As a consequence, it has no local extremevalues. On the other hand, 0≤x j≤θ , ∀ j , so

{ L when θBut x j≤θ , ∀ j

→ θ0=max j{x j }

b3) The estimator:

θML=max j {X j }


c1) For the mean: By using the hint and the plug-in principle,

From the method of the moments: μM=34θM=

34

43

X =X .

From the maximum likelihood method: μML=34θML=

34

max j {X j }.

c2) For the variance: By using that principle again,

From the method of the moments: σM2=

380

θM2=

380 (

43

X)2

=115

( X )2 .


From the maximum likelihood method: σML2=

380

θML2=

380

(max j {X j })2.

Conclusion: For this model, the two methods provide different estimators. The quality of the estimatorsobtained should be studied. We have used the estimator of θ to obtain estimators of μ and σ2.

[PE] Properties of EstimatorsRemark 4pe: As regards the sample sizes, we can talk about static situations where we study the dependence of the concepts on thesizes, or the possible relation between the sizes, say nX = c·nY. On the other hand, we can talk about dynamic situations where thesame dependences are studied asymptotically while the sample sizes are always increasing, say nX(k)= c(k)·nY(k), where k is theindex of a sequence of statistical schemes with those sample sizes. (Statistically, we are interested in sequences with nondecreasingsample sizes; mathematically, all possible sequences should be taken into account.) The static and the dynamic situations arerespectively represented in the following figures:

Remark 5pe: We do not usually use the definition of the mean square error but the result at the end of the following equalities:

MSE ( θ)= E ([ θ−θ]2)=E ([ θ−E (θ)+E (θ)−θ]

2)=E ([θ−E (θ)]

2+[E (θ)−θ]

2+2 [θ−E(θ)]⋅[E (θ)−θ])

= E ([ θ−E (θ)]2) + [E (θ)−θ]

2+ 2 E (θ)⋅[E (θ)−θ]−2 E (θ)⋅[E (θ)−θ]= Var (θ)+b(θ)2

Remark 6pe: To study the consistency in probability we have been taught a sufficient—but not necessary—condition that isequivalent to the consistency in mean of order two (managing the definition is quite complex). Thus, this type of consistency isproved when the condition is fulfilled, which is sufficient—but not necessary—for the consistency in probability. By using theChebyshev's inequality:

P(|θ−θ|≥ϵ)≤E((θ−θ)

2)

ϵ2 =

MSE (θ)

ϵ2 → limn→∞ P (|θ−θ|≥ϵ)≤

limn→∞ MSE (θ)

ϵ2

If the sufficient condition is not fulfilled, the estimator under study is not consistent in mean of order two, but it can still be

consistent in probability—this type of consistency should be studied using a different way. Additionally, since MSE ( θ) ,b (θ)2 and Var (θ) are nonnegative, the mean square error is zero if and only if the other two are zero at the same time, and viceversa.

The same happens for their limits. That is why we are allowed to split the limit of the mean square error into two limits.

Exercise 1pe-p

The efficiency (in lumens per watt, u) of light bulbs of a certain type have a population mean of 9.5u andstandard deviation of 0.5u, according to production specifications. The specifications for a room in whicheight of these bulbs (the simple random sample) are to be installed call for the average efficiency of the eightbulbs to exceed 10u. Find the probability that this specification for the room will be met, assuming thatefficiency measurements are normally distributed.

(From Mathematical Statistics with Applications, Mendenhall, W., D.D. Wackerly and R.L. Scheaffer, Duxbury Press.)


My notes:

Discussion: The supposition that efficiency measurements follow the distribution N(μ=9.5u, σ2=0.52u2)should be tested by applying an appropriate statistical technique. The event is defined in terms of X. We thinkabout making the proper statistic appear, and hence to be allowed to use its sampling distribution.

Identification of the variable and selection of the statistic: The variable is the efficiency of thelight bulbs, while the estimator is the sample mean of eight elements. Since the population is normal and thetwo population parameters are known, we will consider the (dimensionless) statistic:

T (X ;μ)=X −μ

√ σ2

n

∼ N (0,1)

Rewriting the event: Although in this case the sampling distribution of X is known, as X ∼ N (μ , σ2

n ) ,

we need to standardize before consulting the table of the standard normal distribution:

P ( X > 10)=P (X −μ

√ σ2

n

>10−μ

√ σ2

n )=P(T >10−9.5

√ 0.52

8 )=P(T >0.5 √ 8

√ 0.52 )=P (T > √ 8)=0.0023

where in this case the language R has been used:

Conclusion: The production specifications will be met, for the room mentioned, with a probability of0.0023, that is, they will hardly be met.

Exercise 2pe-p

When a production process is working properly, the resistance of the components follows a normaldistribution with standard deviation 4.68u. A simple random sample with four components is taken. What isthe probability that the sample quasivariance will be bigger than 30u2?

Discussion: In this exercise, the supposition that the normal distribution reasonably explains the variableresistance should be evaluated by using proper statistical techniques. The question involves S2. Again, it isnecessary to make the proper statistic appear, in order to use its sampling distribution.

Identification of the variable: R ≡ Resistance (of one component) R ~ N(μ, σ2 = 4.682u2)

Sample and statistic: R1, R2, R3, R4 (The resistance of four components is measured.) → n = 4

S2=

14−1∑ j=1

4(R j−R)

2 Sample quasivariance

Search for a known distribution: The quantity required is P(S2>30) . To calculate the probability of an

event, we need to know the distribution of the random quantity involved. In this case, we do not know thesampling distribution of S 2 , but since R follows a normal distribution we are allowed to use


> 1 - pnorm(sqrt(8),0,1)[1] 0.002338867

My notes:

T =(n−1)S2

σ2 ∼χn−1

2

Then, by completing the inequality with the necessary constants (until making T appear):

P(S2>30)=P( (n−1)S2

σ2 >

(n−1)30

σ2 )=P(T >

(4−1)30

4.682 )=P(T >4.11)

where T ∼χ32. Multiplying and dividing by positive quantities have not changed the inequality.

Table of the χ2 distribution: Since n–1=4–1=3, it is necessary to look at the third row.

The probabilities in the table are given for events of the form P(T <x ) (or P(T ≤x ) , as the distribution iscontinuous), and therefore the complementary of the event must be considered:

P(T>4.11)=1−P (T≤4.11)=1−0.75=0.25

Conclusion: The probability of the event is 0.25. This means that S2 will sometimes take a value larger than30u2, when evaluated at specific data x coming from the mentioned distribution.

Exercise 3pe-p

A simple random sample of 270 homes was taken from a large population of older homes to estimate theproportion of homes with unsafe wiring. If, in fact, 20% of homes have unsafe wiring, what is the probabilitythat the sample proportion will be between 16% and 24%?

Hint: Since probabilities and proportions are measured in a 0-to-1 scale, write all quantities in this scale.

(From Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)

LINGUISTIC NOTE (From: The Careful Writer: A Modern Guide to English Usage. Bernstein, T.M. Atheneum)

home, house. It is a tribute to the unquenchable sentimentalism of users of English that one of the matters of usage that seem to agitatethem the most is the use of home to designate a structure designed for residential purposes. Their contention is that what the builder erectsis a house and that the occupants then fashion it into a home.

That is, or at least was, basically true, but the distinction has become blurred. Nor is this solely the doing of the real estateoperators. They do, indeed, lure prospective buyers not with the thought of mere masonry but with glowing picture of comfort,congeniality, and family collectivity that make a house into a home. But the prospective buyers are their co-conspirators; they, too, viewthe premises not as a heap of stone and wood but as a potential abode.

There may be areas in which the words are not used interchangeably. In legal or quasi-legal terminology we speak of a “house andlot,” not a “home and lot.” The police and fire departments usually speak of a robbery or a fire in a house, not a home, at Main Street andFirst Avenue. And the individual most often buys a home, but sells his house (there, apparently, speaks sentiment again). But in mostareas the distinction between the words has become obfuscated. When a flood or a fire destroys a community, it wipes out not merelyhouses but homes as well, and homes has come to be accepted in this sense. No one would discourage the sentimentalists from trying topry the two words apart, but it would be rash to predict much success for them.

Discussion: The information of this “real-world study” must be translated into the mathematical language.Since there are two possible situations, each home can be “modeled” by using a Bernoulli variable. Although


My notes:

given in a 0-to-100 scale, the population and sample proportions—always in a 0-to-1 scale—are involved. Thedimensionless character of a proportion is due to its definition. Note that if the data (x1,...,xn) are taken and wehave access to them, there is nothing random any longer. The lack of knowledge, as if we had to select nelements to build (X1,...,Xn), justifies the use of Probability Theory.

Identification of the variable and selection of the statistic: The variable having unsafe wiring cantake two possible values: 0 (not having unsafe wiring) and 1 (having it, if one want to register or count thisfact). The theoretical proportion of older homes with unsafe wiring is known: η = 0.20 (20%). For thisframework—a large sample from a Bernoulli population with parameter η—we select the dimensionless,asympotic statistic:

T (X ;η)=η−η

√ ?(1−? )n

→d

N (0,1)

where ? is substituted by the best information available about the parameter: η or η . Here we know η.

Rewriting the event: We are asked for the probability P (0.16 < η < 0.24) , but to calculate it we need torewrite the event until making T appear:

P (0.16 < η< 0.24)=P (0.16−η

√ η(1−η)

n

<η−η

√ η(1−η)

n

<0.24−η

√η(1−η)

n)

=P (T <0.24−0.20

√ 0.20(1−0.20)270 )−P(T ≤

0.16−0.20

√ 0.20 (1−0.20)270 )=P(T < 1.64)−P (T ≤−1.64)

(In these calculations, we have standardized and then decomposed, but it is also possible to decompose andthen to standardize.) Now, let us assume that we have a table of the standard normal distribution includingpositive quantiles only. By using a simple plot with the density function of this distribution, it is easy to see(look at the areas) that for the second probability P (T ≤−1.64)=P (T ≥+1.64)=1−P (T <+1.64) , so

P (T < 1.64)−P (T ≤−1.64)=P (T < 1.64)−[1−P (T < 1.64)]=2⋅P (T < 1.64)−1=2⋅0.9495−1=0.90.

Alternatively, by using the language R:

Conclusion: The probability of the event is 0.90, which means that the sample proportion of older homeswith unsafe wiring, calculated from the sample X = (X1,...,X270), will take a value between 0.16 and 0.24 withthis probability. As a percentage: the proportion of the 270 homes with unsafe wiring will be between 16%and 24% with 90% certainty.

Exercise 4pe-p

Simple random samples X = (X1,...,X11) and Y = (Y1,...,Y6) are taken from two independent populations

X ∼ N (μX=1 ,σ X2=1) and Y ∼ N (μY=2 ,σY

2=0.5)

Calculate or find:

(1) The probability P (SY2 ≤1.5).


> pnorm(1.64,0,1) - pnorm(-1.64,0,1)[1] 0.8989948

My notes:

(2) The quantile c such that P ( X > c)= 0.25.

(3) The probability P ( X −0.1 > 0.1+Y ) .

(4) The quantile c such that P ( S X2

S Y2 ≤ c )= 0.9.

(Advanced Item) The probability P ( X −0.1 > 0.1−Y ) .

Discussion: There are two independent normal populations whose parameters are known. The variances, notthe standard deviation, are given. It is required to calculate probabilities or find quantiles for events involvingthe sample means and the sample quasivariances. In the first two sections, only one of the populations isinvolved. Sample sizes are 11 and 6, respectively. The variables X and Y are dimensionless, and so are bothsides of the inequalities.

(1) The event involves the estimator S 2 , which reminds us of the statistic T =(nY−1)SY

2

σY2 ∼ χnY−1

2 . Then,

P (S y2≤1.5)= P ((n y−1)S y

2

σ y2 ≤

(n y−1)1.5

σ y2 )= P(T ≤

(6−1)1.50.5 )= P (T ≤

5⋅1.512 )= P (T ≤ 15 )= 0.99

(2) The event involves X , so we think about the statistic T =X −μX

√σ X2

n X

∼ N (0,1) . Then,

0.25= P ( X > c)= P (X −μ X

√σ X2

nX

>c−μX

√σX2

nX)= P(T >

c−μ x

√ σX2

nX)= P (T >

c−1

√ 111

)

or, equivalently,

1−0.25= 0.75= P (T ≤c−1

√ 111

)Now, the quantile found in the table of the standard normal distribution must verify that

r 0.25=l 0.75=0.674=c−1

√ 111

→ c= 0.674√ 111

+1=1.20

(3) To work with the means of two populations, we use T =( X −Y )−(μX−μY )

√σ X2

nX

+σY

2

nY

∼ N (0,1) , so

P ( X −0.1 > 0.1+Y )= P ( X −Y > 0.2)= P(( X −Y )−(μ x−μ y )

√σ x2

nx

+σ y

2

ny

>0.2−(μ x−μ y )

√σ x2

nx

+σ y

2

ny)= P (T >

0.2−(1−2)

√ 111

+0.56

)


= P (T >0.2−1+2

√ 111

+1

12)= P (T > 2.87 )= 1−P (T ≤ 2.87 )= 1−0.9979= 0.0021

(4) To work with the variances of two populations, T =S X

2σY

2

S Y2σX

2 ∼ F n X−1 ,nY−1 is used:

0.9= P ( S X2

SY2 ≤ c)= P (σY

2 S X2

σ X2 SY

2 ≤ cσY

2

σ X2 )= P (T ≤ c

σY2

σ X2 )= P (T ≤ c

0.51 )= P (T ≤

c2 )

The quantile found in the table of the distribution F nX−1 , nY−1=F 11−1 ,6−1=F 10,5 is 3.30, which allows us to

find the unknown c:

r 0.1=l 0.9=3.30=c2

→ c= 6.60.

(Advanced Item) In this case, allocating the two sample means in the first side of the inequality leads to

P ( X −0.1 > 0.1−Y )= P( X +Y > 0.2)We remember that

X ∼ N (μ X ,σX

2

nX) and Y ∼ N (μY ,

σY2

nY)

so the rules that govern the sums—and hence subtractions—of normally distributed variables imply both

X −Y ∼ N (μX−μY ,σX

2

nX

+σY

2

nY) and X +Y ∼ N (μX+μY ,

σ X2

n X

+σY

2

nY)

(Note that in both cases the variances are added—uncertainty increases.) Although the difference is used morefrequently, to compare to populations, the sampling distribution of the sum of the sample means is also knownthanks to the rules for normal variables; alternatively, we could still use the first result by doing X+Y = X–(–Y)and using the –Y has mean and variances equal to –μY and σY

2. Either way, after standardizing:

T =( X +Y )−(μX+μY)

√σ X2

n X

+σY

2

nY

∼ N (0,1 )

This is the “mathematical tool” necessary to work with X +Y. Now,

P ( X −0.1 > 0.1−Y )= P( X +Y > 0.2)= P(( X +Y )−(μX+μY )

√ σX2

nX

+σY

2

nY

>0.2−(μ X+μY )

√σ X2

nX

+σY

2

nY)= P (T >

0.2−(1+2)

√ 111

+0.56

)

= P (T >0.2−3

√ 111

+112

)= P (T >−6.71 )= 1−P (T ≤−6.71 ) =1

The quantile 6.71 is not usually in the tables of the N(0,1), so we can consider that P (T ≤−6.71 )≈0. Or, ifwe use the programming language R:

Conclusion: For each case, we have selected the appropriate statistic. After completing the expression of theevent, the statistic T appears. Then, since the (sampling) distribution of T is known, the tables can be used tocalculate probabilities or to find quantiles. In the latter case, the unknown c is found after the quantile of T.


> 1-pnorm(-6.71,0,1)[1] 1

> qf(0.9, 10, 5)[1] 3.297402

Exercise 5pe-p

Suppose that you manage a bank where the amounts of daily deposits and daily withdrawals are given byindependent random variables with normal distributions. For deposits, the mean is ₤12,000 and the standarddeviation is ₤4,000; for withdrawals, the mean is ₤10,000 and the standard deviation is ₤5,000.

(a) For a week, calculate or bind the probability that the five withdrawals will add up to more than₤55,000.

(b) For a particular day, calculate or bind the probability that withdrawals will exceed deposits by morethan ₤5,000.

Imagine that you are to launch a new monthly product. A prospective study indicated that profits (in milliondollars) can be modeled through the random quantity Q = (X+1)/2.325, where X follows a t distribution withtwenty degrees of freedom.

(c) For a particular month, calculate or bind the probability that profits will be smaller than ₤106 (onemillion pounds).

(Based on an exercise of Business Statistics, Douglas Downing and Jeffrey Clark, Barron's.)

Discussion: There are several suppositions implicit in the statement, namely: (i) the normal distribution canreasonably be used to model the two variables of interest D and W; (ii) withdrawals and deposits areindependent; and (iii) X can reasonably be modeled by using the t distribution. These suppositions shouldfirstly be evaluated by using proper statistical techniques. To solve this exercise, the rules on sums anddifferences of normally distributed variables must be used.

Identification of variables and distributions: If D and W represent the random variables daily sumof deposits and daily sum of withdrawals, respectively, from the statement we have that

D ∼ N (μD=₤12,000 ,σD2=₤

2 4,0002) and W ∼ N (μW=₤ 10,000 ,σW

2=₤

2 5,0002)

(a) Since the variables are measured daily, in a week we have five measurements (one for each working day).

Translation into the mathematical language: We are asked for the probability

P (W 1+W 2+W 3+W 4+W 5 > 55,000)=P (∑ j=1

5W j > 55,000)

Search for a known distribution: To calculate or bind this probability, we need to know the distribution ofthe sum or, alternatively, to relate it to any quantity whose distribution we know. By using the rules thatgovern the sums and subtractions of normal variables,

∑ j=1

5W j ∼ N (5μW ,5σW

2)

Rewriting the event: We can easily rewrite the event in terms of the standardized version of this normaldistribution:

P (∑ j=1

5W j>55,000)=P (∑ j=1

5W j−5μW

√5σW2

>55,000−5μW

√5σW2 )=P(Z >

55,000−50,000

√5⋅5,0002 )=P (Z>0.4472)


My notes:

Consulting the table: Finally, it is enough to consult the table of the standard normal distribution Z. On theone hand, in the table we are given values for the quantiles 0.44 and 0.45, so we could round the value 0.4472to the closest 0.45 or, more exactly, we can bind the probability. On the other hand, our table provides lower--tail probabilities, so we will consider the complementary of some events. From the figure below, it is easy todeduce that

P (Z> 0.44)> P (Z> 0.4472)> P (Z> 0.45)

1−P (Z≤0.44)> P (Z> 0.4472)> 1−P (Z≤0.45)

1−0.6700> P (Z> 0.4472)> 1−0.6736

0.3300> P (Z> 0.4472)> 0.3264

Then,

0.3264<P (∑ j=1

5W j > 55,000)<0.3300

Note: It is also possible to relate the total sum to the sample mean

P (∑ j=1

5W j >55,000)=P ( 1

5∑ j=1

5W j >

15

55,000)=P (W >11,000)

and use that

W=15∑ j=1

5W j ∼ N (μW ,

σW2

5 ) → W −μW

√σW2

5

∼ N (0,1)

(b) Translation into the mathematical language: We are asked for the probability P (W >D+5,000).

Search for a known distribution: To calculate or bind this probability, we rewrite the event until all randomquantities are on the left side of the inequality:

P (W >D+5,000)=P (W−D >5,000)

Now we need to know the distribution of W – D or, alternatively, of a quantity involving this difference. Byagain using the rules that govern the sums and differences of normal variables, it holds that

W−D ∼ N (μW−μD , σW2 +σD

2 )=N (₤10,000−₤12,000 , ₤25,0002+₤2 4,0002)

Rewriting the event: We can easily express the event in terms of the standardized version of W – D:

P (W−D>5,000)=P ((W−D)−(μW−μD)

√σW2+σD

2>

5,000−(μW−μD)

√σW2+σD

2 )

=P ((W −D)−(−2,000)

√25⋅106+16⋅106

>5,000−(−2,000)

√25⋅106+16⋅106 )=P(Z >

7⋅103

√25+16⋅103 )=P (Z>1.0932)

Consulting the table: We can bind the probability as follows (see the figure below)

P (Z> 1.0900)> P (Z> 1.0932)> P(Z> 1.1000)

1−P (Z≤1.0900)> P (Z> 1.0932)> 1−P (Z≤1.1000)

1−0.8621> P (Z> 1.0932)> 1−0.8643

0.1379> P(Z> 1.0932)> 0.1357

Then,0.1357<P (W >D+5,000)<0.1379


(c) Translation into the mathematical language: We are asked for P ( X +12.325

⋅106<1⋅106)=P ( X +12.325

<1).

Search for a known distribution: We do not know the distribution of (X+1)/2.325, but we know that

X ∼ t 20

Rewriting the event: The event can easily be rewritten in terms of this known distribution:

P ( X +12.325

<1)=P (X +1<2.325)=P (X <2.325−1)=P (X <1.325)

Consulting the table: Finally, it is enough to consult the table of the t distribution. The quantity 1.325 is inour table of lower-tail probabilities, so

P (X <1.325)=0.900

Conclusion: For a week, the probability that the five withdrawals will add up to more than $55,000 isaround 0.33. For a particular day, the probability that withdrawals will exceed deposits by more than $5,000 isaround 0.13. For a particular month, the probability that profits will be smaller than one (million dollars) is0.9, that is, quite high.

Exercise 6pe-p

To study the mean of a population variable X, μ = E(X), a simple random sample of size n is considered.Imagine that we do not trust the first and the last data, so we think about using the statistic

~X =1

n−2∑ j=2

n−1X j =

1n−2

(X 2+X 3+⋯+X n−1)=X 2+X 3+⋯+X n−1

n−2

Calculate the expectation and the variance of this statistic. Calculate the mean square error (MSE) and itslimit when n tends to infinite. Study the consistency. Compare the previous error with that of the ordinarysample mean.

Discussion: The statement of this exercise is mathematical. Here we are interested in the mean. The quantityX is dimensionless. We could not apply the defintions, and the mean and the variance must be written in termsof the mean and the variance X by applying the basic properties of these measures.

Expectation and variance: The basic properties of the mean and the variance are applied to do:

E (~X )=E ( 1n−2

(X 2+X 3+⋯+X n−1))= 1n−2

( E (X 2)+⋯+E (X n−1))=1

n−2(n−2)μ=μ

Var (~X )=Var( 1n−2

(X 2+X 3+⋯+X n−1))= 1

(n−2)2∑ j=2

n−1Var (X j)=

1

(n−2)2(n−2)σ

2= σ

2

n−2

When n increases, that is, when the sample consists of more and more data, the limits are, respectively:

limn→∞ E (~X )=limn→∞ μ=μ and limn→∞ Var (~X )=limn→∞σ

2

n−2=0


My notes:

Consistency: The previous limits show that ~X has some basic desirable properties: (asymptotic)

unbiasedness and evanescent variance. This pair is equivalent to the evanescence of the mean square error(MSE), that is, the consistency in mean of order two—a sufficient, but not necessary, condition for theconsistency in probability.

Comparison of errors:

MSE (~X )= σ

2

n−2 MSE ( X )= σ

2

n

Since σ2 appears in the two positive quantities, by looking at the coefficients it is easy to see that,

MSE ( X ) < MSE (~X )

(for n larger than 2). This result is due to the fact that the sample mean uses all the data available, though onlythe number of data—not their quality, since all of them are supposed to follow the same distribution—isconsidered in calculating the mean square error. In the limit, –2 is negligible. We can plot the coefficients(they are also the mean square errors when σ=1).

# Grid of values for 'n'n = seq(from=3,to=10,by=1)# The three sequences of coefficientscoeff1 = 1/(n-2)coeff2 = 1/n# The plotallValues = c(coeff1, coeff2)yLim = c(min(allValues), max(allValues));x11(); par(mfcol=c(1,3))plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')points(n, coeff2, type='b')

This code generates the following array of figures:

Asymptotically, both estimators behave similarly, since n–2 ≈ n.

Conclusion: ~X is a consistent estimator of μ. The estimator is appropriate for estimating μ. When nothing

suggests removing data, it is better to maintain them in the sample.

Advanced theory: The estimator in the statement is the usual sample mean when the sample has n–2 datainstead of n (leaving out these two data can be seen as a sort of data treatment implemented in the method, notin the previous analysis of data). When any of the two left out data is not trustable, using this estimator makessense; otherwise, it does not exploit the information available efficiently. On the other hand, the sample meancan be affected by tiny or huge values (outliers). To make the sample mean robust, this estimator is sometimesconsidered after ordering the data from the smallest to the largest; if X(j) is the j-th datum in the sample alreadyreordered:

~X =

1n−2∑ j=2

n−1X ( j )=

1n−2

(X (2)+X (3)+⋯+X (n−1))

This new robust estimator of the population mean μ is called trimmed sample mean, and any number of datacan be left out—not only two.


Exercise 7pe-p

A population variable X follows the χ2 distribution with κ degrees of freedom. We consider a statistic T thatuses the information contained in the simple random sample X = (X1, X2,...,Xn). If

T (X )=T (X 1 , X 2 , ... , X n)=2 X −1 ,

calculate its expectation and variance. Calculate the mean square error of T. As an estimator of twice themean of the population law, is T a consistent estimator?

Hint: If X follows the χ2 distribution with κ degrees of freedom, μ = E(X) = κ and σ2 = Var(X) = 2κ.

Discussion: Even if a population is mentioned, this statement is mathematical. To calculate the value ofthese two properties of the sampling distribution of T, we have to apply the general properties of theexpectation and the variance. The knowledge about the distribution of X will be used in the last steps. This is adimensionless quantity. The mean square error is defined in terms of these quantities.

Expectation or mean:

E (T (X ))=E (2 [ 1n∑ j=1

nX j]−1)=E ( 2

n∑ j=1

nX j)−E (1 )=

2n

E (∑ j=1

nX j)−1

=2n∑ j=1

nE (X j)−1=

2n

n E (X )−1=2 κ−1 (Since μ = E(X) = κ)

Variance:

Var (T (X ))=Var (2[ 1n∑ j=1

nX j]−1)=Var ( 2

n∑ j=1

nX j)=( 2

n )2

Var (∑ j=1

nX j)= 4

n2∑ j=1

nVar (X j)

=4

n2 n Var (X )=8 κn

(Since σ2 = Var(X) = 2κ)

Mean square error: Since b (T )=E (T )−2 E(X )=(2κ−1−2κ)=−1 , then

MSE (T )= b(T )2+Var (T )=1+

8κn

→d

1

Consistency: Although the variance of T tends to zero when n increases, the bias does not (thus, T isasymptotically biased). Hence, the mean square error does not tend either, and nothing can be said about theconsistency in probability using this way (although we can say that it is not consistent in mean of order two).

Conclusion: Since the mean square error tends to 1, in general T is not a “good” estimator of 2μ even formany data.


Independence of Xj

(simple random sample)

My notes:

My notes:

Exercise 8pe-p

Given a simple random sample of size n = 2, that is, X = (X1, X2), the following estimators of μ = E(X) aredefined:

μ1=12

X 1+12

X 2 μ2=13

X 1+23

X 2

1) Calculate their mean square error.

2) Calculate the relative efficiency. Which one would you use to estimate μ?

(Based on an exercise of Statistics for Business and Economics. Newbold, P., W. Carlson and B. Thorne. Pearson-Prentice Hall.)

Discussion: This statement is basically mathematical. The relative efficiency is defined in terms of the meansquare error of the estimators.

(1) Means: By applying the basic properties of the expectation or mean,

E (μ1 )=E (12

X 1+12

X 2)=12

E ( X 1 )+12

E ( X 2 )=12

E ( X )+12

E ( X )=12μ+

12μ=μ

E (μ2 )=E (13 X 1+23

X 2)=13

E ( X 1 )+23

E ( X 2 )=13

E ( X )+23

E ( X )=13μ+

23μ=μ

Variances: By applying the basic properties of the variance,

Var (μ1 )=Var ( 12

X 1+12

X 2)=( 12 )

2

Var ( X 1 )+( 12 )

2

Var ( X 2 )=14

Var ( X )+14

Var ( X )=14σ

2+

14σ

2=

12σ

2

Var (μ2 )=Var (13 X 1+23

X 2)=( 13 )

2

Var ( X 1 )+( 23 )

2

Var ( X 2 )=19

Var ( X )+49

Var ( X )=19σ

2+

49σ

2=

59σ

2

Mean square errors:

MSE (μ1)= b(μ1)2+Var (μ1)=[E(μ1)−μ]

2+Var (μ1)=[μ−μ]

2+

12σ

2=

12σ

2

MSE (μ2)= b(μ2)2+Var (μ2)=[E(μ2)−μ]

2+Var (μ2)=[μ−μ]

2+

59σ

2=

59σ

2

(2) Relative efficiency:

Since bias is zero for unbiased estimators, the mean square error is equal to the variance and we will prefer theestimator with the smallest variance. An easy way of comparing two estimators consists in using the conceptof relative efficiency, which is a simple quotient (take into account which estimator you allocate in thenumerator). When this quotient is over one, the estimator in the denominator has smaller mean square error,and vice versa. In this case,

e (μ1 ,μ2)=MSE (μ2)

MSE (μ1)=

5σ2

9σ

2

2

=109

> 1 → μ1 is preferred for estimating μ.

Conclusion: Both estimators are unbiased while the first has smaller variance; then, the first is preferred.We have not mathematically proved that this first estimator minimizes the variance, so we cannot say that it isan efficent estimator.


Exercise 9pe-p

The mean μ = E(X) of any population can be estimated from a simple random sample of size n through X. Prove that:

(a) This estimator is always consistent.

(b) For X normally distributed (normal population), this estimator is efficient.

Discussion: This statement is theoretical. The first section of this exercise needs calculations similar tothose of previous exercises. To prove the efficiency, we have to apply its definition.

(a) Consistency: The expectation of the sample mean is always—for any population—the population mean.Nevertheless, we repeat the calculations:

E ( X )=E ( 1n∑ j=1

nX j)=1

nE (∑ j=1

nX j)=1

n∑ j=1

nE ( X j )=

1n

n E ( X )=E ( X )=μ

The variance of the sample mean is always—for any population—the population variance divided by n. Werepeat the calculations too:

Var ( X )=Var( 1n∑ j=1

nX j)=( 1

n )2

Var (∑ j=1

nX j)= 1

n2∑ j=1

nVar ( X j )=

1n2 n Var ( X )=σ2

n

The bias is defined as b ( X )= E ( X )−μ =0 . We prove the consistency (in probability) by using thesufficient—but not necessary—condition (consistency in mean of order two):

limn→∞ MSE ( X )= limn→∞ [b( X )2+Var ( X )] = lim n→∞ [0+σ2

n ]=0

Then, it is consistent in mean of order two and therefore in probability.

(b) Efficiency: It is necessary to prove that the two conditions of the definition are fulfilled:

i. The expectation of X is always μ = E(X), that is, X is always an unbiased estimator of μ.ii. X has minimum variance, which happens—because of a theoretical result—when Var(X) attains the

Cramér-Rao's lower bound1

n⋅E [( ∂ log [ f (X ;θ)]∂θ )

2

]where θ = μ in this case, and f(x;θ) is the probability function of the population law where thenonrandom variable x is substituted by the random variable X (otherwise, it is not possible to talkabout expectation, since f(x;θ) is not random when θ is a parameter).

The unbiasedness is proved. On the other hand, we compute the Cramér-Rao's lower bound step by step:

(1) Function (with X in place of x)

f (X ;μ)=1

√2 πσ2e−( X −μ)

2

2σ2


Independence of Xj (simple random sample)

My notes:

(2) Logarithm of the function:

log [ f (X ;μ)]=log( 1

√2 πσ2 )+ log(e−(X −μ )

2

2σ2

)=−log(√2πσ2)−(X−μ)

2

2σ2

(3) Partial derivative of the logarithm of the function:

∂∂μ

( log [ f (X ;μ)])=0−1

2σ2 2 (X −μ)(−1)=X −μ

σ2

(4) Expectation of the squared partial derivative of the logarithm of the function: In this step, we mustrewrite the terms so as to make σ2=Var (X )=E ((X −E (X ))2 )=E ((X −μ)2 ) appear.

E [ ( ∂ log [ f (X ;μ)]∂μ )

2

]=E [( X −μ

σ2 )

2

]= 1σ

4 E [ ( X −μ )2 ]= 1

σ4 Var (X )=

1σ

4 σ2=

1σ

2

(5) Cramér-Rao's lower bound:1

n⋅E[(∂ log [ f (X ;μ)]∂μ )

2

]=

1

n⋅1

σ2

=σ2

n

The variance of the estimator, calculated in section (a), attains the bound and hence the estimator hasminimum variance. Since both conditions are fulfilled, the efficient is proved.

Conclusion: We have proved that the sample mean X is always—for any population—a consistent estimatorof the population mean μ. For a normal population, it is also efficient.

Advanced theory: When log[f(x;θ)] is twice differentiable with respect to θ, the Cramér-Rao's bound canequivalently be written as

−1

n⋅E [ ∂2 log [ f (X ;θ)]

∂θ2 ]

Concerning the regularity conditions, Wikipedia refers (http://en.wikipedia.org/wiki/Fisher_information) toeq. (2.5.16). of Theory of Point Estimation, Lehmann, E. L. and G. Casella, 1998. Springer. Let us assume thatthis alternative expression can be applied; then, step (3) would be

∂2

∂μ2 (log [ f (X ;μ)])= ∂

∂μ ( X −μ

σ2 )= 1

σ2⋅(−1)=−

1σ

2

step (4) would be

E [∂2 log [ f (X ;μ)]

∂μ2 ]=E [− 1σ2 ]=− 1

σ2

and, finally, step (5) would be

−1

n⋅E[ ∂2 log [ f (X ;μ)]

∂μ2 ]

=−1

n⋅−1

σ2

=σ2

n

We would have obtained the same result with easier calculations, although the fulfillment of the regularityconditions must have been verified previously.


My notes:

http://en.wikipedia.org/wiki/Fisher_information

Exercise 10pe-p

Let θ be the parameter of a population random variable X that follows a continuous uniform distribution onthe interval [θ–2, θ+1], and let X = (X1,...,Xn) be a simple random sample; then,

(a) Plot the density function of the variable X.

(b) Study the consistency of the sample mean X when it is used to estimate the parameter θ.

(c) Study the efficiency of the sample mean X when it is used to estimate the parameter θ.

(d) Find an unbiased estimator of θ and study its consistency.

Hint: Use that E(X) = θ – 1/2 and Var(X) = 3/4.

Discussion: This statement is mathematical. We should know the density function of the continuous uniformdistribution, although it could also be deduced from the fact that all possible values have the same probability.The quantity X is dimensionless.

(a) Density function: For this distribution, all values have the same probability, so the density function mustbe a flat curve. For the case θ > 2 (there is a similar figure for any other θ),

This plot is not necessary for the following sections.

(b) Study the consistency (in probability) of X as an estimator of θ

We apply the sufficient consistency in mean of order two: limn →∞ MSE (θ)=0 ↔ { lim n→∞ b( θ)=0

limn→∞ Var (θ)=0

(b1) Bias: By applying a property of the sample mean and the information of the statement,

E ( X )=E (X )=θ−12

→ b ( X )=E( X )−θ=θ−12−θ=−

12

→ limn→∞ b( X )=lim n→∞−12=−

12

(It is asymptotically biased.) Since one condition of the pair is not verified, it is not necessary to check theother, and neither the fulfillment of the consistency in probability nor the opposite can be proved using thisway (though the estimator is not consistent in the mean-square sense).

(c) Study the efficiency of X as an estimator of θ

The definition of efficiency consists of two conditions: unbiasedness and minimum variance (this latter ischecked by comparing the variance and the Cramér-Rao's bound).

(c1) Unbiasedness: In the previous section it has been proved that X is a biased estimator of θ.

The first condition does not hold, and hence it is not necessary to check the second one. The conclusion is thatX is not an efficient estimator of θ.


(d) An unbiased estimator of θ and its consistency

In (b) we found that b ( X )=−12

, which suggests correcting the previous estimator by adding 1/2, that is:

θ=X +12

. To study its consistency (in probability), we apply the sufficient condition mentioned in section

b (the consistency in mean of order two).

(d1) Bias: By applying a property of the sample mean and the information of the statement,

E (θ)=E( X )+12=θ−

12+

12=θ → b(θ)=E (θ)−θ=θ−θ=0 → limn→∞ b(θ)=limn→∞ 0= 0

(d2) Variance: By applying a property of the sample mean and the information of the statement,

Var (θ)=Var ( X +12 )=Var ( X )=

Var (X )

n=

34⋅n

→ limn→∞ Var (θ)=lim n→∞

34⋅n

= 0

As a conclusion, the mean square error (MSE) tends to zero and hence the proposed estimator θ=X +12

is a

consistent—in mean square error and hence in probability—estimator of θ.

Conclusion: We could prove neither the consistency nor the efficiency. Nevertheless, the bias has allowedus to build an unbiased, consistent estimator of the parameter. The efficiency of this new estimator could bestudied, but it is not required in the statement.

Exercise 11pe-p

A population random quantity X is supposed to follow a geometric distribution. Let X = (X1,...,Xn) be a simplerandom sample. By applying the factorization theorem below, find a sufficient statistic T(X) = T(X1,...,Xn) forthe parameter. Give explanations.

Discussion: The factorization theorem can be applied both to prove that a given statistic is sufficient and tofind sufficient statistics. On the other hand, for the distribution involved we know that

Likelihood function:

L(X ;η)=∏ j=1

nf (X j ;η)= f (X 1 ;η)⋅ f (X 2 ;η)⋯ f (X n ;η)=η⋅(1−η)

X 1−1⋅η⋅(1−η)

X 2−1⋯η⋅(1−η)

X n−1


My notes:

=ηn⋅(1−η)X 1−1+X 2−1+⋯+X n−1

=ηn⋅(1−η)(∑ j=1

n

X j)−n

Theorem:

We must try allocating each term of the likelihood function:

➔ ηn depends only on the parameter, not on Xj. Then, it would be part of g.

➔ (1−η)(∑ j=1

n

X j)−ndepends on both the parameter and the data Xj, and these two kinds of information

neither are mixed nor can mathematically be separated. Then, it would be part of g and the only

possible sufficient statistic, if the theorem holds, is T=∑ j=1

nX j .

By considering g (T (X ) ;η)=ηn⋅(1−η)

−n(1−η)

∑ j=1

n

X j and h(X )=1 , the theorem holds and hence the

statistic T (X )=∑ j=1

nX j is sufficient for studying η. The idea behind this kind of statistics is that they

“summarize the important information (about the parameter)” contained in the sample. In fact, the statistic Thas essentially the same information as any one-to-one transformation of it, particularly the sample mean

T (X )=nn∑ j=1

nX j=n X .

Conclusion: The factorization theorem has been used to find a sufficient statistic (for the parameter). Sincethe total sum appears, we complete the expression to write the result in terms of the sample mean. Bothstatistics contain the same information about the parameter of the distribution.

Exercise 12pe-p (*)

For population variables X and Y, simple random samples of size nX and nY are taken. Calculate the meansquare error of the following estimators, possibly by using proper statistics (involving them) whose samplingdistribution is known.

(A) For any populations: X X −Y

(B) For Bernoulli populations: η ηX−ηY

(C) For normal populations: V 2 V X

2

V Y2 s2

s X2

sY2 S 2

S X2

SY2

Suppose that the two populations are independent. Study the consistency in mean of order two and then theconsistency in probability.

Discussion: In this exercise, the most important estimators are involved. The basic properties of theexpectation and the variance allows us to calculate the mean square error. In most cases, the estimators will becompleted for a proper quantity (with known sampling distribution) to appear, and then use its properties.

Although the estimators of the third section can be used for any X and Y, the calculations for normallydistributed variables are easier due to the use of additional information—the knowledge about statistics andtheir sampling distribution. Thus, the results of this section are based on the normality of the variables X andY. (Some of the quantities are also valid for any variables.)


My notes:

The mean square errors are found for static situations, but the idea of limit involves dynamicsituations. Statistically speaking, we want to study the behaviour of the estimators when the number of dataincreases—we can imagine a sequence of schemes where more and more data are added to the samples, thatis, with the sample sizes always increasing. (From the mathematical point of view, limits must be studied forany possible way in which the sample sizes tend to infinite.)

Fortunately, the limits of the two-variable functions—sequences, really—that appear in this exercise caneasily be solved either by decomposing them into two limits of one-variable functions or by binding the two-variable sequences. That the limits are studied when nX and nY tend to infinite facilitates the calculations (e.g. aconstant like –2 is negligible when it appears in a factor).

(A) For any populations

(a1) For the sample mean X

It holds that

E ( X )=E (1n∑ j=1

nX j)=1

n∑ j=1

nE ( X j )=

1n

n E ( X )=E ( X )=μ

Var ( X )=Var( 1n∑ j=1

nX j)= 1

n2∑ j=1

nVar ( X j )=

1

n2 n Var ( X )=Var ( X )

n=

1nσ

2

MSE( X )= [E( X )−μ ]2+ Var ( X )=0

2+ σ

2

n=

1nσ

2

Then,• The estimator X is unbiased for μ, whatever the sample size.

• The estimator X is consistent (in mean of order two and therefore in probability) for μ, since

limn→∞ MSE ( X)= limn→∞σ

2

n=0

It is sufficient and necessary the sample size tending to infinite—see the mathematical appendix.

(a2) For the difference between the sample means X −Y

By using the previous results,

E ( X −Y )= E ( X )−E (Y )=μX−μY

Var ( X −Y )=Var ( X )+Var (Y )=1nX

σX2+

1nY

σY2

MSE( X−Y )= [E( X−Y )−(μX−μY)]2+ Var ( X−Y )=

1nX

σ X2+

1nY

σY2

The mean square error of X–Y is the sum of the mean square errors of X and Y. On the other hand,

• The estimator X −Y is unbiased for μX–μY, whatever the sample sizes.

• The estimator X −Y is consistent (in the mean-square sense and hence in probability) for μX–μY, as


limn X→∞

nY →∞

MSE ( X−Y )= limnX→∞

nY→∞(σX

2

nX

+σY

2

nY)=0

It is sufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.

(B) For Bernoulli populations

(b1) For the sample proportion η

Since η is a particular case of the sample mean,

E ( η)=μ=η

Var ( η) = σ2

n=

1nη(1−η)

MSE( η)= [ E(η)−η]2+ Var (η)=

1nη(1−η)

Then,

• The estimator η is unbiased whatever the sample size.

• It is consistent for η, being sufficient and necessary the sample size tending to infinite.

(b2) For the difference between the sample proportion ηX−ηY

Again, this is a particular case of difference between sample means,

E ( ηX−ηY )=μX−μY=ηX−ηY

Var ( ηX−ηY )=1n X

σX2+

1nY

σY2=

1nX

ηX (1−ηX )+1nY

ηY (1−ηY )

MSE( ηX−ηY)=1nX

σX2+

1nY

σY2=

1nX

ηX (1−ηX)+1nY

ηY(1−ηY )

Then,

• The estimator ηX−ηY is unbiased for ηX–ηY, whatever the sample sizes.

• It is also consistent for ηX–ηY, being sufficient and necessary the two sample sizes tending to infinite.

(C) For normal populations

(c1) For the variance of the sample V 2

By using T=nV 2

σ2 ∼ χn

2 and the properties of the chi-square distribution,

E (V 2)= E( σ

2

nn V 2

σ2 )= σ

2

nE( n V 2

σ2 )= σ

2

nn=σ

2

Var (V 2 )= Var (σ2

nnV 2

σ2 )= σ

4

n2 Var ( nV 2

σ2 )= σ

4

n2 2 n=2nσ4

MSE (V2)= [E (V

2)−σ

2]

2+Var (V

2)=

2nσ

4

Then,• The estimator V 2 is unbiased for σ2, whatever the sample size.


• The estimator V 2 is consistent (in mean of order two and therefore in probability) for σ2, since

limn→∞ MSE (V 2)= limn→∞

2σ 4

n=0


In another exercise, this estimator is compared with the other two estimators of the variance. (For theexpectation, it is easy to find in literature direct calculations that lead to the same value for any variables—notnecessarily normal.)

(c2) For the quotient between the variances of the samplesV X

2

V Y2

By using T=V X

2σY

2

V Y2σ X

2 ∼ F nX , nYand the properties of the F distribution,

E (V X2

V Y2 )= E (σ X

2

σY2

V X2σY

2

V Y2σ X

2 )=σX2

σY2 E (V X

2σY

2

V Y2σX

2 )=σ X2

σY2

nY

nY−2=

nY

nY−2σX

2

σY2 (nY >2)

Var (V X2

V Y2 )=Var (σX

2

σY2

V X2σY

2

V Y2σX

2 )=(σX2

σY2 )

2

Var (V X2σY

2

V Y2σX

2 )= 2nY2(nX +nY−2)

nX (nY−2)2(nY−4)

σ X4

σY4 (nY>4)

MSE(V X2

V Y2 )= [E(V X

2

V Y2 )−σX

2

σY2 ]

2

+Var (V X2

V Y2 )=[σX

2

σY2

nY

nY−2−σX

2

σY2 ]

2

+ (σX2

σY2 )

22nY

2(nX+nY−2)

nX (nY−2)2(nY−4)

=[( nY

nY−2−1)

2

+2nY

2(nX+nY−2)

nX (nY−2)2(nY−4) ] σX4

σY4 (nY>4)

Then,• The estimator is V X

2 /V Y2 biased for σX

2/σY2, but it is asymptotically unbiased since

limn X→∞

nY →∞

E(V X2

V Y2 )=limnY→∞ ( nY

nY−2σX

2

σY2 )=σX

2

σY2 limnY→∞ (

1

1−2nY

)=σX

2

σY2

Mathematically, only nY must tend to infinite. Statistically, since populations can be named andallocated in either order, it is deduced that both sample sizes must tend to infinite. In fact, it issufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.

• The estimator V X2/V Y

2 is consistent (in mean of order two and therefore in probability) for σX2/σY

2,since it is asymptotically unbiased and

limn X→∞

nY →∞

Var (V X2

V Y2 )= σ X

4

σY4 limnX→∞

nY →∞

2nY2(nX+nY−2)

nX (nY−2)2(nY−4)=0

=σ X

4

σY4 limnX→∞

nY →∞

nY−3 nX

−12nY2(nX+nY−2)

nY−3 nX

−1 nX (nY−2)2(nY−4)=σ X

4

σY4 limnX→∞

nY →∞

2( 1nY

+1nX

−2

nY nX)

(1− 2nY)

2

(1− 4nY

)=0

The numerator tends to zero if and only if so do both sample sizes. In short, it is sufficient andnecessary the two sample sizes tending to infinite—this limit has been studied in the mathematicalappendix.


In another exercise, this estimator is compared with the other two estimators of the quotient of variances.

(c3) For the sample variance s2

By using T=n s2

σ2 ∼ χn−1


E (s2)= E(σ2

nn s2

σ2 )= σ2

nE ( n s2

σ2 )= σ2

n(n−1)=

n−1n

σ2

Var (s2)=Var (σ

2

nn s2

σ2 )= σ

4

n2 Var ( n s2

σ2 )= σ

4

n2 2(n−1)=2(n−1)

n2 σ4

MSE (s2)= [E (s2)−σ2]2 + Var (s2)= [ n−1n

σ2−σ2]2

+2(n−1)

n2σ4= ( 2

n−

1

n2 )σ4

Then,• The estimator s2 is biased but asymptotically unbiased (for σ2), since

limn→∞ E(s2)= lim n→∞ ( n−1

nσ

2)=σ2 limn→∞ (1−

1n

1 )=σ2


• The estimator s2 is consistent (in mean of order two and therefore in probability) for σ2, since

limn→∞ MSE (s2)= limn→∞ [( 2

n−

1n2 )σ 4]=0



(c4) For the quotient between the sample variancess X

2

sY2

By using T=S X

2σY

2

SY2σX

2 =nX (nY−1)nY (nX−1)

s X2σY

2

sY2σ X

2 ∼ F n X−1 , nY−1 and the properties of the F distribution,

E ( sX2

sY2 )= nY (nX−1)

n X (nY−1)

σX2

σY2 E( nX (nY−1)

nY (n X−1)

s X2σY

2

sY2σ X

2 ) =

nY (nX−1)n X (nY−1)

σX2

σY2

nY−1(nY−1)−2

=nY (n X−1)nX (nY −3)

σ X2

σY2 (nY−1>2)

Var ( s X2

sY2 )= nY

2(n X−1)2

nX2 (nY−1)2

σ X4

σY4 Var ( nX (nY−1)

nY (nX−1)

s X2σY

2

sY2 σX

2 ) =

nY2(nX−1)2

n X2(nY−1)2

σX4

σY4

2(nY−1)2(nX−1+nY−1−2)

(nX−1)(nY−1−2)2(nY−1−4)=

2nY2(nX−1)(nX+nY−4)

nX2(nY−3)2(nY−5)

σX4

σY4 (nY−1>4)


MSE( sX2

sY2 )= [E( sX

2

sY2 )−σX

2

σY2 ]

2

+Var ( sX2

sY2 )=[nY(nX−1)

nX (nY−3)

σX2

σY2 −

σX2

σY2 ]

2

+2nY

2(nX−1)(nX+nY−4)

nX2(nY−3)2(nY−5)

σX4

σY4

={[ nY (nX−1)

nX (nY−3)−1]

2

+2nY

2(nX−1)(nX+nY−4)

nX2(nY−3)2(nY−5) }σ X

4

σY4 (nY−1>4 )

Then,• The estimator is sX

2 / sY2 biased for σX


limn X→∞

nY →∞

E( sX2

sY2 )= limnX→∞

nY →∞[ nY (nX−1)

nX(nY−3)

σ X2

σY2 ]=σ X

2

σY2 limnX→∞

nY →∞

nXnY−nY

nX nY−3nX

=σ X

2

σY2 limnX→∞

nY →∞

1−1nX

1−3nY

=σX

2

σY2


• The estimator is sX2 / sY

2 consistent (in mean of order two and therefore in probability) for σX2/σY

2, as itis asymptotically unbiased and

limn X→∞

nY →∞

Var ( s X2

sY2 )= limn X→∞

nY→∞[2nY

2(nX−1)(nX+nY−4)

nX2(nY−3)2(nY−5)

σ X4

σY4 ]

=σ X

4

σY4 limnX→∞

nY→∞

nX−2nY

−3 2nY2(nX−1)(nX+nY−4)

nX−2nY

−3nX2(nY−3)2(nY−5)

=σ X

4

σY4 limnX→∞

nY→∞

2( 1nY

−1

nXnY)( 1

nY

+1nX

−4

nXnY)

(1− 3nY

)2

(1− 5nY)

=0



(c5) For the sample quasivariance S 2

By using T=(n−1)S 2

σ2 ∼ χn−1


E (S2)= E( σ

2

n−1(n−1)S 2

σ2 ) = σ

2

n−1E( (n−1)S 2

σ2 )= σ

2

n−1(n−1)= σ

2

Var (S 2)=Var ( σ2

n−1(n−1)S 2

σ2 )= σ

4

(n−1)2Var ((n−1)S 2

σ2 )= σ

4

(n−1)22 (n−1)=

2n−1

σ4

MSE (S2)= [E (S

2)−σ

2]2+ Var (S

2)=

2n−1

σ4

Then,• The estimator S 2 is unbiased for σ2, whatever the sample size.

• The estimator S 2 is consistent (in mean of order two and therefore in probability) for σ2, since

limn→∞ MSE (s2)= limn→∞

2σ 4

n−1=0




(c6) For the quotient between the sample quasivariancesS X

2

SY2

By using T=S X

2σY

2

SY2σX

2 ∼ Fn X−1 ,nY −1 and the properties of the F distribution,

E ( S X2

S Y2 )= σX

2

σY2 E ( S X

2σY

2

SY2 σX

2 )=σ X2

σY2

nY−1

(nY −1)−2=

nY −1nY−3

σX2

σY2 (nY−1>2)

Var( S X2

S Y2 )=(σ X

2

σY2 )

2

Var ( S X2 σY

2

S Y2 σ X

2 )=σ X4

σY4

2(nY−1)2(nX−1+nY−1−2)

(nX−1)(nY−1−2)2(nY−1−4)

=2 (nY −1)2(nX +nY−4)

(nX−1)(nY−3)2(nY−5)

σ X4

σY4 (nY−1>4)

MSE( SX2

SY2 )= [E( SX

2

SY2 )−σX

2

σY2 ]

2

+Var ( SX2

SY2 )=[nY−1

nY−3σX

2

σY2 −

σ X2

σY2 ]

2

+2(nY−1)2(nX+nY−4 )

(nX−1)(nY−3)2(nY−5)

σX4

σY4

=[( nY−1nY−3

−1)2

+2 (nY−1)2(nX+nY−4)

(nX−1)(nY−3)2(nY−5) ] σX4

σY4 (nY−1>4)

Then,• The estimator is S X

2 /S Y2 biased for σX


limn X→∞

nY →∞

E( SX2

SY2 )=limnX→∞

nY→∞(nY−1

nY−3

σX2

σY2 )=σX

2

σY2 limnY→∞

1−1nY

1−3nY

=σ X

2

σY2

Mathematically, only nY must tend to infinite. Statistically, since populations can be named andallocated in either order, it is deduced that both sample sizes must tend to infinite. In fact, it issufficient and necessary the two sample sizes tending to infinite—see the mathematical appendix.

• The estimator is S X2 /S Y

2 consistent (in mean of order two and therefore in probability) for σX2/σY

2, asit is asymptotically unbiased and

limn X→∞

nY →∞

Var (S X2

SY2 )= limn X→∞

nY→∞[ 2(nY−1)2(nX+nY−4)

(nX−1)(nY−3)2(nY−5)

σ X4

σY4 ]

=σ X

4

σY4

limnX→∞

nY →∞

nX−1nY

−32(nY−1)2(nX+nY−4)

nX−1 nY

−3(nX−1)(nY−3)2(nY−5)=σX

4

σY4

limnX→∞

nY→∞

2(1− 1nY )

2

( 1nY

+1nX

−4

nXnY )(1− 1

nX )(1−3nY )

2

(1− 5nY )

=0



Conclusion: For the most important estimators, the mean square error has been calculated either directly (infew cases) or by making a proper statistic appear. The consistencies in mean square error of order two and in


probability have been proved. Some limits for functions of two variables arised. These kinds of limit are nottrivial in general, as there is an infinite amount of ways for the sizes to tend to infinite. Nevertheless, thoseappearing here could be calculated directly of after doing some simple algebra transformation (multiplyingand dividing by the proper quantity, as they were limits of sequences of the indetermined form infinite-over--infinite).

On the other hand, it is worth noticing that there are in general several matters to be considered inselecting among different estimators of the same quantity:

(a) The error can be measured by using a quantity different to the mean square error.(b) For large sample sizes, the differences provided by the formulas above may be negligible.(c) The computational or manual effort in calculating the quantities must also be taken into account—not

all of them requires the same number of operations.(d) We may have some quantities already available.

Exercise 13pe-p (*)

In the following situations, compare the mean square error of the following estimators when simple random samples, taken from normal populations, are considered:

(A) V 2 s2 S 2

(B) V X

2

V Y2

s X2

sY2

S X2

SY2 (Consider only the case nX = n = nY)

In the second section, suppose that the populations are independent.

Discussion: The expressions of the mean square error of these estimators have been calculated in otherexercise. Comparing the coefficients is easy in some cases, but sequences may sometimes cross one anotherand the comparisons must be done analitically—by solving equalities and inequalities—or graphically. Weplot the sequences (lines between dots are used to facilitate the identification).

The mean square errors were found for static situations, but the idea of limit involves dynamicsituations. By using a computer, it is also possible to study—either analytically or graphically—the asymptoticbehaviour of the estimators (but it is not a “whole mathematical proof”). It is worth noticing that the formulasand results of this exercise are valid for normal populations (because of the theoretical results on which theyare based); in the general case, the expressions for the mean square error of these estimators are morecomplex. For two populations, there is an infinite amount of mathematical ways for the two sample sizes totend to infinite (see the figure); the case nX = n = nY, in the last figure, will be considered.


My notes:

(A) For V 2 , s2 and S 2

The expressions of their mean square error are:

MSE (V2)=

2nσ

4 MSE (s2)= ( 2

n−

1

n2 )σ4 MSE (S2)=

2n−1

σ4

Since σ4 appears in all these positive quantities, by looking at the coefficients it is easy to see that, for n islarger than two,

MSE (s2) < MSE (V 2

) < MSE (S2)

That is, sequences—indexed by n—do not cross one another. We can plot the coefficients (they are also themean square errors when σ=1).

# Grid of values for 'n'n = seq(from=2,to=10,by=1)# The three sequences of coefficientscoeff1 = 2/ncoeff2 = 2/n - 1/(n^2)coeff3 = 2/(n-1)# The plotallValues = c(coeff1, coeff2, coeff3)yLim = c(min(allValues), max(allValues));x11(); par(mfcol=c(1,4))plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')plot(n, coeff3, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 3', type='b')plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')points(n, coeff2, type='b')points(n, coeff3, type='b')


Asymptotically, the three estimators behave similarly, since2n−

1

n2 ≈2n≈

2n−1

.

(B) ForV X

2

V Y2 ,

s X2

sY2 and

S X2

SY2

The expressions of their mean square error, when nX = n = nY, are:

MSE(V X2

V Y2 )= {[ n

n−2−1]

2

+2n2

(n+n−2)

n(n−2)2(n−4)}σX

4

σY4 ={[ n

n−2−1]

2

+4 n(n−1)

(n−2)2(n−4) }σX

4

σY4 (n>4)

MSE( sX2

sY2 )= {[ n(n−1)

n(n−3)−1]

2

+2n2

(n−1)(n+n−4)

n2(n−3)2(n−5) }σX

4

σY4 ={[ n−1

n−3−1]

2

+4 (n−1)(n−2)

(n−3)2(n−5) }σX

4

σY4 (n−1>4 )

MSE( SX2

SY2 )= {[ n−1

n−3−1]

2

+2 (n−1)2(n+n−4 )

(n−1)(n−3)2(n−5)}σX

4

σY4 ={[n−1

n−3−1]

2

+4(n−1)(n−2)

(n−3)2(n−5) }σX

4

σY4 (n−1>4)

For equal sample sizes, the mean square error of the last two estimators is the same (but they may behavedifferently under other criteria different to the mean square error, e.g. even their expectation). We can plot thecoefficients (they are also the mean square errors when σX = σY), for n > 5.


# Grid of values for 'n'n = seq(from=6,to=15,by=1)# The three sequences of coefficientscoeff1 = ((n/(n-2))-1)^2 + (4*n*(n-1))/(((n-2)^2)*(n-4))coeff2 = (((n-1)/(n-3))-1)^2 + (4*(n-1)*(n-2))/(((n-3)^2)*(n-5))coeff3 = coeff2# The plotallValues = c(coeff1, coeff2, coeff3)yLim = c(min(allValues), max(allValues));x11(); par(mfcol=c(1,4))plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')plot(n, coeff3, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 3', type='b')plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')points(n, coeff2, type='b')points(n, coeff3, type='b')


This shows that, for normal populations and samples of sizes nX = n = nY, it seems that

MSE(V X2

V Y2 ) ≤? MSE ( sX

2

sY2 ) = MSE ( S X

2

SY2 )

and the sequences do not cross one another. Really, a figure is not a mathematical proof, so we do thefollowing calculations:

( nn−2

−1)2

+4 n (n−1)

(n−2)2(n−4)≤?

( n−1n−3

−1)2

+4(n−1)(n−2)

(n−3)2(n−5)

4(n−4)+4 n(n−1)

(n−2)2(n−4)≤? 4(n−5)+4 (n−1)(n−2)

(n−3)2(n−5) ↔ n−4+n2

−n(n−2)2(n−4)

≤? n2

−2 n−3(n−3)2(n−5)

(n−2)(n+2)

(n−2)2(n−4)≤? (n−3)(n+1)

(n−3)2(n−5) ↔ (n+2)(n−3)(n−5)≤

?(n+1)(n−2)(n−4)

n3−6n2

−n+30≤?

n3−5n2

+2n+8 ↔ 22≤?

n (n+3)

This inequality is true for n≥4, since it is true for n=4 and the second side increases with n. Thus, we canguarantee that, for n > 5,

MSE(V X2

V Y2 ) ≤ MSE ( sX

2

sY2 ) = MSE ( S X

2

SY2 )

Asymptotically, by using infinites

limn X→∞

nY →∞

MSE (V X2

V Y2 )= limnX→∞

nY→∞{[( nY

nY−2−1)

2

+2nY

2(nX+nY−2)

nX (nY−2)2(nY−4) ] σX4

σY4 }

=limnX→∞

nY→∞{[( nY

nY

−1)2

+2nY

2(nX+nY )

nX nY2 nY

] σX4

σY4 }=limnX→∞

nY→∞[ 2(nX+nY )

nX nY

σX4

σY4 ]=0


limn X→∞

nY →∞

MSE ( sX2

sY2 )= {[nY (nX−1)

nX (nY−3)−1]

2

+2nY

2(nX−1)(nX+nY−4)

nX2(nY−3)2(nY−5) }σ X

4

σY4 (nY−1>4 )

=limnX→∞

nY→∞{[ nY nX

nXnY

−1]2

+2nY

2 nX (nX+nY)

nX2nY

2nY

}σX4

σY4 =limn X→∞

nY→∞[ 2(nX+nY)

nXnY

σ X4

σY4 ]=0

limn X→∞

nY →∞

MSE ( SX2

SY2 )= limnX→∞

nY →∞{[ nY−1

nY−3−1]

2

+2(nY−1)2(nX+nY−4)

(nX−1)(nY−3)2(nY−5)}σ X

4

σY4

=limnX→∞

nY→∞{[( nY

nY

−1)2

+2nY

2(nX+nY )

nX nY2 nY

] σX4

σY4 }= limnX→∞

nY→∞[2(nX+nY )

nX nY

σX4

σY4 ]=0

The three estimators behave similarly, since the quantitative behaviour of their mean square errors ischaracterized by the same limit, namely:

limn X→∞

nY →∞[2(nX+nY )

nXnY

σ X4

σY4 ]=0 .

(It is worth noticing that this asymptotic behaviour arises when the limits are solved by using infinites—thiscannot seen when the limits are solved by using other ways.)

Conclusion: The expression of the mean square error of these estimators allow us to compare then, to studytheir consistency and even their rate of convergence. We have proved the following result:

Proposition(1) For a normal population,

MSE (s2) < MSE (V 2

) < MSE (S2)

(2) For two independent normal populations, when nX = n = nY

MSE(V X2

V Y2 ) ≤ MSE ( sX

2

sY2 ) = MSE (S X

2

SY2 )

Note: For one population, V 2 has higher error than s2 , even if the information about the value of thepopulation mean μ is used by the former while it is estimated in the other two estimators. For two populations,the information about the value of the two population means μX and μY is used in the first quotient while theymust be estimated in the other two estimators. Either way, the population mean in itself does not play animportant role in studying the variance, which is based on relative distances, but any estimation using thesame data reduces the amount of information available and the degrees of freedom in a unit.

Again, it is worth noticing that there are in general several matters to be considered in selecting amongdifferent estimators of the same quantity:




My notes:

Exercise 14pe-p (*)

For population variables X and Y, simple random samples of size nX and nY are taken. Calculate the meansquare error of the following estimators (use results of previous exercises).

(A) For two independent Bernoulli populations: 12(ηX+ηY ) ηp

(B) For two independent normal populations:

12(V X

2+V Y

2)

12(sX

2+sY

2)

12(S X

2+S Y

2) V p

2 s p2 S p

2

where

ηp=n X ηX+nY ηY

nX+nY V p

2=

nX V X2+nY V Y

2

nX +nY

s p2=

n X s X2+nY sY

2

n X+nY

S p2=(n X−1) S X

2+(nY−1)SY

2

nX +nY−2

(Similarly for Y.) Try to compare the mean square errors. Study the consistency in mean of order two and thenthe consistency in probability.

Discussion: The expressions of the mean square error of the basic estimators involved in this exercise hasbeen calculated in another exercise, and they will be used in calculating the mean square errors of the newestimators. The errors are calculated for static situations, but limits are studied in dynamic situations

Comparing the coefficients is easy in some cases, but sequences can sometimes cross one another and thecomparisons must be done analitically—by solving equalities and inequalities—or graphically. By using acomputer, it is also possible to study—either analytically or graphically—the behaviour of the estimators. Theresults obtained here are valid for two independent Bernoulli populations and two independent normalpopulations, respectively. On the other hand, we must find the expression of the error for the new estimatorsbased on semisums:

MSE ( 12( θ1+θ2))= [E ( 1

2(θ1+θ2))−θ ]

2

+Var ( 12( θ1+θ2))

and, for unbiased estimators,

MSE ( 12( θ1+θ2))= 0+

14[Var (θ1)+Var (θ2)]

(A) For Bernoulli populations: 12(ηX+ηY ) and ηp

(a1) For the semisum of the sample proportions12(ηX+ηY )

By using previous results and that μ=η and σ2=η(1–η),

E( 12( ηX+ηY ))= 1

2[E(ηX)+E( ηY )]=

12(ηX+ηY )=η


Var ( 12(ηX+ηY ))==

12

[Var ( ηX)+Var ( ηY )]=14 (ηX (1−ηX)

nX

+ηY(1−ηY )

nY)=1

4 (1nX

+1nY)η(1−η)

MSE( 12( ηX+ηY ))= [ 1

2(ηX+ηY )−μ ]

2

+14 (

ηX(1−ηX )

nX

+ηY (1−ηY )

nY)=1

4 (1nX

+1nY

)η(1−η)

Then,

• The estimator12(ηX+ηY ) is unbiased for μ, whatever the sample sizes.

• The estimator12(ηX+ηY ) is consistent (in the mean-square sense and therefore in probability) for η.

limn X→∞

nY →∞

MSE (12 (ηX+ηY))= limn X→∞

nY→∞[14 ( 1

nX

+1nY )η(1−η)]

It is sufficient and necessary that both sample sizes must tend to infinite—see the mathematicalappendix.

(a2) For the pooled sample proportion ηp

Firstly, we write ηp=1

n X+nY

(nX ηX +nY ηY) . Now, by using previous results,

E ( ηp)=1

nX +nY

[n X E ( ηX )+nY E ( ηY )]=nX ηX +nY ηY

nX +nY

=η

Var ( ηp)=1

(nX+nY )2 [nX

2 Var (ηX )+nY2 Var ( ηY )]=

nX ηX (1−ηX )+nY ηY (1−ηY )

(nX +nY )2 =

1nX+nY

η(1−η)

MSE( ηp)=( nX ηX+nY ηY

nX+nY

−η)2

+nX ηX (1−ηX)+nY ηY (1−ηY)

(nX+nY)2 =

1nX+nY

η(1−η)

Then,• The estimator ηp is unbiased for η, whatever the sample sizes.

• The estimator ηp is consistent (in mean of order two and therefore in probability) for η, since

limn X→∞

nY →∞

MSE (ηp)= limnX→∞

nY→∞

η(1−η)

nX+nY

=0

If the mean square error is compared with those of the two populations, we can see that the newdenominator is the sum of both sample sizes. Again, it is worth noticing that it is sufficient andnecessary at least one sample size tending to infinite, but not both. In this case, the denominator tendsto infinite. The interpretation of this fact is that, in estimating, one sample can do “the whole work.”

(a3) Comparison of12(ηX+ηY ) and ηp

Case nX = n = nY

MSE ( 12( ηX +ηY ))= η(1−η)

2n= MSE ( ηp)

In fact, by looking at the expressions of the estimators themselves, ηp=12(ηX +ηY ) in this case.

General case

The expressions of their mean square error are (the sample proportion is unbiased):


MSE ( 12(ηX +ηY))= 1

4 (1

nX

+1nY

)η(1−η) MSE(ηp)=1

nX+nY

η(1−η)

Then

14 (

1nX

+1nY

)≤ 1n X+nY

↔ (n X+nY)( nX+nY

n X nY)≤4 ↔ n X

2+nY

2+2n X nY≤4 nX nY ↔ (n X−nY )

2≤0

Then, the pooled estimator is always better or equal than the semisum of the sample proportions. Bothestimators have the same mean square error—their behaviour may be different under other criteria different tothe mean square error—only when nX=nY. Besides, Thus, (nX–nY)2 can be seen as a measure of the convenienceof using the pooled sample proportion, since it shows how different the two errors are. The inequality alsoshows a symmetric situation, in the sense that it does not matter which sample size is bigger: the measuredepends on the difference. We have proved the following result:

PropositionFor two independent Bernoulli populations with the same parameter, the pooled sample proportionhas smaller or equal mean square error than the semisum of the sample proportions. Besides, bothare equivalent only when the sample sizes are equal.

We can plot the coefficients (they are also the mean square errors when η(1– η)=1) for a sequence of samplesizes, indexed by k, such that nY(k)=2nX(k), for example (but this only one possible way for the sample sizes totend to infinite):

# Grid of values for 'n'c = 2n = seq(from=2,to=10,by=1)# The sequences of coefficientscoeff1 = (1 + 1/c)/(4*n)coeff2 = 1/((1+c)*n)# The plotallValues = c(coeff1, coeff2)yLim = c(min(allValues), max(allValues));x11(); par(mfcol=c(1,3))plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='b')plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')points(n, coeff2, type='b')


The reader can repeat this figure by using values closer to and farther from 1 than c(k) = 2.

(B) For normal populations

(b1) For the semisum of the variance of the samples12(V X

2+V Y

2)

By using previous results,


E ( 12(V X

2+V Y

2))= 1

2 [ E (V X2 )+E (V Y

2 ) ]=12

(σ X2+σY

2 )=σ2

Var ( 12(V X

2+V Y

2))= 1

22 [Var (V X2 )+Var (V Y

2 ) ]=12 (

σ X4

nX

+σY

4

nY)=1

2 (1nX

+1nY)σ4

MSE( 12(V X

2+V Y

2))= [ 1

2(σ X

2+σY

2 )−σ2]

2

+12 (

σX4

nX

+σY

4

nY)=1

2 (1nX

+1nY

)σ4

Then,

• The estimator12(V X

2+V Y

2) is unbiased for σ2, whatever the sample sizes.

• The estimator12(V X

2+V Y

2) is consistent (in the mean-square sense and therefore in probability) for σ2

since,

limn X→∞

nY →∞

MSE (ηp)= limnX→∞

nY→∞

12 (

1nX

+1nY

)σ4=0

It is sufficient and necessary that both sample sizes must tend to infinite—see the mathematicalappendix.

(b2) For the semisum of the sample variances12(sX

2+sY

2)


E ( 12(s X

2+sY

2))= 1

2[ E ( s X

2 )+E (sY2 ) ]= 1

2 (nX−1

nX

σX2+

nY−1

nY

σY2 )=1

2 (nX−1

nX

+nY−1

nY)σ2

Var ( 12(sX

2+sY

2))= 1

22 [Var (sX2 )+Var (sY

2 ) ]=12 [ nX−1

nX2

σ X4+nY−1

nY2

σY4 ]=1

2 [ (nX−1)

nX2

+(nY−1)

nY2 ]σ4

MSE( 12(sX

2+sY

2))= [ 1

2 (nX−1

nX

σX2+nY−1

nY

σY2 )−σ

2]2

+12 [ nX−1

nX2 σX

4+nY−1

nY2 σY

4 ] =[12 ( nX−1

nX

+nY−1

nY)σ2

−σ2]

2

+12 [ nX−1

n X2 +

nY−1

nY2 ]σ4

= {[−12 (

1nX

+1nY

)]2

+2 [n X nY2−nY

2+nX

2 nY−nX2

4nX2 nY

2 ]}σ4=[ (nX+nY )

2

4nX2 nY

2 +2nX nY

2−nY

2+nX

2 nY−nX2

4n X2 nY

2 ]σ4

= [ 2n X nY+2nX nY2+2nX

2 nY−nX2−nY

2

4nY2 n X

2 ]σ4=[ 2nX nY (n X+nY )−(nX−nY)2

4nX2 nY

2 ]σ4

=12 [ nX+nY

n X nY

−(nX−nY )

2

2nX2 nY

2 ]σ4=12 [ 1

n X

+1nY

−(nX−nY )

2

2n X2 nY

2 ]σ4

Then,

• The estimator12(sX

2+sY

2) is biased but asymptotically unbiased for σ2, since

limn X→∞

nY →∞

E(12 (sX2+sY

2))=σ

2 limnX→∞

nY→∞

12 (

nX−1

nX

+nY−1

nY)=σ

2 12

(1+1 )=σ2



• The estimator12(sX

2+sY

2) is consistent (in the mean-square sense and therefore in probability) for σ2,

because it is asymptotically unbiased and

limn X→∞

nY →∞

Var ( 12(s X

2+sY

2))=σ

4 lim nX→∞

nY→∞

12 (

nX−1

nX2 +

nY−1

nY2 )=0

Again, it is sufficient and necessary the two sample sizes tending to infinite—see the mathematicalappendix.

(b3) For the semisum of the sample quasivariances12(S X

2+S Y

2)


E ( 12(S X

2+SY

2))= 1

2 [E (S X2 )+E (SY

2 ) ]=12

(σ X2+σY

2 )=σ2

Var ( 12(SX

2+SY

2))= 1

22 [Var (SX2 )+Var (SY

2 ) ]=12 (

σX4

nX−1+

σY4

nY−1 )=12 (

1nX−1

+1

nY−1 )σ4

MSE( 12(SX

2+SY

2))= [ 1

2(σ X

2+σY

2 )−σ2]

2

+12 (

σX4

nX−1+

σY4

nY−1 )=12 (

1nX−1

+1

nY−1 )σ4

Then,

• The estimator12(S X

2+S Y

2) is unbiased for σ2, whatever the sample sizes.

• The estimator12(S X

2+S Y

2) is consistent (in the mean-square sense and therefore in probability) for σ2

since,

limn X→∞

nY→∞

MSE (12 (SX2+SY

2))= limn X→∞

nY→∞

12 (

1nX−1

+1

nY−1 )σ4=0

It is sufficient and necessary both sample sizes tending to infinite—see the mathematical appendix.

(b4) For the pooled variance of the samples V p2

We can write V p2=

nX V X2+nY V Y

2

nX +nY

=1

nX +nY

(nX V X2+nY V Y

2) . By using previous results,

E (V p2)=

nX E(V X2)+nY E(V Y

2)

n X+nY

=nX σX

2+nY σY

2

n X+nY

=σ2

Var (V p2 )=

nX2 Var (V X

2)+nY

2 Var (V Y2)

(nX +nY )2 =2

nX σX4+nYσY

4

(n X+nY )2 =

2nX+nY

σ4

MSE(V p2)= ( nXσX

2+nY σY

2

nX+n y

−σ2)

2

+ 2nX σX

4+nYσY

4

(nX+n y)2 =

2nX+ny

σ4

Then,• The estimator V p

2 is unbiased for σ2, whatever the sample sizes.

• The estimator V p2 is consistent (in mean of order two and therefore in probability) for σ2, since


limn X→∞

nY →∞

MSE (V p2)= σ

4limnX→∞

nY→∞

2nX+nY

=0

It is worth noticing that it is sufficient and necessary at least one sample size tending to infinite, butnot both. In this case, the denominator tends to infinite. The interpretation of this fact is that, inestimating, one sample can do “the whole work.”

(b5) For the pooled sample variance s p2

We can write s p2=

n X s X2+nY sY

2

n X+nY

=1

nX+nY

(n X s X2+nY sY

2) . By using previous results,

E (s p2)=

nX E( sX2)+nY E (sY

2)

nX+nY

=(n X−1)σ X

2+(nY−1)σY

2

n X+nY

=n X+nY−2

n X+nY

σ2

Var (s p2)=

nX2 Var (s X

2)+nY

2 Var (sY2)

(n X+nY )2 =2

(nX−1)σX4+(nY−1)σY

4

(nX+nY )2 =2

n X+nY−2

(nX+nY )2 σ

4

MSE(s p2)= ( nX+nY−2

nX+nY

σ2−σ2)2

+ 2nX+nY−2

(nX+nY )2 σ

4=[ (nX+nY−2−nX−nY )2

(nX+nY)2 + 2

nX+nY−2

(nX+nY )2 ]σ4=

2nX+nY

σ4

Then,• The estimator s p

2 is biased for σ2, but asymptotically unbiased

limn X→∞

nY →∞( nX+nY−2

nX+nY

σ2)= limn X→∞

nY →∞( nX+nY

nX+nY

σ2)= σ

2

(The calculation above for the mean suggests that a –2 in the denominator of the definition wouldprovide an unbiased estimator—see the estimator in the following section.)

• The estimatoris s p2 consistent (in mean of order two and therefore in probability) for σ2, since

limn X→∞

nY →∞

MSE (s p2)= σ

4limnX→∞

nY→∞

2nX+nY

=0


(b6) For the (bias-corrected) pooled sample variance S p2

We can write S p2=(n X−1) S X

2+(nY−1)SY

2

nX +nY −2=

1n X+nY−2

[(nX−1)S X2+(nY−1) SY

2 ] . By using previous results,

E (S p2)=

(nX−1)E (S X2)+(nY−1)E (SY

2)

n X+nY−2=(nX−1)σX

2+(nY−1)σY

2

nX +nY−2=σ

2

Var (S p2)=

(n X−1)2 Var (S X2)+(nY−1)2Var (S Y

2)

(n X+nY−2)2=2

(nX−1)σX4+(nY−1)σY

4

(n X+nY−2)2=

2nX +nY−2

σ4

MSE(S p2)= [ (nX−1)σ X

2+(nY−1)σY

2

nX+nY−2−σ

2]2

+ 2(nX−1)σX

4+(nY−1)σY

4

(nX+nY−2)2=

2nX+nY−2

σ4

Then,• The estimator S p

2 is unbiased for σ2, whatever the sample sizes.


• The estimator S p2 is consistent (in mean of order two and therefore in probability) for σ2, since

limn X→∞

nY →∞

MSE (S p2)= limnX→∞

nY →∞

2σ4

nX+nY−2=0


(b7) Comparison of12(V X

2+V Y

2) ,

12(sX

2+sY

2) ,

12(S X

2+S Y

2) , V p

2 , s p2 and S p

2

Case nX = n = nY

MSE ( 12(V X

2+V Y

2))= 1

2 (21n )σ

4=

1nσ

4

MSE( 12(sX

2+sY

2))= 1

2 (21n−0)σ4

=1nσ

4

MSE ( 12(S X

2+S Y

2))= 1

2 (21

n−1 )σ4=

1n−1

σ4

MSE(V p2)=

22n

σ4=

1nσ

4

MSE(s p2)=

22n

σ4=

1nσ

4

MSE(S p2)=

22n−2

σ4=

1n−1

σ4

Since σ4 appears in all these positive quantities, by looking at the coefficients it is easy to see the relation

MSE ( 12(s X

2+sY

2))= MSE ( 1

2(V X

2+V Y

2))= MSE (V p

2)= MSE (s p

2) < MSE (S p

2)= MSE ( 1

2(S X

2+SY

2))

(For individual estimators, the order MSE (s2) < MSE (V 2

) < MSE (S2) was obtained in other exercise.) This

relation has been obtained for the case nX = n = nY and (independent) normal populations. We can plot thecoefficients (they are also the mean square errors when σ=1).

# Grid of values for 'n'n = seq(from=10,to=20,by=1)# The three sequences of coefficientscoeff1 = 1/ncoeff2 = coeff1coeff3 = 1/(n-1)coeff4 = coeff1coeff5 = coeff1coeff6 = coeff3# The plotallValues = c(coeff1, coeff2, coeff3, coeff4, coeff5, coeff6)yLim = c(min(allValues), max(allValues));x11(); par(mfcol=c(1,7))plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 1', type='l')plot(n, coeff2, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 2', type='l')plot(n, coeff3, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 3', type='b')plot(n, coeff4, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 4', type='l')plot(n, coeff5, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 5', type='l')plot(n, coeff6, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='Coefficients 6', type='b')plot(n, coeff1, xlim=c(min(n),max(n)), ylim = yLim, xlab=' ', ylab=' ', main='All coefficients', type='l')points(n, coeff2, type='l')points(n, coeff3, type='b')points(n, coeff4, type='l')points(n, coeff5, type='l')points(n, coeff6, type='b')



By using this code, it is also possible to study—either analytically or graphically—the asymptotic behaviourof these estimators (but only with simulated data of some particular distributions for X, what would not be a“whole mathematical proof”). It is worth noticing that the formulas obtained in this exercise are valid fornormal populations (because of the theoretical results on which they are based). In the general case, theexpressions for the mean square error of these estimators are more complex.

General case

The expressions of their mean square error are:

MSE ( 12(V X

2+V Y

2))= 1

2 (1

n X

+1nY)σ4

MSE ( 12(s X

2 +sY2 ))= 1

2 [ 1n X

+1nY

−(n X−nY )

2

2nX2 nY

2 ]σ4

MSE ( 12(S X

2+S Y

2))= 1

2 (1

nX−1+

1nY−1 )σ4

MSE(V p2)=

2nX+ny

σ4

MSE(s p2)=

2nX+ny

σ4

MSE(S p2)=

2nX+nY−2

σ4

We have simplified the expressions as much as possible, and now a general comparison can be tacked bydoing some pairwise comparisons. Firstly, by looking at the coefficients

MSE ( 12(s X

2+sY

2))≤ MSE ( 1

2(V X

2+V Y

2)) < MSE ( 1

2(S X

2+SY

2))

and the equality is reached only when nX = n = nY. On the other hand,

MSE (V p2 )= MSE (s p

2 ) < MSE (S p2 )

Now, we would like to allocate V p2 , s p

2 and S p2 in the first chain. To compare V p

2 and s p2 with

12(V X

2+V Y

2) ,

2nX +nY

≤12 (

1n X

+1nY) ↔ 4 nX nY≤(nX +nY )

2 ↔ 4 nX nY ≤ n X2+nY

2+2 n X nY ↔ 0≤(nX−nY )

2


That is,

MSE (V p2)= MSE (s p

2) ≤ MSE ( 1

2(V X

2+V Y

2))

and the equality is attained only when nX = n = nY. To compare S p2 with

12(V X

2+V Y

2)

2nX +nY−2

≤12 (

1n X

+1nY

) ↔ 4 nX nY≤(nX +nY )(nX+nY−2) ↔ 2(nX+nY)≤(nX−nY)2

That is,

{MSE (S p2) ≤ MSE ( 1

2(V X

2+V Y

2)) if 2(n X+nY)≤(nX−nY )

2

MSE (S p2) ≥ MSE ( 1

2(V X

2+V Y

2)) if 2(n X+nY)≥(nX−nY )

2

Intuitively, in the region around the bisector line the difference of the sample means is small, and therefore thepooled sample variance is worse; on the other hand, in the complementary region the square of the differenceis bigger than twice the sum of the sizes, and, therefore, the pooled sample variance is better. The frontierseems to be parabolic. Some work can be done to find the frontier determined by the equality and the tworegions on both sided—this is done in the mathematical appendix. Now, we write some “force-based” lines forthe computer to plot these points in the frontier:

N = 100vectorNx = vector(mode="numeric", length=0)vectorNy = vector(mode="numeric", length=0)for (nx in 1:N){ for (ny in 1:N) { if (2*(nx+ny)==(nx-ny)^2) { vectorNx = c(vectorNx, nx); vectorNy = c(vectorNy, ny) } }}plot(vectorNx, vectorNy, xlim = c(0,N+1), ylim = c(0,N+1), xlab='nx', ylab='ny', main=paste('Frontier of the region'), type='p')

To compare S p2 with

12(S X

2+S Y

2)

2nX +nY−2

≤12 (

1n X−1

+1

nY−1 ) ↔ 4 (n X−1)(nY−1)≤(nX +nY−2)2

↔ 4 nX nY−4nX−4nY +4≤ nX2 +nY

2 +2nX nY +4−4nX−4nY ↔ 0≤(nX−nY)2

That is,

MSE (S p2) ≤ MSE ( 1

2(S X

2+SY

2))

and the equality is attained only if the sample sizes are the same.We can summarize all the results of this section in the following statement:


PropositionFor two independent normal populations, when nX = n = nY

(a) MSE ( 12(s X

2+sY

2))=MSE ( 1

2(V X

2+V Y

2))=MSE (V p

2)=MSE (s p

2) < MSE (S p

2)=MSE ( 1

2(S X

2+SY

2))

In the general case, when the sample sizes can be different,

(b) MSE ( 12(s X

2+sY

2))≤ MSE ( 1

2(V X

2+V Y

2)) < MSE ( 1

2(S X

2+SY

2))

(c) MSE (V p2 )= MSE (s p

2 ) < MSE (S p2 )

(d) MSE (V p2)= MSE (s p

2) ≤ MSE ( 1

2(V X

2+V Y

2))

(e) {MSE (S p2) ≤ MSE ( 1

2(V X

2+V Y

2)) if 2(n X+nY )≤(nX−nY)

2

MSE (S p2) ≥ MSE ( 1

2(V X

2+V Y

2)) if 2(n X+nY )≥(nX−nY)

2

(f) MSE (S p2) ≤ MSE ( 1

2(S X

2+SY

2))

In (b), (d) and (f), the equality is attained when nX = n = nY.

Note: I have tried to compare V p2 , s p

2 and S p2 with

12(sX

2+sY

2) , but I have not managed to solve

the inequalities. On the other hand, these relations show that, for two independent normal populations,there exist estimators with smaller mean square error than the pooled sample variance S p

2 . Nevertheless,there are other criteria different to the mean square error, and, additionally, the pooled sample variance hasalso some advantages (see the advanced theory at the end).

Conclusion: For some pooled estimators, the mean square errors have been calculated either directly ormaking a proper statistic appear. The consistencies in mean square error of order two and in probability havebeen proved. By using theoretical expressions for the mean square error, the behaviour of the pooledestimators for the proportion (Bernoulli populations) and for the variance (normal populations) have beencompared with “natural” estimators consisting in the semisum of the individual estimators for eachpopulation.

Once more, it is worth noticing that there are in general several matters to be considered in selectingamong different estimators of the same quantity:



Advanced Theory: The previous estimators can be written as a sum ωX θX+ωY θY with weightsω=(ωX ,ωY ) such that ωX+ωY=1. As regards the interpretation of the weights, they can be seen as a

measure of the importance that each estimator is given in the global formula. For some weights that dependson the sample sizes, it is possible for one estimator to adquire all the importance when the sample sizesincrease in the proper way. On the contrary, when the weights are constant the possible effect—positive or


negative—due to each estimator is bounded. The errors were calculated when the data are representative ofthe population, but if the quality of one sample is always small, the other sample cannot do the wholeestimation if the weights do not depend on the sizes.

[PE] Methods and Properties

Exercise 1pe

We have reliable information that suggests the probability distribution with density function

f (x ;θ)=2

θ2 (θ−x ) , x∈[0,θ] ,

as a model for studying the population quantity X. Let X = (X1,...,Xn) be a simple random sample.

(a) Apply the method of the moments to find an estimator θM of the parameter θ.

(b) Calculate the bias and the mean square error of the estimator θM .

(c) Study the consistency of θM .

(d) Try to apply the maximum likelihood method to find and estimator θML of the parameter θ.

(e) Obtain estimators of the mean and the variance.

Hint: (i) Use that μ = E(X) = θ/3 and E(X 2) = θ2/6.

Discussion: This statement is mathematical. The assumptions are supposed to have been checked. We aregiven the density function of the distribution of X (a dimensionless quantity). The exercise involves twomethods of estimation, the definition of the bias, the mean square error and the sufficient condition for theconsistency (in probability). The two first population moments are provided.

Note: If E(X) and E(X2) had not been given in the statement, they could have been calculated by applying the definition and solving the integrals,

E (X )=∫−∞

+∞

x f ( x ;θ)dx=∫0

θ

x2(θ−x )

θ2 dx=

2θ

2 (∫0

θ

x θdx−∫0

θ

x2 dx)

=2

θ2 (θ [ x2

2 ]0

θ

−[ x3

3 ]0

θ

)= 2

θ2 (θ θ

2

2−θ

3

3 )=θ216=

13θ

E (X 2)=∫

−∞

+∞

x2 f (x ;θ)dx=∫0

θ

x2 2(θ− x)

θ2 dx=

2θ

2 (∫0

θ

x2θ dx−∫0

θ

x3 dx )

=2

θ2 (θ [ x3

3 ]0

θ

−[ x 4

4 ]0

θ

)= 2

θ2 (θ θ

3

3−θ

4

4 )=2θ2( 43⋅4

−3

4⋅3 )=212

θ2=

16θ

2


(a1) Population and sample moments

The distribution has only one parameter, so one equation suffices. By using the information in the hint:


My notes:

μ1(θ)=13θ and m1(x1 , x2 ,... , xn)=

1n∑ j=1

nx j= x

(a2) System of equations

μ1(θ)=m1(x1 , x2 ,... , xn) → 13θ=

1n∑ j=1

nx j= x → θ0=

3n∑ j=1

nx j=3 x

(a3) The estimator

It is obtained after substituting the lower case letters xj by upper case letters Xi:

θM=3n∑ j=1

nX j=3 X

(b) Bias and mean square error

(b1) Bias

To apply the definition b ( θM )=E (θM )−θ we need to calculate the expectation:

E ( θM )=E (3 X )=3 E ( X )=3 E ( X )=3 θ3=θ

where we have used the properties of the expectation, a property of the sample mean and the information inthe statement. Now

b ( θM )= E ( θM )−θ=θ−θ= 0

and we can see that the estimator is unbiased (we could see it also from the calculation of the expectation).

(b2) Mean square error

We do not usually apply the definition MSE (θM )= E ((θM−θ)2 ) but a property derived from it, for which

we need to calculate the variance:

Var (θM )=Var (3 X ) = 32 Var ( X )

n=32 E ( X 2 )−E ( X )2

n=

32

n [ θ2

6−(θ3 )

2

]= 32θ

2

n (16−

19 )=

32θ

2

n118

= θ2

2 n

where we have used the properties of the variance, a property of the sample mean and the information in thestatement. Then

MSE (θM )= b (θM )2+ Var (θM )=02

+ θ2

2 n= θ

2

2 n

(c) Consistency

We try applying the sufficient condition limn→∞ MSE (θ)=0 or, equivalently, { limn→∞ b(θ)=0limn→∞ Var ( θ)=0

. Since

limn→∞ MSE (θM )= limn→∞θ

2

2 n= 0

it is concluded that the estimator is consistent (in mean of order two and hence in probability) for estimating θ.

(d) Maximum likelihood method

(d1) Likelihood function: The density function is f (x ;θ)=2

θ2 (θ−x ) for 0≤x≤θ , so


L( x1 , x2 , ... , x n ;θ)=∏ j=1

nf (x j ;θ)=

2n

θ2n∏ j=1

n(θ−x j)

(d2) Optimization problem: First, we try to find the maximum by applying the technique based on thederivatives. The logarithm function is applied,

log [L( x1 , x2 , ... , xn ;θ)]=n log(2)−2n log(θ)+∑ j=1

nlog(θ−x j)

and the first condition leads to a useless equation:

0=d

d θlog [L(x1 , x2 , ... , xn ;θ)]=0−2n

1θ+∑ j=1

n 1θ− x j

→ ?

Then, we realize that global minima and maxima cannot always be found through the derivatives (only if theyare also local extremes). In this case, it is difficult even to know whether L monotonically decreases with θ ornot, since part of L increases and another decreases—which one changes more? We study the j-th element ofthe product, that is, f(xj;θ). Its first derivative is

f ' ( x j ;θ)=2θ

2−(θ− x j)2θ

θ4 =2

θ(2 x j−θ)

θ4 so it has an extreme in θ=2 x j

This implies that L is the product of n terms having the extreme in a different way, so L does not changemonotonically with the parameter θ.

(d3) The estimator: → ?

(e) Estimators of the mean and the variance

To obtain estimators of the mean, we take into account that μ=E (X )=θ3

and apply the plug-in principle:

μM =θM

3=

3 X3

= X μML =θML

3=

max j {X j }

3

To obtain estimators of the variance, since σ2=Var (X )=θ

2

6

σM2=

θM2

6=

(2 X )2

6=

2( X )2

3 σML

2 = ?

Conclusion: The method of the moment is applied to obtain an estimator that is unbiased for any samplesize n and has good behaviour when used with for large n (many data). The maximum likelihood methodcannot be applied since it is difficult to optimize the likelihood function by considering either its expression orthe behaviour of the density function.

Exercise 2pe

Let X be a random variable following the Rayleigh distribution, whose with probability function is

f (x ;θ)=xθ

2 e−

x2

2θ2

, x ≥0, (θ>0)

such that E (X )=θ √π2

and Var (X )=4−π

2θ

2. Let X = (X1,...,Xn) be a simple random sample.


My notes:

(a) Apply the method of the moments to find an estimator θM of the parameter θ.

(b) For θM , calculate the bias and the mean square error, and study the consistency.

(c) Apply the maximum likelihood method to find and estimator θMV of the parameter θ.

CULTURAL NOTE (From: Wikipedia.)

In probability theory and statistics, the Rayleigh distribution is a continuous probability distribution for positive-valued random variables.A Rayleigh distribution is often observed when the overall magnitude of a vector is related to its directional components. One examplewhere the Rayleigh distribution naturally arises is when wind velocity is analyzed into its orthogonal 2-dimensional vector components.Assuming that the magnitudes of each component are uncorrelated, normally distributed with equal variance, and zero mean, then theoverall wind speed (vector magnitude) will be characterized by a Rayleigh distribution. A second example of the distribution arises in thecase of random complex numbers whose real and imaginary components are i.i.d. (independently and identically distributed) Gaussianwith equal variance and zero mean. In that case, the absolute value of the complex number is Rayleigh-distributed. The distribution isnamed after Lord Rayleigh.

Discussion: This is a theoretical exercise where we must apply two methods of point estimation. The basicproperties must be considered for the estimator obtained through the first method.

Note: If E(X) had not been given in the statement, it could have been calculated by applying integration by parts (since polynomials andexponentials are functions “of different type”):

E (X )=∫−∞

+∞

x f ( x ;θ)dx=∫0

∞

xxθ

2 e−

x2

2θ 2

dx=[−x e−

x2

2θ2

−∫ 1⋅(−e−

x2

2θ2

)dx]0∞

=0+∫0

∞

e−( x

√2θ2 )2

dx=∫0

∞

e−t 2

√2θ2 dt=√2θ2 √π2=θ√

π2

where ∫u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫u ' (x)⋅v (x )dx has been used with

• u=x → u '=1

• v '=xθ

2 e−

x2

2θ 2

→ v=∫ xθ

2 e−

x2

2θ2

dx=−e−

x2

2θ2

Then, we have applied the changex

√2θ2=t → x=t√2θ2

→ dx=dt√2θ2

We calculate the variance by using the first two moments. For the second moment, we can apply integration by parts twice (as theexponent decreases one unit each time)

E (X 2)=∫0

∞

x2 xθ

2 e−

x2

2θ2

dx=[−x2e−

x2

2θ2

−∫ 2 x⋅(−e−

x2

2θ2

)dx ]0∞

=0+2θ2∫0

∞ xθ

2 e−

x2

2θ2

dx=2θ2

where ∫u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫u ' (x)⋅v (x )dx has been used with

• u=x2 → u '=2 x

• v '=xθ

2 e−

x2

2θ 2

→ v=∫ xθ

2 e−

x2

2θ2

dx=−e−

x2

2θ2

The variance is Var (X )=E (X2)−E (X )

2=2θ

2−θ

2 π2=θ

2 4−π

2. (In substituting, that ex changes faster than xk for any

k has been taken into account. On the other hand, in an advanced table of integrals like those physicists or engineers use, one can

find ∫0

+∞

e−a x2

dx (see the appendixes of Mathematics) or ∫0

+∞

x2 e−a x2

dx directly.)



Since there appears only one parameter in the density function, one equation suffices; moreover, since theexpression of μ = E(X) involves θ, the equation and the solution are:

μ1(θ)= x → θ√π2= x → θ=

1

√π2

x=√ 2π x → θM=√ 2

π X

(b) Bias, mean square error and consistency

Mean or expectation: E (θM )=E(√ 2π X )=√ 2

π E ( X )=√ 2π E (X )=√ 2

π θ√π2=θ

Bias: b (θM )=E ( θM )−θ=θ−θ=0 → θM is an unbiased estimator of θ.

Variance: Var ( θM )=Var (√ 2π X )= 2

π Var ( X )=2π

Var (X )n

=2π(4−π)

2nθ2=

(4−π)

πnθ2

Mean square error: ECM (θM )=b (θM )2+Var (θM )=0+

(4−π)

πnθ2=

(4−π)

πnθ2

Consistency: limn→∞ MSE (θM )=limn→∞

(4−π)

πnθ2=0 and therefore σM is consistent (for θ).

(c) Maximum likelihood method

Likelihood function:

L(X ;θ)=∏ j=1

nf ( x j ;θ)= f ( x1;θ)⋯ f (xn ;θ)=

x1 e−

x12

2θ2

θ2 ⋯

xn e−

xn2

2θ2

θ2 =

(∏ j=1

nx j) e

−1

2θ 2∑ x j2

θ2n

Log-likelihood function:

To facilitate the differentiation, θ2n is moved to the numerator and a property of the logarithm is applied.

log ( L(X ;θ))=log (∏ j=1

nx j)− 1

2θ2∑ x j2+ log (θ−2 n )= log (∏ j=1

nx j)−

∑ j=1

nx j

2

21θ2−2 n log (θ)

Search for the maximum:

0=d

d θlog ( L(X ;θ))=0−

∑ j=1

nx j

2

2−1θ

4 2θ−2 n1θ=∑ j=1

nx j

2

θ3 −

2nθ

→ 2nθ=∑j=1

nx j

2

θ3 → θ

2=∑j=1

nx j

2

2n

Now we prove the condition on the second derivative.

d 2

d θ2 log ( L(X ;θ))=d

d θ (∑ j=1

nx j

2

θ3 −

2 nθ )=∑ j=1

nx j

2−1

θ6 3θ2

−2n−1

θ2 =−3

∑ j=1

nx j

2

θ4 +

2n

θ2

The first term is negative and the second is positive, but it is difficult to check qualitatively whether thesecond is larger in absolute value than the first. Then, the extreme obtained is substituted:

d 2

d σ2 log(L(X ;σ2=∑ j=1

nx j

2

2n))=−3

(∑ j=1

nx j

2)22 n2

(∑ j=1

nx j

2)

2+

(2 n)2

∑ j=1

nx j

2=−3

4 n2

∑ j=1

nx j

2+

4n2

∑ j=1

nx j

2=−2

4 n2

∑ j=1

nx j

2< 0

Thus, the extreme is really a maximum.


The estimator:

θML=√∑j=1

nX j

2

2n

Discussion: The Rayleigh distribution is one of the few cases for which the two methods provide differentestimators of the parameter. In the first case, we could easily calculate the mean and the variance, as theestimator was linear in Xj; nevertheless, in the second case the nonlinearities Xj

2 and the square root makethose calculations difficult.

Exercise 3pe

Before commercializing a new model of light bulb, a deep statistical study on its duration (measured in days,d) must be carried out. The population variable duration is expected to follow the exponential probabilitymodel:

Let X = (X1,...,Xn) be a simple random sample. Then, we want to:

(a) Apply the method of the moments to find an estimator of the parameter λ.

(b) Apply the maximum likelihood method to find an estimator of the parameter λ.

(c) Find a sufficient statistic (see the hint below).

(d) Prove that X is not an efficient estimator of λ.

(e) Prove that X is a consistent estimator of λ–1.

(f) Prove that X is an efficient estimator of λ–1. To cope with this, use the following alternative, equivalentnotation in terms of θ = λ–1

Now you must prove that X is an efficient estimator of θ and you can easily calculated

d θ, while

only experts can calculated

d λ−1.

(g) The empirical part of the study, based on the measurement of 55 independent light bulbs, has yielded a

total sum of ∑ j=1

55x j= 598d . Introduce this information in the expressions obtained in previous

sections to give final estimates of λ.

(h) Give an estimate of the mean μ = E(X).

Hint: For section (c), apply the factorization theorem and make it clear how the two parts are. In the theorem: (1) g and h arenonnegative; (1) T cannot depend on θ; (2) g depends only on the sample and the parameter, and it depends on the sample throughT; (3) h can be 1; and (4) since h is any function of the sample, it may involve T.

Discussion: First of all, the supposition that the exponential distribution can reasonably be used to model thevariable duration should be tested. One aim of this exercise is to show how many methods and propertiesinvolved in previous exercises can be involved in the same statistical analysis. The quality of the estimators


My notes:

obtained is studied. (See the appendixes to see how the mean and the variance of this distribution could becalculated, if necessary.)


(a1) Population and sample moments: The population distribution has only one parameter, so one equationsuffices. The first-order moments of the model X and the sample x are, respectively,

μ1(λ)=E(X )=1λ

and m1(x1 , x2 ,... , xn)=1n∑ j=1

nx j= x

(a2) System of equations: Since the parameter of interest λ appears in the first moment of X, the solution is:

μ1(λ)=m1( x1 , x2 , ... , x n) → 1

λ=

1n∑ j=1

nx j= x → λ=( 1

n∑ j=1

nx j)

−1

=1x

(a3) The estimator:

λM=( 1n∑ j=1

nX j)

−1

=1X

(b) Maximum likelihood method

(b1) Likelihood function: For an exponential random variable the density function is f (x ; λ)=λ e−λ x , sowe write the product and join the terms that are similar

L( x1 , x2 , ... , x n ;λ)=∏ j=1

nf ( x j ;λ)=∏ j=1

nλ e−λ x j=λ e−λ x1⋅λ e−λ x2⋯λ e−λ xn=λ

n e−λ∑ j=1

n

x j

(b2) Optimization problem: The logarithm function is applied to make calculations easier

log [L( x1 , x2 , ... , xn ;λ)]=log [λn]+log [e

−λ∑ j=1

n

x j

]=n⋅log [λ]−λ⋅∑ j=1

nx j

The population distribution has only one parameter, and hence a onedimensional function must be maximized.To find the local or relative extreme values, the necessary condition is:

0=d

d λlog [L (x1, x2,. .. , xn ;λ)]=

nλ−∑ j=1

nx j → λ0=( 1

n∑ j=1

nx j)

−1

=1x


d2

d λ2 log [L(x1, x2,. .. , xn ; λ)]=−nλ

2 < 0

which holds for any value, particularly for λ0=1x

.

(b3) The estimator:

λML=(1n∑ j=1

nX j)

−1

=1X

(c) Sufficient statistic

Both to prove that a given statistic is sufficient and to find a sufficient statistic, we apply the factorizationtheorem (see the hint).

(c1) Likelihood function: Computed previously, it is L(X 1 , X 2 , ... , X n ;λ)=λne

−λ∑ j=1

n

X j .


(c2) Theorem: L(X 1 , X 2 , ... , X n ;λ)=g (T (X 1 , X 2 , ... , X n);λ )⋅h(X 1 , X 2 , ... , X n) . We must analise eachterm of the likelihood function.

➔ λn depends only on the parameter, so it would be part of g.

➔ e−λ∑ j=1

n

X j depends on both the parameter and the sample, and it is not possible to separatemathematically both types of information; then, this term would be part of g too. Moreover, the only

candidate to be a sufficient statistic is T (X )=T (X 1 , ... , X n)=∑ j=1

nX j .

Since the condition holds for g (T (X 1 , X 2 , ... , X n) ; λ)=λn e

−λ∑ j=1

n

X j and h(X 1 , X 2 , ... , X n)=1 , the

statistic T (X )=T (X 1 , ... , X n)=∑ j=1

nX j is sufficient. This means that it “summarizes the important

information (about the parameter)” contained in the sample. The previous statistic contains the same

information as any one-to-one transformation of it, concretely the sample mean T (X )=1n∑ j=1

nX j= X .

(d) X is not an efficient estimator of λ

The definition of efficiency consists of two conditions: unbiasedness and minimum variance (this latter ischecked by comparing the variance with the Cramér-Rao's bound).

(d1) Unbiasedness: By applying a property of the sample mean and the information of the statement,

E ( X )=E (X )=1λ

→ b ( X )=E ( X )−λ=1λ−λ≠ 0

The first condition does not hold for all values of λ, and hence it is not necessary to check the second one.

Note: The previous bias is zero when1λ−λ=0 ↔ λ=±√ 1 → λ=1 (for f(x) to be a probability function, λ must be positive, so

the solution –1 is not taken into account). Thus, when λ = 1, the estimator may still be efficient if the second condition holds.

(e) X is a consistent estimator of λ–1

To prove the consistency (in probability), we will apply any of the following sufficient conditions (consistencyin mean of order two)

limn →∞ MSE (θ)=0 ↔ { lim n→∞ b( θ)=0

limn→∞ Var (θ)=0

(e1) Bias: By applying a property of the sample mean and the information of the statement,

E ( X )=E (X )=1λ

→ b( X )=E ( X )−λ−1=

1λ−

1λ=0 → limn→∞ b( X )=limn→∞ 0=0

(e2) Variance: By applying a property of the sample mean and the information of the statement,

Var ( X )=Var (X )

n=

1λ

2⋅n

→ limn→∞ Var ( X )=lim n→∞

1

λ2⋅n

=0

As a conclusion, the mean square error (MSE) tends to zero, which is sufficient—but not necessary—for theconsistency (in probability).

(f) X is an efficient estimator of λ–1

Now, we are recommended to use the notation


where θ=λ–1.

(f1) Unbiasedness: By applying a property of the sample mean and the information of the statement,

E ( X )=E (X )=θ → b( X )=E ( X )−λ−1=θ−θ= 0

The first condition holds, and hence it is necessary to check the second one.

(f2) Mininum variance: We compare the variance and the Cramér-Rao's bound. The variance is:

Var ( X )=Var (X )

n=θ

2

n

On the other hand, the bound is calculated step by step:

i. Function (with X in place of x)

f (X ;θ)=1θ

e−

Xθ

ii. Logarithm of the function:

log [ f (X ;μ)]=log(θ−1)+log(e

−Xθ )=−log(θ)−

Xθ

iii. Derivative of the logarithm of the function:

∂∂θ

( log [ f (X ;θ)])=−1θ−X⋅

−1

θ2=−

1θ+

X

θ2

iv. Expectation of the squared partial derivative of the logarithm of the function: We rewrite theexpression so as to make σ

2=Var (X )=E ((X −E (X ))

2 )=E ((X −μ)2 ) appear. In this case, it also

holds that σ2=Var (X )=E ((X −θ)2 )=θ2 . Then

E [( ∂ log [ f (X ;θ)]∂θ )

2

]=E [(−1θ⋅θθ+

Xθ

2 )2

]=E [(X −θ)2

θ4 ]=Var (X )

θ4 =θ

2

θ4=

1θ

2

v. Theoretical Cramér-Rao's lower bound:

1

n⋅E[(∂ log [ f (X ;θ)]∂θ )

2

]=

1

n⋅1

θ2

=θ2

n

The variance of the estimator attains the bound, so the estimator has minimum variance. The fulfillment of thetwo conditions proves that X is an efficient estimator of λ–1 = θ.

(g) Estimation of λ

It is necessary to use the only information available: ∑ j=1

55x j= 598d .

From the method of the moments: λM=( 1n∑ j=1

nx j)

−1

=( 155

598 d )−1

=0.09197 d−1 .

From the maximum likelihood method, since the same estimator was obtained: λML=0.09197d−1 .

(h) Estimation of μ

Since μ=E (X )=1λ

, an estimator of λ induces, by applying the plug-in principle, an estimator of μ:


From the method of the moments: μM=( λM )−1=

598 d55

=10.87 d .

From the maximum likelihood method: μML=10.87d .


Conclusion: We can see that for the exponential model the two methods provide the same estimator for λ.The estimator obtained has been used to obtain an estimator of the population mean. The mean durationestimate of the new model of light bulb was 10.87 days. On the other hand, some desirable properties of theestimator have been proved. A different, equivalent notation has been used to facilitate the proof of one ofthese properties, which emphasizes the importance of the notation in doing calculations.


My notes:

Confidence Intervals

[CI] Methods for EstimatingRemark 1ci: Confidence can be interpreted as a probability (so it is, although we sometimes use a 0-to-100 scale). See remark 1pt,in the appendix of Probability Theory, on the interpretation of the concept of probability.

Remark 2ci: Since there is an infinite number of pairs of quantiles a1 and a2 such that P (a1≤T≤a2)=1−α , thosedetermining tails of probability α/2 are considered by convention. This criterion is also applied for two-tailed hypothesis tests.

Remark 3ci: When the Central Limit Theorem can be applied, asymptotic results on averages are relatively independent of theinitial population. Therefore, in some exercises there are not suppositions on the distribution of the population variables.

Exercise 1ci-m

To forecast the yearly inflation (in percent, %), a simple random sample has been gathered:

1.5 2.1 1.9 2.3 2.5 3.2 3.0

It is assumed that the variable inflation follows a normal distribution.

(a) By using these data, construct a 99% confidence interval for the mean of the inflation.

(b) Experts have the opinion that the previous interval is too wide, and they want a total length of a unit.Find the level of confidence for this new interval.

(c) Construct a confidence interval of 90% for the standard deviation.

Discussion: The intervals will be built by applying the method of the pivot, and then the expression of themargin of error is determined. Since variances are nonnegative by definition and the positive branch of thesquare root function is strictly increasing, the interval for the standard deviation is obtained by applying thesquare root to the inteval for the variance.

Identification of the variable

X ≡ Predicted inflation (of one country) X ~ N(μ,σ2)

Sample information

Theoretical (simple random) sample: X1,..., X7 s.r.s. → n = 7

Empirical sample: x1,..., x7 → 1.5 2.1 1.9 2.3 2.5 3.2 3.0

In this exercise, we know the values of the sample xi. This allows calculating any quantity we want.

(a) Confidence interval for the mean: To choose the proper pivot, we take into account:

• The variable of interest follows a normal distribution.• The population variance σ2 is unknown, so it must be estimated by the sample (quasi)variance.• The sample size is small, n = 7, so we should not think about the asymptotic framework.

From a table of statistics (e.g. in [T]), the pivot


T (X ;μ)=X −μ

√ S 2

n

=X −μ

S√n

∼ tn−1

is selected. Then

1−α=P(lα/2≤T (X ;μ)≤rα/2)=P(−rα/2≤X−μ

√ S2

n

≤+rα/2)=P (−rα/2√ S2

n≤ X−μ≤+rα/2√ S2

n)

=P (− X −rα/2√ S 2

n≤−μ≤−X +rα/2√ S2

n)=P ( X +r+α/2√ S 2

n≥μ≥ X −rα/2√ S 2

n)

so

I 1−α=[ X −rα/2√ S 2

n, X +rα/2√ S2

n ]where rα/2 is the quantile such that P(T>rα/2)=α/2. Let us calculate the quantities in the formula:

• x=17∑j=1

7x j=2.36

• The level of confidence is 99%, and hence α = 0.01. The quantile is found in the table of the t distribution with κ = 7–1degrees of freedom rα/2=r0.01 /2=r 0.005=3.71

• By using the data, S2=

17−1∑ j=1

7( x j− x )

2=

17−1

[(1.5%−2.35%)2+⋯+(3.0%−2.35%)

2 ]=0.36%2

• Finally, n = 7

Then, the interval is

I 0.99= [2.35%−3.71√ 0.36%2

7, 2.35%+3.71√ 0.36%2

7 ]=[1.51% , 3.20%]

whose length is 3.20%–1.51% = 1.69%.

(b) Confidence level: The length of the interval, the distance between the two endpoints, is twice the marginof error when T follows a symmetric distribution.

L=( X+rα/2√ S2

n )−(X−rα/2√ S2

n )=2 rα/2√ S2

n

In this section L is given and α must be found; nevertheless, it is necessary to find rα/2 previously.

rα/2=L√n2S

=1⋅√7%2⋅0.6%

=2.20

In the table of the t law it is found that α/2 = 0.035, so α = 0.07 and 1–α = 0.93. The confidence level is 93%.

(c) Confidence interval for the standard deviation: To choose the new statistic:

• The variable of interest follows a normal distribution.• The quantity of interest it the standard deviation σ.• The population mean μ is unknown.• The sample size is small, n = 7, so we should not think about the asymptotic framework

From a table of statistics (e.g. in [T]), the proper pivot


> 1-pt(2.20, 7-1)[1] 0.03505109

T (X ;σ)=(n−1) S2

σ2 ∼ χn−1

2

is selected. Then

1−α=P ( lα/2≤ (n−1) S2

σ2 ≤rα/2)=P ( lα/2

(n−1)S 2≤1σ

2 ≤rα/2

(n−1) S2 )=P ((n−1)S 2

lα/2≥σ

2≥(n−1)S 2

rα/2 )and hence the interval is

I 1−α=[(n−1)S 2

rα/2,(n−1)S 2

lα/2 ]The quantities in the formula are:

• Sample size n = 7, so n–1= 6

• S2=0.36%2

• Since α = 0.1 and κ = n–1= 6, the quantiles are l0.05=1.64 and r0.05=12.6

By substituting and applying the square root function, the interval is

I 0.9= [√ 6⋅0.36%2

12.6, √ 6⋅0.36%2

1.64 ]=[0.414% , 1.148%]

Conclusion: The length in section (b) is smaller than in section (a), that is, the interval is narrower and theconfidence is smaller.

Exercise 2ci-m

In the library of a university, the mean duration (in days, d) of the borrowing period seems to be 20d. A simplerandom sample of 100 books is analysed, and the values 18d and 8d2 are obtained for the sample mean andthe sample variance, respectively. Construct a 99% confidence interval for the mean duration of theborrowings to check if the initial population value is inside.

Discussion: For so many data, asymptotic results are considered. The method of the pivotal quantity canalso be applied. The dimension of the variable duration is time, while the unit of measurement is days.

Identification of the variable:

X ≡ Duration (of one borrowing) X ~ ?

Sample information:

Theoretical (simple random) sample: X1,...,X100 s.r.s. → n = 100

Empirical sample: x1,...,x100 → x=18d , s2=8d2

The values xj of the sample are unknown; instead, the evaluation of some statistics is given. These quantitiesmust be sufficient for the calculations, and, therefore, formulas must be written in terms of X and S2 .


My notes:

> qchisq(c(0.05, 1-0.05), 7-1)[1] 1.635383 12.591587

Confidence interval: To select the pivot, we take into account:

• Nothing is said about the probability distribution of the variable of interest• The sample size is big, n = 100 (>30), so an asympotic expression can be considered• The population variance is unknown, but it is estimated through the sample variance

From a table of statistics (e.g. in [T]), the proper pivot

T (X ;μ)=X −μ

√ S 2

n

→ N (0,1)

is chosen, where S2 is the sample quasivariance. By applying the method of the pivotal quantity:

1−α=P (lα/2≤T (X ;μ)≤rα/2)=P(−rα/2≤X −μ

√ S 2

n

≤+rα/ 2)=P (−rα/2√ S2

n≤ X −μ≤+rα/2√ S 2

n)

=P (− X −rα/2√ S 2

n≤−μ≤−X +r α/2√ S2

n)=P ( X +rα/2√ S 2

n≥μ≥ X −rα/2√ S 2

n)


I 1−α=[ X −rα/2√ S 2

n, X +rα/2√ S2

n ]where rα/2 is the quantile such that P(Z>rα/2)=α/2.

Substitution: We calculate the quantities involved in the formula,

• Sample mean x=18d

• For a confidence of 99%, α = 0.01 and rα/2=2.58

• To calculate S2the property (n−1)S2

=∑ j=1

100(x j− x)2=n s2

is used: S2=

nn−1

s2=

10099

8d2=8.1d

2

• n = 100

The interval is

I 0.99= [18d−2.58√8.1d2

√100, 18d+2.58

√8.1d2

√100 ]=[17.27d , 18.73d ]

Conclusion: The mean duration estimate of the borrowings belongs to the interval obtained with 99%confidence. The initial value μ = 20d is not inside the high-confidence interval obtained, that is, it is notsupported by the data. (Remember: statistical results depend on: the assumptions, the methods, the certaintyand the data.)

Exercise 3ci-m

The accounting firm Price Waterhouse periodically monitors the U.S. Postal Service's performance. Oneparameter of interest is the percentage of mail delivered on time. In a simple random sample of 332,000


My notes:

mailed items, Price Waterhouse determined that 282,200 items were delivered on time (Tampa Tribune, March26, 1995.) Use this information to estimate with 99% confidence the true percentage of items delivered ontime by the U.S. Postal Servece.

(Taken from: Statistics. J.T. McClave and T. Sincich. Pearson.)

Discussion: The population is characterized by a Bernoulli variable, since for each item there are only twopossible values. We must construct a confidence interval for the proportion (a percent is a proportionexpressed in a 0-to-100 scale). Proportions have no dimension.


X ≡ Delivered on time (one item) ? X ~ B(η)

Confidence interval

For this kind of population and amount of data, we use the statistic:

T (X ;η)=η−η

√ ?(1−? )n

→d

N (0,1)

where ? is substituted by η or η . For confidence intervals η is unknown and no value is supposed, andhence it is estimated through the sample proportion. By applying the method of the pivot:

1−α=P (lα/2≤T (X ;η)≤rα/2)=P (−rα/2≤η−η

√ η(1−η)

n

≤+rα/2)=P (−rα/2√ η(1−η)

n≤ η−η≤+rα/2√ η(1−η)

n )=P (−η−r α/2√ η(1−η)

n≤−η≤−η+rα/2√ η(1−η)

n )=P (η+r +α/2√ η(1−η)

n≥η≥η−rα/2√ η(1−η)

n )Then, the interval is

I 1−α=[η−r+α/2√ η(1−η)

n, η+r+α/2√ η(1−η)

n ]Substitution: We calculate the quantities in the formula,

• n = 332000

• η=282200332000

=0.850

• 99% → 1–α = 0.99 → α = 0.01 → α/2 = 0.005 → rα/2=r0.005=l 0.995=2.58

So

I 0.99=[0.850−2.58√ 0.850(1−0.850)332000

, 0.850+2.58√ 0.850 (1−0.850)332000 ]=[0.848 , 0.852]

Conclusion: With a confidence of 0.99, measured in a 0-to-1 scale, the value of η will be in the interval


obtained. In average, 99% times the method applied provides a right interval. Nonetheless, frequently we donot know the real η and therefore we will never know if the method has failed or not. (Remember: statisticalresults depend on: the assumptions, the methods, the certainty and the data.)

Exercise 4ci-m

Two independent groups, A and B, consist of 100 people each of whom have a disease. A serum is given togroup A but not to group B, which are termed treatment and control groups, respectively; otherwise, the twogroups are treated identically. Two simple random samples have yielded that in the two groups, 75 and 65people, respectively, recover from the disease. To study the effect of the serum, build a 95% confidenceinterval for the difference ηA–ηB. Does the interval contain the case ηA = ηB?

Discussion: There are two independent Bernoulli populations. The interval for the difference of proportionis built by applying the method of the pivot. Proportions are, by definition, dimensionless quantities.

Identification of the variable: Having got better or not is a dichotomic situation,

A ≡ Recuperating (an individual of the treatment group)? A ~ B(ηA)

B ≡ Recuperating (an individual of the control group)? B ~ B(ηB)

(1) Pivot: We take into account that:

• There are two independent Bernoulli populations• Both sample sizes are large, 100, so an asymptotic approximation can be applied

From a table of statistics (e.g. in [T]), the following pivot is selected

T (A , B ;ηA ,ηB)=(ηA−ηB)−(ηA−ηB)

√ ηA(1−ηA)

nA

+ηB(1−ηB)

nB

→d

N (0,1)

(2) Event rewriting:

1−α=P (lα/2≤T (A , B ;ηA ,ηB)≤rα/2)≈P (−rα/2≤(ηA−ηB)−(ηA−ηB)

√ ηA(1−ηA)

nA

+ηB(1−ηB)

nB

≤+rα/2)

=P (−rα/2√ ηA(1−ηA)

nA

+ηB(1−ηB)

nB

≤(ηA−ηB)−(ηA−ηB)≤+rα/2√ ηA(1−ηA)

nA

+ηB(1−ηB)

nB)

=P (−(ηA−ηB)−rα/2√ ηA(1−ηA)

n A

+ηB(1−ηB)

nB

≤−(ηA−ηB)≤−(ηA−ηB)+rα/2√ ηA(1−ηA)

nA

+ηB(1−ηB)

nB)

=P ((ηA−ηB)+rα/2√ ηA(1−ηA)

n A

+ηB(1−ηB)

nB

≥ηA−ηB ≥(ηA−ηB)−rα/2√ ηA(1−ηA)

n A

+ηB(1−ηB)

nB)


My notes:

(3) The interval:

I 1−α=[(ηA−ηB)−rα/2√ ηA(1−ηA)

nA

+ηB(1−ηB)

nB

, (ηA−ηB)+rα/2√ ηA(1−ηA)

n A

+ηB(1−ηB)

nB]

where rα/2 is the value of the standard normal distribution such that P(Z>rα/2)=α/2.

Substitution: We need to calculate the quantities involved in the previous formula,

• nA = 100 and nB = 100.

• Theoretical (simple random) sample: A1,...,A100 s.r.s. (each value is 1 or 0).

Empirical sample: a1,...,a100 → ∑j=1

100a j=75 → ηA=

1100∑ j=1

100a j=

75100

=0.75

Theoretical (simple random) sample: B1,...,B100 s.r.s. (each value is 1 or 0)

Empirical sample: b1,...,b100 → ∑j=1

100b j=65 → ηB=

1100∑ j=1

100b j=

65100

=0.65 .

• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → rα/2=1.96 .

Then,

I 0.95=(0.75−0.65)∓1.96√ 0.75(1−0.75)100

+0.65(1−0.65)

100=[−0.0263 , 0.226 ]

The case ηA = ηB is included in the interval.

Conclusion: The lack-of-effect case (ηA = ηB) cannot be excluded when the decision has 95% confidence.Since η ∈(0,1) , any “reasonable” estimator of η should provide values in this range or close to it. Becauseof the natural uncertainty of the sampling process (randomness and variability), in this case the smallerendpoint of the interval was –0.0263, which can be interpreted as being 0. When an interval of highconfidence is far from 0, the case ηA = ηB can clearly be discarded or rejected. Finally, it is important to noticethat a confidence interval can be used to make decisions about hypotheses on the parameter values—it isequivalent to a two-sided hypothesis test, as the interval is also two-sided. (Remember: statistical resultsdepend on: the assumptions, the methods, the certainty and the data.)

Advanced theory: When the assumption ηA = η = ηB seems reasonable (notice that this case is included inthe 95% confidence interval just calculated), it makes sense to try to estimate the common variance of the

estimator as well as possible. This can be done by using the pooled sample proportion ηp=n A ηA+nB ηB

nA+nB

in

estimating η(1– η) for the denominator; nonetheless, the pooled estimator should not be considered in thenumerator, as (ηp−ηp)=0 whatever the data are. The statistic would be:

~T (A , B)=(ηA−ηB)−(ηA−ηB)

√ ηp(1−ηp)

nA

+ηp (1−ηp)

nB

→d

N (0,1)

Now, the expression of the interval would be

~I 1−α=[(ηA−ηB)−rα/2√ ηp(1−ηp)

n A

+ηp(1−ηp)

nB

, (ηA−ηB)+rα/2√ ηp(1−ηp)

nA

+ηp(1−ηp)

nB]

The quantities involved in the previous formula are

• nA = 100 and nB = 100


• Since ηA=0.75 and ηB=0.65 , the pooled estimate is

ηp =nA ηA+nB ηB

nA+nB

=n(ηA+ηB)

2 n=

0.75+0.652

= 0.70

• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → rα/2=1.96

Then,

~I 0.95=(0.75−0.65)∓1.96√20.70(1−0.70)

100=[−0.0270, 0.227]

One way to measure how different the results are consists in directly comparing the length—twice the marginof error—in both cases:

L=0.226−(−0.0263)=0.2523 ~L=0.227−(−0.0270)=0.254

Even if the latter length is larger, it is theoretically more trustable than the former when ηA = η = ηB is true.The general expressions of these lengths can be found too:

L=2 rα/2√ ηA(1−ηA)

nA

+ηB(1−ηB)

nB

~L=2 rα/2√ ηp(1−ηp)

nA

+ηp (1−ηp)

nB

Another way to measure how different the results are can be based on comparing the statistics:

~T (A , B)=(ηA−ηB)−(ηA−ηB)

√ ηp(1−ηp)

nA

+ηp (1−ηp)

nB

=(ηA−ηB)−(ηA−ηB)

√ ηA(1−ηA)

nA

+ηB(1−ηB)

nB

√ ηA(1−ηA)

n A

+ηB(1−ηB)

nB

√ ηp(1−ηp)

nA

+ηp(1−ηp)

nB

=T (A , B)√ ηA(1−ηA)

nA

+ηB(1−ηB)

nB

√ ηp(1−ηp)

nA

+ηp(1−ηp)

nB

→ L~L=

√ ηA(1−ηA)

nA

+ηB (1−ηB)

nB

√ ηp(1−ηp)

n A

+ηp(1−ηp)

nB

=~TT

( so L⋅T =~L⋅~T )

Thus, the quantity

√ ηA(1−ηA)

n+ηB (1−ηB)

n

√ ηp (1−ηp)

n+ηp(1−ηp)

n

=√ ηA(1−ηA)+ηB (1−ηB)

√2 ηp(1−ηp)=0.994

can be seen as a measure of the effect of using the pooled sample proportion. This effect is little in thisexercise, but it could be higher in other situations. As regards the case ηA = η = ηB, it is also included in thisinterval, which is not worthy as it has been used as an assumption; nevertheless, the exclusion of this casewould have contradicted the initial assumption.

[CI] Minimum Sample SizeRemark 4ci: In calculating the minimum sample size to guarantee a given precision by applying the method based on the margin oferror, the result is obtained using other results: theorem giving the sampling distribution of the pivot T and the method of the pivot.


My notes:

When the proper statistic T is based on the supposition that the population variable X follows a given parametric probabilitydistribution, the whole process can be seen at a parametric approach; when T is based on an asymptotic result, the nonparametricCentral Limit Theorem is indirectly being applied. On the other hand, the method based on the Chebyshev's inequality is validwhichever the probability distribution of the population variable X and nonnegative function h(x). The Central Limit Theorem, beinga nonparametric result, seems more powerful than the Chebyshev's inequality, based on a rough binding (see the appendixes). As aconsequence, we expect the method based on the this inequality to overestimate the minimum sample size. On the contrary, thenumber provided by the method based on the margin of error may be less trustable if the assumptions on which it is based are false.

Remark 5ci: Once there is a discrete quantity in an equation, the unknown cannot take any possible value. This implies that, strictlyspeaking, equalities like

E=rα/2√σ2

n σ

2

n E2 =α

may be never fulfilled for continuous E, α, σ and discrete n. Solving the equality and rounding the result upward is a way alternativeto solving the inequalities

E g≥E=r α/2√σ2

n σ

2

n Eg2 ≤α

where the purpose is to find the minimum n for which the (possible discrete values of the) margin of error is smaller than or equal tothe given precision Eg.

Exercise 1ci-s

The lengths (in millimeters, mm) of metal rods produced by an industrial process are normally distributedwith a standard deviation of 1.8mm. Based on a simple random sample of nine observations from thispopulation, the 99% confidence interval was found for the population mean length to extend from 194.65mmto 197.75mm. Suppose that a production manager believes that the interval is too wide for practical use and,instead, requires a 99% confidence interval extending no further than 0.50mm on each side of the samplemean. How large a sample is needed to achieve such an interval? Apply both the method based on theconfidence interval and the method based on the Chebyshev's inequality.

(From: Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)

Discussion: There is one normal population with known standard deviation. By using a sample of nineelements, a 99% confidence interval was built, I1 = [194.65mm, 197.75mm], of length 197.75mm – 194.65mm= 3.1mm and margin of error 3.1mm/2 = 1.55mm. A narrower interval is desired, and the number of datanecessary in the new sample must be calculated. More data will be necessary for the new margin of error to besmaller (0.50 < 1.55) while the other quantities—standard deviation and confidence—are the same.


X ≡ Length (of one metal rod) X ~ N(μ, σ2=1.82mm2)

Sample information:

Theoretical (simple random) sample: X1,..., Xn s.r.s. (the lengths of n rods are taken)

Margin of error:

We need the expression of the margin of error. If we do not remember it, we can apply the method of the pivotto take the expression from the formula of the interval.

I 1−α=[ X −rα/2√ σ2

n, X +rα/2√ σ

2

n ]If we remembered the expression, we can use it. Either way, the margin of error (for one normal population


with known variance) is:

E=rα/2√σ2

nSample size

Method based on the confidence interval: We want the margin of error E to be smaller or equal than thegiven Eg,

E g≥E=rα/2√σ2

n→ E g

2≥rα/2

2 σ2

n→ n≥zα/2

2

(σEg )

2

=2.582( 1.8mm0.5mm )

2

=86.27 → n≥87

since rα/2=r0.01 /2=r 0.005=2.58 . (The inequality does not change neither when multiplying or dividing bypositive quantities nor squaring, while it changes when inverting.)

Method based on the Chebyshev's inequality: For unbiased estimators, it holds that:

P (|θ−θ|≥E )=P (|θ−E (θ)|≥E)≤Var ( θ)

E2 ≤α

so Var (θ)=Var ( X )=σ2

n

σ2

n Eg2 ≤α → n≥

1α (

σEg )

2

=1

0.011.82 mm2

0.52 mm2=1296 → n≥1296

Conclusion: At least n data are necessary to guarantee that the margin of error is equal to 0.50 (this margincan be thought of as “the maximum error in probability”, in the sense that the distance or error ∣θ−θ∣ willbe smaller that Eg with a probability of 1–α = 0.99, but larger with a probability of α = 0.01). Any number ofdata larger than n would guarantee—and go beyond—the precision desired. As expected, more data arenecessary (86 > 9) to increase the accuracy (narrower interval) with the same confidence. The minimumsample sizes provided by the two methods are quite different (see remark 4ci). (Remember: statistical resultsdepend on: the assumptions, the methods, the certainty and the data.)

[CI] Methods and Sample Size

Exercise 1ci

The mark of an aptitude exam follows a normal distribution with standard deviation equal to 28.2. A simplerandom sample with nine students yields the following results:

∑j=1

9x j=1,098 ∑j=1

9x j

2=138,148

a) Find a 90% confidence interval for the population mean μ.

b) Discuss without calculations whether the length of a 95% confidence interval will be smaller, greateror equal to the length of the interval of the previous section.

c) How large must the minimum sample size be to obtain a 90% confidence interval with length (distancebetween the endpoints) equal to 10? Apply the method based on the confidence interval and also themethod based on the Chebyshev's inequality.


My notes:

Discussion: The supposition that the normal distribution is an appropriate model for the variable markshould be evaluated. The method of the pivot will be applied. After obtaining the theoretical expression of theinterval, it is possible to reason on the relation confidence-length. Given the length of the interval, theexpression also allows us to calculate the minimum number of data necessary. The mark can be seen as aquantity without any dimension. Finally, it is worth noticing that an approximation is used, since the mark is adiscrete variable while the normal distribution is continuous.


X ≡ Mark (of one student) X ~ N(μ, σ2=28.22)

Sample information:

Theoretical (simple random) sample: X1,..., X9 s.r.s. (the marks of nine students are to be taken) → n = 9

Empirical sample: x1,...,x9 → ∑j=1

9x j=1,098 ∑j=1

9x j

2=138,148 (the marks have been taken)

We can see that the sample values xj themselves are unknown in this exercise; instead, information calculatedfrom them is provided; this information must be sufficient for carrying out the calculations.

a) Method of the pivotal quantity: To choose the proper statistic with which the confidence interval iscalculated, we take into account that:

• The variable follows a normal distribution• We are given the value of the population standard deviation σ• The sample size is small, n = 9, so asymptotic formulas cannot be applied

From a table of statistics (e.g. in [T]), the pivot

T (X ;μ)=X −μ

√ σ2

n

∼ N (0,1)

is selected. Then

1−α=P (lα/2≤T (X ;μ)≤rα/2)=P(−rα/2≤X −μ

√σ2

n

≤+rα/ 2)=P (−rα/2√σ2

n≤ X −μ ≤+rα/2√σ

2

n)

=P (− X −rα/2√σ2

n≤−μ≤− X +rα/2√ σ

2

n)=P ( X +r+α/2√σ

2

n≥μ ≥ X −rα/2√σ

2

n)

so

I 1−α=[ X −rα/2√ σ2

n, X +rα/2√ σ

2

n ]where rα/2 is the value of the standard normal distribution verifying P(Z>rα/2)=α/2 , that is, the valuesuch that an area equal to α/2 is on the right (upper tail).

Substitution: We calculate the quantities in the formula,

• x=19∑ j=1

9x j=

19

1,098=122

• A 90% confidence level implies that α = 0.1, and the quantile rα/2=r0.05=1.645 is in the table.


• From the statement, σ=28.2

• Finally, n = 9

Thus, the interval is

I 0.9=[122−1.64528.2

√9, 122+1.645

28.2

√9 ]= [106.54 , 137.46 ]

b) Length of the interval: To answer this question it is possible to argue that, when all the parameters but thelength are fixed, if higher certainty is desired it is necessary to widen the interval, that is, to increase thedistance between the two endpoints. The formal way to justify this idea consists in using the formula of theinterval:

L=( X+rα/2√σ2

n )−( X−rα/2√σ2

n )=2⋅rα/2√ σ2

n

Now, if σ and n remain unchanged, to study how L changes with α it is enough to see how the quantile“moves”. For the 95% interval:

• α = 0.05 → α decreases with respect to the value in section (a)

• Now rα/2 must leave less area (probability) on the right → rα/2 increases → L increases

In short, when the tails (α) get smaller the interval (1–α) gets wider, and vice versa.

c) Sample size:

Method based on the confidence interval: Now the 90% confidence interval of the first section is revisited.For given α and Lg, the value of n must be found. From the expression of the length,

Lg≥L=2 rα/2√σ2

n→ Lg

2≥22 rα/2

2 σ2

n→ n≥(2 zα/2

σLg )

2

=(2⋅1.64528.210 )

2

=86.08 → n≥87

(Only when inverting the inequality must be changed.)

Method based on the Chebyshev's inequality: For unbiased estimators:

P (|θ−θ|≥E )=P (|θ−E (θ)|≥E)≤Var ( θ)

E2 ≤α


n

σ2

n Eg2 ≤α → n≥ σ

2

α Eg2=

28.22

0.1⋅( 102 )

2=318.10 → n≥319

Conclusion: Given the other quantities, confidence grows with the length, and vice versa. If a value greaterthan n were considered, a higher accuracy interval would be obtained; nevertheless, in practice usually thiswould also imply higher expense of both time and money. The minimum sample sizes provided by the twomethods are quite different (see remark 4ci). (Remember: statistical results depend on: the assumptions, themethods, the certainty and the data.)


My notes:

Exercise 2ci

A 64-element simple random sample of petrol consumption (litres per 100 kilometers, u) in private cars hasbeen taken, yielding a mean consumption of 9.36u and a standard deviation of 1.4u. Then:

a) Obtain a 96% confidence interval for the mean consumption.

b) Assume both normality (for the consumption) and variance σ2 = 2u2. How large must the sample be if,with the same confidence, we want the maximum error to be a quarter of litre? Apply the methodbased on the confidence interval and the method based on the Chebyshev's inequality.

(From 2007's exams for accessing to the Spanish university.)

Discussion: For 64 data, asymptotic results can be applied. The method of the pivotal quantity will beapplied. The role of the number 100 is no other than being part of the units in which the data are measured.For the second section, additional suppositions—added by myself—are considered; in a real-world situationthey should be evaluated.


C ≡ Consumption (of one private car, measured in litres per 100 kilometers) C ~ ?

Sample information:

Theoretical (simple random) sample: C = (C1,...,C64) s.r.s. → n = 64

Empirical sample: c = (c1,...,c64) → c=9.36u , s=1.4u

The values cj of the sample are unknown; instead, the evaluation of some statistics is given. These quantitiesmust be sufficient for the calculations, so formulas must involve C and s2.

a) Confidence interval: To select the pivot, we take into account:

• Nothing is said about the probability distribution of the variable of interest• The sample size is big, n = 64 (>30), so an asympotic expression can be used• The population variance is unknown, but it is estimated by the sample variance


T (C ;μ)=C−μ

√ S2

n

→ N (0,1)

where S2 will be calculated by applying the relation n s2=(n−1) S2 . By applying the method of the pivot:

1−α=P (lα/2≤T (C ;μ)≤rα/2)=P (−r α/2≤C−μ

√ S 2

n

≤+rα/2)=P (−rα/2√ S2

n≤ C−μ≤+rα/2√ S 2

n)

=P (−C−rα/2√ S 2

n≤−μ≤−C+rα/2√ S2

n)=P (C+rα/2√ S 2

n≥μ≥C−rα/2√ S 2

n)

Then, the confidence interval is


I 1−α=[C−rα/2√ S2

n, C+rα/2√ S 2

n ]where rα/2 is the quantile such that P(Z>rα/2)=α/2.


• Sample mean c=9.36u .

• For a confidence of 96%, α = 0.04 and rα/2=r0.04 /2=r0.02=l0.98=2.054 .

• The sample quasivariance is S2=

nn−1

s2=

6463

1.42u

2=1.99u

2.

• Finally, n = 64.

The interval is

I 0.96= [9.36u−2.054√ 1.99u2

64, 9.36u+2.054 √ 1.99u2

64 ]=[9.00u , 9.72u]

b) Minimum sample size:

Method based on the confidence interval: To select the pivot, we take into account the new suppositions:

• The variable of interest follows a normal distribution• The population mean is being studied• The population variance is known

From a table of statistics (e.g. in [T]), the following pivot is selected (now the exact sampling distribution isknown, instead of the asympotic distribution)

T (C ;μ)=C−μ

√σ2

n

∼ N (0,1)

By doing calculations similar to those of the previous section or exercise, the interval is

I 1−α=[C−rα/2√σ2

n, C+rα/2√σ

2

n ]from which the expression of the margin of error is obtained, namely: E=rα/2√σ

2

n. Values can be

substituted either before or after breaking an inequality; this time let us use numbers from the beginning:

E g=14

u≥E=2.054√ 2 u2

n→ 1

42 u2≥2.0542 2u2

n→ n≥42

⋅2.0542⋅2=135.01 → n≥136

(When inverting, the inequality must be changed.)


P (|θ−θ|≥E )=P (|θ−E (θ)|≥E)≤Var ( θ)

E2 ≤α


n


σ2

n Eg2 ≤α → n≥ σ

2

α Eg2=

2u2

0.04⋅( 14u)

2=800

Conclusion: The unknown mean petrol consumption of the population of private cars belongs to the

interval obtained with 96% confidence. For 64 data, the margin of error were 2.055√ 1.99u2

64=0.36 u , while

136 data are needed for the margin to be 1/4= 0.250. The minimum sample sizes provided by the two methodsare quite different (see remark 4ci). (Remember: statistical results depend on the assumptions, the methods,the certainty and the data.)

Exercise 3ci

You have been hired by a consortium of dairy farmers to conduct a survey about the consumption of milk.Based on results from a pilot study, assume that σ = 8.7oz. Suppose that the amount of milk is normallydistributed. If you want to estimate the mean amount of milk consumed daily by adults:

(a) How many adults must you survey if you want 95% confidence that your sample mean is in error by nomore than 0.5oz? Apply both the method based on the confidence interval and the method based on theChebyshev's inequality.

(b) Calculate the margin of error if the number of data in the sample were twice the minimum (rounded)value that you obtained. Is now the margin of error half the value it was?

(Based on an exercise of: Elementary Statistics. Triola M.F. Pearson.)

CULTURAL NOTE (From: Wikipedia.)

A fluid ounce (abbreviated fl oz, fl. oz. or oz. fl., old forms , fl , f , ƒ ) is a unit of ℥ ℥ ℥ ℥ volume (also called capacity) typically used formeasuring liquids. It is equivalent to approximately 30 millilitres. Whilst various definitions have been used throughout history, tworemain in common use: the imperial and the United States customary fluid ounce. An imperial fluid ounce is 1⁄20 of a imperial pint, 1⁄160

of an imperial gallon or approximately 28.4 ml. A US fluid ounce is 1⁄16 of a US fluid pint, 1⁄128 of a US fluid gallon or approximately

29.6 ml. The fluid ounce is distinct from the ounce, a unit of mass; however, it is sometimes referred to simply as an "ounce" wherecontext makes the meaning clear.

Discussion: There is one normal population with known standard deviation. In both sections, the answer canbe found by using the expression of the margin of error.


X ≡ Amount of milk (consumed daily by an adult) X ~ N(μ, σ2=8.72oz2)

Sample information:

Theoretical (simple random) sample: X1,...,Xn s.r.s. (the amount is measured for n adults)

Formula for the margin of error:



My notes:

http://en.wikipedia.org/wiki/Mass

http://en.wikipedia.org/wiki/Ounce

http://en.wikipedia.org/wiki/United_States_customary_units#Fluid_volume

http://en.wikipedia.org/wiki/Gallon

http://en.wikipedia.org/wiki/Pint

http://en.wikipedia.org/wiki/United_States_customary_units

http://en.wikipedia.org/wiki/Imperial_units

http://en.wikipedia.org/wiki/Millilitre

http://en.wikipedia.org/wiki/Volume

I 1−α=[ X −rα/2√ σ2

n, X +rα/2√ σ

2

n ]If we remembered the expression, we can directly use it. Either way, the margin of error (for one normalpopulation with known variance) is:

E=rα/2√σ2

n(a) Sample size

Method based on the confidence interval: The equation involves four quantities, and we can calculate anyof them once the others are known. Here:

E g≥E=rα/2√σ2

n→ E g

2≥rα/2

2 σ2

n→ n≥zα/2

2

(σEg )

2

=1.962( 8.7oz0.5oz )

2

=1163.08 → n≥1164

since rα/2=r0.05 /2=r 0.025=1.96 . (The inequality does not change neither when multiplying or dividing bypositive quantities nor squaring, while it changes when inverting.)


P (|θ−θ|≥E )=P (|θ−E (θ)|≥E)≤Var ( θ)

E2 ≤α


n

σ2

n Eg2 ≤α → n≥

1α (

σEg )

2

=1

0.058.72oz2

0.52oz2=6055.2 → n≥6056

(b) Margin of error

Way 1: Just by substituting.

E=rα/2√σ2

n=1.96√ 8.72 oz2

2⋅1164=0.3534oz

When the sample size is doubled, the margin of error is not reduced by half but by less than this amount.

Way 2 (suggested to me by a student): By managing the algebraic expression.

~E=rα/2√σ

2

~n=rα/2√ σ

2

2 n=

1

√2rα/2√σ

2

n=

E

√2=

0.5oz

√2=0.3535oz

Now it is easy to see that if the sample size is multiplied by 2, the margin of error is divided by √2. Besides,more generally:

PropositionFor the confidence interval estimation of the mean of a normal population with known variance,based on the method of the pivot, when the sample size is multiplied by any scalar c the margin oferror is divided by √c.

(Notice that 0.5 is slightly smaller than the real margin of error after rounding n upward; that is why there is asmall different between the results of both ways.)

Conclusion: At least 1164 or 6056 data are necessary to guarantee that the margin of error is equal to 0.50(this margin can be thought of as “the maximum error in probability”, in the sense that the distance or error


∣θ−θ∣ will be smaller that Eg with a probability of 1–α = 0.95, but larger with a probability of α = 0.05).When the sample size is multiplied by c, the margin of error is divided by √c. Using more data would alsoguarantee the precision desired. The minimum sample sizes provided by the two methods are quite different(see remark 4ci). (Remember: statistical results depend on: the assumptions, the methods, the certainty and thedata.)

Exercise 4ci

A company makes two products, A and B, that can be considered independent and whose demands follow thedistributions N(μA, σA

2=702u2) and N(μB, σB2=602u2), respectively. After analysing 500 shops, the two simple

random samples yield a = 156 and b = 128.

(a) Build 95 and 98 percent confidence intervals for the difference between the population means.

(b) What are the margin of errors? If sales are measured in the unit u = number of boxes, what is the unitof measure of the margin of error?

(c) A margin of error equal to 10 is desired, how many shops are necessary? Apply both the method basedon the confidence interval and the method based on the Chebyshev's inequality.

(d) If only product A is considered, as if product B had not been analysed, how many shops are necessaryto guarantee a margin of error equal to 10? Again, apply the two methods.

LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B.Heaton. Longman.)

company. an organization that makes or sells goods or that sells services: 'My father works for an insurance company.' 'IBM is one of thebiggest companies in the electronics industry.'

factory. a place where goods such as furniture, carpets, curtains, clothes, plates, toys, bicycles, sports equipment, drinks and packagedfood are produced: 'The company's UK factory produces 500 golf trolleys a week.'

industry. (1) all the people, factories, companies etc involved in a major area of production: 'the steel industry', 'the clothing industry'(2) all industries considered together as a single thing: 'Industry has developed rapidly over the years at the expense of agriculture.'

mill. (1) a place where a particular type of material is made: 'a cotton mill', 'a textile mill', 'a steel mill', 'a paper mill' (2) a place whereflour is made from grain: 'a flour mill'

plant. a factory or building where vehicles, engines, weapons, heavy machinery, drugs or industrial chemicals are produced, wherechemical processes are carried out, or where power is generated: 'Vauxhall-Opel's UK car plants', 'Honda's new engine plant atMicroconcord. Swindon', 'a sewage plant', 'a wood treatment plant', 'ICI's ₤100m plant', 'the Sellafield nuclear reprocessing plant inCumbria'

works. an industrial building where materials such as cement, steel, and bricks are produced, or where industrial processes are carriedout: 'The drop in car and van sales has led to redundancies in the country's steel works.'

Discussion: It should statistically be proved the supposition that the normal distribution is appropriate tomodel both variables. The independence of the two populations should be tested as well. The method of thepivot will be applied. After obtaining the theoretical expression of the interval, it is possible to argue about therelation confidence-length. Given the length of the interval, the expression allows us to calculate the minimumnumber of data necessary. The number of units demaned can be seen as dimensionless quantities. Anapproximation is implicitly being used in this exercise, since the number of units demanded is a discretevariable while the normal distribution is continuous.

(a) Confidence interval

The variables are


My notes:

A ≡ Number of units of product A sold (in one shop) A ~ N(μA, σA2=702u2)

B ≡ Number of units of product B sold (in one shop) B ~ N(μB, σB2=602u2)

(a1) Pivot: We know that

• There are two independent normal populations• We are interested in μA – μB

• Variances are known

Then, from a table of statistics (e.g. in [T]), we select

T (A , B ;μA ,μB)=( A−B)−(μ A−μB)

√ σA2

nA

+σB

2

nB

∼ N (0,1)

(a2) Event rewriting

1−α=P (lα/2≤T (A ,B ;μ AμB)≤rα/2)=P (−rα/2≤( A−B)−(μA−μB)

√ σA2

nA

+σB

2

nB

≤+rα/2) =P (−rα/2√σ A

2

nA

+σB

2

nB

≤( A−B)−(μA−μB)≤+rα/2√σ A2

n A

+σB

2

nB)

=P (−( A−B)−rα/2√σA2

nA

+σB

2

nB

≤−(μ A−μB)≤−( A−B)+rα/2√σA2

nA

+σ B

2

nB)

=P (( A−B)+rα/2√σA2

nA

+σB

2

nB

≥μA−μB ≥( A−B)−rα/2√ σA2

nA

+σB

2

nB)

(a3) The interval

I 1−α=[( A− B)−rα/2√σ A2

nA

+σB

2

nB

, ( A−B)+rα/2√σ A2

nA

+σB

2

nB]

Substitution: The quantities in the formula are

• a=156u and b=128u

• σA2=702 u2

and σB2=602 u2

• nA=500 and nB=500

• At 95%, 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → rα/2=r0.025=l0.975=1.96

• At 98%, 1–α = 0.98 → α = 0.02 → α/2 = 0.01 → rα/2=r0.01=l0.99=2.326

Thus, at 95%

I 0.95=[(156−128)−1.96√ 702

500+

602

500, (156−128)+1.96√ 702

500+

602

500 ]=[19.92, 36.08]

and at 98%


I 0.98=[(156−128)−2.326√ 702

500+

602

500, (156−128)+2.326√ 702

500+

602

500 ]=[18.41, 37.59]

(b) Margin of error: Regarding the units, they can be treated as any other algebraic letter representing anumerical quantity. The quantile and the sample sizes are dimensionless, while the variances are expressed inthe unit u2—because of the square in the definition σ2 = E([X–E(X)]2)—when data X are measured in the unitu. At 95%

E0.95=rα/2√ σA2

nA

+σB

2

nB

=1.96 √ 702 u2

500+

602 u2

500=1.96√ 702

500+

602

500√u2

=8.08u

and at 98%

E0.98=rα/2√ σA2

nA

+σB

2

nB

=2.326√ 702 u2

500+

602 u2

500=2.326 √ 702

500+

602

500√u2

=9.59u

(c) Minimum sample sizes

Method based on the confidence interval: Since here both samples sizes are equal to the number of shops,

E g≥E=rα/2√σ A2

n+σB

2

n → E g

2≥rα/2

2 σ A2+σB

2

n → n≥rα/2

2 σA2+σB

2

E g2 =rα/2

2

(σA

Eg )2

+rα/22

(σB

E g )2

and hence at 95% and 98%, respectively,

n≥1.962 702 u2+602 u2

102 u2 =326.54 → n≥327 and n≥2.3262 702 u2+602 u2

102 u2 =459.87 → n≥460


P (|θ−θ|≥E )=P (|θ−E (θ)|≥E)≤Var ( θ)

E2 ≤α

If Var (θ)=Var ( A)+Var ( B)=σ A

2

n+σB

2

n→

σ A2

n+σB

2

nEg

2 =σ A

2+σB

2

n E g2 ≤α → n≥

σ A2+σB

2

α Eg2 =

1α (

σA

Eg )2

+1α (

σB

Eg )2

so

n≥702 u2

+602u2

0.05⋅102 u2 =1700 and n≥702 u2

+602u2

0.02⋅102 u2 =4250

(d) Minimum sample size nA

Method based on the confidence interval: In this case, when the method of the pivotal quantity is applied(we do not repeat the calculations here), the interval and the margin of error are, respectively,

I 1−α=[ A−r α/2√σ A2

nA

, A+rα/2√σA2

nA] and E=rα/2√σA

2

nA

(Note that this case can be thought of as a particular case where the second population has values B = 0, μB=0and σB

2=0.) Then,

E g≥E=rα/2√σ A2

nA

→ E g2≥rα/2

2 σ A2

n A

→ nA≥rα/22 σA

2

E g2


and hence at 95% and 98%, respectively,

nA≥1.962 702 u2

102 u2=188.24 → nA≥189 and nA≥2.3262 702 u2

102 u2=265.10 → nA≥266


P (|θ−θ|≥E )=P (|θ−E (θ)|≥E)≤Var ( θ)

E2 ≤α

If Var (θ)=Var ( A)=σA

2

nA

→

σ A2

n A

Eg2 =

σA2

nA Eg2 ≤α → nA ≥

σA2

αEg2

so

nA ≥702

0.05⋅102=980 and nA ≥702 u2

0.02⋅102u2=2450

Conclusion: As expected, when the probability of the tails α decreases the margin of error—and hence thelength—increases. For either one or two products and given the margin of error, the more confidence (lesssignificance) we want the more data we need. Since 500 shops were really considered to attain this margin oferror, there has been a waste of time and money—fewer shops would have sufficed for the desired accuracy(95% or 98%). When two independent quantities are added or subtracted, the error or uncertainty of the resultcan be as large as the total of the two individual errors or uncertainties; this also holds for random quantities(if they are dependent, a correction term—covariance—appears); for this reason, to guarantee the samemargin of error, more data are necessary in each of the two samples—notice that for two populations theminimum value is larger than or equal to the sum of the minimum values that would be necessary for eachpopulation individually (for the same precision and confidence). The minimum sample sizes provided by thetwo methods are quite different (see remark 4ci). (Remember: statistical results depend on: the assumptions,the methods, the certainty and the data.)


My notes:

Hypothesis TestsRemark 1ht: Like confidence, the concept of significance can be interpreted as a probability (so they are, although we sometimesuse a 0-to-100 scale). See remark 1pt, in the appendix of Probability Theory, on the interpretation of the concept of probability.

Remark 2ht: The quantities α, p-value, β, 1–β and φ are probabilities, so their values must be between 0 and 1.

Remark 3ht: For two-tailed tests, since there is an infinite number of pairs of quantiles such that P (a1≤T 0≤a2)=1−α ,those that determine tails of probability α/2 are considered by convention. This criterion is also applied for confidence intervals.

Remark 4ht: To apply the second methodology, binding the p-value is sometimes enough to compare it with α. To do that, theproper closest value included in the table is used.

Remark 5ht: In calculating the p-value for two-tailed tests, by convention the probability of the tail determined by T0(x,y) isdoubled. When T0(X,Y) follows an asymmetric distribution, it is difficult to identify the tail if the value of T0(x,y) is close to themedian. In fact, knowing the median is not necessary, since if we select the wrong tail, twice its probability will be greater than 1and we will realize that the other tail must have been considered. Alternatively, it is always possible to calculate the twoprobabilities (on the left and on the right) and double the minimum of them (this is useful in writing code for software programs).

Remark 6ht: When more than one test can be applied to make a decision about the same hypotheses, the most powerful should beconsidered (if it exists).

Remark 7ht: After making a decision, it is possible to evaluate the strengh with which it was made: for the first methodology, bycomparing the distance from the statistic to the critical values—or, better, the area between this set of values and the densityfunction of T0—and, for the second methodology, by looking at the magnitude of the p-value.

Remark 8ht: For small sample sizes, n=2 or n=3, the critical region—obtained by applying any methodology—can be plotted in thetwo- or threedimensional space.

[HT] ParametricRemark 9ht: There are four types of pair of hypotheses: (1) simple versus simple (2) simple versus one-sided composite (3) one-sided composite versus one-sided composite (4) simple versus two-sided compositeWe will directly apply Neyman-Pearson's lemma for the first case. When the solution of the first case does not depend upon anyparticular value of the parameter θ1 under H1, the same test will be uniformly most powerful for the second case. In addition, whenthere is a uniformly most powerful test for the second case, it will also be uniformly most powerful for the third case.

Remark 10ht: Given H0 and α, different decisions can be made for one- and two-tailed tests. That is why: (i) describing the detailsof the framework is of great important in Statistics; and (ii) as a general rule, all trustworthy information must be used, whichimplies that a one-sided test should be used when there is information that strongly suggests so—compare the estimate calculatedfrom the sample with the hypothesized values.

Remark 11ht: For parametric tests, α(θ)= P (Reject H 0∣θ∈Θ0) and 1−β(θ)= P (Reject H 0∣θ∈Θ1) , so to plot

the power function ϕ(θ)= P (Reject H 0∣θ∈Θ0∪Θ1) it is usually enough to enter θ∈Θ0 in the analytical expression of

1−β(θ). This is the method that we have used in some exercises where the computer has been used.

Remark 12ht: A reasonable testing process should verify that

1−β(θ1)=P (T 0∈Rc∣θ∈Θ1) > P (T 0∈Rc∣θ∈Θ0)= α(θ0)

with 1–β(θ1) ≈ α(θ0) when θ1 ≈ θ0. This can be noticed in the power functions plotted in some exercises, where there is a localminimum at θ0.

Remark 13ht: Since one-sided tests are, in its range of parameter values, more powerful than the corresponding two-sided test, thebest way of testing an equality consists in accepting it when it is compared with the two types of inequality. Similarly, the best way


to test an inequality consists in accepting it when it is allocated either in the null hypothesis or in the alternative hypothesis. (Thisideas, among others, are rigurously explained in the materials of professor Alfonso Novales Cinca.)

[HT-p] Based on T

Exercise 1ht-T

The lifetime of a machine (measured in years, y) follows a normal distribution with variance equal to 4y2. Asimple random sample of size 100 yields a sample mean equal to 1.3y. Test the null hypothesis that thepopulation mean is equal to 1.5y, by applying a two-tailed test with 5 percent significance level. What is thetype I error? Calculate the type II error when the population mean is 2y. Find the general expression of thetype II error and then use a computer to plot the power function.

Discussion: First of all, the supposition that the normal distribution reasonably explains the lifetime of themachine should be evaluated by using proper statistical techniques. Nevertheless, the purpose of this exerciseis basically to apply the decision-making methodologies.

Statistic: Since

• There is one normal population• The population variance is known

the statistic

T (X ;μ)=X −μ

√ σ2

n

∼ N (0,1)

is selected from a table of statistics (e.g. in [T]). Two particular cases of T will be used:

T 0(X )=X −μ0

√σ2

n

∼ N (0,1) and T 1(X )=X −μ1

√σ2

n

∼ N (0,1)

To apply any of the two methodologies, the value of T0 at the specific sample x = (x1,...,x100) is necessary:

T 0( x)=x−μ0

√σ2

n

=1.3−1.5

√ 4100

=−0.2⋅10

2=−1

Hypotheses: The two-tailed test is determined by

H 0: μ=μ0 =1.5 and H 1: μ=μ1≠ 1.5For these hypotheses


Decision: To make the final decision about the hypotheses, two main methodologies are available. To applythe first one, the critical values a1 and a2 that determine the rejection region are found by applying thedefinition of type I error, with α = 0.05 at μ0 = 1.5, and the criterion of leaving half the probability in each tail:

α(1.5)= P (Type I error )= P (Reject H 0∣H 0 true)= P (T (X ;μ)∈Rc∣H0)

= P ( {T 0(X )<a1}∪{T 0(X )>a2})

{α(1.5)

2=P(T 0(X )<a1) → a1= lα/2=−1.96

α(1.5)2

=P (T 0 (X )>a2) → a2=rα/2=+1.96

→ Rc={T 0(X )<−1.96 }∪{T 0(X )>+1.96 }={∣T 0(X )∣>+1.96 }

The decision is: T 0( x)=−1 → T 0( x)∉Rc → H0 is not rejected.

The second methodology is based on the calculation of the p-value:

pV =P (X more rejecting than x∣H 0 true)=P (∣T 0(X )∣>∣T 0( x )∣)

=P (∣T 0(X )∣>∣−1∣)=2⋅P (T 0 (X )<−1)=2⋅0.1587=0.32

→ pV =0.32>0.05=α → H0 is not rejected.

Type II error: To calculate β, we have to work under H1, that is, with T1. Nonetheless, the critical region isexpressed in terms of T0. Thus, the mathematical trick of adding and subtracting the same quantity is applied:

β(μ1)= P(Type II error)= P (Accept H 0∣H 1 true)= P (T 0(X )∉Rc∣H 1)= P (∣T 0(X )∣≤1.96∣H 1)

= P (−1.96≤T 0(X )≤+1.96 ∣H 1)= P (−1.96≤X−μ0

√σ2

n

≤+1.96∣H 1) = P (−1.96≤

X −μ1+μ1−μ0

√σ2

n

≤+1.96∣H1)= P (−1.96−μ1−μ0

√σ2

n

≤T 1(X )≤+1.96−μ1−μ0

√ σ2

n ) = P (T 1(X )≤+1.96−

μ1−μ0

√σ2

n )−P(T 1(X )<−1.96−μ1−μ0

√σ2

n )For the particular value μ1 = 2,

β(2)= P (T 1(X )≤−0.54)−P (T 1(X )<−4.46 )=0.29

By using a computer, many more values μ1 ≠ 2 can be considered so as to numerically determine the powercurve 1–β(μ1) of the test and to plot the power function.

ϕ(μ)= P (Reject H 0)= { α(μ) if μ∈Θ0

1−β(μ) if μ∈Θ1

# Population

variance = 4

# Sample and inference

n = 100

alpha = 0.05

theta0 = 1.5 # Value under the null hypothesis H0

q = qnorm(1-alpha/2,0,1)


> pnorm(-0.54,0,1)-pnorm(-4.46,0,1)[1] 0.2945944

theta1 = seq(from=0,to=+3,0.01)

paramSpace = sort(unique(c(theta1,theta0)))

PowerFunction = 1 - pnorm(+q-(paramSpace-theta0)/sqrt(variance/n),0,1) + pnorm(-q-(paramSpace-theta0)/sqrt(variance/n),0,1)

plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

Conclusion: The hypothesis that 1.5y is the mean of the distribution of the lifetime is not rejected. Asexpected, when the true value is supposed to be 2, far from 1.5, the probability of rejecting 1.5 is 1–β(2) =0.71, that is, high. This value has been calculated by hand; additionally, after finding the analytical expressionof the curve 1–β, also by hand, the computer allows the power function to be plotted. This theoretical curve,not depending on the sample information, is symmetric with respect to μ0 = 1.5. (Remember: statistical resultsdepend on: the assumptions, the methods, the certainty and the data.)

Exercise 2ht-T

A company produces electric devices operated by a thermostatic control. The standard deviation ofthe temperature at which these controls actually operate should not exceed 2.0ºF. For a simplerandom sample of 20 of these controls, the sample quasi-standard deviation of operatingtemperatures was 2.39ºF. Stating any assumptions you need (write them), test at the 5% level the nullhypothesis that the population standard deviation is not larger than 2.0ºF against the alternative thatit is. Apply the two methodologies and calculate the type II error at σ2=4.5ºF2. Use a computer to plotthe power function. On the other hand, between the two alternative hypothesis H 1: σ=σ1 > 2 or

H1: σ=σ1≠ 2 , which one would you have selected? Why?

Hint: Be careful to use S2 and σ2 wherever you work with a variance instead of a standard deviation.

(Based on an exercise of Statistics for Business and Economics. Newbold, P., W.L. Carlson and B.M. Thorne. Pearson.)

LINGUISTIC NOTE (From: Longman Dictionary of Common Errors. Turton, N.D., and J.B. Heaton. Longman.)

actual = real (as opposed what is believed, planned or expected): 'People think he is over fifty but his actual age is forty-eight.' 'Althoughbuses are supposed to run every fifteen minutes, the actual waiting time can be up to an hour.'present/current = happening or existing now: 'No one can drive that car in its present condition.' 'Her current boyfriend works for Shell.'

LINGUISTIC NOTE (From: Common Errors in English Usage. Brians, P. William, James & Co.)

“Device” is a noun. A can-opener is a device. “Devise” is a verb. You can devise a plan for opening a can with a sharp rock instead. Onlyin law is “devise” properly used as a noun, meaning something deeded in a will.


My notes:

Discussion: Because of the mathematical theorems available, we are able to study the variance only fornormally distributed random variables. Thus, we need the supposition that the temperature follows a normaldistribution. In practice, this normality should be evaluated.

Statistic: We know that

• There is one normal population• The population mean is unknown

and hence the following (dimensionless) statistic, involving the sample quasivariance, is chosen

T (X ;σ)=(n−1) S2

σ2 ∼ χn−1

2

We will work with the two following particular cases:

T 0(X )=(n−1)S 2

σ02 ∼ χn−1

2 and T 1(X )=(n−1) S2

σ12 ∼ χn−1

2

To make the decision, we need to evaluate the statistic T0 at the specific data available x:

T 0( x)=(20−1)2.392 F 2

22 F 2 =27.13

Hypothesis test

Hypotheses: H 0: σ2=σ0

2≤ 22 and H 1: σ2

=σ12> 22

Then,

Decision: To determine the rejection region, under H0, the critical value a is found by applying the definitionof type I error, with α = 0.05 at σ0

2 = 4ºF2 :

α(4)= P (Type I error )= P (Reject H 0∣H 0 true)= P (T (X ;θ)∈Rc∣H 0)= P (T 0 (X )>a)

→ a=rα=r0.05=30.14 → Rc= {T 0(X )>30.14 }

To make the final decision: T 0( x)=27.13 < 30.14 → T 0( x)∉Rc → H0 is not rejected.

The second methodology requires the calculation of the p-value:

pV =P (X more rejecting than x |H 0 true)=P (T 0(X )>T 0( x))=P (T 0(X )>27.13)=0.102


Type II error: To calculate β, we have to work under H1, that is, with T1. Since the critical region is alreadyexpressed in terms of T0, the mathematical trick of multiplying and dividing by same quantity is applied:

β(σ12)= P (Type II error )= P (Accept H 0|H 1 true)= P (T 0(X )∉Rc|H 1)= P (T 0(X )≤30.14 |H 1)


> 1 - pchisq(27.13, 20-1)[1] 0.1016613

= P ( n s2

σ02 ≤30.14|H 1)= P( ns2

σ12

σ12

σ02≤30.14|H 1)= P(T 1(X )≤

30.14⋅σ02

σ12 )

For the particular value σ12 = 4.5ºF2,

β(4.5)= P (T 1(X )≤30.14⋅4

4.5 )= P (T 1(X )≤26.79 )= 0.89

By using a computer, many other values σ12 ≠ 4.5ºF2 can be considered so as to numerically determine the

power curve 1–β(σ12) of the test and to plot the power function.

ϕ(σ2)= P (Reject H 0)= { α(σ

2) if σ∈Θ0

1−β(σ2) if σ∈Θ1


n = 20

alpha = 0.05

theta0 = 4 # Value under the null hypothesis H0

q = qchisq(1-alpha,n-1)

theta1 = seq(from=4,to=15,0.01)


PowerFunction = 1 - pchisq(q*theta0/paramSpace, n-1)


Conclusion: The null hypothesis H 0: σ=σ0 ≤2 is not rejected. When any of these factors is different,the decision might be the opposite. As regards the most appropriate alternative hypothesis, the value of Ssuggests that the test with σ1 > 2 is more powerful than the test with σ1 ≠2 (the test with σ1 < 2 againstthe equality would be the least powerful as both the methodologies—H0 is the default hypothesis—and thedata “tend to help H0”). (Remember: statistical results depend on: the assumptions, the methods, the certaintyand the data.)

Exercise 3ht-T

Let X = (X1,...,Xn) be a simple random sample with 25 data taken from a normal population variable X. Thesample information is summarized in


> pchisq(26.79, 20-1)[1] 0.8903596

My notes:

∑ j=1

25x j=105 and ∑ j=1

25x j

2=579.24

(a) Should the hypothesis H0: σ2 = 4 be rejected when H1: σ2 > 4 and α = 0.05? Calculate β(5).

(b) And when H1: σ2 ≠ 4 and α = 0.05? Calculate β(5).

Use a computer to plot the power function.

Discussion: The supposition that the normal distribution is appropriate to model X should be statisticallyproved. This statement is theoretical.


• There is one normal population• The population mean is unknown

and hence the following statistic is selected

T (X ;σ)=ns2

σ2 ∼ χn−1

2

We will work with the two following particular cases:

T 0(X )=n s2

σ 02 ∼ χn−1

2 and T 1(X )=n s2

σ 12 ∼ χn−1

2

To make the decision, we need to evaluate the statistic at the specific data available x:

T 0( x)=

25[ 125∑ x j

2−( 1

25∑ xk )

2

]4

=25⋅5.53

4=34.56

where to calculate the sample variance, the general property s2=

1n∑ j=1

nX j

2−( 1

n∑ j=1

nX j)

2

has been used.

(a) One-tailed alternative hypothesis


2= 4 and H 1: σ2

=σ12> 4

For these hypotheses,

Decision: To determine the rejection region, under H0, the critical value a is found by applying the definitionof type I error, with α = 0.05 at σ0

2 = 4:

α(4)= P (Type I error )= P (Reject H 0∣H 0 true)= P (T (X ;θ)∈Rc∣H 0)= P (T 0 (X )>a)


→ a=rα=r0.05=36.4 → Rc = {T 0(X )>36.4 }

To make the final decision: T 0( x)=34.56 < 36.4 → T 0( x)∉Rc → H0 is not rejected.


pV =P (X more rejecting than x ∣H 0 true)=P (T 0(X )>T 0( x))=P (T 0(X )>34.56)=0.075

→ pV =0.075> 0.05=α → H0 is not rejected.

Type II error: To calculate β, we have to work under H1, that is, with T1. Since the critical region is expressedin terms of T0, the mathematical trick of multiplying and dividing by same quantity is applied:

β(σ12)= P (Type II error )= P (Accept H 0∣H 1 true)= P (T 0(X )∉R c∣H 1)= P (T 0(X )≤36.4∣H 1)

= P ( n s2

σ02 ≤36.4∣H 1)= P( ns2

σ12

σ12

σ02≤36.4∣H 1)= P(T 1(X )≤

36.4⋅σ02

σ12 )

For the particular value σ12 = 5,

β(5)= P(T 1(X )≤36.4⋅4

5 )= P (T 1(X )≤29.12 )= 0.78

By using a computer, many other values σ12 ≠ 5 can be considered so as to numerically determine the power

curve 1–β(σ12) of the test and to plot the power function.

ϕ(σ2)= P (Reject H 0)= { α(σ

2) if σ∈Θ0



n = 25

alpha = 0.05


q = qchisq(1-alpha,n-1)



PowerFunction = 1 - pchisq(q*theta0/paramSpace, n-1)


(b) Two-tailed alternative hypothesis


2= 4 and H 1: σ2

=σ12≠ 4


> 1 - pchisq(34.56, 25-1)[1] 0.07519706

> pchisq(29.12, 25-1)[1] 0.7843527


Decision: Now there are two tails, determined by two critical values a1 and a2 that are found by applying thedefinition of type I error, with α = 0.05 at σ0

2 = 4, and the criterion of leaving half the probability in each tail:

α(4)=P(Type I error )=P (Reject H 0∣H 0 true)=P(T (X ;θ)∈R c∣H 0)=P (T 0(X )<a1)+P (T 0(X )>a2)

We always consider two tails with the same probability,

{α(4)

2=P (T 0(X )<a1) → a1=r1−α/2=12.4

α(4)2

=P (T 0(X )>a2) → a2=rα/2=39.4

→ Rc={T 0(X )<12.4 }∪{T 0(X )>39.4 }

To make the final decision: T 0( x)=34.56 → T 0( x)∉Rc → H0 is not rejected

To base the decision on the p-value, we calculate twice the probability of the tail:

pV =P (X more rejecting than x ∣H 0 true)=2⋅P (T 0(X )>T 0(x ))

=2⋅P (T 0(X )>34.56)=2⋅0.075=0.15

→ pV =0.15> 0.05=α → H0 is not rejected

Note: The wrong tail would have been selected if we had obtained a p-value bigger than 1.

Type II error: To calculate β,

β(σ12)= P (Type II error )= P (Accept H 0∣H 1 true)= 1−P (T (X ;θ)∈Rc∣H 1)

=1−P ({T 0(X )<12.4 }∪{T 0(X )>39.4 }|H1)= 1−[P( n s2

σ02 <12.4|H 1)+P( n s2

σ02 >39.4|H 1)]

=1−[P( n s2

σ12 <

12.4⋅σ02

σ12 ∣H 1)+1−P ( n s2

σ12 ≤

39.4⋅σ02

σ12 ∣H 1)]

=−P (n s2

σ12 <

12.4⋅σ02

σ12 |H 1)+P ( n s2

σ12 ≤

39.4⋅σ02

σ12 |H 1)= P (T 1(X )≤

39.4⋅σ02

σ12 )−P (T 1(X )<

12.4⋅σ02

σ12 )


β(5)= P (T 1(X )≤31.52 )−P (T 1(X )<9.92 )=0.86−0.0051=0.85

Again, the computer allows the power function to be plotted.


n = 25

alpha = 0.05


q = qchisq(c(alpha/2,1-alpha/2),25-1)


> 1 - pchisq(34.56, 25-1)[1] 0.07519706

> pchisq(c(9.92, 31.52), 25-1)[1] 0.00513123 0.86065162



PowerFunction = 1 - pchisq(q[2]*theta0/paramSpace, n-1) + pchisq(q[1]*theta0/paramSpace, n-1)


Comparison of the power functions: For the one-tailed test, the power of the test at σ12 = 5 is 1–β(5) =

1–0.78 = 0.22, while for the two-tailed test it is 1–β(5) = 1–0.85 = 0.15. As expected, this latter test hassmaller power (higher type II error), since in the former test additional information is being used when one tailis previously discarded. Now we compare the power functions of the two tests graphically, for the commonvalues (> 4), by using the code

# Sample and inference n = 25 alpha = 0.05 theta0 = 4 # Value under the null hypothesis H0 q = qchisq(c(alpha/2,1-alpha/2),25-1) theta1 = seq(from=0,to=15,0.01) paramSpace1 = sort(unique(c(theta1,theta0))) PowerFunction1 = 1 - pchisq(q[2]*theta0/paramSpace1, n-1) + pchisq(q[1]*theta0/paramSpace1, n-1) q = qchisq(1-alpha,n-1) theta1 = seq(from=4,to=15,0.01) paramSpace2 = sort(unique(c(theta1,theta0))) PowerFunction2 = 1 - pchisq(q*theta0/paramSpace2, n-1) plot(paramSpace1, PowerFunction1, xlim=c(0,15), xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l') lines(paramSpace2, PowerFunction2, lty=2)

It can be noticed that the curve of the one-sided test is over the curve of the two-sided test for any σ2 > 4,


which makes it uniformly more powerful. In this exercise, from the sample information we could havecalculated the estimator S2 of σ2 so as to see if its value is far from 4 and therefore one of the two one-sidedtests should be considered better.

Conclusion: The hypothesis that the population variance is equal to 4 is not rejected in either of the twosections. Although it has not happened in this case, different decisions may be made for the one- and two--tailed cases. (Remember: statistical results depend on: the assumptions, the methods, the certainty and thedata.)

Exercise 4ht-T

Imagine that you are hired as a cook. Not an ordinary one but a “statistical cook.” For a normal population,in testing the two hypotheses

{H 0: σ2=σ0

2=4

H1: σ2=σ1

2>4

the data (sample x of size n = 11 such that S2=7.6u2) and the significance (α=0.05) have led to rejecting thenull hypothesis because

where T0 is the usual statistic. A decision depends on several factors:

Methodology Statistic T0

Form of the alternative hypothesis H1

Significance α Data x

(edu.glogster.com/)

Since the chef—your boss—wants the null hypothesis H0 not to be rejected, find three different ways toscientifically make the opposite decision by changing any of the previous factors. Give qualitativeexplanations and, if possible, quantitative ones.

Discussion: Metaphorically, Statistics can be thought of as the kitchen with its utensils and appliances, thefirst two factors as the recipe, and the next three items as the ingredients—if H1, α or x are inappropriate, thereis little to do and it does not matter how good the kitchen, the recipe and you are. Our statistical knowledgeallows us to change only the last three elements. The statistic to study the variance of a normal population is

T (X )=(n−1)S 2

σ2 ∼ χn−1

2 so, under H0, T 0( x)=(n−1) S2

σ02 =

(11−1)7.6 u2

4u2 =764=19.


My notes:

r0.05=18.31

1−α

T 0(x)=19

Qualitative reasoning: By looking at the figure above, we consider that:

A) If a two-tailed test is considered (H1: σ2 = σ12 ≠ 4), the critical value would be rα/2 instead of rα

and, then, the evaluation T 0(x) may not lie in the rejection region (tails).

B) Equivalently, for the original one-tailed test, the critical value rα increases when the significance αdecreases, perhaps with the same implication as in the previous item.

C) Finally, for the same one-sided alternative hypothesis and significance, that is, for the same criticalvalue rα , the evaluation T 0(x) would lie out ot the critical region (tail) if the data x—the valuesthemselves or only the sample size—are such that T 0(x) < rα=18.31 .

D) Additionally, a fourth way could consist of some combinations of the previous ways.

Quantitative reasoning: The previous qualitative explanations can be supported with calculations.

A) For the two-tailed test, now the critical value would be r0.05 /2=r 0.025=20.48 . Then

T 0( x)=19 < 20.48=r 0.025 → T 0( x)∉Rc → H0 is not rejected.

B) The same effect is obtained if, for the original one-tailed H1, the significance is taken to be 0.025instead of 0.05. Any other value smaller than 0.025 would lead to the same result. Is 0.025—suggestedby the previous item—the smallest possible value? The answer is made by using the p-value, since it issometimes defined as the smallest significance level at which the null hypothesis is rejected. Then,since

pV =P (X more rejecting than x |H 0 true)=P (T 0(X )>19)=0.0403

for any α < 0.0403 it would hold that

0.0403= pV >α → H0 is not rejected

C) Finally, for the original test and the same value for n, since

~T 0( x)=(n−1)

~S 2

σ02 =

~S 2

S 2

(n−1) S2

σ02 =

~S 2

7.6u2 19 < 18.31=rα

the opposite decision would be made for any sample quasivariance such that

~S 2< 18.31

7.6u2

19=7.324 u2 → T 0( x)∉Rc → H0 is not rejected

On the other hand, for the original test and the same value for S, since

~T 0( x)=(~n−1)S2

σ02 =

(~n−1)(n−1)

(n−1)S 2

σ02 =

(~n−1)(11−1)

19 < 18.31=rα

the opposite decision would be made for any sample size such that

~n < 18.31(11−1)

19+1=10.63684 ↔ ~n ≤ 10 → T 0( x)∉Rc → H0 is not rejected

D) Some combinations can easily be proved to lead to rejecting H0.

Conclusion: This exercise highlights how much careful one must be in either writing or reading statisticalworks.


> 1 - pchisq(19, 11-1)[1] 0.04026268

My notes:

Exercise 5ht-T

The distribution of a variable is supposed to be normally distributed in two independent biologicalpopulations. The two population variances must be compared. After gathering information through simplerandom samples of sizes nX = 11, nY = 10, respectively, we are given the value of the estimators

S X2=

1nX−1∑ j=1

nX

( x j− x )2=6.8 sY

2=

1nY∑ j=1

nY

( y j− y )2=7.1

For α = 0.1, test: (a) H0: σX = σY against H1: σX < σY

(b) H0: σX = σY against H1: σX > σY

(c) H0: σX = σY against H1: σX ≠ σY

In each section, calculate the analytical expression of the type II error and plot the power function by using acomputer.

Discussion: In a real-world situation, suppositions should be proved. We must pay careful attention to thedetails: the sample quasivariance is provided for one group, while the sample variance is given for the other.

Statistic: From the information in the statement,

• There are two independent normal populations• The population means are unknown

the statistic

T (X , Y ;σX ,σY)=

S X2

σX2

S Y2

σY2

=S X

2σY

2

SY2σX

2 ∼ Fn X−1 ,nY −1

is selected from a table of statistics (e.g. in [T]). It will be used in two forms (we can write σX2/ σY

2 = θ1):

T 0(X ,Y )=

S X2

σX2

S Y2

σY2

=S X

2

SY2 ∼ F n X−1 ,nY−1 and T 1(X ,Y )=

S X2

θ1⋅σY2

S Y2

σY2

=1θ1

S X2

SY2 ∼ F nX−1 , nY −1

(On the other hand, the pooled sample variance Sp2 should not be considered even under H0: σX = σ = σY, as

T 0=(S p2/S p

2)=1 whatever the data are.) To apply any of the two methodologies we need to evaluate T0 at

the samples x and y:

T 0( x , y )=S X

2

S Y2 =

S X2

nY

nY−1sY

2

=6.8

1010−1

7.1=0.86

Since we were given the sample quasivariance of population X, but the sample variance of population Y, thegeneral property ns2

=(n−1)S2 has been used to calculate SY2.


(a) One-tailed alternative hypothesis σX < σY

Hypotheses: H 0: σ X2=σY

2 and H 1: σ X2<σY

2

Or, equivalently, H 0:σ X

2

σY2 = θ0 =1 and H 1:

σX2

σY2 = θ1 < 1


Decision: To determine the critical region, under H0, the critical value a is found by applying the definition oftype I error, with α = 0.1 at θ0 = 1:

α(1)=P (Type I error )=P (Reject H 0∣H 0 true)=P (T (X ,Y )<a ∣H 0)=P (T 0(X ,Y )<a)

→ 0.1=P(T 0(X ,Y )<a)=P( 1T 0(X ,Y )

>1a ) →

1a=2.35

→ a=r 1−α=1

2.35=0.43 → Rc= {T 0(X ,Y )<0.43}

To make the final decision about the hypotheses:

T 0( x , y )=0.86 → T 0( x)∉Rc → H0 is not rejected.


pV =P (X ,Y more rejecting than x , y ∣H 0 true)

=P (T 0(X ,Y )<T 0(x , y))=P (T 0(X ,Y )<0.86)=0.41

→ pV =0.41> 0.1=α → H0 is not rejected.

Power function: To calculate β, we have to work under H1, that is, with T1. Since in this case the criticalregion is already expressed in terms of T0, the mathematical trick of multiplying and dividing by the samequantity is applied:

β(θ1)= P (Type II error)= P(Accept H0 ∣H 1 true)= P (T 0(X )∉Rc∣H1)= P (T 0(X )≥0.43∣H 1)

= P ( S X2

S Y2 ≥0.43∣H 1)= P ( 1

θ1

S X2

S Y2 ≥

1θ1

0.43∣H 1)= P (T 1(X )≥0.43θ1 )=1−P (T 1(X )<

0.43θ1 )

By using a computer, many values θ1 can be considered so as to determine the power curve 1–β(θ1) of the testand to plot the power function.

ϕ(θ)= P (Reject H 0)= { α(θ) if θ∈Θ0

1−β(θ) if θ∈Θ1


nx = 11; ny = 10

alpha = 0.1

theta0 = 1

q = qf(alpha,nx-1,ny-1)



PowerFunction = pf(q/paramSpace, nx-1, ny-1)



> pf(0.86, 11-1, 10-1)[1] 0.406005

(From the definition of the F distribution, it is easy to see that if X follows a Fk1,k2 then 1/X follows a Fk2,k1. We use

this property to consult our table.)

(b) One-tailed alternative hypothesis σX > σY


2 and H 1: σ X2>σY

2


2

σY2 = θ0 =1 and H 1:

σX2

σY2 = θ1 > 1


Decision: To apply the methodology based on the rejection region, the critical value a is found by applyingthe definition of type I error, with α = 0.1 at θ0 = 1:

α(1)=P (Type I error )=P (Reject H 0∣H 0 true)=P (T (X ,Y )>a ∣H 0)=P (T 0(X ,Y )>a)

→ a=rα=2.42 → Rc = {T 0(X , Y )>2.42 }

The final decision is: T 0( x , y )=0.86 → T 0( x)∉Rc → H0 is not rejected.


pV =P (X ,Y more rejecting than x , y |H 0 true)=P(T 0(X ,Y )>T 0( x , y ))

=P (T 0(X , Y )>0.86)= 1−0.41=0.59


Power function: Now

β(θ1)= P (Type II error )= P (Accept H 0|H 1 true)= P(T 0(X )∉Rc|H 1)= P (T 0(X )≤2.42|H 1)

= P ( S X2

S Y2 ≤2.42|H 1)= P( 1

θ1

S X2

SY2 ≤

1θ1

2.42 |H 1)= P (T 1(X )≤2.42θ1

)By using a computer, many values θ1 can be considered so as to plot the power function.


> pf(0.86, 11-1, 10-1)[1] 0.406005


nx = 11; ny = 10

alpha = 0.1

theta0 = 1

q = qf(1-alpha,nx-1,ny-1)



PowerFunction = 1 - pf(q/paramSpace, nx-1, ny-1)


(c) Two-tailed alternative hypothesis σX ≠ σY


2 and H 1: σ X2≠σY

2


2

σY2 =θ0 =1 and H 1:

σX2

σY2 = θ1 ≠1


Decision: For the first methodology, the critical region must be determined by applying the definition of type Ierror, with α = 0.1 at θ1 = 1, and the criterion of leaving half the probability in each tail:

α(1)=P (Type I error )=P(Reject H 0|H 0 true)=P (T 0(X ,Y )<a1)+P (T 0(X ,Y )>a2)

→ {α(1)

2=P(T 0(X ,Y )<a1) → a1=lα/2=0.33

α(1)2

=P (T 0(X ,Y )>a2) → a2=rα/2=3.14

→ Rc={T 0(X , Y )<0.33 }∪{T 0(X , Y )>3.14 }

The decision depends on whether the evaluation of T0 is in the rejection region:

T 0( x , y )=0.86 → T 0( x)∉Rc → H0 is not rejected.


> qf(c(0.05, 0.95), 11-1, 10-1)[1] 0.3310838 3.1372801

To apply the methodology based on the p-value, we calculate the median qf(0.5, 11-1, 10-1)=1.007739; thus, since T(x,y) is in the left-hand tail:

pV =P (X ,Y more rejecting than x , y |H 0 true)=2⋅P (T 0(X ,Y )<T 0(x , y))

=2⋅P (T 0(X ,Y )<0.86)=2⋅0.41=0.82


If you cannot calculate the median, try the tail you trust most and change it if a value bigger than 1 is obtainedafter doubling the probability.

Power function: Now

β(θ1)= P (Type II error )= P (Accept H 0|H 1 true)= P(T 0(X )∉Rc|H 1)

= P (0.33≤T 0(X )≤3.14|H 1)= P(0.33≤S X

2

S Y2 ≤3.14|H 1)= P ( 0.33

θ1≤

1θ1

S X2

S Y2 ≤

3.14θ1 |H 1)

= P ( 0.33θ1

≤T 1(X )≤3.14θ1 )= P (T 1(X )≤

3.14θ1 )−P (T 1(X )<

0.33θ1 )

By using a computer, many values θ1 can be considered in order to plot the power function.


nx = 11; ny = 10

alpha = 0.1

theta0 = 1

q = qf(c(alpha/2, 1-alpha/2),nx-1,ny-1)



PowerFunction = 1 - pf(q[2]/paramSpace, nx-1, ny-1) + pf(q[1]/paramSpace, nx-1, ny-1)


Comparison of the power functions: Now we compare the power functions of the three testsgraphically, by using the code

# Sample and inference nx = 11; ny = 10 alpha = 0.1 theta0 = 1 q = qf(c(alpha/2, 1-alpha/2),nx-1,ny-1) theta1 = seq(from=0,to=15,0.01) paramSpace1 = sort(unique(c(theta1,theta0))) PowerFunction1 = 1 - pf(q[2]/paramSpace1, nx-1, ny-1) + pf(q[1]/paramSpace1, nx-1, ny-1) q = qf(alpha,nx-1,ny-1) theta1 = seq(from=0,to=1,0.01) paramSpace2 = sort(unique(c(theta1,theta0))) PowerFunction2 = pf(q/paramSpace2, nx-1, ny-1) q = qf(1-alpha,nx-1,ny-1)


theta1 = seq(from=1,to=15,0.01) paramSpace3 = sort(unique(c(theta1,theta0))) PowerFunction3 = 1 - pf(q/paramSpace3, nx-1, ny-1) plot(paramSpace1, PowerFunction1, xlim=c(0,15), xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l') lines(paramSpace2, PowerFunction2, lty=2) lines(paramSpace3, PowerFunction3, lty=2)

It can be seen that the curves of the one-sided tests are over the curve of the two-sided test for any θ1—in itsregion each one-sided test has more power than the two-sided test, since additional information is used whenone tail is discarded. Then, any of the two one-sided tests is uniformly more powerful than the two-sided testin their respective common domains.

Conclusion: The hypothesis that the population variance is equal in the two biological populations is notrejected when tested against any of the three alternative hypotheses. Although it has not happened in this case,different decisions can be made for the one- and two-tailed tests. In this exercise, the empirical value T0(x) =SX

2/SY2 = 0.86 suggests the alternative hypothesis H1: σX

2/σY2 < 1. (Remember: statistical results depend on: the

assumptions, the methods, the certainty and the data.)

Exercise 6ht-T

Two simple random samples of 700 citizens of Italy and Russia yielded, respectively, that 53% of Italianpeople and 47% of Russian people wish to visit Spain within the next ten years. Should we conclude, withconfidence 0.99, the Italians' desire is higher than the Russians'? Determine the critical region and make adecision. What is the type I error? Calculate the p-value and apply the methodology based on the p-value tomake a decision.

1) Allocate the question in the alternative hypothesis. Calculate the type II error for the value –0.1.

2) Allocate the question in the null hypothesis. Calculate the type II error for the value +0.1.

Use a computer to plot the power function.


My notes:

Discussion: After reading the statement (possibly twice, if necessary), we realize that there are twoindependent populations whose citizens have been set a question with two possible answers (dichotomicsituation). Then, each individual can be thought of as—modeled through—a Bernoulli variable. In practice,the implicit supposition that the same parameter η governs the behaviour of all the individuals should still beevaluated for each population (a sort of homogeneity to analyse whether or not several subpopulations shouldbe considered). The independence of the two populations should be studied as well. Either way, in thisexercise we will merely apply the testing methodologies.

The sample proportions of those who said 'yes' are given: ηI=0.53 and ηR=0.47, respectively. IfηI and ηR are the theoretical proportions of the populations, that is, the quantities we want to compare, we needto test the hyphothesis ηI > ηR (one-tailed test).

Should this hypothesis be written as a null or as an alternative hypothesis? In general, since we fix thetype I error in our methodologies, a strong sample evidence is necessary to reject H0. Thus, the decision ofallocating the condition to be tested in H0 or H1 depends on our choice (usually on what “making a wrongdecision” means or implies for the specific framework we are working in). We are going to solve both cases.From a theoretical point of view, H0: ηI ≥ ηR is essentialy the same as H0: ηI = ηR.

As a final remark, in this exercise it holds that 0.53 + 0.47 = 1; this happens just by chance, since thesetwo quantities are independent and can take any value in [0,1]. On the other hand, proportions are alwaysdimensionless.


• There are two independent Bernoulli populations• The sample sizes are larger than 30

so we use the asymptotic result involving two proportions:

T ( I , R)=(ηI−ηR)−(ηI−ηR)

√ ? I (1−? I )

n I

+?R(1−?R)

nR

→d

N (0,1)

where each ? must be substituted by the best possible information: supposed or estimated. Two particularversions of this statistic will be used:

T 0( I , R)=(ηI−ηR)−θ0

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

→d

N (0,1) and T 1(I , R)=(ηI−ηR)−θ1

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

→d

N (0,1)

To determine the critical region or to calculate the p-value, both under H0, we need the value of the statistic forthe particular samples available:

T 0( i , r )=(0.53−0.47)−0

√ 0.53(1−0.53)700

+0.47(1−0.47)

700

=2.25

1) Question in H0

Hypotheses: If we want to allocate the question in the null hypothesis to reject it only when the data stronglysuggest so,

H 0: ηI−ηR =θ0≥ 0 and H 1: ηI−ηR= θ1 < 0

By looking at the alternative hypothesis, we deduce the form of the critical region:


The quantity c can be thought of as a margin over θ0 not to exclude cases where ηI – ηR = θ0 = 0 really holdswhile values slightly smaller than θ0 are due to mere random effects.

Decision: To apply the first methodology, the critical value a that determines the rejection region is found byapplying the definition of type I error, with the value α = 1 – 0.99 = 0.01 at θ0 = 0:

α(0)= P (Type I error)= P(Reject H 0|H 0 true)= P(T (I , R)∈R c|H 0)= P (T 0( I ,R)<a)

→ a=l 0.01=−2.326 → Rc = {T 0(I , R)<−2.326}

The decision is: T 0( i , r )=2.25 → T 0( i , r )∉Rc → H0 is not rejected.

As regards the value of the type I error, it is α by definition. The second methodology is based on thecalculation of the p-value:

pV =P ( I , R more rejecting than i , r|H 0 true)=P (T 0(I , R) < T 0(i , r ))

=P (T 0(I , R) < 2.25)=0.988

→ pV =0.988 > 0.01=α → H0 is not rejected.

Type II error: To calculate β, we have to work under H1. Since the critical region is expressed in terms of T0

and we must use T1, we are going to apply the mathematical trick of adding and subtracting the same quantity:

β(θ1)= P (Type II error)= P(Accept H0 ∣H 1 true)

= P (T 0( I , R)∉Rc|H 1)= P ((ηI−ηR)−θ0

√ ηI (1−ηI )

nI

+ηR (1−ηR)

nR

≥−2.326|H 1) = P (

(ηI−ηR)+0−θ1

√ ηI (1−ηI )

n I

+ηR(1−ηR)

nR

+θ1

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

≥−2.326|H1) = P (T 1( I , R)≥−2.326−

θ1

√ 0.53(1−0.53)700

+0.47(1−0.47)

700 )For the particular value θ1 = –0.1,

β(−0.1)= P (T 1( I , R)≥−2.326−−0.1

√ 0.53 (1−0.53)700

+0.47(1−0.47)

700)= P (T 1( I , R)≥1.42 )=0.078

By using a computer, many other values θ1 ≠ –0.1 can be considered so as to numerically determine the powerof the test curve 1–β(θ1) and to plot the power function.




# Sample and inferenceni = 700; nr = 700sPi = 0.53; sPr = 0.47alpha = 0.01theta0 = 0 # Value under the null hypothesis H0q = qnorm(alpha,0,1)theta1 = seq(from=-0.25,to=0,0.01)paramSpace = sort(unique(c(theta1,theta0)))PowerFunction = pnorm(q-paramSpace/sqrt(sPi*(1-sPi)/ni + sPr*(1-sPr)/nr),0,1)plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

This code generates the following figure:

2) Question in H1

Hypotheses: If we want to allocate the question in the alternative hypothesis to accept it only when the datastrongly suggest so,

H 0: ηI−ηR =θ0≤ 0 and H 1: ηI−ηR= θ1 > 0

By looking at the alternative hypothesis, we deduce the form of the critical region:

The quantity c can be thought of as a margin over θ0 not to exclude cases where ηI – ηR = θ0 = 0 really holdswhile values slightly larger than θ0 are due to mere random effects.

Decision: To apply the first methodology, the critical value a is calculated as follows:

α(0)= P (Type I error)= P(Reject H 0|H 0 true)= P(T (I , R)∈R c|H 0)= P (T 0( I , R)>a)

→ a=r 0.01=2.326 → Rc = {T 0(I , R)>2.326 }

The decision is: T 0( i , r )=2.25 → T 0( i , r )∉Rc → H0 is not rejected.

The second methodology consists in doing:


pV =P ( I , R more rejecting than i , r|H 0 true)=P (T 0(I , R) > T 0(i , r ))

=P (T 0(I , R) > 2.25)=1−P (T 0( I , R)≤ 2.25)=0.0122

→ pV =0.0122 > 0.01=α → H0 is not rejected.

Type II error: Finally, to calculate β:

β(θ1)= P (Type II error)= P(Accept H0 ∣H 1 true)

= P (T 0( I , R)∉Rc|H 1)= P ((ηI−ηR)−θ0

√ ηI (1−ηI )

nI

+ηR (1−ηR)

nR

≤2.326|H 1) = P (

(ηI−ηR)+0−θ1

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

+θ1

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

≤2.326|H 1) = P (T 1( I , R)≤2.326−

θ1

√ 0.53 (1−0.53)700

+0.47(1−0.47)

700 )For the particular value θ1 = 0.1,

β(0.1)= P (T 1(I , R)≤2.326−0.1

√ 0.53(1−0.53)700

+0.47(1−0.47)

700)= P (T 1( I , R)≤−1.42 )=0.078

By using a computer, many more values θ1 ≠ 0.1 can be considered so as to numerically determine the powerof the test curve 1–β(θ1) and to plot the power function.



# Sample and inferenceni = 700; nr = 700sPi = 0.53; sPr = 0.47alpha = 0.01theta0 = 0 # Value under the null hypothesis H0


q = qnorm(1-alpha,0,1)theta1 = seq(from=0,to=+0.25,0.01)paramSpace = sort(unique(c(theta1,theta0)))PowerFunction = 1 - pnorm(q-paramSpace/sqrt(sPi*(1-sPi)/ni + sPr*(1-sPr)/nr),0,1)plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

This code generates the figure above.

Conclusion: The hypothesis that the two proportions are equal is not rejected when the question is allocatedin either the alternative or the null hypothesis (the best way of testing an equality). That is, it seems that bothpopulations wish to visit Spain with the same desire. The sample information ηI=0.53 and ηR=0.47suggested the alternative hypothesis H1: ηI – ηR > 0. The two power functions show how symmetric thesituations are. (Remember: statistical results depend on: the assumptions, the methods, the certainty and thedata.)

Advanced theory: Under the hypothesis H0: ηI = η = ηR, it makes sense to try to estimate the commonvariance η(1–η) of the estimator—in the denominator—as well as possible. This can be done by using the

pooled sample proportion ηp=n I ηI+nR ηR

n I+nR

. Nevertheless, the pooled estimator should not be considered in

the numerator, since (ηp−ηp)=0 whatever the data are. Now, the statistic under the null hypothesis is:

~T 0( I , R)=(ηI−ηR)−θ0

√ ηp (1−ηp)

nI

+ηp(1−ηp)

nR

=(ηI−ηR)−θ0

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

√ ηp (1−ηp)

nI

+ηp(1−ηp)

nR

=T 0( I ,R)√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

√ ηp(1−ηp)

nI

+ηp(1−ηp)

nR

→d

N (0,1)

Then,

ηp =700⋅0.53+700⋅0.47

700+700=

0.53+0.471+1

=12=0.5 →

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

√ ηp(1−ηp)

nI

+ηp(1−ηp)

nR

=0.9981983


→ ~T 0( i , r )=2.25⋅0.9981983=2.24 .

The same decisions are made with T 0 and ~T 0 because of the little effect of using ηp in this exercise (see

the value of the quotient of square roots above); in other situations, both ways may lead to paradoxical results.As regards the calculations of the type II error, both the mathematical trick of multiplying and dividing by thesame quantity and the mathematical trick of adding and subtracting the same quantity should be applied now.For section (a):

β(θ1)= P (Type II error )= P (Accept H 0|H 1 true)

= P (~T 0( I , R)∉Rc|H 1)= P (

( ηI−ηR)−θ0

√ ηp(1−ηp)

n I

+ηp(1−ηp)

nR

≥−2.325|H 1) = P ( (ηI−ηR)−0−θ1+θ1

√ ηI (1−ηI )

nI

+ηR (1−ηR)

nR

≥−2.325⋅√ ηp(1−ηp)

nI

+ηp(1−ηp)

nR

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

|H 1) = P (

( ηI−ηR)−θ1

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

+θ1

√ ηI (1−ηI )

nI

+ηR(1−ηR)

nR

≥−2.325⋅1.002|H 1) = P (T 1( I , R)≥−2.330−

θ1

√ 0.53(1−0.53)700

+0.47(1−0.47)

700 )For the particular value θ1 = –0.1,

β(−0.1)= P (T 1( I , R)≥−2.330−−0.1

√ 0.53(1−0.53)700

+0.47(1−0.47)

700)= P (T 1( I , R)≥1.41 )=0.079 .

Similarly for section (b).

[HT-p] Based on Λ

Exercise 1ht-Λ

A random quantity X follows a Poisson distribution. Let X = (X1,...,Xn) be a simple random sample. Byapplying the results involving Neyman-Pearson's lemma and the likelihood ratio, study the critical region(estimator that arises and form) for the following pairs of hypotheses.


My notes:

{H 0: λ =λ0

H 1: λ =λ1{ H 0: λ = λ0

H 1: λ = λ1>λ0{ H 0 : λ = λ0

H 1 : λ = λ1<λ0{ H 0 : λ ≤ λ0

H1 : λ = λ1>λ0{ H 0 : λ ≥ λ0

H1 : λ = λ1<λ0

Discussion: This is a theoretical exercise where no assumption should be evaluated. First of all, Neyman--Pearson's lemma will be applied. We expect the maximum-likelihood estimator of the parameter—calculatedin a previous exercise—and the “usual” critical region form to appear. If the critical region does not depend onany particular value θ1, the uniformly most powerful test will have been found.

Poisson distribution: X ~ Pois(λ)

For the Poisson distribution,

Identification of the variable: X ~ Pois(λ)

Hypothesis test {H 0: λ =λ0

H 1: λ =λ1

Likelihood function and likelihood ratio:

L(X ; λ)= λ∑ j=1

n

X j

∏ j=1

nX j !

e−nλ and Λ(X ;λ0 ,λ1)=L (X ;λ0)

L(X ;λ1)=(λ0

λ1)∑ j=1

n

X j

e−n(λ0−λ 1)

Rejection region:

Rc = {Λ<k }= {(λ0

λ1 )∑ j=1

n

X j

e−n(λ0−λ1)<k }= {(∑ j=1

nX j)⋅log (

λ0

λ1 )−n (λ0−λ1) < log (k )}

={(∑ j=1

nX j)⋅log(

λ0

λ1 )< log(k )+n(λ0−λ1)}= {n X⋅log (λ0

λ1 )< log (k )+n (λ0−λ1)}

Now it is necessary that λ1≠λ0 and

• if λ1<λ0 then log( λ0

λ1 )>0 and hence Rc={ X <log(k )+n (λ0−λ1)

n log( λ0

λ1 ) }• if λ1>λ0 then log( λ0

λ1 )<0 and hence Rc={ X >log(k )+n (λ0−λ1)

n log( λ0

λ1 ) }This suggests the estimator X =λML (calculated in a previous exercise) and regions of the form

Rc = {Λ<k }=⋯={λML<c }=⋯= {T 0<a } or Rc = {Λ<k }=⋯={λML>c }=⋯= {T 0>a }

Hypothesis tests { H 0 : λ = λ0

H1 : λ = λ1>λ0{ H 0 : λ= λ0

H 1: λ = λ1<λ0

In applying the methodologies, and given α, the same critical value c or a will be obtained for any λ1 since itonly depends upon λ0 through λML or T0:


α=P (Type I error)=P (T 0<a) or α=P (Type I error)=P (T 0>a)

This implies that the uniformly most powerful test has been found.

Hypothesis tests { H 0 : λ ≤ λ0

H1 : λ = λ1>λ0{ H 0 : λ≥ λ0

H 1: λ = λ1<λ0

A uniformly most powerful test for H 0: λ = λ0 is also uniformly most powerful for H 0: λ ≤ λ0 .

Exponential distribution: For the exponential distribution,

Identification of the variable: X ~ Exp(λ)

Hypothesis test {H 0: λ =λ0

H 1: λ =λ1


L(X ; λ)= λn e

−λ∑ j=1

n

X j and Λ(X ;λ0 ,λ1)=L (X ;λ0)

L(X ;λ1)=(λ0

λ1)

n

e−(λ0−λ1 )∑ j=1

nX j

Rejection region:

Rc = {Λ < k }= {( λ0

λ1)

n

e−(λ0−λ1 )∑ j=1

nX j < k }= {n log(λ0

λ1)−(λ0−λ1)∑ j=1

nX j < log(k )}

={(λ1−λ0)∑ j=1

nX j < log(k )−n log(

λ0

λ1 )}= {(λ1−λ0)n X < log(k )−n log(λ0

λ1 )}

Now it is necessary that λ1≠λ0 and

• if λ1<λ0 then (λ1−λ0)<0 and Rc={ X >

log(k )−n log(λ0

λ1 )n(λ1−λ0)

}={1X

<n (λ1−λ0)

log(k )−n log(λ0

λ1 )}• if λ1>λ0 then (λ1−λ0)>0 and Rc={ X <

log(k )−n log(λ0

λ1 )n(λ1−λ0)

}={1X

>n (λ1−λ0)

log(k )−n log(λ0

λ1 )}

This suggests the estimator1

X=λML (calculated in a previous exercise) and regions of the form

Rc = {Λ<k }=⋯={λML<c }=⋯= {T 0<a } or Rc = {Λ<k }=⋯={λML>c }=⋯= {T 0>a }

Hypothesis tests { H 0 : λ = λ0

H1 : λ = λ1>λ0{ H 0 : λ= λ0

H 1: λ = λ1<λ0

In applying the methodologies, and given α, the same critical value c or a will be obtained for any λ1 since it


only depends upon λ0 through λML or T0:



Hypothesis tests { H 0 : λ ≤ λ0

H1 : λ= λ1>λ0{ H 0 : λ≥ λ0

H 1: λ= λ1<λ0

A uniformly most powerful test for H 0: λ= λ0 is also uniformly most powerful for H 0: λ≤ λ0 .

Bernoulli distribution: For the Bernoulli law,

Identification of the variable: X ~ B(η)

Hypothesis test {H 0: η= η0

H 1: η= η1


L(X ;η)= η∑ j=1

n

X j

(1−η)n−∑ j=1

n

X j and Λ(X ;η0 ,η1)=L(X ;η0)

L(X ;η1)=(

η0η1 )

∑ j=1

n

X j

( 1−η0

1−η1)

n−∑ j=1

n

X j

Rejection region:

Rc = {Λ < k }={(η0η1 )

∑ j=1

nX j

(1−η0

1−η1)

n−∑ j=1

nX j

<k }={(∑ j=1

nX j) log (

η0η1 )+(n−∑ j=1

nX j) log( 1−η0

1−η1) < log(k )}

={(∑ j=1

nX j)[log (

η0η1 )−log(1−η0

1−η1)]< log(k )−n log( 1−η0

1−η1)}

= {n X log(η0(1−η1)

η1(1−η0))< log (k )−n log( 1−η0

1−η1)}

Now it is necessary that η1≠η0 and

• if η1<η0 then log( η0(1−η1)

η1(1−η0))>0 and Rc={ X <

log(k )−n log( 1−η0

1−η1)

n log(η0(1−η1)

η1(1−η0)) }• if η1>η0 then log( η0(1−η1)

η1(1−η0))<0 and Rc={ X >

log(k )−n log( 1−η0

1−η1)

n log(η0(1−η1)

η1(1−η0)) }This suggests the estimator X =ηML (calculated in a previous exercise) and regions of the form


Rc= {Λ<k }=⋯={ηML<c }=⋯= {T 0<a } or Rc = {Λ<k }=⋯={ηML>c }=⋯= {T 0>a }

Hypothesis tests { H 0 : η= η0

H1 : η= η1>η0{ H 0: η= η0

H 1: η= η1<η0

In applying the methodologies, and given α, the same critical value c or a will be obtained for any η1 since itonly depends upon η0 through ηML or T0:



Hypothesis tests { H 0 : η≤ η0

H1 : η= η1>η0{ H 0: η≥ η0

H 1: η= η1<η0

A uniformly most powerful test for H 0: η= η0 is also uniformly most powerful for H 0: η≤ η0.

Normal distribution: For the normal distribution,

Identification of the variable: X ~ N(μ,σ2)

Hypothesis test {H 0: μ =μ0

H 1: μ =μ1


L(X ;μ)= ( 12πσ2 )

n /2

e−

12σ 2∑ j=1

n( X j−μ)

2

and

Λ(X ;μ0 ,μ1)=L(X ;μ0)

L (X ;μ1)= e

−1

2σ2 [∑ j=1

n

(X j−μ0 )2−∑ j=1

n

(X j−μ1)2 ]= e

−1

2σ2∑ j=1

n

(X j2−2μ0 X j+μ0

2−X j

2+2μ1 X j−μ1

2)

= e−

1

2σ2 ∑ j=1

n

(μ02−μ1

2−2μ0 X j+2μ1 X j)

= e−

1

2σ 2 [n (μ02−μ1

2)−2 (μ0−μ1)∑ j=1

n

X j ]= e

(μ0−μ1 )

σ2 ∑ j=1

nX j

e−

n (μ02−μ1

2)

2σ2

Rejection region:

Rc = {Λ < k }= {e(μ0−μ1)

σ2 ∑ j=1

nX j

e−

n (μ02−μ1

2)

2σ 2

< k }= {(μ0−μ1)

σ2 ∑ j=1

nX j−

n (μ02−μ1

2)

2σ 2 < log (k )}

= {(μ0−μ1)(∑ j=1

nX j )< log (k )σ2

+n2(μ0

2−μ1

2)}= {(μ0−μ1)n X < log(k )σ2

+n2(μ0

2−μ1

2)}

Now it is necessary that μ1≠μ0 and

• if μ1<μ0 then (μ0−μ1)>0 and Rc={ X <

log(k )σ2+

n2(μ0

2−μ1

2)

n(μ0−μ1)}


• if μ1>μ0 then (μ0−μ1)<0 and Rc={ X >

log(k )σ2+

n2(μ0

2−μ1

2)

n(μ0−μ1)}

This suggests the estimator X =μML (calculated in a previous exercise) and regions of the form

Rc= {Λ<k }=⋯={μML<c }=⋯={T 0<a } or Rc = {Λ<k }=⋯={μML>c }=⋯={T 0>a }

Hypothesis tests { H 0 : μ=μ0

H1 : μ =μ1>μ0{ H 0 : μ =μ0

H 1 : μ =μ1<μ0

In applying the methodologies, and given α, the same critical value c or a will be obtained for any μ1 since itonly depends upon μ0 through μML or T0:



Hypothesis tests { H 0 : μ≤μ0

H1 : μ =μ1>μ0{ H 0 : μ ≥μ0

H 1 : μ =μ1<μ0

A uniformly most powerful test for H 0: μ=μ0 is also uniformly most powerful for H 0: μ≤μ0 .

Conclusion: Well-known theoretical results have been applied to study the optimal form for the criticalregion of different pairs of hypothesis. Since both the likelihood ratio and the maximum likelihood estimatoruse the likelihood function, the critical region of the tests can be expressed in terms of this estimator.

[HT-p] Analysis of Variance (ANOVA)

Exercise 1ht-av

The fog index is used to measure the reading difficulty of a written text: The higher the value of the index, themore difficult the reading level. We want to know if the reading difficulty index is different for threemagazines: Scientific American, Fortune, and the New Yorker. Three independent random samples of 6advertisements were taken, and the fog indices for the 18 advertisements were measured, as recorded in thefollowing table

SCIENTIFIC AMERICAN FORTUNE NEW YORKER

15.75 12.63 9.27

11.55 11.46 8.28

11.16 10.77 8.15

9.92 9.93 6.37

9.23 9.87 6.37

8.20 9.42 5.66


My notes:

Apply an analysis of variance to test whether the average level of difficulty is the same in the three magazines.

(From Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)

Discussion: The analysis of variance can be applied when populations are normally distributed and theirvariances are equal, that is, X p ∼ N (μ p ,σp

2) with σ p =σ , ∀ p . These suppositions should be evaluated

(this will be done at the end of the exercise). If the equality of the means is rejected, additional analyses wouldbe necessary to identify which means are different—this information is not provided by the analysis ofvariance. On the other hand, the calculations involved in this analysis are so tedious that almost everybodyuses the computer. Finally, the unit of measurement of the index u is unknown for us.

Statistic: There is one factor identifying the population out of the three possible ones (we do not considerother magazines), so a one-factor fixed-effects analysis will be applied. The statistic is

T (X SA , X FO , X NY )=MSGMSW

with T 0 =MSGMSW

∼ F P−1, n−P ≡ F 3−1, 18−3≡ F2, 15

Some calculations are necessary to evaluate of the statistic T ( xSA , x FO , xNY ) . First of all, we look at thethree sample means:

X SA=16∑ j=1

6X SA , j=

15.75u+⋯+8.20u6

=10.97u

X FO =16∑ j=1

6X FO , j=

12.63u+⋯+9.42u6

=10.68u

X NY =16∑ j=1

6X NY , j=

9.27 u+⋯+5.66u6

=7.35u

The magnitude of the first and the third seems quite different, which suggests that the population means maybe different. Nevertheless, we should not trust intuition.

X =1

18∑ j=1

nX j=

15.75 u+⋯+5.66 u18

=9.67 u

SSG =∑p=1

Pn p( X p− X )

2= nSA⋅( X SA− X )

2+nFO⋅( X FO− X )

2+nNY⋅( X NY− X )

2

=6⋅(10.97u−9.67 u)2+6⋅(10.68u−9.67u)2+6⋅(7.35 u−9.67u)2=48.53u2

MSG =1

P−1SSG=

48.53 u2

3−1=24.26 u2

SSW =∑ p=1

P

∑ j=1

n p

(X p , j− X p)2=∑ j=1

6(X SA , j− X SA)

2+∑ j=1

6(X FO , j− X FO)

2+∑ j=1

6(X NY , j− X NY )

2

=(15.75 u−10.97u)2+⋯+(8.20u−10.96u)2

+ (12.63 u−10.68u)2+⋯+(9.42 u−10.68u)2

+ (9.27u−7.35u)2+⋯+(5.66u−7.35u)2

=52.22 u2

MSW =1

n−PSSW=

52.22 u2

18−3=3.48u2

and, finally,

T 0( xSA , x FO , xNY )=MSGMSW

=24.26 u2

3.48u2 =6.97

Hypotheses and form of the critical region:

H 0: μ1=μ2 =⋯=μP and H 1: ∃a ,b / μa ≠μb


For this statistic,

By applying the definition of α:

α= P (Type I error)= P (Reject H 0∣H 0 true)= P (T∈Rc∣H 0)= P (T 0>a)

→ a=rα=6.359 → Rc = {T 0(X SA , X FO , X NY ) > 6.359}

Decision: Finally, it is necessary to check if this region “suggested by H0” is compatible with the value thatthe data provide for the statistic. If they are not compatible because the value seems extreme when thehypotheses is true, we will trust the data and reject the hypothesis H0.

Since T 0( xSA , x FO , xNY )=6.97 > 6.359 → T 0( x)∈Rc → H0 is rejected.


pV =P ((X SA , X FO , X NY ) more rejecting than (xSA , xFO , x NY )∣H 0 true)

=P (T 0(X SA , X FO , X NY )>T 0( xSA , x FO , x NY))=P (T 0>6.97)=0.0072

→ pV =0.007243<0.01=α → H0 is rejected.

Conclusion: As suggested by the sample means, the population means of the three magazines are not equalwith a confidence of 0.99, measured in a 0-to-1 scale. Pairwise comparisons could be applied to identify thedifferences.

Code to apply the analysis “semimanually”

We have not done the calculations by hand but using the programming language R. The code is:

# To enter the three samples

SA = c(15.75, 11.55, 11.16, 9.92, 9.23, 8.20)

FO = c(12.63, 11.46, 10.77, 9.93, 9.87, 9.42)

NY = c(9.27, 8.28, 8.15, 6.37, 6.37, 5.66)

# To join the samples in a unique vector

Data = c(SA, FO, NY)

# To calculate the sample mean of the three groups and the total sample mean

mean(SA) ; mean(FO) ; mean(NY) ; mean(Data)

# To calculate the measures and the statistic (for large datasets, the previous means should have been saved)

SSG = 6*((mean(SA) - mean(Data))^2) + 6*((mean(FO) - mean(Data))^2) + 6*((mean(NY) - mean(Data))^2)

MSG = SSG/(3-1)

SSW = sum((SA - mean(SA))^2) + sum((FO - mean(FO))^2) + sum((NY - mean(NY))^2)

MSW = SSW/(18-3)

T0 = MSG/MSW

# To find the quantile 'a' that determines the critical region

a = qf(0.99, 2, 15)

# To calculate the p-value

pValue = 1 - pf(T0, 2, 15)

(In the console, write the name of a quantity to print its value.)

Code to apply the analysis with R

Statistical software programs have many built-in functions to apply the most basic methods. Now we use R toobtain the analysis of variance table. As regards the syntaxis, it is based on the linear regression framework,


> 1-pf(6.97, 2, 15)[1] 0.007235116

> qf(0.99, 2, 15)[1] 6.358873

X p , j =μ p + ϵ p , j , where this linear dependence of X on the factor effect μp is denoted by Data ~ Group(see the call to the function aov below).

## After running the first block of lines of the previous code:

# To create a vector with the membership labels

Group = factor(c(rep("SA",length(SA)), rep("FO",length(FO)), rep("NY",length(NY))))

# To apply a one-factor analyis of variance

objectAV = aov(Data ~ Group)

# To print the table with the results

summary(objectAV)

The ANOVA table is

Df Sum Sq Mean Sq F value Pr(>F) Group 2 48.53 24.264 6.97 0.00723 ** Residuals 15 52.22 3.481--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Compare these quantities with those obtained in the previous calculations.) An equivalent way of applyingthe analysis of variance with R consists in substituting the lines

# To apply a one-factor analyis of variance

objectAV = aov(Data ~ Group)

# To print the table with the results

summary(objectAV)

by the lines

# To fit a linear regression model

Model = lm(Data ~ Group)

# To apply and print the analysis of variance

anova(Model)

Code to check the assumptions

By using a computer it is also easy to evaluate the fulfillment of the assumptions.

# To enter the three samples

SA = c(15.75, 11.55, 11.16, 9.92, 9.23, 8.20)

FO = c(12.63, 11.46, 10.77, 9.93, 9.87, 9.42)

NY = c(9.27, 8.28, 8.15, 6.37, 6.37, 5.66)

# To join the samples in a unique vector

Data = c(SA, FO, NY)

# To create a vector with the membership labels

Group = factor(c(rep("SA",length(SA)), rep("FO",length(FO)), rep("NY",length(NY))))

# To test the normality of the sample SA by applying two different hypothesis tests

shapiro.test(SA)

ks.test(SA, "pnorm", mean=mean(SA), sd=sd(SA))

# To test the normality of the sample FO by applying two different hypothesis tests

shapiro.test(FO)

ks.test(FO, "pnorm", mean=mean(FO), sd=sd(FO))

# To test the normality of the sample NY by applying two different hypothesis tests

shapiro.test(NY)

ks.test(NY, "pnorm", mean=mean(NY), sd=sd(NY))

# To test the equality of the variances

bartlett.test(Data ~ Group)


My notes:

[HT] NonparametricRemark 14ht: Nonparametric methods involve questions not based on parameters, and therefore it is not usually necessary toevaluate some kinds of supposition that were present in the parametric hypothesis tests.

Exercise 1ht-np

Occupational Hazards. The following table is based on data from the U.S. Department of Labor, Bureau ofLabor Statistics.

Police CashiersTaxi

DriversGuards

Homicide 82 107 70 59

Cause of death otherthan homicide 92 9 29 42

490A) Use the data in the table, coming from a simple random sample, to test the claim that occupation is

independent of whether the cause of death was homicide. Use a significance α = 0.05 and apply anonparametric chi-square test.

B) Does any particular occupation appear to be most prone to homicides? If so, which one?

(Based on an exercise of Essentials of Statistics, Mario F. Triola, Pearson)


job. Your job is what you do to earn your living: 'You'll never get a job if you don't have any qualifications.' 'She'd like to change her jobbut can't find anything better.' Your job is also the particular type of work that you do: 'John's new job sounds really interesting.' 'I knowshe works for the BBC but I'm not sure what job she does.' A job may be full-time or part-time (NOT half-time or half-day): 'All shecould get was a part-time job at a petrol station.'

do (for a living). When you want to know about the type of work that someone does, the usual questions are What do you do? Whatdoes she do for a living? etc 'What does your father do?' - 'He's a police inspector.'

occupation. Occupation and job have similar meanings. However, occupation is far less common than job and is used mainly in formaland official styles: 'Please give brief details of your employment history and present occupation.' 'People in manual occupations seem tosuffer less from stress.'

post/position. The particular job that you have in a company or organization is your post or position: 'She's been appointed to the post ofdeputy principal.' 'He's applied for the position of sales manager.' Post and position are used mainly in formal styles and ofter refer tojobs which have a lot of responsability.

career. Your career is your working life, or the series of jobs that you have during your working life: 'The scandal brought his career inpolitics to a sudden end.' 'Later on in his career, he became first secretary at the British Embassy in Washington.' Your career is also theparticular kind of work for which you are trained and that you intend to do for a long time: 'I wanted to find out more about careers inpublishing.'

trade. A trade is a type of work in which you do or make things with your hands: 'Most of the men had worked in skilled trades such ascarpentry or printing.' 'My grandfather was a bricklayer by trade.'

profession. A profession is a type of work such as medicine, teaching, or law which requires a high level of training or education: 'Untilrecently, medicine has been a male-dominated profession.' 'She entered the teaching profession in 1987.'

LINGUISTIC NOTE (From: The Careful Writer: A Modern Guide to English Usage. Bernstein, T.M. Atheneum)

occupations. The words people use affectionately, humorously, or disparagingly to describe their own occupations are their own affair.They may say, “I'm in show business” (or, more likely, “show biz”), or “I'm in the advertising racket,” or “I'm in the oil game,” or “I'm inthe garment line.” But outsiders should use more caution, more discretion, and more precision. For instance, it is improper to write, “Mr.Danaher has been in the law business in Washington.” Law is a profession. Similarly, to say someone is “in the teaching game” wouldundoubtedly give offense to teachers. Unless there is some special reason to be slangy or colloquial, the advisable thing to do is to accordevery occupation the dignity it deserves.


Discussion: In this exercise, it is clear from the statement that we need to test the independence of twovariables. A particular sample (x1,...,x490) were grouped and we are given the absolute frequencies in theempirical table. By looking at the table, the cashier occupation appears to be most prone to homicides.

Statistic: Since we have to apply a test of independence, from a table of statistics (e.g. in [T]) we select

T 0(X )=∑l=1

L

∑k=1

K (N lk− elk )2

elk

→d

χ( L−1)(K−1)2

for L and K classes, respectively.

Hypotheses: The null hypothesis supposes that the two variables are independent,

H 0: X ,Y independent and H 1: X ,Y dependentor, probabilistically,

H 0: f ( x , y)= f X ( x)⋅ f Y ( y ) and H 1: f (x , y )≠ f X (x )⋅ f Y ( y )

This implies that the probability at any cell is the product of the marginal probabilities of its file and column.Note that two underlying probability distributions are supposed for X and Y, although we do not care aboutthem, and we will directly estimate the probabilities from the empirical table. As

by substituting in the expression of the statistic,

T 0( x)=(82−

318⋅174490 )

2

318⋅174490

+⋯+(42−

172⋅101490 )

2

172⋅101490

=65.52

This value, calculated under H0 and using the data, is necessary both to determine the critical region and tocalculate the p-value.

On the other hand, for any chi-square tests T0 is a nonnegative measure of the dissimilarity between thetwo tables; therefore, a value close to zero means that the two tables are similar, while the critical region isalways of the form:

Decision: There are L= 2 and K = 4 classes, respectively, so

T 0(X ) →d

χ( L−1)(K−1)2

≡ χ(2−1 )(4−1)2

≡ χ32

For the first methodology, to calculate a the definition of type I error is applied with α = 0.05:

α=P (Type I error)=P (Reject H 0∣H0 true)=P(T (X )∈Rc∣H 0)≈P (T 0(X )>a)


→ a=rα=7.81 → Rc= {T 0(X )>7.81 }

The decision is: T 0( x)=65.52 ∈ Rc → H0 is rejected.

If we apply the methodology based on the p-value,

pV = P (X more rejecting than x ∣H 0 true)=P (T 0(X )>T 0( x))

= P (T 0(X )>65.52)= 3.885781⋅10−14

→ pV <0.05=α → H0 is rejected.

Instead of using the computer, we can consider the last value in our table to bind the p-value (statisticianswant to discover its value, while we want only to check whether or not it is smaller than α):

pV = P (T 0(X )>65.52) < P(T 0(X )>11.3)=0.01 → pV <0.01<0.05=α → H0 is rejected

Conclusion: The hypothesis that the two variables are independent is rejected. This means that there seemsto be a correlation between occupation and cause of death. (Remember: statistical results depend on: theassumptions, the methods, the certainty and the data.)

Exercise 2ht-np

World War II Bomb Hits in London. To carry out an analysis, South London was divided into 576 areas. Forthe variable N ≡ number of bombs in the k-th area (any), a simple random sample (x1,...,x576) was gatheredand grouped in the following table:

EMPIRICAL

Number of Bombs 0 1 2 3 45 or

more

Number of Regions 229 211 93 35 7 1 n = 576

Data taken from: An application of the Poisson distribution. Clarke, R. D. Journal of the Institute of Actuaries [JIA] (1946) 72: 481http://www.actuaries.org.uk/research-and-resources/documents/application-poisson-distribution

By applying the chi-square goodness-of-fit methodology,

(1) Test at 95% confidence whether N can be supposed to follow a Poisson distribution.

(2) Test at 95% confidence whether N can be supposed to follow a Poisson distribution with λ = 0.8.

Discussion: We must apply the chi-square methodology to study if the data statistically fit the modelsspecified. In the second section, a value for the parameter is given. For this probability model,

We have to calculate or estimate the probabilities in order to obtain the expected absolute frequencies. Finally,by using the statistic T we will compare the two tables and make a decision.

Statistic: Since we have to apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select


1-pchisq(65.52, 3)[1] 3.885781e-14

My notes:

http://www.actuaries.org.uk/research-and-resources/documents/application-poisson-distribution

T 0(X )=∑k=1

K (N k−ek )2

ek

→d

χK−s−12

where K is the number of classes and s is the number of parameters of F0 that we need to estimate so as to usethis distribution for obtaining the class probabilities or approximations of them.

(1) Fit to the Poisson family

Hypotheses: For this nonparametric goodness-of-fit test, the hypotheses are

H 0: N ∼ F0 = Pois (λ) and H 1: N ∼ F≠Pois(λ)

(It can be thought that both hypotheses are composite.) To fill in the expected table (under H0), the formulaek=n⋅pk will be applied. To estimate pk the supposed distribution under H0 must be used. And to use the

distribution, an estimator λ of the parameter is necessary. Once we have this estimator, the probabilities arecalculated by using software, tables, or the mass function plus the plug-in principle: f ( x ; λ) .

On the other hand, to estimate λ we take into account that for this distribution the expectation (and alsothe variance) is equal to the parameter. Since the sample mean estimates the expectation, in this case it can beused to estimate λ too. (If we had not remembered this property, we would have applied the method of themoments or the maximum likelihood method to obtain this estimator.) Then,

λ=μ= x=1

576∑ j=1

576x j

Since our data are grouped, we can imagine that (look at the table): 229 data are 0's, 211 are 1's, 93 are 2's, 35are 3's, 7 are 4's, and, finally, 1 is unknown but equal or higher than 5, so we can consider 5 or even 6.

λ=229⋅0+211⋅1+93⋅2+35⋅3+7⋅4+1⋅5

576=0.93

By using the plug-in principle and the calculator we obtain

p0=P λ (X =0)= f λ(0)=0.930

0!e−0.93

=0.395 p1=P λ (X =1)= f λ (1)=0.931

1!e−0.93

=0.367

p2=P λ (X =2)= f λ (2)=0.932

2!e−0.93

=0.171 p3=P λ (X =3)= f λ (3)=0.933

3!e−0.93

=0.0529

p4=P λ(X =4)= f λ (4)=0.934

4!e−0.93

=0.0123 p5>=1−P λ (X ≤4)=1−(0.39+⋯+0.012)=0.00270

Poisson (λ = 0.93)

Values 0 1 2 3 4 5 or more

Probabilities 0.395 0.367 0.17 0.0529 0.0123 0.00270 1

Now, we fill in the expected table by using the formula ek=n⋅pk .

EXPECTED (UNDER H0)

Number of Bombs 0 1 2 3 4 5 or more

Number of Regions 227.26 211.35 98.28 30.47 7.08 1.55 n = 576

We have really done the calculations with the programming language R. By using a calculator, some quantities may be slightly different due to technicals effects (number of decimal digits, accuracy, etc).


> dpois(c(0,1,2,3,4), 0.93)[1] 0.39455371 0.36693495 0.17062475 0.05289367 0.01229778> 1 - sum(dpois(c(0,1,2,3,4), 0.93))[1] 0.002695135

> 576*dpois(c(0,1,2,3,4), 0.93)[1] 227.262937 211.354532 98.279857 30.466756 7.083521> 576*(1 - sum(dpois(c(0,1,2,3,4), 0.93)))[1] 1.552398

To guarantee the quality of the chi-square methodology, the expected absolute frequencies are usually requiredto be larger than four (≥5). For this reason, we merge the last two classes in both the empirical and theexpected tables.

EMPIRICAL

Number of Bombs 0 1 2 3 4 or more

Number of Regions 229 211 93 35 7+1=8 n = 576

EXPECTED (UNDER H0)


Number of Regions 227.26 211.35 98.28 30.47 7.08+1.55=8.63 n = 576

We evaluate T0, which is necessary to apply any of the two methodologies.

T 0( x)=( 229−227.26 )2

227.26+⋯+

(8−8.63 )2

8.63=1.019

We have calculated the value of T0 with the computer too:> empirical = c(229, 211, 93, 35, 8)> expected = 576*c(dpois(c(0,1,2,3), 0.93), (1-sum(dpois(c(0,1,2,3), 0.93))))> sum(((empirical-expected)^2)/expected)[1] 1.018862

For this kind of test, the critical region always has the following form:

Decision: There are K = 5 classes (after merging two of them) and s = 1 estimation, so

T 0(X ) →d

χK−s−12

≡ χ5−1−12

≡χ32

If we apply the methodology based on the critical region, the necessary quantile a is calculated from thedefinition of the type I error, with the given α = 0.05:

α=P (Type I error)=P (Reject H 0∣H 0 true)=P(T 0(X )∈Rc )≈P (T 0(X )>a)

→ a=rα=7.81 → Rc = {T 0(X )>7.81 }

Then, the decision is: T 0( x)=1.019 < 7.81 → T 0( x)∉Rc → H0 is not rejected.

If we apply the alternative methodology based on the p-value,

pV = P (X more rejecting than x |H 0 true)=P(T 0(X )>T 0(x ))=P (T 0(X )>1.019)=0.80



> 1 - pchisq(1.019, 3)[1] 0.7966547

> qchisq(1-0.05, 3)[1] 7.814728

(2) Fit to a member of the Poisson family

Hypotheses: For this nonparametric goodness-of-fit test, the hypotheses are

H 0: N ∼ F0 = Pois (0.8) and H 1: N ∼ F≠Pois(0.8)

(It can be thought that the null hypothesis is simple while the alternative hypothesis is composite.) To fill inthe expected table (under H0), the formula ek=n⋅pk will be applied, where the probabilities can be takenfrom a table or can be calculated by substituting in the mass function f (x ; λ) ,

p0=Pλ (X =0)= f λ(0)=0.80

0!e−0.8

=0.449 p1=Pλ (X =1)= f λ (1)=0.81

1!e−0.8

=0.359

p2=Pλ (X =2)= f λ (2)=0.82

2!e−0.8

=0.144 p3=Pλ (X =3)= f λ (3)=0.83

3!e−0.8

=0.0383

p4=Pλ(X =4)= f λ (4)=0.84

4!e−0.8

=0.00767 p5>=1−Pλ (X ≤4)=⋯=0.00141

Poisson (λ = 0.8)

Values 0 1 2 3 4 5 or more

Probabilities 0.449 0.359 0.144 0.0383 0.00767 0.00141 1

Now, we fill in the expected table by using the formula ek=n⋅pk .

EXPECTED (UNDER H0)

Number of Bombs 0 1 2 3 4 5 or more

Number of Regions 258.81 207.05 82.82 22.09 4.42 0.813 n = 576

Again, we have done these calculations with the programming language R.

> dpois(c(0,1,2,3,4), 0.8)[1] 0.449328964 0.359463171 0.143785269 0.038342738 0.007668548> (1-sum(dpois(c(0,1,2,3,4), 0.8)))[1] 0.00141131

> 576*dpois(c(0,1,2,3,4), 0.8)[1] 258.813483 207.050787 82.820315 22.085417 4.417083> 576*(1-sum(dpois(c(0,1,2,3,4), 0.8)))[1] 0.8129146

As in the previous case, we merge the last two classes for all the expected absolute frequencies to be largerthan four

EMPIRICAL


Number of Regions 229 211 93 35 7+1=8 n = 576

EXPECTED (UNDER H0)


Number of Regions 258.81 207.05 82.82 22.09 4.42+0.813=5.233 n = 576

We calculate the value of T0 with the computer as well:> empirical = c(229, 211, 93, 35, 8)> expected = 576*c(dpois(c(0,1,2,3), 0.8), (1-sum(dpois(c(0,1,2,3), 0.8))))> sum(((empirical-expected)^2)/expected)[1] 13.77982

so


T 0( x)=( 229−258.81 )2

258.81+⋯+

(8−5.233 )2

5.233=13.78

On the other hand, for this kind of test the critical region always has the following form:

Decision: Now K = 5 and s = 0, since no estimation has been needed, so

T 0(X ) →d

χK−s−12

≡ χ5−1−02

≡ χ42

Now T0 follows the χ2 distribution with 4 degrees of freedom—it was 3 in the previous section.

α=P (Type I error)=P (Reject H 0|H 0true)=P (T 0(X )∈Rc)≈P (T 0(X )>a)

→ a=rα=9.49 → Rc = {T 0(X )>9.49 }

Then, the decision is: T 0( x)=13.78 > 9.49 → T 0( x)∈Rc → H0 is rejected.



→ pV =0.0080<0.05=α → H0 is rejected.

Conclusion: The hypothesis that bomb hits can reasonably be modeled by using the Poisson family has notbeen rejected. In this case, data provided an estimate λ=0.93 . Nevertheless, when the value λ=0.8 isimposed, the hypotheses that bomb hits can be modeled by using a Pois(λ=0.8) model is rejected. This provesthat:

i. Even quite reasonable a model may not fit the data if inappropriate parameter values are considered.This emphasizes the importance of using good parameter estimation methods.

ii. Estimating the parameter value was better than fixing a value close to the estimate. As statisticians say:“let the data talk”. This hightlights the necessity of testing all suppositions, which implies thatnonparametric procedures should sometimes be applied before the parametric ones: in this case, beforesupposing that the Poisson family is proper and imposing a value for the parameter, the whole Poissonfamily must be considered.

(Remember: statistical results depend on: the assumptions, the methods, the certainty and the data.)

Advanced theory: Mendelhall, W., D.D. Wackerly and R.L. Scheaffer say (Mathematical Statistics withApplications, Duxbury Press) that the expected absolute frequencies can be as low as 1 for some situations,according to Cochran, W.G., “The χ2 Test of Goodness of Fit”, Annals of Mathematical Statistics, 23 (1952)pp. 315-345. To take the most advantage of this exercise, we repeat the previous calculations without mergingthe last two classes.

(1) Fit to the Poisson family

We evaluate T0, which is necessary to apply any of the two methodologies.

T 0( x)=( 229−227.26 )2

227.26+⋯+

(1−1.55 )2

1.55=1.167

Now there are K = 6 classes and s = 1 estimation, so T 0(X ) →d

χK−s−12

≡ χ6−1−12

≡ χ42 . If we apply the


> 1-pchisq(13.78, 4)[1] 0.00803134

> qchisq(1-0.05, 4)[1] 9.487729

methodology based on the critical region, the necessary quantile a is calculated from the definition of the typeI error, with the given α = 0.05:


→ a=rα=9.49 → Rc = {T 0(X )>9.49 }

Then, the decision is: T 0( x)=1.167 < 9.49 → T 0( x)∉Rc → H0 is not rejected.

If we apply the alternative methodology based on the p-value,

pV = P (X more rejecting than x ∣H 0 true)=P (T 0(X )>T 0( x))=P (T 0(X )>1.167)=0.88


(2) Fit to a member of the Poisson family

We calculate the value of T0

T 0( x)=( 229−258.81 )2

258.81+⋯+

(1−0.813 )2

0.813=13.87

Since K = 6 and s = 0, now T 0(X ) →d

χK−s−12

≡ χ6−1−02

≡ χ52. Then,

α=P (Type I error)=P (Reject H 0|H 0true)=P (T 0(X )∈R c)≈P (T 0(X )>a)

→ a=r α=11.07 → Rc = {T 0(X )>11.07}




→ pV =0.0165<0.05=α → H0 is rejected.

In both sections the same decisions have been made, which implies that this is one of those situations wheremerging the last two classes does not seem essential.

Exercise 3ht-np

Three finantial products have been commercialized and the presence of interest in them has been registeredfor some individuals. It is possible to imagine different situations where the following data could have beenobtained.

Product 1 Product 2 Product 3

Group 1 10 18 9 37

Group 2 20 13 15 48

30 31 24 85

(a) A simple random sample of 48 people of the second group were allocated after considering thevariable product, test at α = 0.01 whether this variable follows the distribution determined by thesample of the first group.


> 1-pchisq(1.167, 4)[1] 0.8835014

> 1-pchisq(13.87, 5)[1] 0.01645663

> qchisq(1-0.05, 4)[1] 9.487729

> qchisq(1-0.05, 5)[1] 11.0705

My notes:

(b) A simple random sample of 85 people with interest in any of the products were allocated afterconsidering the two variables group and product. Test at α = 0.01 the independence of the twovariables.

(c) From two independent groups, simple random samples of 37 people and 48 people are surveyed,respectively. Test at α = 0.01 the homogeneity of the distribution of the variable product in the groups.

Discussion: In this exercise, the same table is looked at as containing data obtained from three differentschemes. The chi-square methodology will be applied in all sections through three kinds of test: goodness-of--fit, independence and homogeneity. In the first case, a probability distribution F0 is specified, while in the lasttwo cases the underlying distributions have no interest by themselves.

(a) Goodness-of-fit test

Statistic: To apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select

T 0(X )=∑k=1

K (N k−ek )2

ek

→d

χK−s−12

where there are K classes and s parameters must be estimated to determine the probabilities.

Hypotheses: For a nonparametric goodness-of-fit test, the null hypothesis assumes that the theoreticalprobabilities of the second group follow the probabilities determined by the sample of the first group. If Fk

represents the distribution of the variable product in the k-th population,

H 0: F 2 ∼ F 1 and H 1: F 2 ∼ F≠F 1

The variable of the first group determines the following distribution F1:

Value 1 2 3

Probability1037

1837

937

Now, under H0 the formula ek=n pk allows us to fill in the expected table:

Then, we need the evaluation

T 0( x)=(20−48

1037 )

2

481037

+(13−48

1837 )

2

481837

+(15−48

937 )

2

48937

=9.34

On the other hand, for this kind of test the critical region always has the form


Decision: Since there are K = 3 classes and s = 0 (no parameter has to be estimated to determine theprobabilities),

T 0(X ) →d

χK−s−12

≡ χ3−0−12

≡ χ22

If we apply the methodology based on the critical region, the necessary quantile a is calculated from thedefinition of the type I error, with the given α = 0.01:


→ a=rα=9.21 → Rc= {T 0(X )>9.21 }



pV = P (X more rejecting than x ∣H 0 true)=P (T 0(X )>T 0( x))≈P (T 0(X )>9.34)

< P (T 0(X )>9.21)=1−0.99=0.01

→ pV < 0.01=α → H0 is rejected.

(b) Independence test

Statistic: To apply a test of independence, from a table of statistics (e.g. in [T]) we select

T 0(X )=∑l=1

L

∑k=1

K (N lk− elk )2

elk

→d

χ( L−1)(K−1)2

for L and K classes, respectively. Underlying distributions are supposed—but not specified—for the variables X and Y, and the probabilities are directly estimated from the sample information.

Hypotheses: For a nonparametric independence test, the null hypothesis assumes that the probabilities at anycell is the product of the marginal probabilities of its file and column,

H 0: X ,Y independent and H 1: X ,Y dependentor, probabilistically,

H 0: f ( x , y)= f X ( x)⋅ f Y ( y ) and H 1: f (x , y )≠ f X (x )⋅ f Y ( y )

Under H0, the formula e lk=n plk=n p l pk=nN l⋅

n

N⋅k

nallows us to fill in the expected table:

Then,

T 0( x)=(10−

37⋅3085 )

2

37⋅3085

+⋯+(15−

48⋅2485 )

2

48⋅2485

=4.29



9.34 is not in our table while 9.21 is

Decision: There are L= 2 and K = 3 classes, respectively, so

T 0(X ) →d

χ( L−1)(K−1)2

≡ χ(2−1 )(3−1)2

≡ χ22

For the first methodology, to calculate a the definition of type I error is applied with α = 0.01:

α=P (Type I error)=P (Reject H 0∣H0 true)=P(T 0(X )∈Rc )≈P (T 0(X )>a)

→ a=rα=9.21 → Rc = {T 0(X )>9.21 }

The decision is: T 0( x)=4.29 → T 0( x)∉Rc → H0 is not rejected.


pV = P (X more rejecting than x ∣H 0 true)=P (T 0(X )>T 0( x))≈P (T 0(X )≥4.29)

>P (T 0(X )>4.61)=1−0.9=0.1

→ pV > 0.1> 0.01=α → H0 is not rejected.

(c) Homogeneity test

Statistic: To apply a test of homogeneity, from a table of statistics (e.g. in [T]) we select

T 0(X )=∑l=1

L

∑k=1

K (N lk− elk )2

elk

→d

χ( L−1)(K−1)2

for L groups and K classes. An underlying distribution is supposed—but not specified—for the variable X, andthe probabilities are directly estimated from the sample information. (Note that the membership of a group canbe seen as the value of a factor.)

Hypotheses: For a nonparametric homogeneity test, the null hypothesis assumes that the marginalprobabilities in any column are the same for the two groups, that is, are independent of the group or stratum.This means that the variable of interest X follows the same probability distribution in each (sub)group orstratum. If G represents the variable group, mathematically,

H 0: F (x∣G)=F ( x) and H 1: F (x∣G)≠F (x )

Under H0, the formula e lk=nl p lk=n l pk=nl

N⋅k

nallows us to fill in the expected table:


Then

T 0( x)=(10−37

3085 )

2

373085

+⋯+(15−48

2485 )

2

482485

=4.29


Decision: For L= 2 groups and K = 3 classes,

T 0(X ) →d

χ( L−1)(K−1)2

≡ χ(2−1 )(3−1)2

≡ χ22

If we apply the methodology based on the critical region, to calculate the quantile a the definition of type Ierror is applied with α = 0.01:

α=P (Type I error)=P (Reject H 0∣H 0 true)=P(T 0(X i)∈Rc)≈P (T 0(X i)>a)

→ a=rα=9.21 → Rc= {T 0(X )>9.21 }

To make the decision: T 0( x)=4.29 → T 0( x)∉Rc → H0 is not rejected


pV = P (X more rejecting than x ∣H 0 true)=P (T 0(X i)>T 0( x i))≈P(T 0(X i)>4.29)

>P (T 0(X i)>4.61)=1−0.9=0.1

→ pV > 0.1> 0.01=α → H0 is not rejected

Conclusion (advanced): Neither the independence nor the homogeneity has been rejected, while thehypothesis supposing that the variable product follows in population 2 the distribution determined by thesample of the group 1 has been rejected. On the one hand, the distribution determined by one sample, involvedin section (a), is in general different to the common supposed underlying distribution involved in section (b),which is estimated by using the samples of both groups. Thus, it can be thought that this underlyingdistribution “is between the two samples”, by which we can justify the decisions made in (a), (b) and (c).Group 2 has more weight in determining that distribution, since it has more elements. It is worth noticing thesimilarity between the independence and the homogeneity tests: same distribution and evaluation for thestatistic, same critical region, et cetera. (As regards the application of the methodologies, binding the p-valueis sometimes enough to discover whether it is smaller than α or not, but in general statisticians want to find itsvalue.)


4.29 is not in our table while 4.61 is

My notes:

[HT] Parametric and Nonparametric

Exercise 1ht

To test if a coin is fair or not, it has independently been tossed 100,000 times (the outputs are a simplerandom sample), and 50,347 of them were heads. Should the fairness of the coin, as null hypothesis, berejected when α = 0.1?

(a) Apply a parametric test. By using a computer, plot the power function.

(b) Apply the nonparametric chi-square goodness-of-fit test.

(c) Apply the nonparametric position signs test.

Discussion: In this exercise, no supposition should be evaluated: in (a) because the Bernoulli model is “theonly proper one” to model a coin, and in (b) and (c) because they involve nonparametric tests. The sections ofthis exercise need the same calculations as in previous exercises.

(a) Parametric test

Statistic: From a table of statistics (e.g. in [T]), since the population variable is Bernoulli and the asymptoticframework can be considered (since n is big), the statistic

T (X ;η)=η−η

√ ?(1−? )n

→d

N (0,1)

is selected, where the symbol ? is substituted by the best information available. In testing hypotheses, it willbe used in two forms:

T 0(X )=η−η0

√η0(1−η0)

n

→d

N (0,1) and T 1(X )=η−η1

√η1(1−η1)

n

→d

N (0,1)

where the supposed knowledge about the value of η is used in the denominators to estimate the variance (wedo not have nor suppose this information when T is used to build a confidence interval, or for tests with twopopulations). Regardless of the methodology to be applied, the following value will be necessary:

T 0( x)=

50,347100,000

−12

√12 (1−

12 )

100,000

=2.19

where η0 = 1/2 when the coin is supposed to be fair.

Hypotheses: Since a parametric test must be applied, the coin—dichotomic situation—is modeled by aBernoulli random variable, and the hypotheses are

H 0: η= η0 =12

and H 1: η= η1≠12

Note that the question is about the value of the parameter η while the Bernoulli distribution is supposed underboth hypotheses; in some nonparametric tests, this distribution is not even supposed in general (although the


only reasonable distribution to model a coin is the Bernoulli). For this kind of alternative hypothesis, thecritical region takes the form

Decision: To determine Rc, the quantiles are calculated from the type I error with α = 0.1 at η0 = 1/2:

α(1/2)=P (Type I error )=P (Reject H 0∣H 0true)=P (T (X ;θ)∈Rc∣H 0)=P(∣T 0(X )∣>a)

→ a=rα/2=1.645 → Rc= {∣T 0(X )∣>1.645 }

Thus, the decision is: T 0( x)=2.19 > 1.645 → T 0( x)∈Rc → H0 is rejected.


pV = P (X more rejecting than x ∣H 0 true)=P (∣T 0(X )∣>∣T 0( x)∣)

= 2⋅P (T 0(X )<−2.19)=2⋅0.0143=0.0248

→ pV =0.0248 < 0.1=α → H0 is rejected.

Power function: To calculate β, we have to work under H1. Since in this case the critical region is alreadyexpressed in terms of T0 and we must use T1, we apply the mathematical tricks of multiplying and dividing bythe same quantity and of adding and subtracting the quantity:

β(η1)= P (Type II error )= P (Accept H 0∣H 1 true)= P (T 0(X )∉Rc∣H 1)= P(∣T 0(X )∣≤1.645 ∣H 1)

= P (−1.645≤η−η0

√ η0(1−η0)

n

≤+1.645∣H 1)= P(−1.645≤η−η0

√ η1(1−η1)

n

√η1(1−η1)

√η0(1−η0)≤+1.645∣H 1)

= P (−1.645√η0 (1−η0)

√η1(1−η1)≤

η−η0

√ η1(1−η1)

n

≤+1.645√η0(1−η0)

√η1(1−η1) ∣H 1) = P (−1.645

√η0 (1−η0)

√η1(1−η1)≤η−η1+η1−η0

√η1(1−η1)

n

≤+1.645√η0(1−η0)

√η1(1−η1) ∣H 1) = P (−1.645

√η0 (1−η0)

√η1(1−η1)−

η1−η0

√ η1(1−η1)

n

≤η−η1

√η1(1−η1)

n

≤+1.645√η0(1−η0)

√η1(1−η1)−

η1−η0

√η1(1−η1)

n∣H 1)

= P (−1.645√η0 (1−η0)−√n(η1−η0)

√η1(1−η1)≤T 1≤

+1.645√η0(1−η0)−√n(η1−η0)

√η1(1−η1) ) = P (T 1≤

+1.645√η0(1−η0)−√n(η1−η0)

√η1(1−η1) )−P (T 1<−1.645√η0(1−η0)−√n(η1−η0)

√η1(1−η1) )


By using a computer, many more values η1 ≠ 0.5 can be considered to plot the power function

ϕ(η)= P (Reject H 0)= { α(η) if p∈Θ0

1−β(η) if p∈Θ1

# Sample and inferencen = 100000alpha = 0.1theta0 = 0.5 # Value under the null hypothesis H0q = qnorm(c(alpha/2, 1-alpha/2),0,1)theta1 = seq(from=0,to=1,0.01)paramSpace = sort(unique(c(theta1,theta0)))PowerFunction = 1-pnorm((q[2]*sqrt(theta0*(1-theta0))-sqrt(n)*(paramSpace-theta0))/sqrt(paramSpace*(1-paramSpace)),0,1) + pnorm((q[1]*sqrt(theta0*(1-theta0))-sqrt(n)*(paramSpace-theta0))/sqrt(paramSpace*(1-paramSpace)),0,1)plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

With this code the power function is plotted:

(b) Nonparametric chi-square goodness-of-fit test

Statistic: To apply a goodness-of-fit test, from a table of statistics (e.g. in [T]) we select

T 0(X )=∑k =1

K (N k−ek )2

ek

→d

χK−s−12

where there are K classes, and s parameters have to be estimated to determine F0 and hence the probabilities.

Hypotheses: For a nonparametric goodness-of-fit test, the null hypothesis supposes that the sample wasgenerated by a Bernoulli distribution with η0 = 1/2, while the alternative hypothesis supposes that it wasgenerated by a different distribution (Bernoulli or not, although this distribution is here “the reasonable way”of modeling a coin).

H 0: X ∼ F 0= B( 12 ) and H 1: X ∼ F ≠ B( 1

2 )For the distribution F0, the table of probabilities is

Value –1 (tail) +1 (head)

Probability 1/2 1/2

and, under H0, the formula ek=n pk=n Pθ(kth

class)=100,00012=50,000 allows us to fill in the expected

table:


and

T 0( x)=∑k=1

2 (nk−ek )2

ek

=(50,347−50,000)2

50,000+(49,653−50,000)2

50,000=4.82

On the other and, for this kind of test, the critical region always has the following form:

Decision: There are K = 2 classes and s = 0 (no parameter has been estimated), so

T 0(X ) →d

χK−s−12

≡ χ2−1−02

≡ χ12

If we apply the methodology based on the critical region, the definition of type I error, with α = 0.1, is appliedto calculate the quantile a:


→ a=rα=2.71 → Rc= {T 0(X )>2.71}



pV = P (X more rejecting than x )=P (T 0(X )>T 0(x ))≈P (T 0(X )>4.82)

<P (T 0(X )>3.84)=0.05

→ pV < 0.05 < 0.1=α → H0 is rejected.

Note: Binding the p-value is sometimes enough to make the decision—4.82 is not in our table while 3.84 is.

(c) Nonparametric position sign test

Statistic: To apply a position sign test, from a table of statistics (e.g. in [T]) we select

T 0(X )=Number {X j−θ0>0 }∼ Bin(n , P (X j>θ0))

Here θ0=0 and P(Xj>0)=1/2, so Me(T0)=E(T0)=n/2.

Hypotheses: For a nonparametric position test, if head and tail are equivalently translated into the numbers+1 and –1, respectively, the hypotheses are

H 0: Me(X )=θ0= 0 and H 1: Me(X )= θ1 ≠0For these hypotheses,


We need the evaluation∣T 0(x )−100,000/ 2∣=∣50,347−50,000∣=347

Decision: In the first methodology, the quantile a is calculated by applying the definition of the type I errorwith α = 0.1. On the one hand, we know the distribution of T0, while, on the other hand, Rc was easily writtenin terms of T0–n/2, whose distribution is involved in a well-known asymptotic result—the Central LimitTheorem for the Bin(n,1/2). (Moreover, the probabilities of the binomial distribution are not tabulated for n =100,000.) Then,

α=P (Type I error)=P (Reject H 0∣H0 true)=P (T 0(X )∈Rc )

=P (∣T 0(X )−n/ 2∣>a )=P(∣T 0(X )−n /2

√n12 (1−

12 )∣

>a

√n12 (1−

12 ) )

≈P(∣Z∣> 2⋅a

√n )

→ r α/2=1.645=2⋅a

√n → a≈1.645 √100,000

2≈260.097 → Rc = {∣T 0(X )−

n2∣>260.097}

The final decision is: ∣T 0(x )−100,000/ 2∣=347 > 260.097=a → T 0( x)∈Rc → H0 is rejected.


pV = P (X more rejecting than x ∣H 0 true)=P (∣T 0(X )−n/2∣>∣T 0(x )−n/2∣)

= P (∣T 0(X )−n/2

√n12 (1−

12 )∣

>∣T 0( x)−n /2

√n12 (1−

12 )∣)

≈ P(∣Z∣>∣50,347−50,000

√100,00012 (1−

12 )∣)

= P (∣Z∣>2.19)=2⋅P (Z <−2.19)=2⋅0.0143=0.0248

→ pV = 0.0248 < 0.1=α → H0 is rejected.

Conclusion: (1) In this case the three different tests agree to make the same decision, but this may nothappen in other situations. When it is possible to compare the power functions and there exists a uniformlymost powerful test, the decision of the most powerful should be considered. In general, (proper) parametrictests are expected to have more power than the nonparametric ones in testing the same hypotheses. (2) Withtwo classes, the chi-square test does not distinguish any two distributions such that the two class probabilitiesare (½, ½), that is, in this case the test provides a decision about the symmetry of the distribution (chi-squaretests work with class probabilities, not with the distributions themselves). (3) In this exercise the parametrictest and the nonparametric test of the signs are essentially the same. (Remember: statistical results depend on:the assumptions, the methods, the certainty and the data.)


My notes:

PE – CI – HT

Exercise 1pe-ci-ht

From a previous pilot study, concerning the monthly amount of money (in $) that male and female students ofa community spend on cell phones, the following hypotheses are reasonably supported:

i. The variable amount of money follows a normal distribution in both populations.

ii. The population means are μM = $14.2 and μF = $13.5, respectively.

iii. The two populations are independent.

Two independent simple random samples of sizes nM = 53 and nF = 49 are considered, from which thefollowing statistics have been calculated:

S M2=$2 4.99 and S F

2= $25.02

Then,

A) Calculate the probability P (M−F ≤1.27). Repeat the calculations with the supposition that σM and σF are equal.

B) Build a 95% confidence interval for the quotient σM/σF.

Discussion: The pilot statistical study mentioned in the statement should cover the evaluation of allsuppositions. The hypothesis that σM = σF should be evaluated as well. The interval will be built by applyingthe method of the pivot.


M ≡ Amount of money spent by a male (one) M ~ N(μM, σM2)

F ≡ Amount of money spent by a female (one) F ~ N(μF, σF2)

Selection of the statistics: We know that:

• There are two independent normal populations• Standard deviations σM and σF are unknown, and we compare them through σM/σF

From a table of statistics (e.g. in [T]) we select

T (M , F ;μM ,μ F)=(M−F )−(μM−μF)

√ S M2

nM

+S F

2

nF

∼ tκ with κ=( S M

2

nM

+S F

2

nF)

2

1nM−1 (

S M2

nM)

2

+1

nF−1 (S F

2

nF)

2

T (M , F ;μM ,μ F)=(M −F )−(μM−μF)

√ S p2

nM

+S p

2

nF

∼ t nM+nF−2 with S p2=

nM sM2+nF s F

2

nM+nF−2=(nM−1)S M

2+(nF−1)S F

2

nM +nF−2


T (M , F ;σM ,σF )=

S M2

σM2

S F2

σF2

=S M

2σ F

2

S F2σM

2 ∼ F nM−1 , nF−1

Because of the information available, the first and the second statistics allow studying M – F (the second forthe particular case where σM = σF), while the third allows studying σM/σF.

(A) Calculation of the probability:

P (M−F ≤1.27)=P((M −F )−(μM−μF)

√ S M2

nM

+S F

2

nF

≤1.27−(μM−μF)

√ S M2

nM

+S F

2

nF)=P (T ≤

1.27−(14.2−13.5)

√ 4.9953

+5.0249 )=P (T ≤1.29)

with T ~ tκ where

κ=( S M

2

nM

+S F

2

nF)

2

1nM−1 (

S M2

nM)

2

+1

nF−1 (S F

2

nF)

2=

( 4.9953

+5.0249 )

2

153−1 (

4.9953 )

2

+1

49−1 (5.0249 )

2=99.33

Should we round this value downward, κ = 99, or upward, κ = 100? We will use this exercise to show that

➢ For large values of κ1 and κ2, the t distribution provides close values

➢ For a large value of κ, the t distribution provides values close to those of the standard normaldistribution (the tκ distribution tends with κ to the standard normal distribution)

By using the programming language R:

• If we make it κ = 99.33 to 99, the probability is

• If we make it κ = 99.33 to 100, the probability is

• If we use the N(0,1), the probability is

On the other hand, when the variances are supposed to be equal they can and should be estimated jointly byusing the pooled sample variance.

S p2=(nM−1)S M

2+(nF−1) S F

2

nM+nF−2=(53−1)$2 4.99+(49−1)$2 5.02

53+49−2=$25.0044≈$2 5

Then,

P (M−F ≤1.27)=P((M −F )−(μM−μF)

√ S p2

nM

+S p

2

nF

≤1.27−(μM−μF)

√ S p2

nM

+S p

2

nF)=P (T ≤

1.27−(14.2−13.5)

√ 553

+549 )=P (T ≤1.29)

with T ∼ t nM +nF−2=t53+49−2=t 100 .

• By using the table of the t distribution, the probability is 0.9.

• By using the language R, the probability is


> pt(1.29, 100)[1] 0.8999871

> pt(1.29, 99)[1] 0.8999721

> pnorm(1.29)[1] 0.9014747

> pt((1.27-14.2+13.5)/sqrt((5/53)+(5/49)), 100)[1] 0.8993372

(B) Method of the pivotal quantity:

1−α=P (lα/2≤T ≤rα/2)=P (lα/2≤S M2σ F

2

σM2 S F

2 ≤rα/2)=P (lα/2 S F2

S M2 ≤

σF2

σM2 ≤rα/2

S F2

S M2 )=P( S M

2

lα/2 S F2 ≥

σM2

σF2 ≥

S M2

rα/2 S F2 )

so confidence intervals for σM2/σF

2 and σM/σF are respectively given by

I 1−α=[ S M2

rα/2 S F2

,S M

2

lα/2 S F2 ] and then I 1−α=[√ S M

2

rα/2 S F2 , √ S M

2

lα/2 S F2 ]

In the calculations, multiplying by a quantity and inverting can be applied in either order.


• S M2=$2 4.99 and S F

2= $25.02

• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → {lα/2=0.57rα/2=1.76

Then

I 0.95=[ √ 4.991.76⋅5.02

, √ 4.990.57⋅5.02 ]=[0.75, 1.32]

Conclusion: First of all, in this case there is very little difference between the two ways of estimating thevariance. On the other hand, as the variances are related through a quotient, the interpretation is not direct: thedimensionless, multiplicative factor c in σM

2 = cσF2 is, with 95% confidence, in the interval obtained. The

interval (with dimensionless endpoints) contains the value 1, so it may happen that the variability of theamount of money spent is the same for males and females—we cannot reject this hypothesis (note thatconfidence intervals can be used to make decisions). (Remember: statistical results depend on: theassumptions, the methods, the certainty and the data.)

Exercise 2pe-ci-ht

The electric light bulbs of manufacturer X have a mean lifetime of 1400 hours (h), while those ofmanufacturer Y have a mean lifetime of 1200h. Simple random samples of 125 bulbs of each brand are tested.From these datasets the sample quasivariances Sx

2 = 156h2 and Sy2 = 159h2 are computed. If manufacturers

are supposed to be independent and their lifetimes are supposed to be normally distributed:

a) Build a 99% confidence interval for the quotient of standard deviations σX/σY. Is the value σX/σY=1,that is, the case σX=σY, included in the interval?

b) By using the proper statistic T, find k such that P ( X −Y ≤ k )= 0.4.

Hint: (i) Firstly, build an interval for the quotient σX2/σY

2; secondly, apply the positive square root function. (ii) If a random variableξ follows a F124, 124 then P(ξ ≤ 0.628) = 0.005 and P(ξ ≤ 1.59) = 0.995. (iii) If ξ follows a t248, then P(ξ ≤ –0.25) = 0.4

(Based on an exercise of Statistics, Spiegel, M.R., and L.J. Stephens, McGraw–Hill.)


electric means carrying, producing, produced by, powered by, or charged with electricity: 'an electric wire', 'an electric generator', 'anelectric shock', 'an electric current', 'an electric light bulb', 'an electric toaster'. For machines and devices that are powered by electricitybut do not have transistors, microchips, valves, etc, use electric (NOT electronic): 'an electric guitar', 'an electric train set', 'an electricrazor'.


> qf(c(0.025, 1-0.025), 53-1, 49-1)[1] 0.5723433 1.7583576

My notes:

electrical means associated with electricity: 'electrical systems', 'a course in electrical engineering', 'an electrical engineer'. To refer to thegeneral class of things that are powered by electricity, use electrical (NOT electric): 'electrical equipment', 'We stock all the latestelectrical kitchen appliances'.

electronic is used to refer to equipment which is designed to work by means of an electric current passing through a large number oftransistors, microchips, valves etc, and components of this equipment: 'an electronic calculator', 'tiny electronic components'. Compare:'an electronic calculator' BUT 'an electric oven'. An electronic system is one that uses equipment of this type: 'electronic surveillance', 'e-mail' (=electronic mail, a system for sending messages very quickly by means of computers).

electronics (WITH s) refers to (1) the branch of science and technology concerned with the study, design or use of electronic equipment:'a students of electronics' (2) (used as a modifier) anything that is connected with this branch: 'the electronics industry'.

Discussion: There are two independent normal populations. All suppositions should be evaluated. Theirmeans are known while their variances are estimated from samples of size 125. A 99% confidence interval forσX/σY is required. The interval will be built by applying the method of the pivot. If the value σX/σY=1 belongsto this interval of confidence 0.99, the probability of the second section can reasonably be calculated under thesupposition σX=σ=σY—this implies that the common variance σ2 is jointly estimated by using the pooledsample quasivariance Sp

2. On the other hand, this exercise shows the natural order in which the statisticaltechniques must sometimes be applied in practice: the supposition σX=σY is empirically supported—byapplying a confidence interval or a hypothesis test—before using it in calculating the probability. Since thestandard deviations have the same units of measurement as the data (hours), their quotient is dimensionless,and so are the endpoints of the interval.

Identification of the variables:

X ≡ Lifetime of a light bulb of manufacturer X X ~ N(μX=1400h, σX2)

Y ≡ Lifetime of a light bulb of manufacturer Y Y ~ N(μY=1200h, σY2)

(a) Confidence interval

Selection of the statistics: We know that:

• There are two independent normal populations• The standard deviations σX and σY

are unknown, and we compare them through σX/σY

From a table of statistics (e.g. in [T]) we select a (dimensionless) statistic. To compare the variances of twoindependent normal populations, we have two candidates:

T (X , Y ;σX ,σY)=V X

2σY

2

V Y2σ X

2 ∼ F nX , nY and T (X , Y ;σX ,σY )=

S X2σY

2

SY2σX

2 ∼ Fn X−1 ,nY−1

where V X2=

1n∑ j=1

n

( X j−μ)2 and S X

2=

1n−1∑ j=1

n

( X j− X )2, respectively (similarly for population Y).

We would use the first if we were given V X2 and V Y

2 or we had enough information to calculate them (weknow the means but not the data themselves). In this exercise we can use only the second statistic.

Method of the pivot:

1−α=P (lα/2≤T ≤rα/2)=P (lα/2≤S X2σY

2

σX2 S Y

2 ≤rα/2)=P (lα/2 S Y2

S X2 ≤

σY2

σ X2 ≤rα/2

S Y2

S X2 )=P ( S X

2

lα/2 SY2 ≥

σX2

σY2 ≥

S X2

rα/2 SY2 )

so confidence intervals for σX2/σY

2 and σX/σY are respectively given by

I 1−α=[ S X2

rα/2 S Y2

,S X

2

lα/2 SY2 ] and I 1−α=[√ S X

2

rα/2 S Y2 , √ S X

2

lα/2 S Y2 ]


(In the previous calculations, multiplying by a quantity and inverting can be applied either way.)


• S X2 = 156 h2

and S Y2 =159h2

• 99% → 1–α = 0.99 → α = 0.01 → α/2 = 0.005 → { P (ξ≤lα/2)=α/ 2=0.005P(ξ≤rα/2)=1−α/2=0.995

→ {lα/2=0.628rα/ 2=1.59

where the information in (ii) of the hint has been used. Then

I 0.95=[√ 156h2

1.59⋅159 h2 , √ 156h2

0.628⋅159 h2 ]=[0.786, 1.25]

The value σX/σY=1 is in the interval of confidence 0.99 (99%), so the supposition σX=σY is strongly supported.

(b) Probability

To work with the difference of the means of two independent normal populations when σX=σY, we consider:

T (X , Y ;μX ,μY)=( X −Y )−(μ X−μY )

√ S p2

nX

+S p

2

nY

∼ t n x+n y−2

where S p2=(n X−1) S X

2+(nY−1)SY

2

n X+ny−2=

124⋅156 h2+124⋅159 h2

125+125−2=157.5 h2 is the pooled sample quasivariance.

The quantile can be found after rewriting the event as follows:

0.4= P( X −Y ≤ k )=P (( X −Y )−(μ X−μY)

√ S p2

n X

+S p

2

nY

≤k−(μ X−μY )

√ S p2

n X

+S p

2

nY)=P(T ≤

k−(1400−1200)

√ 157.5125

+157.5125

)

Now, by using the information in (iii) of the hint,

l 0.4=−0.25=kh−(1400 h−1200h)

√ 157.5h2

125+

157.5 h2

125

→ k=200 h−0.25√2157.5 h2

125=199.60 h

Conclusion: A confidence interval has been obtained for the quotient of the standard deviations. Thedimensionless value of θ = σX/σY is between 0.786 and 1.250 with confidence 99%; alternatively, as thestandard deviations are related through a quotient, an equivalent interpretation is the following: the(dimensionless) multiplicative factor θ in σX=θ·σY is, with 99% confidence, in the interval obtained. Since thevalue θ = 1 is in this high-confidence interval, it may happen that the variability of the two lifetimes is thesame—we cannot reject this hypothesis (note that confidence intervals can be used to make decisions);besides, it is reasonable to use the supposition σX=σY in calculating the probability of the second section. Ifany two simple random samples of size 125 were considered, the difference of the sample means will besmaller than 199.60 with a probability of 0.4. Once two particular samples are substituted, randomness is notinvolved any more and the inequality x− y≤k=199.60 is true or false. The endpoints of the interval haveno dimension, like the quotient σX/σY or the multiplicative factor c. (Remember: statistical results depend on:the assumptions, the methods, the certainty and the data.)


My notes:

Exercise 3pe-ci-ht

In 1990, 25% of births were by mothers of more than 30 years of age. This year a simple random sample ofsize 120 births has been taken, yielding the result that 34 of them were by mothers of over 30 years of age.

a) With a significance of 10%, can it be accepted that the proportion of births by mothers of over 30years of age is still ≤ 25%, against that it has increased? Select the statistic, write the critical regionand make a decision. Calculate the p-value and make a decision. If the critical region is

Rc={η> 0.30 }, calculate β (probability of the type II error) for η1 = 0.35. Plot the power functionwith the help of a computer.

b) Obtain a 90% confidence interval for the proportion. Use it to make a decision about the value of η,which is equivalent to having applied a two-sided (nondirectional) hypothesis test in the first section.

(First half of section a and first calculation in b, from 2007's exams for accessing the Spanish university; I have added the other parts.)

Discussion: In this exercise, no supposition should be evaluated. The number 30 plays a role only indefining the population under study. The Bernoulli model is “the only proper one” to register the presence--absence of a condition. Percents must be rewritten in a 0-to-1 scale. Since the default option is that theproportion has not changed, the equality is allocated in the null hypothesis. On the other hand, proportions aredimensionless by definition.

(a) Hypothesis test

Statistic: From a table of statistics (e.g. in [T]), since the population variable follows the Bernoullidistribution and the asymptotic framework can be considered (large n), the statistic

T (X ; p)=η−η

√ ?(1−?)n

→d

N (0,1)

is selected, where the symbol ? is substituted by the best information available: η or η . In testinghypotheses, it will be used in two forms:

T 0(X )=η−η0

√η0(1−η0)

n

→d

N (0,1) and T 1(X )=η−η1

√η1(1−η1)

n

→d

N (0,1)

where the supposed knowledge about the value of η is used in the denominators to estimate the variance (wedo not have this information when T is used to build a confidence interval, like in the next section).Regardless of the testing methodology to be applied, the evaluation of the statistic is necessary to make thedecision. Since η0=0.25

T 0( x)=

34120

−0.25

√ 0.25(1−0.25)120

=0.843

Hypotheses:

H 0: η= η0 ≤0.25 and H 1: η= η1 > 0.25


For this alternative hypothesis, the critical region takes the form

Rc={η>c }={η−η0

√η0(1−η0)

n

>c−η0

√ η0(1−η0)

n}={T 0>a }

Decision: To determine Rc, the quantile is calculated from the type I error with α = 0.1 at η0 = 0.25:

α(0.25)=P (Type I error)=P(Reject H 0∣H 0 true)=P(T 0>a)

→ a=r 0.1=l 0.9=1.28 → Rc= {T 0(X )>1.28 }.

Now, the decision is: T 0( x)=0.843 < 1.28 → T 0( x)∉Rc → H0 is not rejected.

p-value

pV = P (X more rejecting than x ∣H 0 true)=P (T 0(X )>T 0( x))= P (T 0(X )>0.843)=0.200

→ pV = 0.200 > 0.1=α → H0 is not rejected.

Type II error: To calculate β, we have to work under H1. Since the critical region has been expressed in termsof T0, and we must use T1, we could apply the mathematical trick of adding and subtracting the same quantity.Nevertheless, this way is useful when the value c in Rc={η>c } has not been calculated yet; now, since wehave been said that Rc={η>0.3} it is easier to directly standardize with η1:

β(η)= P (Type II error )= P (Accept H 0|H 1 true)= P (T 0(X )∉Rc|H 1)= P (η≤0.3|H 1)

= P (η−η1

√ η1(1−η1)

n

≤0.3−η1

√ η1(1−η1)

n∣H 1)= P (T 1≤

0.3−η1

√ η1(1−η1)

n)

For the particular value η1 = 0.35,

β(0.35)= P (T 1≤0.3−0.35

√ 0.35(1−0.35)120

)= P (T 1≤−1.15)= 0.125

By using a computer, many more values η1 ≠ 0.35 can be considered to plot the power function

ϕ(η)= P (Reject H 0)= { α(η) if p∈Θ0

1−β(η) if p∈Θ1

# Sample and inferencen = 120alpha = 0.1theta0 = 0.25 # Value under the null hypothesis H0c = 0.3theta1 = seq(from=0.25,to=1,0.01)paramSpace = sort(unique(c(theta1,theta0)))PowerFunction = 1 - pnorm((c-paramSpace)/sqrt(paramSpace*(1-paramSpace)/n),0,1)plot(paramSpace, PowerFunction, xlab='Theta', ylab='Probability of rejecting theta0', main='Power Function', type='l')

This code generates the power function:


> pnorm(-1.15,0,1)[1] 0.125

(b) Confidence interval

Statistic: From a table of statistics (e.g. in [T]), the same statistic is selected

T (X ;η)=η−η

√ ?(1−?)n

→d

N (0,1)

where the symbol ? is substituted by the best information available. In testing hypotheses we were alsostudying the unknown quantity η, although it was provisionally supposed to be known under the hypotheses;for confidence intervals, we are not working under any hypothesis and η must be estimated in thedenominator:

T (X ;η)=η−η

√ η(1−η)

n

→d

N (0,1)

The interval is obtained with the same calculations as in previous exercises involving a Bernoulli population,

I 1−α=[η−rα/2√ η(1−η)

n, η+rα/2√ η(1−η)

n ]where rα/2 is the value of the standard normal distribution such that P(Z>rα/2)=α/2. By using

• n = 120.

• Sample proportion: η=34120

=0.283 .

• 90% → 1–α = 0.9 → α = 0.1 → α/2 = 0.05 → r0.05=l0.95=1.645 .

the particular interval (for these data) appears

I 0.9=[0.283−1.645√ 0.283 (1−0.283)120

, 0.283+1.645√ 0.283 (1−0.283)120 ]=[0.215 , 0.351]

Thinking about the interval as an acceptance region, since η0=0.25∈ I the hypothesis that η may still be0.25 is not rejected.

Conclusion: With confidence 90%, the proportion of births by mothers of over 30 years of age seems to be0.25 at most. The same decision is still made by considering the confidence interval that would correspond to


a two-sided (nondirectional) test with the same confidence, that is, by allowing the new proportion to bedifferent because it had severely increased or decreased. (Remember: statistical results depend on: theassumptions, the methods, the certainty and the data.)

Exercise 4pe-ci-ht

A random quantity X is supposed to follow a distribution whose probability function is, for θ > 0,

f (x ;θ)= {θ xθ−1

if 0≤x≤1

0 otherwise



C) Use the estimators obtained to build others for the mean μ and the variance σ2.

D) Let X = (X1,...,Xn) be a simple random sample. By applying the results involving Neyman-Pearson'slemma and the likelihood ratio, study the critical region for the following pairs of hypotheses.

{H 0: θ=θ0

H 1: θ=θ1{ H 0 : θ=θ0

H 1: θ= θ1>θ0{ H 0: θ= θ0

H 1: θ=θ1<θ0{ H 0 : θ≤θ0

H 1 : θ =θ1>θ0{ H 0: θ≥ θ0

H 1: θ= θ1<θ0

Hint: Use that E(X) = θ/(θ+1) and E(X2) = θ/(θ+2).

Discussion: This statement is basically mathematical. The random variable X is dimensionless. (Thisprobability distribution, with standard power function density, is a particular case of the Beta distribution.)

Note: If E(X) had not been given in the statement, it could have been calculated by integrating:

E (X )=∫−∞

+∞

x f ( x ;θ)dx=∫0

1x θ xθ−1 dx=θ∫0

1xθdx=θ[ xθ+1

θ+1 ]01

= θθ+1

Besides, E(X2) could have been calculated as follows:

E (X 2)=∫−∞

+∞

x2 f (x ;θ)dx=∫0

1x2θ xθ−1 dx=θ∫0

1xθ+1 dx=θ [ xθ+2

θ+2 ]01

= θθ+2

Now,

μ=E (X )= θθ+1

and σ2=Var (X )=E (X 2

)−E (X )2= θθ+2

−( θθ+1 )

2

= θ

(θ+2)(θ+1)2 .


a1) Population and sample moments: There is only one parameter—one equation is needed. The first-ordermoments of the model X and the sample x are, respectively,

μ1(θ)=E(X )= θθ+1

and m1(x1 , x2 ,... , xn)=1n∑ j=1

nx j= x

a2) System of equations: Since the parameter of interest θ appears in the first-order moment of X, the firstequation suffices:

μ1(θ)=m1(x1 , x2 ,... , xn) → θθ+1

=1n∑ j=1

nx j= x → θ=θ x+ x → θ=

x1− x


My notes:

a3) The estimator:

θM=X

1− X


b1) Likelihood function: For this probability distribution, the density function is f (x ;θ)=θ xθ−1 so

L( x1 , x2 , ... , x n ;θ)=∏ j=1

nf (x j ;θ)=∏ j=1

nθ x j

θ−1=θ

n(∏ j=1

nx j)

θ−1


log [L( x1 , x2 , ... , xn ;θ)]=n log(θ)+(θ−1) log(∏ j=1

nx j)

To find the local or relative extreme values, the necessary condition is:

0=dd θ

log [L(x1, x2, ... , xn ;θ)]=n1θ+log(∏ j=1

nx j) →

nθ=−log(∏ j=1

nx j) → θ0=−

n

log (∏ j=1

nx j)


d 2

d θ2 log [L( x1 , x2 , ... , x n ;θ)]=d

d θ[n

1θ+log(∏ j=1

nx j)]=−

nθ

2 < 0

The second derivative is always negative, also at the value θ0.

b3) The estimator:

θML=−n

log(∏ j=1

nX j)


c1) For the mean: By applying the plug-in principle,

From the method of the moments: μM=θM

θM+1=

X1− XX

1−X+1

=

X1− X

X1− X

+1− X1− X

= X .

From the maximum likelihood method: μML=θML

θML+1=

−n

log(∏ j=1

nX j)

−n

log(∏ j=1

nX j)

+1=

n

n−log(∏ j=1

nX j)

.

c2) For the variance: Instead of substituting in the large expression of σ2, we use functional notation

From the method of the moments: σM2 =σ2(θM) , with σ

2(θ) and θM given above.

From the maximum likelihood method: σML2=σ

2(θML) , with σ

2(θ) and θML given above.

D) Neyman-Pearson's lemma and likelihood ratio

d1) For the hypotheses: {H 0: θ=θ0

H 1: θ=θ1


The likelihood function and the likelihood ratio are

L(X ;θ)=θn(∏ j=1

nX j)

θ−1

and Λ(X ;θ0 ,θ1)=L (X ;θ0)

L(X ;θ1)=(

θ0

θ1 )n

(∏ j=1

nX j)

θ0−θ1

Then, the critical or rejection region is

Rc = {Λ<k }= {(θ0

θ1 )n

(∏ j=1

nX j )

θ0−θ1

<k }= {n⋅log (θ0

θ1 )+(θ0−θ1) log (∏ j=1

nX j ) < log(k )}

={(θ0−θ1) log (∏ j=1

nX j)< log(k )−n⋅log (

θ0

θ1 )}={1

(θ0−θ1) log (∏ j=1

nX j )

>1

log(k )−n⋅log(θ0

θ1 )} ={

1(θ0−θ1)

−n

log (∏ j=1

nX j)

<−n


θ1 ) }={

1(θ0−θ1)

θML<−n


θ1 )}Now it is necessary that θ1≠θ0 and

• if θ1<θ0 then (θ0−θ1)>0 and hence Rc={{θML <−n(θ0−θ1)


θ1)}}

• if θ1>θ0 then (θ0−θ1)<0 and hence Rc={{θML >−n(θ0−θ1)


θ1)}}

This suggests regions of the form

Rc = {Λ<k }=⋯={θML<c} or Rc = {Λ<k }=⋯={θML>c }

The form of the critical region can qualitatively be justified as follows: if θ1 < θ0, the hypothesis H1 will beaccepted when an estimator of θ is in the lower tail, and vice versa.

Hypothesis tests { H 0: θ= θ0

H 1 : θ=θ1>θ0{ H 0 : θ=θ0

H 1: θ= θ1<θ0

In applying the methodologies, the same critical value c will be obtained for any θ1 since it only depends uponθ0 through θML :

α=P (Type I error)=P (θML<c) or α=P (Type I error)=P (θML>c)


Hypothesis tests { H 0 : θ≤ θ0

H 1 : θ=θ1>θ0{ H 0 : θ≥θ0

H 1: θ= θ1<θ0

A uniformly most powerful test for H 0: θ= θ0 is also uniformly most powerful for H 0: θ≤ θ0 .

Conclusion: For the probability distribution determined by the function given, two methods of pointsestimation have been applied. In this case, the two methods provide different estimators. By applying theplug-in principle, estimators of the mean and the variance have also been obtained. The form of the criticalregion has been studied by applying the Neyman-Pearson's lemma and the likelihood ratio.


Additional Exercises

Exercise 1ae

Assume that the height (in centimeters, cm) of any student of a group follows a normal distribution withvariance 55cm2. If a simple random sample of 25 students is considered, calculate the probability that thesample quasivariance will be bigger than 64.625cm2.

Discussion: In this exercise, the supposition that the normal distribution reasonably explains the variableheight should be evaluated by using proper statistical techniques.

Identification of the variable and selection of the statistic: The variable is the height, thepopulation distribution is normal, the sample size is 25, and we are asked for the probability of an eventexpressed in terms of one of the usual statistics: P (S2

> 64.625).

Search for a known distribution: Since we do not know the sampling distribution of S2, we cannotcalculate this probability directly. Instead, just after reading 'sample quasivariance' we should think about thefollowing theoretical result

T =(n−1)S 2

σ2 ∼ χn−1

2 , or, in this case, T =(25−1)S 2

55cm2 ∼ χ25−12 ,

Rewriting the event: The event has to be rewritten by completing some terms until (the dimensionlessstatistic) T appears. Additionally, when the table of the χ2 distribution gives lower-tail probabilities P(X ≤ x), itis necessary to consider the complementary event:

P (S2 > 64.625)=P ((25−1)S 2

55cm2 >(25−1)64.625cm2

55cm2 )=P (T > 28.2 )=1−P (T ≤ 28.2 )=1−0.75=0.25 .

In these calculations, one property of the transformations has been applied: multiplying or dividing by apositive quantity does not modify an inequality.

Conclusion: The probability of the event is 0.25. This means that S2 will sometimes take a value bigger than64.625cm2, when evaluated at specific data x coming from the population distribution.

Exercise 2ae

Let X be a random variable with probability function

f (x ;θ)=θ xθ−1

3θ , x∈[0,3]


My notes:

My notes:

such that E(X) = 3θ/(θ+1). Supposed a simple random sample X = (X1,...,Xn), apply the method of themoments to find an estimator θM of the parameter θ.

Discussion: This statement is mathematical. Although it is given, the expectation of X could be calculated asfollows

μ1(θ)=E (X )=∫−∞

+∞

x f ( x ;θ)dx=∫0

3xθ xθ−1

3θ dx= θ3θ [ xθ+1

θ+1 ]03

= θ3θ

3θ+1

θ+1=

3θθ+1

Method of the moments

Population and sample centered moments: The first-order moments are

μ1(θ)=3θθ+1

and m1(x1 , x2 ,... , xn)=1n∑ j=1

nx j= x

System of equations: Since the parameter θ appears in the first-order moment of X, the first equation issufficient to apply the method:

μ1(θ)=m1(x1 , x2 ,... , xn) →3θθ+1

=1n∑ j=1

nx j= x → 3θ=θ x+ x → θ(3− x )= x → θ=

x3− x

The estimator:

θM=X

3− X

Exercise 3ae

A poll of 1000 individuals, being a simple random sample, over the age of 65 years was taken to determinethe percent of the population in this age group who had an Internet connection. It was found that 387 of the1000 had one. Find a 95% confidence interval for η.

(Taken from an exercise of Statistics, Spiegel and Stephens, Mc-Graw Hill)

Discussion: Asymptotic results can be applied for this large sample of a Bernoulli population. The cutoffage value determines the population of the statistical analysis, but it plays no other role. Both η and η aredimensionless.

Identification of the variable: Having the connection or not is a dichotomic situation; then

X ≡ Connected (an individual)? X ~ Bern(η)


• There is one Bernoulli population• The sample size is big, n = 1000, so an asymptotic approximation can be applied

A statistic is selected from a table (e.g. in [T]):


My notes:

T (X ;η)=η−η

√ η(1−η)

n

→d

N (0,1)


1−α=P (lα/2≤T (X ;η)≤rα/2)=P (−rα/2≤η−η

√ η(1−η)

n

≤+rα/2)=P (−rα/2√ η(1−η)

n≤ η−η≤+rα/2√ η(1−η)

n )=P (−η−r α/2√ η(1−η)

n≤−η≤−η+rα/2√ η(1−η)

n )=P (η+r +α/2√ η(1−η)

n≥η≥η−rα/2√ η(1−η)

n )(3) The interval: Then,

I 1−α=[η−r+α/2√ η(1−η)

n, η+r+α/2√ η(1−η)

n ]where rα/2 is the value of the standard normal distribution verifying P(Z>rα/2)=α/2.


• n = 1000

• Theoretical (simple random) sample: X1,...,X1000 s.r.s. (each value is 1 or 0)

Empirical sample: x1,...,x1000 → ∑j=1

1000x j=387 → η=

11000∑j=1

1000x j=

3871000

=0.387

• 95% → 1–α = 0.95 → α = 0.05 → α/2 = 0.025 → rα/2=1.96

Finally,

I 0.95=[0.387−1.96√ 0.387 (1−0.387)1000

, 0.387+1.96√ 0.387 (1−0.387)1000 ]=[0.357 , 0.417 ]

Conclusion: The unknown proportion of individuals over the age of 65 years with Internet connection isinside the range [0.357, 0.417] with a probability of 0.95, and outside the interval with a probability of 0.05.Perhaps a 0-to-100 scale facilitates the interpretation: the percent of individuals is in [35.7%, 41.7%] with95% confidence. Proportions and probabilities are always dimensionless quantities, though expressed inpercent.

Exercise 4ae

A company is interested in studying its clients' behaviour. For this purpose, the mean time betweenconsecutive demands of service is modelized by a random variable whose density function is:

f (x;θ)=1θe−

x−2θ , x≥2, (θ>0)

The estimator provided by the method of the moments is θM = X −2 .


My notes:

1st Is it an unbiased estimator of the parameter? Why? 2nd Calculate its mean square error. Is it a consistent estimator of the parameter? Why?

Note: E(X) = θ + 2 and Var(X) = θ2

Discussion: The two sections are based on the calculation of the mean and the variance of the estimatorgiven in the statement. Then, the formulas of the bias and the mean square error must be used. Finally, thelimit of the mean square error is studied.

Mean and variance of θM

E (θM )=E ( X −2)=E ( X )−E (2)=E (X )−2=θ+2−2=θ

Var (θM )=Var ( X −2)=Var ( X )−Var (2)=Var (X )

n−0=σ2

n=θ2

n

Unbiasedness: The estimator is unbiased, as the expression of the mean shows. Alternatively, we calculatethe bias

b (θM )=E(θM )−θ=θ−θ=0

Mean square error and consistency: The:

MSE ( θM )= b (θM )2+Var (θM )= 02

+ θ2

n= θ

2

n

The population variance θ2 does not depend on the sample, particularly on the sample size n. Then,

limn→∞ MSE (θM )= limn→∞θ

2

n= 0

Note: In general, the population variance can be finite or infinite (for some “strange” probability distributions we do not consider in

this subject). If the variance is infinite, σ2 = ∞, neither Var (θM ) not MSE ( θM ) exists, in the sense that they are infinite; inthis particular exercise it is finite, θ2 < ∞. In the former case, the mean square error would not exist and the consistency (inprobability) could not be studied by using this way. In the latter case, the mean square error would exist and tend to zero(consistency in the mean-square sense), which is sufficient for the estimator of θ to be consistent (in probability).

Conclusion: The calculations of the mean and the variance are quite easy. They show that the estimator isunbiased and, if the variance is finite, consistent.

Advanced Theory: The If E(X) had not been given in the statement, it could have been calculated byapplying integration by parts (since polynomials and exponentials are functions “of different type”):

E (X )=∫−∞

+∞

x f ( x ;θ)dx=∫2

∞

x1θ

e−

x−2θ dx=[−xe

−x−2θ −∫1⋅(−e

−x−2θ )dx ]2

∞

=[−xe−

x−2θ −θe

−x−2θ ]2

∞

=[(x+θ)e− x−2θ ]∞

2

=2+θ .

That ∫u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫u ' (x)⋅v (x )dx has been used with

• u=x → u '=1

• v '=1θ

e−

x−2θ → v=∫ 1

θe−

x−2θ dx=−e

−x−2θ

On the other hand, ex changes faster than xk for any k. To calculate E(X2):


E (X 2)=∫−∞

+∞

x2 f (x ;θ)dx=∫2

∞

x2 1θ

e−

x−2θ dx=[−x2 e

−x−2θ +2∫ xe

−x−2θ dx ]2

∞

=[ x2 e−

x−2θ ]

∞

2

+2θ∫2

∞

x1θ

e−

x−2θ dx=(22

−0)+2θμ=4+2θ(2+θ)=2θ2+4θ+4 .

Again, integration by parts has been applied: ∫u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫u ' (x)⋅v (x )dx with

• u=x2 → u '=2 x

• v '=1θ

e−

x−3θ → v=∫ 1

θe−

x−3θ dx=−e

−x−3θ

Again, ex changes faster than xk for any k. Finally, the variance is

σ2=E (X 2

)−E (X )2=2θ2

+4θ+4−(θ+2)2=2θ2

+4θ+4−θ2−4θ−4=θ

2 .

Regarding the original probability distribution: (i) the expression reminds us the exponential distribution; (ii)the term x–2 suggests a translation; and (iii) the variance θ2 is the same as the variance of the exponentialdistribution. After translating all possible values x, the mean is also translated but the variance is not. Thus, thedistribution of the statement is a translation of the exponential distribution, which has two equivalent notations

Equivalently, when θ = λ–1,

Exercise 5ae

Is There Intelligent Life on Other Planets? In a 1997 Marist Institute survey of 935 randomly selectedAmericans, 60% of the sample answered “yes” to the question “Do you think there is intelligent life on otherplanets?” (http://maristpoll.marist.edu/tag/mipo/). Let's use this sample estimate to calculate a 90%confidence interval for the proportion of all Americans who believe there is intelligent life on other planets.What are the margin of error and the length of the interval?

(From Mind on Statistics. Utts, J.M., and R.F. Heckard. Thomson)

LINGUISTIC NOTE (From: Common Errors in English Usage. Brians, P. William, James & Co.)

American. Many Canadians and Latin Americans are understandably irritated when U.S. citizens refer to themselves simply as“Americans.” Canadians (and only Canadians) use the term “North American” to include themselves in a two-member group with theirneighbor to the south, though geographers usually include Mexico in North America. When addressing and international audiencecomposed largely of people from the Americas, it is wise to consider their sensitivities. However, it is pointless to try to ban this usage in all contexts. Outside of the Americas, “American” is universally understood to referto things relating to the U.S. There is no good substitute. Brazilians, Argentineans, and Canadians all have unique terms to refer tothemselves. None of them refer routinely to themselves as “Americans” outside of contexts like the “Organization of American States.”Frank Lloyd Wright promoted “Usonian,” but in never caught on. For better or worse, “American” is standard English for “citizen orresident of the United States of America.”

LINGUISTIC NOTE (From: Wikipedia.)

American (word). The meaning of the word American in the English language varies according to the historical, geographical, andpolitical context in which it is used. American is derived from America, a term originally denoting all of the New World (also called theAmericas). In some expressions, it retains this Pan-American sense, but its usage has evolved over time and, for various historicalreasons, the word came to denote people or things specifically from the United States of America.


My notes:

http://en.wikipedia.org/wiki/United_States

http://en.wikipedia.org/wiki/Americas

http://en.wikipedia.org/wiki/New_World

http://en.wikipedia.org/wiki/English_language

http://maristpoll.marist.edu/tag/mipo/

In modern English, Americans generally refers to residents of the United States; among native English speakers this usage is almostuniversal, with any other use of the term requiring specification. [1] However, this default use has been the source of controversy,

particularly among Latin Americans, who feel that using the term solely for the United States misappropriates it. [2][3] They argue insteadthat "American" should denote persons or things from anywhere in North, Central or South America, not just the United States, which isonly a part of North America.

Discussion: There are several complementary pieces of information in the statement that help us to identifythe distribution of the population variable X (Bernoulli distribution) and select the proper statistic T:

(a) The meaning of the question—for each item there are two possible values: “yes” or “no”.(b) The value 60% suggests that this is a proportion expressed in percent.(c) The words Let's use this sample estimate and confidence interval for the proportion.

Thus, we must construct a confidence interval for the proportion η (a percent is a proportion expressed in a 0--to-100 scale) of one Bernoulli population. The sample information available consists of two data: the samplesize n = 935 and the sample proportion η=0.6 . The relation between these quantities is the following:

η=1n∑j=1

nX i=

∑ j=1

nX i

n=#1 ' s

n (=#Yesesn ).

Although it is not necessary, we could calculate the number of ones:

0.6=η=#1 ' s935

→ # 1' s=935⋅0.6=561

Now, if we had not realized that 0.6 was the sample proportion, we would do η=561935

=0.6 .


X ≡ Answered with “yes” (one American)? X ~ B(η)

Confidence interval

For this kind of population and amount of data, we use the statistic:

T (X ;η)=η−η

√ ?(1−? )n

→d

N (0,1)

where ? is substituted by η or η . For confidence intervals η is unknown and no value is supposed, andhence it is estimated through the sample proportion. By applying the method of the pivot:

1−α=P (lα/2≤T (X ;η)≤rα/2)=P (−rα/2≤η−η

√ η(1−η)

n

≤+rα/2) =P (−rα/2√ η(1−η)

n≤ η−η≤+rα/2√ η(1−η)

n )=P (−η−rα/2√ η(1−η)

n≤−η≤−η+rα/2√ η(1−η)

n ) =P (η+r +α/2√ η(1−η)

n≥η≥η−rα/2√ η(1−η)

n )


http://en.wikipedia.org/wiki/American_(word)#cite_note-Gage-3

http://en.wikipedia.org/wiki/American_(word)#cite_note-Mencken-2

http://en.wikipedia.org/wiki/Latin_Americans

http://en.wikipedia.org/wiki/American_(word)#cite_note-Wilson-1

http://en.wikipedia.org/wiki/Americans


I 1−α=[η−r+α/2√ η(1−η)

n, η+r+α/2√ η(1−η)

n ]Substitution: We calculate the quantities in the formula,

• n = 935

• η=0.6• 90% → 1–α = 0.90 → α = 0.10 → α/2 = 0.05 → rα/2=r0.05=l0.95=1.645

So

I 0.99=[0.6−1.645√ 0.6 (1−0.6)935

, 0.6+1.645√ 0.6(1−0.6)935 ]=[0.574 , 0.626 ]

Margin of error and length

To calculate the previous endpoints we had calculated the margin of error, which is

E= r+α/ 2√ η(1−η)

n= 1.645√ 0.6 (1−0.6)

935=0.0264

The length is twice the margin of error

L=2⋅E=2⋅0.0264=0.0527

In general, even if the T follows and asymmetric distribution and we do not talk about margin of error, thelength can always be calculated through the difference between the upper and the lower endpoints:

L=0.626−0.574=0.052

Conclusion: Since the population proportion is in the interval (0,1) by definition, the values obtained seemreasonable. Both endpoints are over 0.5, which means that most US citizens think there is intelligent life onother planets. With a confidence of 0.90, measured in a 0-to-1 scale, the value of η will be in the intervalobtained. As regards the methodology applied, 90% times in average it provides a right interval. Nonetheless,frequently we do not know the real η and therefore we will never know if the method has failed or not.

Exercise 6ae

It is desired to know the proportion η of female students at university. To that end, a simple random sample ofn students is to be gathered. Obtain the estimators ηM and ηML for that proportion, by applying themethod of the moments and the maximum likelihood method.

Discussion: This statement is mathematical, really. Although it is given in the statement, the expectation ofX could be calculated as follows

μ1(η)=E (X )=∑Ωx f (x ;θ)=∑x=0

1x ηx

(1−η)1− x

=0⋅1⋅(1−η)+1⋅η⋅1=η


My notes:


Population and sample centered moments: The probability distribution has one parameter. The first-ordermoments are

μ1(η)=E (X )=η and m1(x1 , x2 ,... , xn)=1n∑ j=1

nx j= x

System of equations: Since the parameter η appears in the first-order moment of X, the first equation issufficient to apply the method:

μ1(η)=m1( x1 , x2 , ... , xn) → η=1n∑ j=1

nx j= x

The estimator:ηM= X

Maximum likelihood method

Likelihood function: For the distribution the mass function is f (x ;η)=ηx(1−η)

1−x .

L( x1 , x2 , ... , x n ;η)=∏ j=1

nf (x j ;η)=η

x1(1−η)1−x1 ⋯ η

xn(1−η)1− xn=η

∑ j=1

n

x j

(1−η)n−∑ j=1

n

x j

Optimization problem: The logarithm function is applied to facilitate the calculations,

log [L( x1 , x2 , ... , xn ;η)]= log [η∑ j=1

n

x j

]+ log [(1−η)n−∑ j=1

n

x j

]=(∑ j=1

nx j) log (η)+(n−∑ j=1

nx j) log (1−η) .

To find the local or relative extreme values, the necessary condition is

0=d

d ηlog [L( x1 , x2 , ... , x n ;η)]=(∑ j=1

nx j)

1

η+(n−∑ j=1

nx j)

−11−η

→ n−∑ j=1

nx j

1−η=∑ j=1

nx j

η

→ ηn−η∑ j=1

nx j=∑ j=1

nx j−η∑ j=1

nx j → ηn=∑ j=1

nx j → η0=

∑ j=1

nx j

n= x

To verify that the only candidate is a local or relative maximum, the sufficient condition is

d2

d η2 log [L (x1 , x2 ,... , xn ;η)]=(∑ j=1

nx j)

−1η2 −(n−∑ j=1

nx j)

−1(1−η)2 (−1)=−

∑ j=1

nx j

η2 −n−∑ j=1

nx j

(1−η)2< 0

since 1 ≥ xj and therefore n≥∑ j=1

nx j ↔ n−∑ j=1

nx j≥0 . This holds for any value, including η0 .

The estimator:ηML= X

Conclusion: The two methods provide the same estimator.


My notes:

Exercise 7aeA population variable X is supposed to follow the continuous uniform distribution with parameters λ1 = 0 andλ2. A simple random sample of size n is considered to estimate λ2. Apply the method of the moments to build anestimator.

Discussion: The distribution considered has two parameters, though one of them is known.


Population and sample moments: The first-order moments are

μ1(λ2)=E (X )=0+λ2

2 and m1(x1 , x2 ,... , xn)=

1n∑ j=1

nx j= x

System of equations: Since the parameter of interest λ2 appears in the first-order population moment of X, thefirst equation is enough to apply the method:

μ1(λ2)=m1( x1 , x2 ,... , xn) → λ2

2=

1n∑ j=1

nx j= x → λ2=2 x

The estimator:λ2=2 X

Conclusion: To estimate the parameter λ2, the method of the moments suggests twice the sample mean.

Exercise 8ae

Plastic sheets produced by a machine are constantly monitored for possible fluctuations in thickness(measured in millimeters, mm). If the true variance in thicknesses exceeds 2.25 square millimeters, there iscause for concern about product quality. The production process continues while the variance seems smallerthan the cutoff. Thickness measurements for a simple random sample of 10 sheets produced in a particularshift were taken, giving the following results:

(226, 226, 227, 226, 225, 228, 225, 226, 229, 227)

Test, at the 5% significance level, the hypothesis that the population variance is smaller than 2.25mm2.Suppose that thickness is normally distributed. Calculate the type II error β(2), find the general expression ofβ(σ2) and plot the power function.

(Based on an exercise of: Statistics for Business and Economics, Newbold, P., W.L. Carlson and B.M. Thorne, Pearson.)

Discussion: The supposition of normality should be evaluated. This statistical problem requires us to studythe variance of a normal population. Concretely, the application of a hypothesis test to see whether or not thevalue considered as reasonable has been exceeded. For large samples, we are given some quantities alreadycalculated; here we are given the crude data from which we can calculate any quantity. The hypothesis isallocated at H1 for the production process to continue only when high quality sheets are been made (and forthe equality to be in H0).


My notes:

Statistic: Since

• There is one normal population• The variance must be studied• The mean is known

the following statistic will be used

T (X ;σ)=ns2

σ2 =

(n−1)S 2

σ2 ∼ χn−1

2

As particular cases, when doing calculations under any hypothesis a value for σ2 is supposed:

T 0(X )=(n−1)S 2

σ02 ∼ χn−1

2 and T 1(X )=(n−1) S2

σ12 ∼ χn−1

2

We will need the quantities

X =1n∑ j=1

nX j =

110

( 226 mm+226 mm+⋯+227mm ) = 226.5mm

S2=

1n−1∑ j=1

n

( X j− X )2=

110−1

[(226mm−226.5 mm)2+⋯+(227 mm−226.5mm)

2 ] =1.61mm2

and

T 0( x)=(10−1)⋅1.61mm2

2.25 mm2 =6.44

One-sided (directional) hypothesis test

Hypotheses and form of the critical region: H 0: σ2 =σ02 ≥ 2.25 and H 1: σ2=σ1

2 < 2.25 .

For these hypotheses, Rc = {S 2<c }=⋯={T 0<a }

By applying the definition of α:

α(2.25)= P (Type I error )= P(Reject H 0 |H 0 true)= P(T∈Rc|H 0)= P (T 0<a)

→ a=rα=3.33 → Rc= {T 0(X )<3.33 }

Decision: Finally, it is necessary to check if this region “suggested by H0” is compatible with the value thatthe data provide for the statistic. If they are not compatible because the value seems extreme when thehypotheses is true, we will trust the data and reject the hypothesis H0.

Since T 0( x)=6.44 > 3.33 → T 0( x)∉Rc → H0 is not rejected.


pV =P (X more rejecting than x |H 0 true)=P (T 0(X )<T 0( x))

=P (T 0<6.44)=0.305


Type II error and power function: To calculate β, we have to work under H1, that is, with T1. Since thecritical region is expressed in terms of T0, the mathematical trick of multiplying and dividing by same quantityis applied:


> pchisq(6.44, 10-1)[1] 0.3047995

β(σ12)= P (Type II error )= P (Accept H 0|H 1 true)= P (T 0(X )∉Rc|H 1)= P (T 0(X )≥3.33|H 1)

= P ((n−1)S 2

σ02 ≥3.33|H 1)= P ((n−1)S 2

σ12

σ12

σ02≥3.33|H1)= P (T 1(X )≥

3.33⋅σ02

σ12 )


β(2)= P (T 1(X )≥3.33⋅2.25

2 )= P (T 1(X )≥3.75)= 0.927

By using a computer, many other values σ12 ≠ 2 can be considered so as to numerically determine the power of

the test curve 1–β(σ12) and to plot the power function.

ϕ(σ2)= P (Reject H 0)= { α(σ2) if σ∈Θ0



n = 10

alpha = 0.05

theta0 = 2.25 # Value under the null hypothesis H0

q = qchisq(alpha,n-1)

theta1 = seq(from=0,to=2.25,0.01)


PowerFunction = pchisq(q*theta0/paramSpace, n-1)


Conclusion: With a confidence of 0.95, measured in a 0-to-1 scale, the real value of σ2 will be smaller than2.25mm2, that is, the quality of the product will be appropriate. In average, the method we are applyingprovides a right decision 95% times; however, since frequently we do not known the true value of σ2 we neverknow whether the decision is true or not.

Exercise 9ae

If 132 of 200 male voters and 90 of 159 female voters favor a certain cantidate running for governor of


> 1 - pchisq(3.75, 10-1)[1] 0.9270832

My notes:

Illinois, find a 99% confidence interval for the difference between the actual proportions of male and femalevoters who favor the candidate.

(From: Mathematical Statistics with Applications. Miller, I., and M. Miller. Pearson.)

Discussion: There are two independent Bernoulli populations whose proportions must be compared(populations will not be independent if, for example, males and females would have been selected from thesame couples or families). The value 1 has been used to count the number of voters who favor the candidate.The method of the pivot will be used.

Identification of the variable: Favoring or not is a dichotomic situation,

M ≡ Favoring the candidate M ~ Bern(ηM)

F ≡ Favoring the candidate F ~ Bern(ηF)


• There are two independent Bernoulli populations• Both sample sizes are large, so an asymptotic approximation can be applied


T (M , F ;ηM ,ηF)=(ηM−ηF)−(ηM−ηF )

√ ηM (1−ηM )

nM

+ηF(1−ηF )

nF

→d

N (0,1)


1−α=P (lα/2≤T (M ,F ;ηM ,ηF )≤rα/2)≈P (−rα/2≤(ηM−ηF )−(ηM−ηF)

√ ηM (1−ηM )

nM

+ηF (1−ηF)

nF

≤+rα/2)

=P (−rα/2√ ηM (1−ηM )

nM

+ηF(1−ηF )

nF

≤(ηM−ηF )−(ηM−ηF )≤+rα/2√ ηM (1−ηM )

nM

+ηF (1−ηF)

nF)

=P (−(ηM−ηF )−rα/2√ ≤−(ηM−ηF )≤−(ηM−ηF )+rα/2√ )

=P ((ηM−ηF )+r α/2√ ηM (1−ηM )

nM

+ηF (1−ηF )

nF

≥ηM−ηF ≥(ηM−ηF )−r α/2√ ηM (1−ηM )

nM

+ηF (1−ηF )

nF)

(3) The interval:

I 1−α=[(ηM−ηF )−rα/2√ ηM (1−ηM )

nM

+ηF (1−ηF )

nF

, (ηM−ηF)+rα/2√ ηM (1−ηM )

nM

+ηF(1−ηF )

nF]

where rα/2 is the value of the standard normal distribution such that P(Z>rα/2)=α/2.


• nM = 200 and nF = 159.

• Theoretical (simple random) sample: M1,...,M200 s.r.s. (each value is 1 or 0).


Empirical sample: m1,...,m200 → ∑j=1

200m j=132 → ηM=

1200∑ j=1

200m j=

132200

=0.66 .

Theoretical (simple random) sample: F1,...,F159 s.r.s. (each value is 1 or 0)

Empirical sample: f1,...,f159 → ∑j=1

159f j=90 → ηF=

1159∑ j=1

159f j=

90159

=0.56 .

• 99% → 1–α = 0.99 → α = 0.01 → α/2 = 0.005 → rα/2=2.576 .

Then,

I 0.99=(0.66−0.56)∓2.576√ 0.66(1−0.66)200

+0.56 (1−0.56)

159=[−0.03906 , 0.2270 ]

Conclusion: The case ηM = ηF cannot formally be excluded when the decision is made with 99%confidence. Since η ∈(0,1) , any “reasonable” estimator of η should provide values in this range or close toit; but because of the natural uncertainty of the sampling process (randomness and variability), in this case thesmallest endpoint of the interval was –0.03906, which can be interpreted as being 0. When an interval of highconfidence is far from 0, the case ηM = ηF can clearly be rejected. Finally, it is important to notice that aconfidence interval can be used to make decisions about hypotheses on the parameter values.

Exercise 10ae

For two Bernoulli populations with the same parameter, prove that the pooled sample proportion is anunbiased estimator of the population proportion. For two normal populations, prove that the pooled samplevariance is an unbiased estimator of the population variance.

Discussion: It is necessary to calculate the expectation of the pooled sample proportion by using itsexpression and the basic properties of the mean. Alternatively, the most general pooled sample variance can beused. For Bernoulli populations, the mean and the variance can be written as μ=η and σ

2=η(1−η) .

Mean of ηp : This estimator can be used when ηX=η=ηY . On the other hand, E ( η)=E(X )=η .

E (ηp )=E ( nX ηX+nY ηY

nX+nY)= 1

nX+nY[ nX E ( ηX )+nY E ( ηY ) ]=

1n X+nY

(nX+nY ) E (X )=η

Then, the bias is b ( ηp )=E (ηp )−η=η−η=0 .

Mean of S p2 : This estimator can be used when σ X

2 =σ=σY2 . On the other hand, E (S 2 )=σ

2 .

E (S p2)= E( (n X−1)S X

2+(nY−1)SY

2

n X+ny−2 )=(n X−1)E (S X2)+(nY−1)E (SY

2)

nX +n y−2=

nX−1+nY−1nX+ny−2

σ2=σ

2

Then, the bias is b (S p2)=E (S p

2 )−σ2=σ

2−σ

2=0 .


My notes:

My notes:

Exercise 11aeA research worker wants to determine the average time it takes a mechanic to rotate the tires of a car, and shewants to be able to assert with 95% confidence that the mean of her sample is off by at most 0.50 minute. Ifshe can presume from past experience that σ = 1.6 minutes (min), how large a sample will she have to take?

(From Probability and Statistics for Engineers. Johnson, R. Pearson Prentice Hall.)

Discussion: In calculating the minimum sample size, the only case we consider (in our subject) is that ofone normal population with known standard deviation. Thus, we can suppose that this is the distribution of X.


X ≡ Time (of one rotation) X ~ N(μ, σ=1.6min)

Sample information:

Theoretical (simple random) sample: X1,..., Xn s.r.s. (the time measurement of n rotations will be considered)

Margin of error:


I 1−α=[ X −rα/2√ σ2

n, X +rα/2√ σ

2

n ]If we remembered the expression, we can use it. Either way, the margin of error (for one normal populationwith known variance) is:

E=rα/2√σ2

nSample size

Method based on the confidence interval: We want the margin of error E to be smaller or equal than thegiven Eg,

E g≥E=rα/2√σ2

n→ E g

2≥rα/2

2 σ2

n→ n≥( zα/2

σEg )

2

=(1.961.6min0.50min )

2

=6.2722=39.3 → n≥40

since rα/2=r0.05 /2=r 0.025=l0.975=1.96 . (The inequality does not change neither when multiplying or dividingby positive quantities nor squaring.)

Conclusion: At least n = 40 data are necessary to guarantee that the margin of error is 0.50min at most. Anynumber of data larger than n would guarantee—and go beyond—the precision desired. (This margin can bethought of as “the maximum error in probability”, in the sense that the distance or error |θ−θ| will besmaller that E with a probability of 1–α = 0.95, but larger with a probability of α = 0.05.)

Exercise 12ae

To estimate the average tree height of a forest, a simple random sample with 20 elements is considered,


My notes:

yieldingx=14.70 u and S=6.34 u

where u denotes a unit of length and S2 is the sample quasivariance. If the population variable height issupposed to follow a normal distribution, find a 95 percent confidence interval. What is the margin of error?

Discussion: In this exercise, the supposition that the normal distribution reasonably explains the variableheight should be evaluated by using proper statistical techniques. To build the interval and find the margin oferror, the method of the pivotal quantity will be applied.

(1) Pivot: From the information in the statement, we know that:

• The variable follows a normal distribution• The population variance σ2 is unknown, so it must be estimated• The sample size is n = 20 (asymptotic results cannot be considered)

To apply this method, we need a statistic with known distribution, easy to manage and involving μ. From atable of statistics (e.g. in [T]), we select

T (X ,μ)=X −μ

√ S 2

n

∼ t n−1

where X=(X 1 , X 2 , ... , X n) is a simple random sample, S2 is the sample quasivariance and tκ denotes the tdistribution with κ degrees of freedom.

(2) Event rewriting: The interval is built as follows.

1−α=P (lα/2≤T (X ;μ)≤rα/2)=P(−rα/2≤X −μ

√ S 2

n

≤+rα/ 2)=P (−rα/2√ S2

n≤ X −μ≤+rα/2√ S 2

n)

=P (− X −rα/2√ S 2

n≤−μ≤−X +r α/2√ S2

n)=P ( X +rα/2√ S 2

n≥μ≥ X −rα/2√ S 2

n)

(3) The interval:

I=[ X −rα/2√ S 2

n, X +rα/2√ S 2

n ]1−α

= X ∓rα/2√ S 2

n

Note: We have simplified the notation, but it is important to notice that the quantities rα/2 and S depend on the sample size n.

To use this general formula with the specific data we have, the quantiles of the t distribution with κ = n–1 =20–1 = 19 degrees of freedom are necessary

95% → 0.95 = 1–α → α = 0.05In the table of the t distribution, we must search the quantile provided for the probability p = 1–α/2 = 0.975 ina lower-tail probability table, or p = α/2 = 0.025 in an upper-tail probability table; if a two-tailed table is used,the quantile given for p = 1–α = 0.950 must be used. Whichever the table used, the quantile is 2.093. Finally,

I 0= x∓r0.05 /2√ s2

20=14.70u∓2.093 6.34

u

√20=14.70 u∓2.97 u=[11.73 u ,17.67 u]

By applying the definition of the margin of error,


E=rα/2√ S 2

n=2.0936.34

u

√20=2.97 u

Conclusion: With 95% confidence we can say that the mean tree height is in the interval obtained. Themargin of error, which is expressed in the same unit of measure as the data, can be thought of as the maximumdistance—when the interval contains the true value—from the real unknown mean and the middle point of theinterval, that is, “the maximum error in probability”.


My notes:

Appendixes

[Ap] Probability TheoryRemark 1pt: The probability is a measure in a 0-to-1 scale of the chances with which an event involving a random quantity occurs;alternatively, it can be interpreted as the proportion of times it occurs when many values of the random quantity are consideredrepeatedly and independently. For example, an event of confidence 1–α = 0.95 can be considered in two equivalent ways: (i) that ameasure of its occurring, within [0,1], takes the value 0.95; or (ii) that when the experiment is independently repeated many times,the event will occur more or less 95% of the times. Once values for random quantities are determined, the event will have occurredor not, but no probability is involved any more.

Some Reminders● Markov's Inequality. Chebyshev's Inequality. For any (real) random variable X, any (real) function

h(x) taking nonnegative values, and any (real) positive a > 0,

E(h(X ))=∫Ωh (x)dP=∫{h (X )<a }

h(x )dP+∫{h (X )≥a }h (x)dP

≥∫{h(X )≥a }h(x)dP≥∫{h (X )≥a}

adP=a∫{h (X )≥a }dP=a⋅P(h(X)≥a)

Then, the Markov's inequality is obtained

P(h(X )≥a)≤E(h(X))

a

For discrete X, the same proof can be rewritten with sums (or, alternatively, sums can be seen as aparticular case of Riemann-Stieltjes integrals). For continuous X, the measure dP can be written asf(x)dx for well-behaved distributions—called absolutely continuous—such that a density function f(x)exists. The following cases have special interest:

When h(x )=( x−μ)2, we have that: P((X−μ)2≥a)≤E((X−μ)2)

a=

Var (X )a

When h(x )=|x|, we have that: P(|X|≥a)≤E(|X|)

a

When h(x )=|x−μ|r , we have that: P(|X−μ|r≥a)≤

E(|X−μ|r)

a

When h(x )=x but X itself is a nonnegative random variable: P(X≥a)≤E(X )

a

The Chebyshev's inequality can be obtained as follows (a proof similar to that above can be writtentoo)

P(|X−μ|≥a)=P((X−μ)2≥a2

)≤E((X−μ)

2)

a2 =Var (X)

a2

(The positive branch of the square root is a strictly increasing function and the events in the twoprobabilities are the same. A similar inequality can be obtained with r instead of 2.) We can make ita=kσ to calculate the probability that X takes values farther from μ than k times σ. For example,


a=2σ → P({|X−μ|≥2σ})≤ σ2

4σ2=14=0.25

a=3σ → P({|X−μ|≥3σ})≤ σ2

9σ2=

19=0.11

Interpretation of the first case: the probability that X takes a value farther from the mean μ than twicethe standard deviation 2σ is 0.25 at most.

All these inequalities are true whichever the probability distribution of X, and the proof above isbased on binding in a rough way. They are nonparametric or distribution-free inequalities. As aconsequence, it seems reasonable to expect that there will be “more powerful” inequalities eitherwhen additional or stronger nonparametric results are used or when a parametric approach isconsidered (for example, in calculating the minimum sample size necessary to guarantee a givenprecision, we can also apply methods using statistics T based on asymptotic or parametric results).

● Generating Functions. (This section has been extracted from Probability and Random Processes.Grimmett, G., and D. Stirzaker. Oxford University Press, 3rd ed.) In Probability, generating functionsare useful tools to work with—e.g. when convolutions or sums of independent variables areconsidered. Let a = (a0, a1, a2,...) be a sequence. The simplest one is the (ordinary) generating functionof a, defined

Ga(t)=∑i=0

∞

ai ti , t∈ℝ for which the sum converges

The sequence may in principle be reconstructed from the function by setting a j=Ga

( j )(0)j!

. This

function is especially useful when ai are probabilities. The exponential generating function of a is

Ga(t)=∑ j=0

∞ a j tj

j!, t∈ℝ for which the sum converges

On the other hand, the probability generating function of a random variable X taking nonnegativeinteger values is defined as

G(t)=E(t X) , t∈ℝ for which there is convergence

(Some authors give a definition for z∈ℂ , and the radius of convergence is one at least) “There aretwo major applications of probability generating functions: in calculating moments, and in calculatingthe distributions of sums of independent variables.”

Theorem: E(X )=G '(1) , and, more generally, E(X (X−1)⋯(X−k+1))=G(k)(1) .

“Of course, G(k )(1) is shorthand for lims↑1 whenever the radius of convergence of G is 1.”

Particularly, to calculate the first two raw moments:

E(X )=G(1)(1)

E(X (X−1))=E(X2)−E(X)=G(2)

(1) → E(X2)=G(2)

(1)+E(X )=G(2 )(1)+G(1)

(1)

“If you are more interested in the moments of X than in its mass function, you may prefer to work notwith G but with the function M” called moment generating function and defined by

M (t)=G(et)=E(etX

) , t∈ℝ for which there is convergence

It is, under convergence, the exponential generating function of the moments E(Xk). It holds that

Theorem: E(X )=M ' (0) , and, more generally, E(X k)=M (k )

(0).

Particularly, to calculate the first two raw moments,


E(X )=M(1)(0)

E(X2)=M(2)

(0)

“Moment generating functions provide a very useful technique but suffter the disadvantage that theintegrals which define them may not always be finite. Rather than explore their properties in detail wemove on immediately to another class of functions that are equally useful and whose finiteness isguaranteed.” The characteristic function is defined by

φ(t)=E(eitX) , t∈ℝ , i=√−1

“First and foremost, from the knowledge of φ we can recapture the distribution of X.” “Thecharacteristic function of a distribution is closely related to its moment generating function”. (Momentgenerating functions are related to Laplace transforms while characteristic functions are related toFourier transforms.)

Theorem: (a) If φ(k )(0) exists then { E (|X k|)<∞ if k is even

E (|X k−1|)<∞ if k is odd

(b) If E(|X k|)<∞ then φ(k )(0)=ik E(Xk

) , so E(X k )=φ(k)(0)

ik.

Then, to calculate the first two crude moments,

E(X )=φ(1)(0)i

E(X2)=

φ(2)(0)

i2

Summary of results for calculating the (crude) moments E(Xk).

Generating Functions to Calculate (Raw) Moments

Definition Theorem

ProbabilityGenerating Function

(discrete X)G(t)=E(t X

) , t∈ℝ E(X (X−1)⋯(X−k+1))=G(k)(1)

MomentGenerating Function M (t )=E(e tX

) , t∈ℝ E(Xk)=M (k )

(0)

CharacteristicFunction φ(t)=E(eitX

) , t∈ℝ , i=√−1 E(Xk)=

φ(k)(0)

ik

Existence: Techniques for series and integrals must be used to determine the values of t∈ℝ that guarantee the convergence and hence the existence of the generating function.

When possible, we drop the subindex of the functions to simplify the notation. The reader can consultthe literature on Probability to see whether it is allowed to differentiate inside the series or theintegrals, which is equivalent to differentiate inside the expectation. On the other hand, there are othergenerating functions in literature: joint probability generating function, joint characteristic function,cumulant generating function, et cetera.


Exercise 1pt

In the following cases, calculate the probability of find the quantile:

(a) X∼ Pois (2.7) , P (1≤X<3)

(c ) X ∼UnifDisc (6) , P(X∈{2, 5 })

(e) X ∼ N (μ=−1 ,σ2=4) , P (X >−4.4)

(g) X ∼ t27 , P(X > a)=0.1

(i) X ∼ F15, 6 , P ( X > a )=0.01

(b) X∼ Bin (11 ,0.3) , P (X≤2)

(d ) X ∼UnifCont (2 ,5), P(X ≥3.5)

(f ) X ∼χ162 , P ( X≤ a )=0.025

(h) X ∼ F10 , 8 , P(X > 5.81)

( j) X ∼ t12 , P ({X ≤1.356 }∪{X > 3.055 })

Discussion: Several distributions, discrete and continuous, are involved in this exercise. Different ways canbe considered to find the answers: the probability function f(x), the probability tables or a statistical softwareprogram. Sometimes events need to be rewritten or decomposed. For discrete distributions, tables can containeither individual {X=x} or cumulative {X≤x} (or {X>x}) probabilities; for continuous distributions, onlycumulative probabilities.

(a) The parameter value is λ = 2.7, and for the Poisson distribution the possible values are always 0, 1, 2... Ifthe table provides cumulative probabilities of the form P(X≤x),

P (1≤X <3)=P (X ≤2)−P(X ≤0)=⋯

If the table provides individual probabilities,

P (1≤X <3)=P (X =1)+P(X =2)=0.1815+0.2450=0.4265

By using the mass function,

P (1≤X <3)=P (X =1)+P(X =2)=2.71

1!e−2.7

+2.72

2!e−2.7

=0.1814549+0.2449641=0.426419

Finally, by using the statistical software program R, whose function gives cumulative probabilities,

> ppois(2, 2.7) - ppois(0, 2.7) [1] 0.426419

To plot the probability function

values = seq(0, 10)probabilities = dpois(values, 2.7)

plot(values, probabilities, type="h", lwd=2, ylim=c(0,1), xlab="Value", ylab="Probability", main="Pois(2.7)")

(b) The parameter values are κ = 11 and η = 0.3, so the possible values are 0, 1, 2,..., 11. If the table of thebinomial distribution gives individual probabilities P(X = x),

P (X ≤2)=P (X =0)+P (X =1)+P (X =2)=0.0198+0.0932+0.1998=0.3128

If cumulative probabilities were given in the table, the probability P (X ≤2) would be provided directly. Onthe other hand, the mass function can be used too,

P (X ≤2)=P (X =0)+P (X =1)+P (X =2)=(110 )0.30

(1−0.3)11−0+(11

1 )0.31(1−0.3)11−1

+(112 )0.32

(1−0.3)11−2


=11!

0!(11−0)!⋅1⋅0.7

11+

11!1!(11−1)!

⋅0.3⋅0.710+

11!2!(11−2)!

⋅0.32⋅0.7

9=0.7

11+11⋅0.3⋅0.7

10+

11⋅102

⋅0.32⋅0.7

9

=0.711+11⋅0.3⋅0.7

10+

11⋅102

⋅0.32⋅0.7

9=0.01977327+0.09321683+0.1997504=0.3127405

Finally, by using the statistical software program R, whose function gives cumulative probabilities,

> pbinom(2, 11, 0.3) [1] 0.3127405


values = seq(0, 11) probabilities = dbinom(values, 11, 0.3)

plot(values, probabilities, type="h", lwd=2, ylim=c(0,1), xlab="Value", ylab="Probability", main="Bin(11, 0.3)")

(c) The parameter value is κ = 6, so the possible values are 0, 1, 2,..., 6. This probability distribution is sosimple that no table is needed. Since the event can be decomposed into two disjoint elementary outcomes,

P (X ∈{2, 5})=P (X =2)+P (X =5)=16+

16=

26=

13


values = seq(1, 6) probabilities = rep(1/6, length(values))

plot(values, probabilities, type="h", lwd=2, ylim=c(0,1), xlab="Value", ylab="Probability", main="UnifDisc(6)")

(d) The parameter values are κ1 = 2 and κ2 = 5, so the possible values are the real numbers in the interval [2,5](or with open endpoints, depending on the definition for the uniform distribution that you are considering). Notable is necessary for this distribution, and if we realize that 3.5 is the middle value between 2 and 5 nocalculation is needed either,

P (X ≥3.5)=0.5

If not, we can use the density function,

P (X ≥3.5)=∫3.5

5 15−2

dx=13⋅(5−3.5)=

1.53=0.5


values = seq(2, 5)probabilities = rep(1/(5-2), length(values))

plot(values, probabilities, type="l", lwd=2, ylim=c(0,1), xlab="Value", ylab="Probability", main="UnifCont(2, 5)")

For continuous distributions, the probability of any isolated value is zero, soP (X ≥3.5)=P (X >3.5) .

(e) Here the parameter values are μ = –1 and σ2 = 4, and the value of a normally distributed random variablecan always be any real number. Because of the standardization, a table with probabilities and quantiles for thestandard normal distribution suffices. In using this table, we must pay attention to the form of the eventswhose probabilities are provided


P (X >−4.4)= P( X −μ

√σ2>−4.4−μ

√σ2 )= P (Z>−4.4−(−1)

√ 4 )= P (Z >−1.7 )=P (Z <1.7)=0.9554

Writing the event in terms of +1.7 is necessary when the table contains only positive quantiles. Thestandardization can be applied before or after considering the complementary event. If we try solving theintegral,

P (X >−4.4)=∫−4.4

+∞

f (x)dx=∫−4.4

+∞ 1

√2πσ2e−( x−μ)2

2σ2

dx =?

we are not able to find an antiderivative of f(x)... because it does not exist. Then, we may remember that theantiderivative of e−x2

does not exist and that the definite integral of f(x) can be solved exactly only for somelimits of integration but it can always be solved numerically. On the other hand, by using the statisticalsoftware program R, whose function contains cumulative probabilities for events of the form {X<x},

> 1 - pnorm(-4.4, -1, sqrt(4))[1] 0.9554345


values = seq(-10, +10, length=100)probabilities = dnorm(values, -1, 2)

plot(values, probabilities, type="l", lwd=2, ylim=c(0,1), xlab="Value", ylab="Probability", main="N(-1, sd=2)")

(f) The parameter value is κ = 16. The set of possible values is always composed of all positive real numbers.Most tables of the chi-square distribution provide the probability of events of the form P(X>x). In this case, itis necessary to consider the complementary event before looking for the quantile:

P (X ≤a)=0.025 ↔ P (X >a)=1−0.025=0.975 → a = 6.91

We do not use the density function, as it is too complex. By using the statistical software program R, whosefunction gives quantiles for events of the form {X<x},

> qchisq(0.025, 16)[1] 6.907664


values = seq(0, +40, length(100))probabilities = dchisq(values, 16)

plot(values, probabilities, type="l", lwd=2, ylim=c(0,1),

xlab="Value", ylab="Probability", main="Chi-Sq(16)")

(g) Now the parameter value is κ = 27. A variable enjoying the t distribution can take any real value. Mosttables of this distribution provide the probability of events of the form P(X>x). In this case, it is not necessaryto rewrite the event:

P (X >a)=0.1 → a = 1.314

The density function is too complex to be used. The statistical software program R allows doing (the functionprovides quantiles for events of the form {X<x}),


> qt(1-0.1, 27)[1] 1.313703


values = seq(-5, +5, length=100)probabilities = dt(values, 27)

plot(values, probabilities, type="l", lwd=2, ylim=c(0,1), xlab="Value", ylab="Probability", main="t(27)")

(h) The parameter values for this F distribution are κ1 = 10 and κ2 = 8. The possible values are always allpositive real numbers. Again, most tables of this distribution provide the probability for events of the form{X>x}, so:

P (X >5.81)=0.01

The density function is also complex. Finally, by using the computer,

> 1 - pf(5.81, 10, 8)[1] 0.01002326


values = seq(0, 10, length=100)probabilities = df(values, 10, 8)

plot(values, probabilities, type="l", lwd=2, ylim=c(0,1), xlab="Value", ylab="Probability", main="F(10,8)")

(i) Now, the parameter values are κ1 = 15 and κ2 = 6. Then:

P (X >a)=0.01 → a = 7.56

The density function is also complex. By again using the computer,

> qf(1-0.01, 15, 6)[1] 7.558994


values = seq(0, 10, length=100)probabilities = df(values, 15, 6)

plot(values, probabilities, type="l", lwd=2, ylim=c(0,1), xlab="Value", ylab="Probability", main="F(15, 6)")

(j) Since the parameter value is κ = 12, after decomposing the event into two disjoint tails

P ({X ≤1.356 }∪{X >3.055 })=P ({X ≤1.356 })+P ({X >3.055 }) =1−P ({X >1.356})+P ({X >3.055 })=1−0.1+0.005=0.905

The density function is also complex. Finally,


> pt(1.356, 12) + 1 - pt(3.055, 12)[1] 0.9049621


values = seq(-10, +10, length=100)probabilities = dt(values, 12)

plot(values, probabilities, type="l", lwd=2, ylim=c(0,1), xlab="Value", ylab="Probability", main="t(12)")

Exercise 2pt

Weekly maintenance costs (measured in dollars, $) for a certain factory, recorded over a long period of timeand adjusted for inflation, tend to have an approximately normal distribution with an average of $420 and astandard deviation of $30. If $450 is budgeted for next week, what is an approximate probability that thisbudgeted figure will be exceeded?

(Taken from Mathematical Statistics with Applications. W. Mendenhall, D.D. Wackerly and R.L. Scheaffer. Duxbury Press)

Discussion: We need to extract the mathematical information from the statement. There is a quantity, theweekly maintenance costs, say C, that can be assumed to follow the distribution

C ∼ N (μ=$ 420,σ=$ 30 ) or, in terms of the variance, C ∼ N (μ=$420 ,σ2=$2 302=$2 900)

(In practice, this supposition should be evaluated.) We are asked for the probability P (C > 450) . Since Cdoes not follow a standard normal distribution, we standardize both sides of the inequality, by usingμ=E (C )=$ 420 and σ

2=Var(C )=$2 302 , to be able to use the table of the standard normal distribution:

P (C > 450)= P (C−μ

√σ2>

450−μ

√σ2 )= P(T >$ 450−$ 420

√$2302 )= P (T >3030 )

= P (T > 1 )= 1−P (T ≤ 1 ) =1−0.8413= 0.1587

Exercise 3pt (*)

Find the first two raw (or crude) moments of a random variable X when it enjoys:

(1) The Bernoulli distribution(2) The binomial distribution(3) The geometric distribution(4) The Poisson distribution(5) The exponential distribution(6) The normal distribution


My notes:

My notes:

Use the following concepts to do the calculations in several ways: (i) their definition; (ii) the probabilitygenerating function; (iii) the moment generating function; (iv) the characteristic function; or (v) others. Then,find the mean and the variance of X.

Discussion: Different methods can be applied to calculate the first two moments. We have practiced asmany of them as possible, both to learn as much as possible and to compare their difficulty; besides, some ofthem are more powerful that others. Some of these calculations are advanced. To work with characteristicfunctions, the definitions and rules of the analysis for complex functions of a real variable must be considered,and even some calculations may be easier if we work with the theory for complex functions of a complexvariable. Most of these definitions and rules are “natural generalizations” of those of real analysis, but wemust be careful not to apply them without the necessary justification.

(1) The Bernoulli distribution

By applying the definitions

E(X )=∑x=0

1x ηx

(1−η)1− x

=0⋅1⋅(1−η)+1⋅η⋅1=η

E(X2)=∑x=0

1x2η

x(1−η)

1− x=02

⋅η0⋅(1−η)+12

⋅η1⋅1=η

By using the probability generating function

G(t)=E(t X)=∑x=0

1t xηx

(1−η)1−x

=t0⋅1⋅(1−η)+ t1

⋅η⋅1=1−η+η t

This function exists for any t. Now, the usual definitions and rules of the mathematical analysis for realfunctions of a real variable imply that

E(X )=G(1)(1)= [η ]t=1=η

E(X2)=G(2)

(1)+E(X )=[0 ]t=1+η=η

By using the moment generating function

M (t )=E(e tX)=∑x=0

1e tx

ηx(1−η)

1− x=e t⋅0

⋅1⋅(1−η)+et⋅1⋅η⋅1=1−η+ηe t

This function exists for any real t. Because of the mathematical real analysis,

E(X )=M(1)(0)=[ηe t ]t=0=η

E(X2)=M(2)(0)= [ηe t ]t=0=η

By using the characteristic function

φ(t)=E(eitX)=∑x=0

1eitx

ηx(1−η)

1−x=e i⋅t⋅0

⋅1⋅(1−η)+ei⋅t⋅1⋅η⋅1=1−η+ηeit

This complex function exists for any real t. Complex analysis is considered to do,

E(X )=φ(1)(0)i

=[ηe it i ]t=0

i=ηii=η

E(X2)=

φ(2)(0)

i2 =[ηe it i2 ]t=0

i2 =η i2

i2 =η


Mean and variance

μ=E(X )=η

σ2=Var (X )=E(X2

)−E(X )2=η−η

2=η(1−η)

(2) The binomial distribution


E (X )=∑x=0

κ

x (κx )η

x(1−η)

κ−x=?

E (X 2)=∑x=0

κ

x2

(κx)η

x(1−η)

κ− x=?

A possible way consists in writing X as the sum of κ independent Bernoulli variables: X =∑ j=0

κ

Y j .

E (X )=E (∑ j=0

κ

Y j)=∑ j=0

κ

E(Y j)=κ⋅η

E (X 2)=E ([∑ j=0

κ

Y j ]2

)=?

This way can also be used to calculate the variance easily, but not to calculate the second moment:

σ 2=Var (X )=Var (∑x=0

κ

Y i)=∑x=0

κ

Var (Y i)=κ⋅η(1−η) .


G(t)=E(t X)=∑x=0

κ

t x (κx )η

x(1−η)

κ−x=(1−η)

κ∑x=0

κ

(κx )( η t

1−η )x

=(1−η)κ(1+ ηt

1−η)κ

=[(1−η)(1+ η t1−η )]

κ

=(1−η+ηt)κ

where the binomial theorem (see the appendixes of Mathematics) has been applied. Alternatively, this functioncan also be calculated by looking at X as a sum of Bernoulli variables Yj and applying a property forprobability generating functions of a sum of independent random variables,

G(t)=[GY (t )]κ=(1−η+ηt )

κ

This function exists for any t. Again, complex analysis allows us to do

E(X )=G(1)(1)=[κ (1−η+η t )κ−1η]t=1=κ⋅1κ−1⋅η=κη

E(X2)=G(2)

(1)+E(X )=[κ(κ−1)(1−η+η t )κ−2

η2 ]t=1+κη=κ(κ−1)η2

+κ η=κη(κη−η+1)


M (t)=E(e tX)=∑x=0

κ

e tx

(κx )η

x(1−η)κ−x=(1−η)κ∑x=0

κ

(κx )( ηe t

1−η)x

=(1−η)κ(1+ ηe t

1−η)κ

=[(1−η)(1+ ηe t

1−η )]κ

=(1−η+ηe t)κ

Again, it is also possible to look at X as a sum of Bernoulli variables Yj and apply a property for moment


generating functions of a sum of independent random variables,

M (t)=[M Y (t ) ]κ=(1−η+ηe t )

κ

This function exists for any real t. Because of the mathematical real analysis,

E(X )=M(1)(0)=[κ (1−η+ηet )

κ−1ηe t ]t=0=κη

E(X2)=M(2)

(0)=[κ(κ−1) (1−η+ηet )κ−2

(ηe t)

2+κ (1−η+ηe t )

κ−1ηe t ]t=0

=κ(κ−1) η2+κη=κ η(κ η−η+1)


φ(t)=E(eitX )=∑x=0

κ

eitx

(κx )η

x (1−η)κ−x=(1−η)κ∑x=0

κ

(κx )( ηeit

1−η)x

=(1−η)κ (1+ ηe it

1−η)κ

=[(1−η)(1+ ηeit

1−η )]κ

=(1−η+ηe it)κ

Once more, by looking at X as a sum of Bernoulli variables Yj and applying a property for characteristicfunctions of a sum of independent random variables,

φ(t)=[φY ( t)]κ=(1−η+ηeit )

κ

This complex function exists for any real t. Again, complex analysis is considered in doing,

E(X )=φ(1)(0)i

=[κ (1−η+ηeit )

κ−1ηeit i ]t=0

i=κ ηii

=κη

E(X2)=

φ(2)(0)

i2 =[κ(κ−1) (1−η+ηe it )

κ−2(ηeit i)2+κ (1−η+ηeit )

κ−1ηeit i2 ]t=0

i2

=[κ(κ−1) (1−η+ηeit )

κ−2(ηeit i)2+κ (1−η+ηe it )

κ−1ηeit i2 ]t=0

i2 =[κ(κ−1) η2i2

+κηi2 ]t=0

i2

=[κ(κ−1)η2 i2

+κηi2 ]t=0

i2 =κη(κη−η+1) i2

i2=κη(κη−η+1)

Mean and variance

μ=E(X )=κη

σ2=Var (X )=E(X2

)−E(X )2=η−η

2=κη(κη−η+1)−(κη)2=κη(1−η)

(3) The geometric distribution


E (X )=∑x=1

+∞

x⋅η⋅(1−η)x−1

=?

E (X 2)=∑x=1

+∞

x2⋅η⋅(1−η)

x−1=?

As an example, I include a way to calculate E(X) that I found. To prove that any moment of order r is finite or,equivalently, that the series is (absolutely) convergent, we apply the ratio test for nonnegative series:


lim x→∞

ax+1

ax

=lim x→∞

|(x+1)r⋅η⋅(1−η)x+1−1|

|xr⋅η⋅(1−η)

x−1|=lim x→∞

(x+1)r

xr |1−η|=|1−η|< 1

Mathematically, the radius of convergence is 1, that is,

−1 < 1−η<+1 ↔ −2 <−η< 0 ↔ 2 > η> 0

Probabilistically, the meaning of the variable η (in the geometric distribution, it is a probability between 0 and1) implies that the series is convergent for any η. Either way, this implies that ∞<∑x=1

+∞

xr⋅η⋅(1−η)

x−1.

Once the convergence has been proved, the rules of “the usual arithmetic for finite quantities” can be applied.The convergence of the series is crucial for the following calculations.

E (X )=∑x=1

+∞

x⋅η⋅(1−η)x−1

=η∑x=0

+∞

x⋅(1−η)x=η [1(1−η)

1+2(1−η)

2+⋯]

=η[∑x=0

+∞

(1−η)x+∑x=1

+∞

(1−η)x+⋯]=η[∑x=0

+∞

(1−η)x+(1−η)∑x=0

+∞

(1−η)x+⋯]

=η[∑x=0

+∞

(1−η)x ]⋅[1+(1−η)+⋯]=η [∑x=0

+∞

(1−η)x ]2

=η( 11−(1−η))

2

=η( 1η )

2

=1η .

where the formula of the geometric sequence (see the appendixes of Mathematics) has been used.Alternatively, μ can be calculated by applying the formula available in literature for arithmetico-geometricseries.


G(t)=E(t X)=∑x=1

+∞

t x⋅η⋅(1−η)x−1

=ηt∑x=0

+∞

[t (1−η)]x=

ηt1−(1−η)t

Given η, this function exists for t such that |t(1–η)| < 1 (otherwise, the series does not converge), as thefollowing criterion shows

lim x→∞

ax+1

ax

=lim x →∞

|t (1−η)|x+1

|t (1−η)|x =|t(1−η)|< 1 .

The definitions and rules of the mathematical analysis for real functions of a real variable,

E(X )=G(1)(1)=[ η[1−(1−η) t ]−ηt [−(1−η)]

[1−(1−η) t ]2 ]t=1

=[η

[1−(1−η)t ]2 ]t=1

=η

η2=

1η

E(X2)=G(2)

(1)+E(X )=[ η2 [1−(1−η)t ](1−η)

[1−(1−η) t ]4 ]t=1

+1η=[ η2(1−η)

[1−(1−η)t ]3 ]t=1

+1η=

2(1−η)

η2 +

1η=

2−η

η2


M (t)=E(e tX)=∑x=1

+∞

etx⋅η⋅(1−η)

x−1=ηet∑x=0

+∞

[e t(1−η)]

x=

ηe t

1−(1−η)et

This function exists for any real t such that |et(1–η)| < 1 (otherwise, the series does not converge), as thefollowing criterion shows

lim x→∞

ax+1

ax

=lim x→∞

|et(1−η)|

x+1

|e t(1−η)|

x =|e t (1−η)|< 1.

Because of the mathematical real analysis,

E(X )=M(1)(0)=[ ηe t

[1−(1−η)e t]−ηe t

[−(1−η)et]

[1−(1−η)e t]2 ]

t=0

=[ ηet[1−(1−η)e t

+(1−η)e t]

[1−(1−η)et]2 ]

t=0


=[ ηe t

[1−(1−η)et]2 ]

t=0

=η

η2=

1η

E(X2)=M(2)

(0)=[ ηe t[1−(1−η)e t

]2−ηe t 2[1−(1−η)et

][−(1−η)e t]

[1−(1−η)e t]4 ]

t=0

=[ ηet[1−(1−η)et

+2(1−η)e t]

[1−(1−η)e t]3 ]

t=0

=[ ηet[1+(1−η)e t

]

[1−(1−η)e t]3 ]

t=0

=η(2−η)

η3 =

2−η

η2



+∞

eitx⋅η⋅(1−η)

x−1=ηe it∑x=0

+∞

[eit(1−η)]

x=

ηeit

1−(1−η)e it

This complex function exists for any real t such that |eit(1–η)| < 1, where |z| denotes the modulus of a complexnumber z (otherwise, the series does not converge), as the following criterion shows

lim x→∞

ax+1

ax

=lim x→∞

|eit(1−η)|

x+1

|e it(1−η)|

x =|e it(1−η)|< 1.

Once more, complex analysis allows us to do,

E(X )=φ(1)(0)i

=1i [ ηe it i [1−(1−η)eit

]−ηe it[−(1−η)eit i ]

[1−(1−η)e it]2 ]

t=0

=1i [ηe it i [1−(1−η)eit

+(1−η)e it]

[1−(1−η)eit]

2 ]t=0

=1i [ ηe it i

[1−(1−η)eit]2 ]

t=0

=1iηi

η2=

1η

E(X2)=φ(2)(0)

i2 =1i2 [ ηeit i2

[1−(1−η)eit]2−ηe it i 2[1−(1−η)e it

][−(1−η)e it i ]

[1−(1−η)e it]4 ]

t=0

=1i2 [ ηe it i2

[1−(1−η)e it+2(1−η)e it

]

[1−(1−η)eit]3 ]

t=0

=1i2 [ηe it i2[1+(1−η)e it

]

[1−(1−η)eit]

3 ]t=0

=1i2

η i2(2−η)

η3 =

2−η

η2

Mean and variance

μ=E(X )=1η

σ2=Var (X )=E(X2

)−E(X )2=

2−η

η2 −( 1

η )2

=1−η

η2

Advanced theory: Additional way 1: In Cálculo de probabilidades I, by Vélez, R., and V. Hernández, UNED,the first four moments are calculated as follows (I write the calculations for the first two moments, with thenotation we are using)

E (X )=∑x=1

+∞

x⋅η⋅(1−η)x−1

=η∑x=1

+∞

x⋅(1−η)x−1

=ηd

d (1−η)(∑x=1

+∞

(1−η)x)=η

dd (1−η) (

1−η

1−(1−η)) =η

1⋅[1−(1−η)]−(1−η)(−1)

[1−(1−η)]2 =η

1η

2=1η

E (X 2)=∑x=1

+∞

x2⋅η⋅(1−η)

x−1=η∑x=1

+∞

( x+1) x⋅(1−η)x−1

−η∑x=1

+∞

x⋅(1−η)x−1


=ηd 2

d (1−η)2 (∑x=1

+∞

(1−η)x+ 1)−E (X )=η

d 2

d (1−η)2 ( (1−η)

2

1−(1−η))−E (X )

=ηd

d (1−η) (2 (1−η)[1−(1−η)]−(1−η)2(−1)

[1−(1−η)]2 )−E(X )

=ηd

d (1−η) (2 (1−η)−2(1−η)2+(1−η)

2

[1−(1−η)]2 )−E(X )=η

dd (1−η)( 2(1−η)−(1−η)

2

[1−(1−η)]2 )−E(X )

=η[2−2(1−η)][1−(1−η)]

2−[2(1−η)−(1−η)

2]2[1−(1−η)](−1)

[1−(1−η)]4 −

1η

=η2[1−(1−η)]

2+2[2(1−η)−(1−η)

2]

[1−(1−η)]3 −

1η

=2η2

+4 (1−η)−2(1−η)2

η2 −

η

η2=

2 η2+4−4 η−2−2η2

+4 η−η

η2 =

2−η

η2

(We have already justified the convergence of the series involved.) Additional way 2: In trying to find a waybased on calculating the main part of the series by using an ordinary differential equation, as I had previouslydone for the Poisson distribution (in the next section), I found the following way that is essentially the same asthe additional way above. A series can be differentiated and integrated term by term inside the circle ofconvergence (the radius of convergence was one, which included all possible values for η). The expression ofthe mean suggests the following definition for g(η):

E (X )=∑x=1

+∞

x⋅η⋅(1−η)x−1

=η⋅g (η) → g (η)=∑x=1

+∞

x⋅(1−η)x−1

and it follows, since g is a well-behaved function of η, that

G(η)=∫ g (η)d η=∑x=1

+∞

∫ x⋅(1−η)x−1d η=−∑x=1

+∞

(1−η)x+c=−

1−η

1−(1−η)+c=

η−1

η+c

I spent some time searching a differential equation... and I found this integral one. Now, by solving it,

g(η)=G' (η)=η−(η−1)

η2 +0=

1η

2

(This is a general method to calculate some infinite series.) Finally, the mean is

E(X )=η⋅g(η)=η⋅1

η2=

1

ηFor the second moment, we define

E (X 2)=∑x=1

+∞

x2⋅η⋅(1−η)

x−1=η⋅g (η) → g (η)=∑x=1

+∞

x2⋅(1−η)

x−1

and it follows that

G(η)=∫ g (η)d η=∑x=1

+∞

x∫ x⋅(1−η)x−1 d η=−∑x=1

+∞

x (1−η)x+c

=−1−ηη ∑x=1

+∞

x η(1−η)x−1

+c=c+η−1

η2

Now, by solving this trivial integral equation,

g(η)=G' (η)=0+η

2−(η−1)2η

η4 =

η2−2η2

+2η

η4 =

2−η

η3

Finally, the second moment is

E(X2)=η⋅g (η)=η

2−η

η3 =

2−η

η2


Remark: Working with the whole series of μ(η) or σ2(η), as functions of η, is more difficult than working withthe previous functions g(η), since the variable η would appear twice instead of once (I spent some time until Irealize it).

(4) The Poisson distribution


E (X )=∑x=0

+∞

x⋅λx

x!e−λ

=?

E (X 2)=∑x=0

+∞

x2⋅λ

x

x!e−λ

=?

To prove that any moment of order r is finite or, equivalently, that the series is convergent, we apply the ratiotest for nonnegative series:

lim x→∞

ax+1

ax

=lim x→∞

|(x+1)r⋅ λx+1

( x+1)!e−λ|

|xr⋅λ

x

x!e−λ|

=lim x→∞

( x+1)r

xr

|λ|

x+1=0 < 1

This implies that ∞>∑x=0

+∞

xr⋅λ

x

x!e−λ

=e−λ∑x=0

+∞

xr⋅λ

x

x!. Once the (absolute) convergence has been proved,

the rules of “the usual arithmetic for finite quantities” could be applied. Nevertheless, working with factorialnumbers in series makes it easy to prove the convergence but difficult to find the value.


G(t)=E(t X)=∑x=0

+∞

t x⋅λx

x!e−λ=e−λ∑x=0

+∞ (t λ)x

x!=e−λe t λ=eλ(t−1)

This function exists for any t, as the following criterion shows

lim x→∞

ax+1

ax

=lim x→∞

|( t λ)x+1

x+1! ||(t λ)

x

x! |=lim x→∞

|t λ|x+1

=0 < 1 .

Now, the definitions and rules of the mathematical analysis for real functions of a real variable,

E(X )=G(1)(1)= [eλ(t−1)

λ ]t=1=λ

E(X2)=G(2)

(1)+E(X )=[eλ(t−1)λ

2 ]t=1+E(X )=λ2+λ


M (t)=E(e tX)=∑x=0

+∞

e tx⋅λx

x!e−λ=e−λ∑x=0

+∞ (e tλ)

x

x!=e−λeet

λ=eλ(et−1 )

This function exists for any real t, as the following criterion shows

lim x→∞

ax+1

ax

=lim x→∞

|(etλ)

x+1

x+1! ||(e

tλ)

x

x! |=lim x→∞

|e tλ|

x+1=0 < 1 .


Because of the mathematical real analysis,

E(X )=M(1)(0)=[eλ(et

−1 )λ e t ]t=0=λ

E(X2)=M(2)

(0)= [eλ(et−1 )

(λ et)

2+eλ(e t

−1)λ e t ]t=0=[ eλ (e

t−1)

λ et(λ e t

+1)]t=0=λ(λ+1)=λ2+λ



+∞

eitx⋅λ

x

x!e−λ

=e−λ∑x=0

+∞ (eitλ)

x

x!=e−λeeit

λ=eλ (e

it−1)

This function exists for any real t, as the following criterion shows

lim x→∞

ax+1

ax

=lim x→∞

|(eitλ)

x +1

x+1! ||(e

itλ)

x

x! |=lim x→∞

|e itλ|

x+1=0 < 1.

The definitions and rules of the analysis for complex functions have been applied in the previous calculations(they are similar to those for real functions of real variable). Now, by using the analysis for complex functionsof one real variable,

E(X )=φ(1)(0)i

=[eλ(eit

−1)λ e it i ]t=0

i=λ ii=η

E(X2)=

φ(2)(0)

i2 =[eλ(eit

−1)(λ e it i)2+eλ (e

it−1)

λ eit i2 ]t=0

i2 =[eλ (e

it−1)

λ eit i2(λ eit

+1)]t=0

i2=λ i2

(λ+1)

i2 =λ2+λ

Mean and variance

μ=E(X )=λ

σ2=Var (X )=E(X2

)−E(X )2=λ

2+λ−λ

2=λ

Advanced theory: Additional way 1: In finding ways, I found the following one. A series can bedifferentiated and integrated term by term inside its circle of convergence. The limit calculated at thebeginning was the same for any λ, so the radius of convergence for λ is infinite when the series is looked at asa function of λ. The expression of the mean suggests the following definition for g(λ):

E (X )=∑x=0

+∞

x⋅λx

x!e−λ

=e−λ g (λ) → ∑x=0

+∞

x⋅λx

x!= g (λ)

and it follows, since g is a well-behaved function of λ, that

g '(λ)=∑x=1

+∞

x⋅x λ x−1

x!=∑x=1

+∞

(1+x−1)⋅ λx−1

(x−1)!=∑x=1

+∞ λx−1

(x−1)!+∑x=1

+∞

(x−1)⋅ λx−1

(x−1)!=eλ

+g (λ)

Now, we solve the first-order ordinary differential equation g ' (λ)−g(λ)=eλ .

Homogeneous equation:

g '(λ)−g(λ)=0 →dgd λ

=g →1gdg=d λ → log(g)=λ+k → gh(λ)=eλ+k=c eλ

Particular solution: We apply, for example, the method of variation of parameters or constants. Substituting inthe equation g(λ)=c (λ)eλ and g ' (λ)=c ' (λ)eλ+c (λ)eλ

c ' (λ )eλ+c (λ)eλ

−c(λ)eλ=eλ → c ' (λ )=1 → c (λ )=λ → gp(λ)=λ eλ


General solution: g(λ)=gh(λ)+g p(λ )=c eλ+λ eλ

=(c+λ)eλ

Any g(λ) given by the previous expression verifies the differential equation, so an additional condition isnecessary to determine the value of c. The initial definition implies that g(0) = 0, so c = 0. Finally, the mean is

E(X )=e−λ g(λ)=e−λ λ eλ=λ

(The same can be done to calculate some infinite series.) For the second moment, we define

E (X 2)=∑x=0

+∞

x2⋅λ

x

x!e−λ

=e−λ g (λ) → ∑x=0

+∞

x2⋅λ

x

x!= g (λ)

and it follows, since g is a well-behaved function of λ, that

g '(λ)=∑x=1

+∞

x2⋅x λ x−1

x!=∑x=1

+∞

(1+x−1)2⋅ λx−1

(x−1)!=∑x=1

+∞

[1+(x−1)2+2(x−1)]⋅ λ

x−1

(x−1)!

=∑x=1

+∞ λx−1

(x−1)!+∑x=1

+∞

(x−1)2⋅ λx−1

(x−1)!+ 2∑x=1

+∞

(x−1) λx−1

(x−1)!=eλ + g(λ) + 2eλ

λ

(The expression of the expectation of X has been used in the last term.) Thus, the function we are looking forverifies the first-order ordinary differential equation g '(λ)− g(λ )=eλ

(1+2λ) .

Homogeneous equation: This equation is the same, so gh(λ)=eλ+k=c eλ

Particular solution: By applying the same method,

c ' (λ )eλ+c (λ)eλ

−c(λ)eλ=eλ

(1+2λ) → c ' (λ )=1+2λ → c (λ )=λ+λ2 → gp(λ)=(λ+λ

2)eλ

General solution: g(λ)=gh(λ)+g p(λ )=c eλ+(λ+λ

2)eλ

=(c+λ+λ2)eλ

Any g(λ) given by the previous expression verifies the differential equation, so an additional condition isnecessary to determine the value of c. The definition above implies that g(0) = 0, so c = 0. Finally, the secondmoment is

E(X2)=e−λ g(λ)=e−λ (λ+λ2

)eλ=λ+λ

2

Remark: Working with the whole series of μ(λ) or σ2(λ) as functions of λ is more difficult than working withthe previous functions g(λ), since the variable λ would appear twice instead of once. Additional way 2:Another way consists in using a relation involving the Stirling polynomials (see, e.g., § 2.69 of Análisiscombinatorio: problemas y ejercicios. Ríbnikov et al. Mir)

∑ j=0

+∞

j n⋅

x j

j!= ex Pn(x ) P0(x )=1 , P1(x )=x , P2( x)= x (1+ x) ,... , Pn+1(x )=x∑ j=0

n

(nj)P j( x)

In this case,

E (X )=e−λ∑x=0

+∞

x⋅λx

x!=e−λ

⋅eλ P1(λ)=λ .

E (X 2)=e−λ∑x=0

+∞

x2⋅λ

x

x!=e−λ

⋅eλ P2(λ )=λ2+λ .

(5) The exponential distribution


E(X )=∫0

+∞

x λ e−λ x dx=[−x e−λ x]0+∞−∫0

+∞

−e−λ x dx=[−xe−λ x]0+∞−

1λ[e−λ x

]0+∞=[(x+ 1

λ )e−λ x ]

+∞

0

=1λ−0=

1λ

Where the formula ∫u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫u ' (x)⋅v (x )dx of integration by parts has been applied


with x and λe–λx as initial functions (since these two functions are of “different type”.

• u=x → u '=1

• v '=λ e−λ x → v=∫λ e−λ x dx=−e−λ x

For the second-order moment,

E (X2)=∫0

+∞

x2λ e

−λ xdx=[−x

2e−λ x

]0+∞−∫0

+∞

−2 x e−λ x

dx=0+2λ−1∫0

+∞

x λ e−λ x

dx=2 λ−1μ=

2

λ2

Where the formula ∫u (x )⋅v ' (x )dx=u (x )⋅v (x )−∫u ' (x)⋅v (x )dx of integration by parts has been appliedwith x and λe–λx as initial functions (since these two functions are of “different type”.

• u=x2 → u '=2 x

• v '=λ e−λ x → v=∫λ e−λ x dx=−e−λ x

That the function ex changes faster than xk, for any k, has been used too in calculating both integrals. On theother hand, for the exponential λ > 0, so the previous integrals always converge.


M (t)=E(e tX)=∫0

+∞

etxλ e−λ x dx=λ∫0

+∞

ex [ t−λ ]dx= λt−λ

[ex [t−λ]]0∞= λλ−t

This function exists for real t such that t–λ < 0 (otherwise, the integral does not converge). Because of themathematical real analysis,

E(X )=M(1)(0)=[−λ(−1)

(λ−t)2 ]t=0

= λ

λ2=

1λ

E(X2)=M(2)

(0)=[−2λ (λ−t)(−1)

(λ−t)4 ]t=0

=[ 2λ(λ−t)3 ]t=0

=2λ

2


φ(t)=E(eitX)=∫0

+∞

eitxλ e−λ x dx=λ∫0

+∞

ex (i t−λ)dx=λ limM→∞∫{Z=γ , 0≤γ≤M }ez (i t−λ)dz

=λ limM→∞ [ ez (i t−λ)

i t−λ ]{Z=γ , 0≤γ≤M }

= λi t−λ

limM→∞ [eM (it−λ)−1 ]= λ

i t−λlimM→∞ [e−M λ e iM t

−1 ]= λλ−i t

This function exists for any real t such that it–λ ≠ 0 (dividing by zero is not allowed). In the previouscalculation, that the complex integrand is differentiable has been use to calculate the (line) complex integralby using an antiderivative and the equivalent to the Barrow's rule. Now, the definitions and rules of theanalysis for complex functions of a real variable must be considered to do

E(X )=φ(1)(0)i

=1i [−λ(−i)

(λ−i t)2 ]t=0

= λ

λ2=

1λ

E(X2)=

φ(2)(0)

i2 =1i2 [−λ i 2(λ−i t)(−i)

(λ−i t)4 ]t=0

=1i2 [ 2λ i2

(λ−i t)3 ]t=0

=2λλ

3 =2λ

2

Mean and variance

μ=E(X )=1λ

σ2=Var (X )=E(X2

)−E(X )2=

2λ

2−( 1λ )

2

=1λ

2


(6) The normal distribution


E (X )=∫−∞

+∞

x1

√2πσ2e−(x−μ)

2

2σ2

dx=∫−∞

+∞

(t+μ)1

√2πσ2e−

t 2

2σ2

dt

=∫−∞

+∞

t1

√2πσ2e−

t2

2σ 2

dt +μ∫−∞

+∞ 1

√2πσ2e−

t2

2σ 2

dt =0+μ⋅1=μ

Where the change t=x−μ → x=t+μ → dx=dt

has been applied. In the second line, the first integral is zero because the integrand is and odd function andrange of integration is symmetric, while the second integral is one because f(x) is a density function.

E (X 2)=∫−∞

+∞

x2 1

√2 πσ2e−( x−μ)

2

2σ 2

dx=∫−∞

+∞

(t+μ)21

√2πσ2e−

t2

2σ 2

dt=∫−∞

+∞

(t 2+μ

2+2μ t)

1

√2 πσ2e−

t2

2σ 2

dt

=∫−∞

+∞

t2 1

√2πσ2e−

t 2

2σ2

dt +μ2∫−∞

+∞ 1

√2 πσ2e−

t2

2σ2

dt + 2μ∫−∞

+∞

t1

√2πσ2e−

t 2

2σ2

dt

=1

√2πσ2∫−∞

+∞

t 2 e−

t 2

2σ2

dt +μ2⋅1 + 2μ⋅0=

1

√2πσ2σ

2√2πσ2+μ

2=σ

2+μ

2

where the first integral has been calculated as follows.

∫−∞

+∞

t 2 e−

t 2

2σ2

dt=∫−∞

+∞

t⋅t e−

t 2

2σ2

dt=[−tσ2 e−

t2

2σ 2 ]−∞

+∞

+σ2∫

−∞

+∞

e−

t 2

2σ2

dt=(0−0)+σ2∫−∞

+∞

e−( t

√2σ 2)2

dt

=σ2√2σ2∫−∞

+∞

e−u2

du=σ2√2πσ2

Firstly, we have applied integration by parts

• u=t → u '=1

• v '=t e−

t2

2σ2

→ v=∫ t e−

t 2

2σ2

dt=−σ2e

−t2

2σ 2

(Again, the function ex changes faster than xk, for any k.) Then, we have applied the changet

√2σ2=u → t=u√2σ2 → dt=du√2σ2

and the well-known result ∫−∞

+∞

e−x2

dx=√π (see the appendix of Mathematics). On the other hand, theseintegrals converge for any real t.


M (t)=E(e tX)=∫−∞

+∞

etx 1

√2πσ2e−(x−μ)2

2σ2

dx=1

√2πσ2∫−∞

+∞

etx−

( x−μ)2

2σ 2

dx=e12

t (2μ+σ2 t)

since

∫−∞

+∞

ext e−(x−μ)2

2σ2

dx=∫−∞

+∞

e−

12σ 2 [−2σ2 t x+x2

+μ2−2μ x ]

dx=∫−∞

+∞

e−

12σ2 {x2

+−2x [σ2 t+μ]+μ2}dx

=∫−∞

+∞

e−

12σ2 {(x−[σ 2t+μ])2−[σ 2t+μ]2+μ2 }

dx=e−

12σ2 {μ2−[σ 2t+μ]2}

∫−∞

+∞

e−( x−[σ2 t+μ]

√2σ 2 )2

dx

=e−

12σ2 (μ−[σ

2 t+μ])(μ+[σ 2t+μ])

∫−∞

+∞

e−u2

√2σ2du=e−

12σ 2 [−σ

2 t] [2μ+σ2 ln (t )]

√2 πσ2


=e12

t [2μ+σ 2 ln(t )]

√2πσ2

where we have applied the change

x−[σ

2 t+μ]

√2σ2=u → x=u√2σ2

+[σ2 t+μ] → dx=du√2σ2

The integrand suggested completing the square in the exponent. This way is indicated Probability andRandom Processes, by Grimmett and Stirzaker (Oxford University Press) for the standard normal distribution.We have used this idea for the general normal distribution. This function exists for any real t. Now, because ofthe mathematical real analysis,

E(X )=M(1)(0)=[e

12

t (2μ+σ 2 t) 12(2μ+σ22 t)]

t=0

=[e12

t (2μ+σ2 t)

(μ+σ2 t)]t=0=μ

E(X2)=M(2)

(0)=[e12

t (2μ+σ2 t )

[(μ+σ2t )2+σ2 ] ]t=0=μ2+σ

2


φ(t)=E(eitX)=∫−∞

+∞

eitx 1

√2πσ2e−(x−μ)

2

2σ2

dx=1

√2πσ2∫−∞

+∞

eitx−

( x−μ)2

2σ 2

dx=⋯=e12

it (2μ+σ2 it)

This function exists for any real t. In this case, using the previous calculations with it in place of t leads to thecorrect result, but the whole way is not: in complex analysis we can also make an square appear in theexponent, as well as move coefficients outside of the integral (these operations are not trivial generalizationsof the analogous in real analysis, and it is necessary to take into account the definitions, properties and resultsof complex analysis), but integrals must be solved in the proper way. (For this section, I have consultedVariable compleja y aplicaciones, Churchill, R.V., y J.W. Brown, McGraw-Hill, 5ª ed., and Teoría de lasfunciones analíticas, Markushevich, A., Mir, 1ª ed, 2ª reimp.) When the following limit exists, the integral canbe solved as follows

∫−∞

+∞

eitx−

(x−μ )2

2σ2

dx=limM→∞∫−M

+Me

itx−(x−μ)2

2σ 2

dx

Now, by completing an square in the exponent, as for previous generating functions,

∫−M

+Meitx−

(x−μ)2

2σ2

dx=⋯=e12

i t [2μ+σ2 i t ]

∫−M

+Me−

12σ2 (x−μ−iσ2 t )2

dx

Because of the rules of complex analysis, these calculations are similar—but based on new definitions andproperties— to those of previous sections. What is much different is the way of solving the integral. Now wecannot find an antiderivative of the integrand—as we did for the exponential distribution—and therefore wemust think of calculating the integral by considering a contour containing the points{x−μ−iσ2 t , −M≤x≤+M }. The integral of a complex function is null for any close contour within the

domain in which the function is differentiable. We consider the contour:

C( γ)=C I (γ)∪C II(γ)∪C III(γ)∪C IV (γ)

C I (γ)={z=γ−μ−i tσ2 , −M≤γ≤+M }

C II(γ)={z=M−μ+i(γ−t σ2) , 0≤γ≤t σ2

}

C III(γ )={z=−(γ−μ) , −M≤γ≤+M }

C I (γ)={z=−M−μ−i γ , 0≤γ≤tσ2}


Then,

0=∫Cf (z)dz =∫C I

f (z)dz+∫CII

f (z )dz+∫C III

f (z)dz+∫C IV

f (z)dz

so for f (z)= e−

12σ2 z

2

∫−M

+Me−

12σ2 (x−μ−iσ 2t )2

dx=−∫C II

e−

12σ2 z2

dz−∫CIII

e−

12σ2 z

2

dz−∫C IV

e−

12σ 2 z

2

dz

=−∫0

tσ2

e−

12σ2 [M−μ+i(γ−tσ 2

)]2

d γ−∫−M

+Me−

12σ2 [−(γ−μ)]

2

d γ−∫0

t σ2

e−

12σ2 [−M−μ−iγ]2

d γ

=−∫0

tσ2

e−

12σ2 [(M−μ)

2−(γ−t σ2

)2+i2 (M−μ)(γ−t σ2

)]

d γ−∫−M

+Me−

12σ 2 (γ−μ)

2

d γ−∫0

tσ2

e−

12σ2 [(M+μ)

2−γ

2+i2(M +μ )γ]

d γ

We are interested in the limit when M increases. For the first integral,

|∫0

tσ 2

e−

12σ 2 (M−μ)

2

e1

2σ 2 (γ−tσ 2)

2

e−

12σ2 [i2(M−μ)(γ−tσ2

)]

d γ|≤∫0

tσ2|e−1

2σ 2 (M−μ)2

e1

2σ 2 (γ−tσ 2)

2

e−

12σ2 [i2(M−μ)(γ−tσ2

)]|d γ

= e−

12σ2 (M−μ)

2

∫0

tσ2

e1

2σ2 (γ−tσ2)

2

d γ →M→∞ 0

Since |ei c|=|cos(c)+i sin(c)|=1, ∀ c∈ℝ and the last integral is finite (the integrand is a continuousfunction and the interval of integration is compact) and does not depend on M. For the second integral,

∫−M

+Me−

1

2σ2 (γ−μ )2

d γ=∫−M

+Me−(

γ−μ

√2σ2 )2

d γ=∫−M−μ

√2σ 2

+M−μ

√2σ2

e−u2

√2σ2du →M→∞ √2σ2∫−∞

+∞

e−u2

du=√2πσ2

where the change

γ−μ

√2σ2=u → γ=u√2σ2

+μ → d γ=du√2σ2 and −M−μ

√2σ2≤

γ−μ

√2σ2≤−M−μ

√2σ2

has been applied. Finally, for the third integral,

|∫0

tσ 2

e−

12σ 2 (M +μ )

2

e−

12σ2−γ

2

e−

12σ2 i2 (M+μ)γ

d γ|≤ e−

12σ2 (M+μ)

2

∫0

tσ2

e−

12σ2−γ

2

d γ →M→∞ 0

Again, the last integral is finite and does not depend on M. In short,

φ(t)=1

√2πσ2∫−∞

+∞

eitx−

(x−μ)2

2σ2

dx=1

√2πσ2limM→∞∫−M

+Meitx−

(x−μ)2

2σ2

dx

=1

√2πσ2e

12

i t [2μ+σ 2i t ]

limM→∞∫−M

+Me−

12σ 2 (x−μ−iσ2 t)2

dx=1

√2πσ2e

12

i t [2μ+σ 2 it ]

√2πσ2= e12

it [2μ+σ2 i t ]

This function exists for any real t. (The reader can notice that the correct way is slightly longer.) Now,

E(X )=φ(1)(0)i

=[e

12

i t [2μ+σ2 i t ] 12i(2μ+ iσ 22 t)]

t=0

i=[e

12

i t [2μ+σ2 i t ]

i(μ+iσ2 t)]t=0

i=

iμi=μ

E(X2)=

φ(2)(0)

i2 =[e

12

i t [2μ+σ2 it ]

[ i2(μ+ iσ2t)2+i(iσ2)] ]t=0

i2 =i2μ2

+i2σ

2

i2 =μ2+σ

2

Mean and variance

μ=E(X )=μ


σ2=Var (X )=E(X2

)−E(X )2=σ

2+μ

2−μ

2=σ

2

Conclusion: To calculate the moments of a probability distribution, different methods can be considered,some of them quite more difficult than others. The characteristic function is a complex function of a realvariable, which requires theoretical justifications of complex analysis we must be aware of.

[Ap] MathematicsRemark 1m: The exponential function ex changes faster than any monomial xk of any k.

Remark 2m: In complex analysis, there are frequently definions and properties analogous to those of real analysis. Nevertheless,one must take care before applying them.

Remark 3m: Theoretically, quantities like proportions (sometimes expressed in per cent), rates, statistics, etc., are dimensionless. Tointerpret a numerical quantity, it is necessary to know the framework in which it is being used. For example, 0.98% and 0.98% 2 aredifferent. The second must be interpreted as √(0.98%2) = 0.99%. Thus, to track how they are transformed the use of a symbol maybe useful.

Remark 4m: In working with expressions—equations, inequations, sums, limits, integrals, etc—, special attention must be paidwhen 0 or ∞ appears. For example, even if two limits (series, integrals, etc) do not exist, their summation (difference, quotient,product, etc) may exist.

limn→∞ n3 = ∞ and limn→∞ n4 = ∞, but limn→∞ n3/n4 = 0 or ∫1

∞ 1xdx does not exist while ∫1

∞ 1x⋅

1xdx does

On the other hand, many paradoxes (e.g. Zenon's ones) are based on any wrong step (in red color):

0 = 0 ↔ 0·2 = 0·3 ↔ 0·2 = 0·3 ↔ 2 = 3 and ∞ = ∞ ↔ ∞·2 = ∞·3 ↔ ∞·2 = ∞·3 ↔ 2 = 3

Readers of advanced sections may want to check some theoretical details related to the following items (thevery basic theory is not itemized).

Some Reminders

Real Analysis

For real functions of one or several real variables.

● Binomial Theorem. (x+ y)n=∑ j=0

n

(nj)xj yn− j or, equivalently, (x+ y)n=∑ j=0

n

(nj)xn− j y j

● Limits: infinitesimal and infinite quantities.

● Integration: methods (integration by substitution, integration by parts, etc.), Fubini's theorem, lineintegral.

● Series: convergence, criteria of convergence, radius of convergence, differentiability and integrability,Taylor series, representation of the exponential function, power series. Concretely, when the criterionof the quotient is applied to study the convergence, the radius of convergence is defined as:

limm→∞

am+1

am

=limm→∞

|cm+1 xm+1||cm xm|

=|x|limm→∞

|cm+1||cm|

< 1 → |x|< limm→∞

|cm||cm+1|

= r

(Similarly for the criterion of the square root.)


My notes:

● Geometric Series. For 0<b<1, ∑ j=0

+∞

a⋅bj=

a1−b

< ∞

(See, for example, http://en.wikipedia.org/wiki/Geometric_sequence)

● Arithmetico-Geometric Series.

(See, for example, http://en.wikipedia.org/wiki/Arithmetico-geometric_sequence)

● Ordinary differential equations.

Complex Analysis

For complex functions of one real variable.

● Limits: definitions and basic properties.

● Differentiation: definitions and basic properties.

● Integration: definitions and basic properties, antiderivatives and Barrow's rule.

For complex functions of one complex variable.

● Elementary functions: the exponential complex function

● Limits: definitions and basic properties, infinitesimal and infinite quantities.

● Differentiation: definitions and basic properties, holomorphic (or analytic) functions.

● Integration: definitions and basic properties, antiderivatives and Barrow's rule, basic theorems (of ananalytic function on a close contour), integration by parts.

● Series: convergence and absolute convergence, criteria of convergence, radius of convergence,differentiability and integrability, Taylor series, representation of the exponential function.

Limits

Frequently, we need to deal with limits of sequences and functions. For sequences, any variable or index (sayn) and the quantity of interest (say Q) can take values in a countable set of discrete positive values, even formultidimensional situations: the countable product of countable sets is a countable set. Calculations are easierwhen there is any monotony, since “the small steps determine the whole way,” or symmetry. For example, thesummation and the product increase when any term increases, or both, while the difference and the quotientmay increase or decrease depending on the term that increases in a unit, since the two terms are not affectingthe total expression in the same direction.

Techniques

In calculating limits, firstly we try to mentally substitute the value of the variable in the sequence orfunction. This is frequently enough to solve it, although we can do some formal calculations (specially if weare not totally sure about the value). When the previous substitution leads to one of the following cases

∞−∞ , ∞⋅0 , ∞∞ ,

00, 1∞ , ∞

0 , and 00

we talk about indeterminate forms (we have not written possible variations of the signs or positions, e.g. 0·∞,–∞+∞, or –0/0). The value depends on the particular case, since one term can be “faster than the other” in


http://en.wikipedia.org/wiki/Arithmetico-geometric_sequence

http://en.wikipedia.org/wiki/Geometric_sequence

tending to a value. There are different techniques to cope with the limits and to transform some indeterminateforms in others. (Notice that limits like 0–0 are not indeterminate forms, since |a–b| ≤ |a| + |b|.)

Limits in Statistics

Since the sample sizes of populations are positive integer numbers, in Statistics we have to deal with limits ofsequences frequently.

One-Variable Limits: The variable n takes values in ℕ . For this variable, there is a unique natural way forn to tend to infinite by increasing it one unit at a time. There is a total order in the set ℕ , which is countable.In Statistics, we are usually interested only in any possible nondecreasing sequences of values for n, whichcan be seen as a possible sequence of schemes where more and more data are added.

Two-Variable Limits: A pair of values (nX, nY) can be seen as a point in ℕ xℕ . There are infinite ways fornX and nY to tend to infinite by increasing any of them, or both, one unit at a time. There is not a total order inthe product space ℕ xℕ , though it is still a countable set. Again, in Statistics we are usually interested onlyin any possible nondecreasing sequence of pairs of values (nX,nY), which can be seen as a sequence of schemeswhere more and more data are added.

In this document, we have to work with easy limits or indeterminate forms like ∞/∞ involvingpolynomials. For the latter type of limit, we look at the terms with the highest exponents and we multiply anddivide the quotient by the proper monomial so that to identify the negligible terms, which formally can beseen as the use of infinites. We will also mention other techniques.

Technique Based on Paths

One-Variable Limits: Any possible sequence of values for the sample size, say n(k), can be seen as asubsequence of the most complete set ℕ of possible values n(k) = k. We are specially interested innondecreasing sequences of values n(k) ≤ n(k+1).

The evaluation of any one-dimensional quantity at a subsequence, Q(n(k)), can be seen as a subsequence ofQ(k). If this sequence converges, any subsequence like that must converge. The opposite is not true, since wecan find nonconvergent Q(k) with a convergent subsequence Q(n(k)). The following result can be found inliterature.

TheoremFor a real function f of a real variable x, defined on ℝ=ℝ∪∞ , if a is an accumulation point thefollowing conditions are equivalent:

(i) limx→a f(x) = L

(ii) For any sequence (in the domain) such that limk→∞ x(k) = a, it holds that limk→∞ f(x(k)) = L

A sequence is a particular case of real function of a real variable, and ∞ is an accumulation point in ℝ .

Two-Variable Limits: Any possible sequence of values (nX(k),nY(k)) can be seen as a path s(k) in the mostcomplete set ℕ xℕ of possible values (k1,k2). Again, we are specially interested in nondecreasing sequencesof values nX(k) ≤ nX(k+1) and nY(k) ≤ nY(k+1).


The evaluation of any one-dimensional quantity at a path, Q(nX(k),nY(k)), can be seen as a subset of Q(k1,k2).The convergence of Q(nX(k),nY(k)) may depend on the path s(k). Nevertheless, for those cases where the subsetQ(k1,k2) can be ordered to form a one-index convergent sequence, say Q(k), any subsequence Q(nX(k),nY(k))must converge. The opposite is not true, since we can find nonconvergent Q(k) with a convergent subsequenceQ(nX(k),nY(k)). Notice that the set ℕ xℕ is countable and hence can be “linearized” in the sense of beingdescribed by using one index only, and then the theorem above can be applied. The idea consists in provingthe existence of the limit—section (i) in the theorem—by using the monotony and the properties of sequences,and calculating it by using a particularly appropriate sequence x(k)—section (ii) in the theorem.

I wrote this way to prove that the limit of Q=(nX+nY)/nXnY does not depend on the path considered, andthe unique limit can be found by considering a specially appropriate path. It is possible to think about anunderlying two-dimensional induction principle: when a statement that depends on the position (nX,nY) is stilltrue when any of this variables increases in a unit, then the statement is true restricted to any one-step path(nX(k),nY(k)).

The previous nondecreasing s(k) are the only paths of interest in Statistics. Mathematically, on thecontrary, for any path s(k) such that the sample sizes tend to infinite, the previous description in terms of stepscould be used to prove that the leftward or downward steps must always be “compensated and outnumberedby far.”

Finally, any sizes can be used for the steps of a sequence (nX(k),nY(k)), since it is also possible tocomplete them so that to obtain a path (mX(k),mY(k)) in terms of one-sized steps. Thus, if a limit is different fortwo of those sequences, the limits are also different for these paths.

Exercise 1m (*)

Prove that

(a) ∫−∞

+∞

e−x2

dx=√π (b) ∫−∞

+∞

e−ax2

dx=√πa

, a∈ℝ+ (c) ∫0

+∞

e−x2

dx=√π2

Discussion: The integrand is a continuous function. We remember that e−x2

has no antiderivative but it isstill possible to calculate definite integrals for same domains. As regards the limits of integration, the domainis infinite and we must deal with improper integrals.

(a) Finiteness: Firstly, we prove that the integral is finite not to be working with the equality of two infinitequantities (something “really dangerous”).

∫−∞

+∞

e−x2

dx=∫{|x|< 1}e−x2

dx+∫{|x|≥1}e−x2

dx≤∫−1

+11dx+2∫+1

∞

e−x dx=2+2 [−e− x]x=1∞

=2+2e−1<∞

since• If 0≤|x|<1 then 0≤x2

<1 and e0≤ex2

<e1 and hence 1=e0≥e−x2

>e−1

• For an even function, the integral between -k and +k is twice the integral between 0 and +k.

• If x≥1 then x2≥x and ex2

≥e x and hence e−x2

≤e−x

• For any two quantities, if a1≤b1 and a2≤b2 then a1+a2≤b1+b2


Temporary calculations in a twodimensional space: The Fubini's theorem of integration for improperintegrals can be applied to do

I 2=I⋅I =∫−∞

+∞

e− x2

dx⋅∫−∞

+∞

e− y2

dy=∫−∞

+∞

∫−∞

+∞

e−(x2+ y2 )dx dy=∫0

+∞

∫0

2πe−[ρ2⋅cos (θ)2+ρ2⋅sin (θ)2]

ρd θd ρ

=∫0

+∞

∫0

2πe−ρ

2

ρd θd ρ=∫0

+∞

∫0

2πe−ρ

2

ρd θ d ρ=2 π∫0

+∞

e−ρ2

ρd ρ=2 π[ e−ρ2

−2 ]0

+∞

=π [e−ρ2

]∞0=π

where the Jacobian matrix of the change of variables {x=ρcos (θ)x=ρsin (θ)

is

|J|=|∂ x∂ρ

∂ x∂θ

∂ y∂ρ

∂ y∂θ

|=|cos(θ) −ρsin (θ)sin(θ) ρ cos(θ) |=ρ cos(θ)2+ρsin(θ)2=ρ

Come back to a onedimensional space: Finally,

∫−∞

+∞

e−x2

dx=I=√I 2=√π

(b) Now, to prove that

∫−∞

+∞

e−ax2

dx=√πa

we apply the change

√a x=u → x=1

√au → dx=

1

√adu

which leads to

∫−∞

+∞

e−a x2

dx=∫−∞

+∞

e−(√a x )2

dx=1

√a∫−∞

+∞

e−u 2

du=1

√a√π=√

πa

(c) On the other hand, since f (x)=e−x2

=e−(−x)2=f (−x) is an even function,

∫0

+∞

e−x2

dx=12∫

−∞

+∞

e−x2

dx=√π2

An alternative proof uses the gamma function Γ( p)=∫0

+∞

e−x x p−1 dx and the fact that Γ( 12 )=√π . Now,

by applying the change of variable x2=t , for t≥0, which implies that x=√ t and hence dx=

12√t

dt ,

∫0

+∞

e−x2

dx=12∫0

+∞

e−tt−1/2

dx=12Γ( 1

2 )=√π2

.

Conclusion: To be allowed to apply the version of the Fubini's theorem for improper integrals, the finitenessof the first integral has firstly been proved. The integral of section (a) is used to calculate the others,respectively by applying a change of variables and by considering the even character of the integrand.

About the proof based on the multiple integration: Proof by Siméon Denis Poisson (1781–1840), according to El omnipresentenúmero π, Zhúkov, A.V., URSS. I had found this proof in many books, including the previous reference (for the integral in section bwith a=1/2). I have written the bound of the integral. About the proof based on the gamma function: I have found this proof inProblemas de oposiciones: Matemáticas (Vol. 6), De Diego y otros, Editorial Deimos. In this textbook, the integral in section c issolved by using the two approaches.


My notes:

Exercise 2m

Study the following limits of sequences of one variable

(1) limn→∞ ak nk+ak−1n

k−1+⋯+a1n+a0 , where aj are constants

(2) limn→∞

1n+c

, where c is a constant

(3) limn→∞

an+bcn+d

, where a, b, c and d are constants

(4) limn→∞

ank1+b(n)

cnk2+d (n)

, where a and c are constants and b(n) and d(n) are polynomials whose degrees are

smaller than k1 and k2, respectively

(5) limn→∞

an+

b

n2

cn3

, where a, b and c are constants

Discussion: We have to study several limits. Firstly, we try to substitute the value to which the variabletends in the expression of the quantity in the limit. If we are lucky, the value is found and the formalcalculations are done later; if not, techniques to solve the indeterminate forms must be applied.

(1) limn→∞ ak nk+ak−1n

k−1+⋯+a1n+a0 , where aj are constants

Way 0: Intuitively, the term with the largest exponent leads the growth when n tends to infinite. Then,

limn→∞ ak nk+ak−1n

k−1+⋯+a1 n+a0={−∞ if ak<0

+∞ if ak>0

Necessity: limn|aknk+ak−1n

k−1+⋯+a1n+a0|=∞ then n→∞

If not, that is, if ∃M>0 such that n<M<∞ , then

|ak nk+ak−1n

k−1+⋯+a1n+a0|≤|ak|n

k+|ak−1|n

k−1+⋯+|a1|n+|a0|<|ak|M

k+|ak−1|M

k−1+⋯+|a1|M+|a0| < ∞

and the limit could not be infinite.

(2) limn→∞

1n+c

, where c is a constant

Way 0: Intuitively, the denominator tends to infinite while the numerator does not. (For huge n, the value of cis negligible.) Then, the limit is zero.

Way 1: Formally, we divide the numerator and the denominator (all their terms) by n.

limn→∞

1n+c

=limn→∞ ( n−1

n−1

1n+c )=limn→∞

1n

1+cn

=0


Way 2: By using infinites of the same order, we can substitute n + c by n:

limn→∞

1n+c

=limn→∞

1n=0

Necessity: limn1

n+c=0 then n→∞

If not, that is, if ∃M>0 such that n<M<∞ , then1

n+c>

1M+c

>0 and the limit could not be zero.

(3) limn→∞

an+bcn+d

, where a, b, c and d are constants

Way 0: (This limit includes the previous.) The quotient is an indeterminate form. Intuitively, the numeratorincreases like an and the denominator like cn. (The terms b and d are negligible for huge n.) Then, the limit ofthe quotient tends to a/c.

Way 1: Formally, we divide the numerator and the denominator (all their terms) by n.

limn→∞

an+bcn+d

=limn→∞ ( n−1

n−1

an+bcn+d )= limn→∞

a+bn

c+dn

=ac

Way 2: By using infinites,

limn→∞

an+bcn+d

=limn→∞

ancn

=limn→∞

ac=

ac

Necessity: limnan+bcn+d

=ac

then n→∞

If not, that is, if ∃M>0 such that n<M<∞ , then

|an+bcn+d

−ac|=|

acn+bc−acn−adc (cn+d ) |≥ |bc−ad|

|c|(|c|M+|d|)>0

and the limit could not be a/c... unless the original quotient was always equal to this value. Notice that whenthe previous numerator is cero,

ad=bc ↔ ac=λ=

bd

↔ {a=λ cb=λd

↔ an+bcn+d

=λ(cn+d)cn+d

=λ=ac

↔ an+bcn+d

−ac=0

that is, in this case the function is really a constant. In the initial statement, the condition |a bc d|≠0 could

have been added for the polynomials an+b and cn+d to be independent.

(4) limn→∞

ank1+b(n)

cnk2+d (n)

, where a and c are constants and b(n) and d(n) are polynomials whose degrees are

smaller than k1 and k2, respectively

Way 0: (This limit includes the two previous.) The quotient is an indeterminate form. Intuitively, thenumerator increases like ank1 and the denominator like cnk2 while b(n) and d(n) are negligible. Thus,


limn→∞

ank1+b(n)

cnk2+d (n)={

0 if k 1<k2

ac

if k1=k2

−∞ if k1>k2 ,ac<0

+∞ if k 1>k2 ,ac>0

Way 1: Formally, we divide the numerator and the denominator (all their terms) by the power of n with thehighest degree among all the terms in the quotient (if there were products, we should imagine how themonomials are). For example, for the case k1<k 2

limn→∞

ank1+b(n)

cnk2+d (n)=limn→∞ [ n

−k2

n−k2

ank1+b (n)

cnk2+d (n) ]=limn→∞

ank2−k1

+b (n)

nk2

c+d(n)

nk2

=0

(Similarly for the other cases.)

Way 2: By using infinites, since b(n) and d(n) are negligible for huge n,

limn→∞

ank1+b(n)

cnk2+d (n)=limn→∞

ank1

cnk2=limn→∞

acn

k1−k2={0 if k1<k2

ac

if k 1=k2

−∞ if k1>k2 ,ac<0

+∞ if k1>k2 ,ac>0

(5) limn→∞

an+

b

n2

cn3

, where a, b and c are constants

Way 0: The quotient is an indeterminate form. Intuitively, the numerator decreases like a/n (the slowest) andthe denominator like c/n3, so the denominator is smaller and smaller with respect to the numerator, and, as aconsequence, the limit is –∞ or +∞ depending on whether a/c is negative or positive, respectively.

Way 1: Formally, it is always possible to multiply or divide the numerator and the denominator (all theirmonomials, if they are summation, or any element, if they are products) by the power of n with theappropriate exponent. Then we can do

limn→∞

an+

b

n2

cn3

=limn→∞ ( n3

n3

an+

b

n2

cn3 )=limn→∞

an2+bnc

={−∞ ifac<0

+∞ ifac>0


Way 2: By using infinitesimals,

limn→∞

an+

b

n2

cn3

=limn→∞

ancn3

=limn→∞ ( acn3

n )=lim n→∞ ( ac n2)={−∞ ifac<0

+∞ ifac>0

Conclusion: We have studied the limits proposed. Some of them were almost trivial, while others involvedindeterminate forms like 0/0 or ∞/∞. All the cases were quotients of polynomials, so the limits of the formerform have been transformed into limits of the latter form. To solve these cases, the technique of multiplyingand dividing by the same quantity has suffices (there are other techniques, e.g. L'Hôpital rule).

Additional examples

limn→∞

1n−1

=0 or limn→∞

1n−1

=limn→∞

1n=0

limn→∞ ( 2n−

1

n2 )=0

limn→∞ (2n−n2)=limn→∞ [n(2−n)]=−∞ or limn→∞ (2n−n2)=limn→∞ (−n2)=−∞

limn→∞

n−1n

=limn→∞

1−1n

1=1 or limn→∞

n−1n

=limn→∞

nn=1

limn→∞

nn−2

=limn→∞

1

1−2n

=1 or limn→∞

nn−2

=limn→∞

nn=1

limn→∞

n−1n−3

=lim n→∞

1−1n

1−3n

=1 or limn→∞

n−1n−3

=lim n→∞

nn=1

Exercise 3m (*)

Study the following limits of sequences of two variables

(1) limn X→∞

nY →∞

(nX+nY ) and limn X→∞

nY →∞

(nX−nY )

(2) limn X→∞

nY →∞

(nX⋅nY ) and limn X→∞

nY →∞

nX

nY

(3) limn X→∞

nY →∞

nXnY

nX

and limn X→∞

nY →∞

nX

nXnY

(4) limn X→∞

nY →∞

(nX+a)(nY+b)nX+c

and limn X→∞

nY→∞

nX+a

(nX+b)(nY+c )where a, b and c are constants


My notes:

(5) limn X→∞

nY →∞

1nX

1nY

1nX

and limn X→∞

nY→∞

1nX

1nX

1nY

(6) limn X→∞

nY →∞

nX+nY

nX

and limn X→∞

nY →∞

nX

nX+nY

(7) limn X→∞

nY →∞

1nX+a

1nY+b

1nX+c

and limn X→∞

nY →∞

1nX+a

1nX+b

1nY+c

(8) limn X→∞

nY →∞

1nX

+1nY

1nX

nd limn X→∞

nY →∞

1nX

1nX

+1nY

(9) limn X→∞

nY →∞

nX+nY

nX nY

and limn X→∞

nY →∞

nX nY

nX+nY

(10) limn X→∞

nY →∞

nX−nY

nXnY

and limn X→∞

nY →∞

nXnY

nX−nY

Discussion: We have to study several limits of two-variable sequences. Firstly, we try to substitute the valueto which the variable tends in the expression of the quantity in the limit. If we are lucky, the value is foundand the formal calculations are done later; if not, techniques to solve the indeterminate forms must be applied.These limits may be quite more difficult than those for one variable, since we need to prove that the valuedoes not depend on the particular way for the sample sizes to tend to infinite (if the limit exists or is infinite)or find two ways such that different values are obtained (the limits does not exist).

(1) limn X→∞

nY →∞

(nX+nY ) and limn X→∞

nY →∞

(nX−nY )

Way 0: Intuitively, the first limit is infinite while the second does not exist, since it depends on which variableincreases faster.

Way 1: For the first limit to be infinite, it is necessary and sufficient one variable tending to infinite, say nX.

limn X→∞

nY →∞

(nX+nY ) > limn X→∞ nX=∞

For the necessity, if nX<M<+∞ and nY<M <+∞ then

limn X→∞

nY →∞

(nX+nY ) < 2 M<∞

and the limit could not be zero. To see that

∄ limnX→∞

nY→∞

(nX−nY )

it is enougth to see that different values are obtained for different paths: s1(k )=(k2 , k ) and s2(k )=(k , k ) ,


lim s1 (k) (nX−nY )=limk→∞ (k 2−k )=+∞ and lim s2(k) (nX−nY )=limk→∞ (k−k )=0

(2) limn X→∞

nY →∞

(nX⋅nY ) and limn X→∞

nY →∞

nX

nY

Way 0: Intuitively, the first limit is infinite while the second does not exist, since it depends on which variableincreases faster.

Way 1: For the first limit to be infinite, it is necessary and sufficient one variable tending to infinite, say nX.

limn X→∞

nY →∞

(nX⋅nY ) > limnX→∞ nX=∞

For the necessity, if nX<M<+∞ and nY<M <+∞ then

limn X→∞

nY →∞

(nX⋅nY ) < M 2<∞

and the limit could not be zero. To see that

∄ limnX→∞

nY→∞

nX

nY

it is enougth to see that different values are obtained for different paths: s1(k )=(k2 , k ) and s2(k )=(k , k ) ,

lim s1 (k)

nX

nY

= limk→∞

k2

k=∞ and lim s2(k)

nX

nY

=limk→∞

kk=1

(3) limn X→∞

nY→∞

nXnY

nX

and limn X→∞

nY →∞

nX

nXnY

Way 0: Even if the expression can be simplified, we use this case to show that the product of increasing termsincreases faster than any of its terms, and the new rate is the product of the two rates (the exponents areadded). The quotient in an indeterminate form. The first limit seems infinite and the second zero.

Way 1: Formally, we simplify the quotient

limn X→∞

nY →∞

nXnY

nX

=limnY →∞ nY=∞ and limn X→∞

nY →∞

nX

nXnY

=limnY→∞

1nY

=0

A product of increasing terms that are bigger than one increases faster than any of its terms. The second limitcan also be seen as the inverse of the first. The sufficiency and the necessity in these limits is determined bythe behaviour of nY: the first limit is infinite and the second is zero if and only if nY tends to infinite.

(4) limn X→∞

nY→∞

(nX+a)(nY+b)nX+c

and limn X→∞

nY→∞

nX+a

(nX+b)(nY+c )where a, b and c are constants

Way 0: The quotient in an indeterminate form. Intuitively, the product of increasing terms increases faster thanany of its terms, and the new rate is the product of the two rates (the exponents are added). The constants arenegligible when they are added to or substracted from a power. The first limit seems infinite and the secondzero.

Way 1: Formally, we multiply the numerator and the denominator (all their monomials, if they are summation,or any element, if they are products) by the product of the powers of nX and nY with the highest exponents


limn X→∞

nY →∞

(nX+a)(nY+b)nX+c

=limnX→∞

nY→∞[nX

−1nY−1

nX−1nY

−1

(nX+a)(nY+b)nX+c ]=limn X→∞

nY →∞

(1+ anX

)(1+ bnY)

1nY

+c

nXnY

=∞


limn X→∞

nY →∞

(nX+a)(nY+b)nX+c

=limnX→∞

nY→∞

nX nY

nX

=limnY→∞ nY=∞

The second limit can also be seen as the inverse of the first, by changing the letter of the constants, so we donot repeat the calculations. The sufficiency and the necessity in these limits is determined by nY: the first limitis infinite and the second is zero if and only if nY tends to infinite.

(5) limn X→∞

nY→∞

1nX

1nY

1nX

and limn X→∞

nY→∞

1nX

1nX

1nY

Way 0: Even if the expression can be simplified, we use this case to show that the product of decreasing termsdecreases faster than any of its terms, and the new rate is the product of the two (the exponents are added).The quotient in an indeterminate form. The first limit seems zero and the second is infinite.

Way 1: Formally, we simplify the quotient

limn X→∞

nY →∞

1nX

1nY

1nX

=limnY→∞

1nY

=0 limn X→∞

nY →∞

1nX

1nX

1nY

=limnY→∞ nY=∞

A product of decreasing terms that are smaller than one decreases faster than any of its terms. The secondlimit can also be seen as the inverse of the first. The sufficiency and the necessity in these limits is determinedby the behaviour of nY: the first limit is zero and the second is infinite if and only if nY tends to infinite.

(6) limn X→∞

nY→∞

nX+nY

nX

and limn X→∞

nY →∞

nX

nX+nY

The quotient is an indeterminate form. Since we can write

limn X→∞

nY →∞

nX+nY

nX

=limnX→∞

nY→∞(1+ nY

nX)=? and limn X→∞

nY →∞

nX

nX+nY

=limnY→∞

1

1+nY

nX

=?

and we have seen that the limits of the new quotients do not exist, it seems that none of the limits exists.Formally, we could consider the same paths as we considered there. The second limit can also be seen as theinverse of the first.

(7) limn X→∞

nY→∞

1nX+a

1nY+b

1nX+c

and limn X→∞

nY →∞

1nX+a

1nX+b

1nY+c


Way 0: The constants are negligible and these limits are like the previous, namely: the first limit seems zeroand the second is infinite.

Way 1: Formally, we multiply the numerator and the denominator (all their monomials, if they are summation,or any element, if they are products) by the product of the powers of nX and nY with the highest exponents

limn X→∞

nY →∞

1nX+a

1nY+b

1nX+c

=limnX→∞

nY→∞

nX+c

(nX+a)(nY+b)=0


limn X→∞

nY →∞

1nX+a

1nY+b

1nX+c

=limnX→∞

nY→∞

1nX

1nY

1nX

=limnY→∞

1nY

=0

The second limit can also be seen as the inverse of the first, by changing the letter of the constants, so we donot repeat the calculations. As regards the sufficiency and the necessity in these limits, it is determined by thebehaviour of nY: the first limit is zero and the second is infinite if and only if nY tends to infinite.

(8) limn X→∞

nY →∞

1nX

+1nY

1nX

nd limn X→∞

nY→∞

1nX

1nX

+1nY

Way 0: The quotient in an indeterminate form. Intuitively, any sum of decreasing terms decreases like theslowest while the other becomes negligible. Thus, the first limit would be one if the fastest is nY, infinite if thefaster is nX; and, if both are equal, the limits are two and one over two, respectively. In short, it seems thislimit does not exist.

Way 1: Formally, we can do

limn X→∞

nY →∞

1nX

+1nY

1nX

=limnX→∞

nY →∞

1+nX

nY

=?

limn X→∞

nY →∞

1nX

1nX

+1nY

=limnX→∞

nY →∞

nXnY

nXnY

1nX

1nX

+1nY

=limnY→∞

nY

nY+nX

=?

The second limit can also be seen as the inverse of the first.

(9) limn X→∞

nY→∞

nX+nY

nX nY

and limn X→∞

nY →∞

nX nY

nX+nY

The limit appears in the variance of the estimators of σX2/σY

2. We solve it in two simple ways, although othersways are considered as an “intellectual exercise.”


Way 0: Intuitively, the product changes faster than the summation. Then, the first limit seems zero and thesecond infinite.

Way 1: Formally, we can do

limn X→∞

nY →∞

nX+nY

nX nY

=limnX→∞

nY→∞

nX

nX nY

+ limnX→∞

nY→∞

nY

nXnY

=limnY→∞

1nY

+ limnX→∞

1nX

=0

limn X→∞

nY →∞

nX nY

nX+nY

=limnX→∞

nY→∞

1nX+nY

nX nY

=∞

It is sufficient and necessary that both variables tend to infinite. For the necessity, nX<M<+∞ then

nX+nY

nX nY

=1nY

+1nX

>1M

> 0

and the limit could not be zero.

Way 2: Firstly, let us suppose, without loss of generality, that nX ≤ nY. Then

0 ≤ limnX→∞

nY→∞[ nX+nY

nXnY]≤ limn X→∞

nY →∞( 2nY

nX nY) = limn X→∞

2nX

= 0

(ny has been dropped from the numerator and the denominator, it is not that an iterated limit is beingcalculated). Nonetheless, this solution does not consider those paths (for the sample sizes) that cross thebisector line, that is, when none of the sizes is uniformly behind the other. To complete the proof it is enoughto use again the symmetry of the expression with respect to the two variables (it is the same if we switchthem): for any sequence of values for (nX,nY) crossing the bisector line, an equivalent sequence—in the sensethat the sequence takes the same values—either above or behind the bisector line can be considered bylooking at the bisector line as a mirror or a barrier.

Way 3: Polar coordinates can also be used to study that limit. For any sequence s(k)=(nX(k),nY(k)),

{nX(k )=ρ(k )⋅cos [α (k )]nY (k)=ρ(k)⋅sin[α(k )]

, 0<ρ(k )<∞ , 0<α(k )< π2

{ρ(k )=√nX (k)

2+nY (k)

2

α(k )=arctg ( nY (k)

nX (k ))A mathematical characterization of a sequence s(k) corresponding to sample sizes that tend to infinite can be

ρ(k )→∞

in such a way that even when cos [α(k )]→0 or sin[α (k )]→0 the products nX (k )=ρ(k )⋅cos [α(k )]and nY (k )=ρ(k )⋅sin [α(k)] still tend to infinite. Then, the limit is calculated as follows

limk→∞

ρ(k ) [cos[α (k )]+sin [α(k )] ]ρ(k )2 cos [α (k )]sin [α (k )]

≤ limk→∞

2ρ(k )cos[α(k )]sin [α(k )]

= 0

The only cases that could cause troubles would be those for which either the cosine or the sine tends to zero(the other tends to one). Nevertheless, the characterization above shows that the denominator would still tendto infinite. Finally, as regards the necessity, let us suppose, without loss of generality, that nX ≤ M < ∞. Then,since ρ(k )→∞ it must be cos [α(k )]→0 in such a way that nX (k )=ρ(k )⋅cos [α(k )]≤M . As aconsequence,

limk→∞

ρ(k ) {cos [α(k )]+sin [α(k)] }

ρ(k)2 cos [α(k )]sin [α(k )]≥ limk→∞

cos [α(k)]+sin[α(k )]M⋅sin [α(k )]

=0+1M

=1M

>0


Way 4: Intuitively:

(a) The mean square error—and the sequence in the limit—should monotonically decrease with thesample sizes.

(b) We are working with nonnegative quantities—there is a lower bound.(c) It is a well-known result that a nonincreasing, bounded sequence always converges.(d) The limit of a sequence, when it exists, is unique. As a consequence, it can be calculated by using any

subsequence—concretely, an appropriate simple one. (The opposite is not true: that one subsequenceconverges does not imply that the whole sequence converges.)

First, when nX increases in a unit the sequence decreases:

nX+nY

nX nY

>? nX+1+nY

(nX+1)nY

→ nX2 nY+nY nX+nXnY

2+nY

2>?

nX2 nY+nX nY+nXnY

2 → nY2>?

1 → Yes

Since the expression of the sequence is symmetric, the same inequality is true when nY increases in a unit.Finally, the case when both sizes increase in a unit can always be decomposed in two of the previous steps,while the quantity

Q(nX , nY )=nX+nY

nXnY

depends only on the position, not on the way to arrive at it; thus, the sequence decreases in this case too.Second, Q can take values in a discrete set that can sequentially be constructed and ordered to form asequence that is strictly decreasing and bounded, say Q(k). (The set ℕ xℕ is countable.) The symmetryimplies that the increase of Q can take only two values—not three—when any sample size or both increase ina unit. In sort, Q(k) converges, though we need not build it. Third, any path s(k) such that the sample sizes arenondecreasing and tend to infinite can be written in terms of one-unit rightward and upward steps, with aninfinite amount of any type. For each path s(k), the quantity

Qs(k)=nX (k)+nY (k )nX (k )nY (k )

can be seen as a subsequence of Q(k). Finally, the limit of Q is unique and the case nX = n = nY indicates that itis zero:

limk→∞

n (k )+n(k)n (k )⋅n(k)

= limk→∞

2n(k )

=0

For the necessity for both sample sizes to tend to infinite, let us suppose, without loss of generality, that nX ≤M < ∞. There would be a subsequence that cannot tend to zero:

limk→∞

nX (k )+nY (k)nX (k )nY (k )

≥ limk→∞

nX(k )+nY (k )MnY (k )

=1M

>0

whatever the behaviour of nX(k). The previous nondecreasing s(k) are the only paths of interest in Statistics.

(10) limn X→∞

nY →∞

nX−nY

nXnY

and limn X→∞

nY →∞

nXnY

nX−nY

Way 0: Intuitively, the limit of the difference does not exist, since it takes different values that depend on thepath; but the difference—or the summation, in the previous section—is so smaller than the product, that thefirst limit seems zero while the second seems infinite. Formally, we can do calculations as for the previouslimit, for example

limn X→∞

nY →∞

nX−nY

nXnY

=limn X→∞

nY→∞

nX

nXnY

− limnX→∞

nY→∞

nY

nX nY

=limnY →∞

1nY

− limn X→∞

1nX

=0−0=0

or, alternatively, use the bound:


|limnX→∞

nY→∞

nX−nY

nXnY |≤ limnX→∞

nY→∞|nX−nY

nXnY|≤limnX→∞

nY→∞

nX+nY

nXnY

=0

Conclusion: We have studied the limits proposed. Some of them were almost trivial, while others involvedindeterminate forms like 0/0 or ∞/∞. All the cases were quotients of polynomials, so the limits of the formerform have been transformed into limits of the latter form. To solve these cases, the technique of multiplyingand dividing by the same quantity has suffices (there are other techniques, e.g. L'Hôpital rule). Othertechniques have been applied too.

Additional Examples: Several limits have been solved in the exercises—look for limit in the final index.

Exercise 4m (*)

For two positive integers nX and nY, find the (discrete) frontier and the two regions determined by the equality

2(nX+nY )=(nX−nY )2

Discussion: Both sides of the expression are symmetric with respect to the variables, meaning that they arethe same if the two variables are switched. This implies that the frontier we are looking for is symmetric withrespect to the bisector line. The square suggests a parabolic curve, while

2(nX+nY )=(nX−nY )2 ↔ 2(1+nX nY )=(nX−1)2+(nY−1)2

suggests a sort of transformation of a conic curve.Intuitively, in the region around the bisector line, the difference of the variables is small and therefore

the right-hand side of the original equality is smaller than the left-hand side; obviously, the other region is atthe other side of the (discrete) frontier.

Purely computational approach: In a previous exercise we wrote some “force-based” lines for thecomputer to plot the points in the frontier. Here we use the same code to plot the inner region (see the figuresbelow)

N = 100vectorNx = vector(mode="numeric", length=0)vectorNy = vector(mode="numeric", length=0)for (nx in 1:N){ for (ny in 1:N) { if (2*(nx+ny)>=(nx-ny)^2) { vectorNx = c(vectorNx, nx); vectorNy = c(vectorNy, ny) } }}plot(vectorNx, vectorNy, xlim = c(0,N+1), ylim = c(0,N+1), xlab='nx', ylab='ny', main=paste('Regions'), type='p')

Algebraical-computational approach: Before using the computer, we can do some algebraical work

nX2+nY

2−2nXnY=2nX+2nY ↔ nY

2−2(nX+1)nY+nX(nX−2)=0

nY=2(nX+1)±√4 (nX+1)2−4nX (nX−2)

2=

2(nX+1)±2√nX2+2nX+1−nX

2+2nX

2=(nX+1)±√4nX+1


My notes:

The following code plots the two branches of the frontier (see the figures above)N = 100vectorNx = seq(1,N)vectorNyPos = (vectorNx+1)+sqrt(4*vectorNx+1)vectorNyNeg = (vectorNx+1)-sqrt(4*vectorNx+1)integerSolutions = (vectorNyPos/round(vectorNyPos) == 1)yL = c(0, max(vectorNyPos[integerSolutions], vectorNyNeg[integerSolutions]))plot(vectorNx[integerSolutions], vectorNyPos[integerSolutions], xlim = c(0,N+1), ylim = yL, xlab='nx', ylab='ny', main=paste('Frontier'), type='p') points(vectorNx[integerSolutions], vectorNyNeg[integerSolutions])

Algebraical, analytical and geometrical approach: The change of variables

C1(nX , nY )=(nX−nY , nX+nY )=(u , v )

is a linear transformation. The new frontier can be written as the parabolic curve v=12u

2. The computer

allows plottin this frontier in the U-V plane.N = 50vectorU = seq(-50, +50)vectorV = 0.5*vectorU^2plot(vectorU, vectorV, xlim = c(-N-1,+N+1), ylim = c(0,max(vectorV)), xlab='u', ylab='v', main=paste('Frontier'), type='p')

How should the change of variables be interpreted? If we write

(uv)=(1 −11 1 )(nX

nY)

the previous matrix reminds us a rotation in the plain (although movements have orthonormal matrixes andthe previous is only orthogonal). Let us have a look to how a triangle—a rigid polygon—is transformed,


P1=(1,2) → C1(P1)=(−1 ,3)

P2=(1,1) → C1(P2)=(0 ,1)

P3=(2 ,1) → C1(P3)=(1 ,3)

To confirm that C1 is a rotation plus a dilatation (homothetic transformation), or vice versa, we consider thedistances between points, the linearity, and a rotation of the axes. First, if

A=(a1 , a2) → ~A=(a1−a2, a1+a2) B=(b1 , b2) → ~

B=(b1−b2, b1+b2)

then

du , v(~A ,~B)=√ [(b1−b2)−(a1−a2)]

2+[(b1+b2)−(a1+a2)]

2=√[(b1−a1)−(b2−a2)]

2+[(b1−a1)+(b2−a2)]

2

=√2(b1−a1)2+2(b2−a2)

2=√2⋅√(b1−a1)

2+(b2−a2)

2=√2⋅dnx , ny

(A , B)

This means that the previous change of variable is not an isometry; therefore it cannot be considered amovement in the plain, technically. Nonetheless, the previous lines show that the linear transformation

C2(nX , nY )=1

√2(nX−nY , nX+nY )=(u , v ),

respects the distances, so it is an isometry whose matrix is orthonormal—that is, it is a movement. Now, thefrontier is

2(nX+nY )=(nX−nY )2 ↔

1

√2(nX+nY)=

1

√2 [1

√2(nX−nY )]

2

↔ v=1

√2u

2

C2 can be written as

(uv)=1

√2 (1 −11 1)(nX

nY)

which is the expression of a rotation in the plain (see the literature on Linear Algebra). Second, the linearityimplies that both C1 and C2 transform lines into lines. The expression

AB= 0 A+λ AB=(a1 , a2)+λ (b1−a1 , b2−a2)=(λb1+(1−λ)a1 ,λ b2+(1−λ)a2)

determines the line containing A and B if λ∈ℝ and the segment from A to B if λ∈[0,1] . It is transformedas follows

C1(λb1+(1−λ)a1,λ b2+(1−λ)a2)=(λb1+(1−λ)a1−λb2−(1−λ)a2 , λb1+(1−λ)a1+λb2+(1−λ)a2)

=(λ (b1−b2)+(1−λ)(a1−a2), λ(b1+b2)+(1−λ)(a1+a2))=λ(b1−b2, b1+b2)+(1−λ)(a1−a2 , a1+a2)

=λC1(b1, b2)+(1−λ)C1(a1 , a2)

(similarly for C2). This expression determines the line containing C1(A) and C1(B) if λ∈ℝ and the segmentfrom C1(A) to C1(B) if λ∈[0,1] . Third, as regards the rotation of axes, the following figure and formulas aregeneral

{ e1 = cosα~e1 + sinα~e2

e2 =−sinα~e1 + cosα~e2

(Rotation sinistrorsum)

{e1 = cosα~e1− sinα~e2

e2= sinα~e1 + cosα~e2

(Rotation dextrorsum)


When the axes are rotated in one direction, it can be thought that the points are rotated in the opposite. Now,C2 can be written as a 45º dextrorsum rotation of the axes

{e1 = cos π

4~e1− sin π

4~e2

e2= sin π4~e1 + cos π

4~e2

(e1

e2)=(

cos π4−sin π

4sin π

4cos π

4)(~e1~e2)=(

1√2

−1√2

1√2

1√2

)(~e1~e2)= 1

√2 (1 −11 1)(

~e1~e2)

Any point P=(x , y ) is transformed through

1√2 (

1 −11 1 )(

xy )=

1√2 (

x− yx+ y )=(uv) .

The matrix M =1√2 (

1 −11 1 ) is orthogonal, which means that M⋅M t

=I=M t⋅M and implies that

M−1=M t . Then,

1

√2 (1 1−1 1)( e1

e2)=(

~e1~e2) .

Conclusion: We have applied different approaches to study the frontier and the two regions determined bythe given equality. Fortunately, nowadays the computer allows us to do this work even without any deepertheoretical study—change of variable, transformation, et cetera.


My notes:

ReferencesRemark 1r: When an exercise is based on another of a book, the reference has been included below the statement; some statementsmay have been taken from official exams. I have written the entire solutions. The slides mentioned in the prologue containreferences on theory. For some specific theoretical details, some literature is referred to in proper section of this document.

[1] The R Project for Statistical Computing, http://www.r-project.org/

[2] Wikipedia, http://en.wikipedia.org/


My notes:

http://en.wikipedia.org/

http://www.r-project.org/

Tables of Statistics

Basic Measures

μ= E (X )=∑Ω

x i⋅ f (x i) (Discrete)

μ= E (X )=∫Ωx⋅ f (x )dx (Continuous)

σ2= Var (X )= E ([X −μ]

2 )=⋯= E(X 2)−μ

2

Basic Estimators

X =1n∑i=1

nX i s2

=1n∑i=1

n

( X i−X )2=⋯=

1n∑i=1

nX i

2− X 2

S 2=

1n−1∑i=1

n

( X i− X )2

n s2=(n−1)S 2 S p

2=

nX s X2+ nY sY

2

n X+ n y−2=(nX−1)S X

2+ (nY−1)SY

2

nX+ ny−2

V 2=

1n∑i=1

n

( X i−μ )2

η=∑i=1

nX i

nηp=

n X ηX+nY ηY

nX+nY

1 population

Parameter Estimator

μ X

σ2

V 2 μ known

σ2

s2 or S 2

μ unknown

η η

2 populations

Parameter Estimator

μX –μY X −Y

σX2/σY

2 V X2

V Y2

μX, μY known

σX2/σY

2 s X2

sY2 or

S X2

SY2

μX, μY unknown

ηX–ηY ηX−ηY


Basic Statistics

1 normal population, any n

Parameter Statistic

μT (X ;μ)=

X −μ

√ σ2

n

∼ N (0,1)∑i=1

nX i∼ N (nμ , nσ2

)

σ2 known X ∼ N (μ , σ2

n )μ T (X ;μ)=

X −μ

√ S 2

n

∼ tn−1

σ2 unknown

σ2

T (X ;σ)=nV 2

σ2 ∼ χn

2

μ known

σ2

T (X ;σ)=ns2

σ2 =

(n−1)S 2

σ2 ∼ χn−1

2

μ unknown

2 independent normal populations, any nX and nY

Parameters Statistic

μX–μY T (X ,Y ;μX ,μY )=( X −Y )−(μ X−μY )

√σ X2

n X

+σY

2

nY

∼ N (0,1)

( X −Y ) ∼ N (μX−μY ,√σ X2

nX

+σY

2

nY)

σX2, σY

2 known

μX–μY

T (X ,Y ;μX ,μY )=( X −Y )−(μ X−μY )

√ S X2

nX

+SY

2

nY

∼ t k

where k is the closestinteger to

σX2, σY

2 unknown

σX2/σY

2

T (X ,Y ;σX ,σY)=(1nX

nX V X2

σX2

1nY

nY V Y2

σY2

=

V X2

σ X2

V Y2

σY2 )=V X

2σY

2

V Y2σ X

2 ∼ F nX , nY

μX, μY known


σX2/σY

2

T (X , Y ;σX ,σY )=(1

(n X−1)

(nX−1)S X2

σ X2

1(nY−1)

(nY−1)SY2

σY2

=

S X2

σ X2

SY2

σY2 )= S X

2σY

2

SY2σX

2 ∼ Fn X−1 ,nY−1

μX, μY unknown

1 population, large n

Parameter Statistic

μ T (X ;μ)=X −μ

√ ?n

→d

N (0,1)

where ? is substituted by σ2, S2 or s2

∑i=1

nX i →

dN (nμ , n⋅?)

X →d

N (μ ,?n )

η T (X ;η)=η−η

√ ?(1−? )n

→d

N (0,1)

where ? is substituted by η or η

η →d

N (η ,√ ? (1−?)n )

2 independent populations, large nX and nY


μX–μY T (X , Y ;μX ,μY)=( X −Y )−(μ X−μY )

√ ?X

n X

+?Y

nY

→d

N (0,1)

where for each population ? is substituted by σ2, S2 or s2

( X −Y ) →d

N (μ X−μY ,√ ?X

nX

+?Y

nY)

ηX–ηY T (X , Y ;ηX , ηY)=(ηX−ηY )−(ηX−ηY )

√ ? X (1−? X )

n X

+?Y (1−?Y )

nY

→d

N (0,1)

where for each population ? is substituted by η or η

Remark 1T: For normal populations, the rules that govern the addition and subtraction imply that:

X ∼ N (μ x ,σ x

2

nx) , Y ∼ N (μ y ,

σ y2

ny) , and hence X ∓Y ∼ N (μ x∓μ y ,

σ x2

nx

+σ y

2

n y).

The tables include results combining the rules with a standardization or studentization. We are usually interested in comparing the mean of the twopopulations, for which the difference is considered; nevertheless, the addition can also be considered with


( X ∓Y )−(μX∓μY)

√σ X2

nX

+σY

2

nY

∼ N (0,1).

On the other hand, since the quality of estimators—e.g. measured through the mean square error—increase with the sample size, when theparameters of two populations are supposed to be equal the samples should be merged to estimate the parameter jointly (especially for small nx andny). Then, under the hypothesis σx = σy the pooled sample quasivariance should be used through the statistic:

T (X ,Y ;μX ,μY )=( X −Y )−(μ X−μY )

√ S p2

nX

+S p

2

nY

∼ t n X+nY−2

Remark 2T: For any populations with finite mean and variance, one version of the Central Limit Theorem implies that

X →d

N (μx ,σX

2

nx) , Y →

dN (μ y ,

σY2

ny) , and hence X ∓Y →

dN (μx∓μ y ,

σ X2

nx

+σY

2

ny) ,

where the rules that govern the convergence (in distribution) of the addition—and subtraction—of sequences of random variables (see a text onProbability Theory) and the rules that govern the addition and subtraction of normally distributed variables are applied. We are usually interested incomparing the mean of the two populations, for which the difference is considered; nevertheless, the addition can also be considered with

( X ∓Y )−(μ X∓μY)

√ ?x

nx

+?y

ny

→d

N (0,1)and, for a Bernoulli population,

(ηX∓ηY )−(ηX∓ηY )

√ ?X (1−? X )

nX

+?Y (1−?Y)

nY

→d

N (0,1) .

Besides, variances can be estimated when they are unknown. By applying theorems in section 2.2 of Approximation Theorems of MathematicalStatistics, by R.J. Serfling, John Wiley & Sons, and sections 7.2 and 7.3 of Probability and Random Processes, by G. Grimmett and D. Stirzaker,Oxford University Press,

X −μ

√ S 2

n

=1

√ S 2

σ2

X −μ

√σ2

n

→d

1⋅N (0,1)=N (0,1)and

η−η

√ η(1−η)

n

=1

√ η(1−η)

η(1−η)

η−η

√ η(1−η)

n

→d

N (0,1) .

Similarly for two populations. From the first convergence it is deduced that t n−1 →d

N (0,1) . On the other hand, when the parameters of two

populations are supposed to be equal the samples should be merged to estimate the parameter jointly (especially for medium nx and ny). Then, underthe hypothesis σx = σy the pooled sample quasivariance should be used—although in some cases its effect is negligible—through the statistic:

T (X ,Y ;μX ,μY )=( X −Y )−(μ X−μY )

√ S p2

nX

+S p

2

nY

→d

N (0,1)

For a Bernoulli population, under the hypothesis ηx = ηy the pooled sample proportion should be used—although in some cases the effect isnegligible—in the denominator of the statistic:

T (X ,Y ;ηX , ηY )=(ηX−ηY )−(ηX−ηY)

√ ηp(1−ηp)

n X

+ηp(1−ηp)

nY

→d

N (0,1) .

Remark 3T: In the last tables, the best information available should be used in place of the symbol ?.

Remark 4T: The Bernoulli population is a particular case for which μ=η and σ2=η⋅(1−η) , so X =η , When the variance σ2 is

directly estimated without estimating η, σ2 is used in place of the product ?(1−?).

Remark 5T: Once an interval for the variance is obtained, P(a1 < σ2 < a2), since the positive square root is a strictly increasing function (andtherefore it preserves the order between two values) an interval for the standard deviation is given by P(√ a1 < σ < √a2). (Notice that, for areasonable initial interval, 0 < a1.) Similarly for the quotient of two variances σX

2/σY2.


Statistics Based on Λ

1 population, any n


θ (1 dimension) Λ =L(X ;θ0)

L(X ;θ1)

θ (r dimensions) Λ =L(X ; θ0)

L(X ; θ)Asymptotically, −2 ln(Λ) →

dχ r

2

Analysis of Variance (ANOVA)

P independent normal populations

One-FactorFixed-Effects

Sample Quantities Statistic

Between-GroupMeasures

SSG =∑p=1

Pn p( X p− X )

2 MSG =1

P−1SSG

T 0 =MSGMSW

∼ F P−1, n−PWithin-Group

Measures

SSW =∑ p=1

PSS p where

SS p =∑i=1

n p

(X p ,i− X p)2

MSW =1

n−PSSW

Total Measures SST =∑p=1

P

∑i=1

n p

(X p , i− X )2= SSW + SSG

Nonparametric Hypothesis Tests

Chi-SquareTests

DataNull

Hypothesis

Statisticand

Expected Absolut Frequency

Goodness-of-Fit

X 1 ,... , X n

K classes1 model F0

H0: The sample comesfrom the model F0

T 0(X )=∑i=1

K (N i−e i)2

e i

→d

χK−(1+ s)2

=χK−1−s2

where s parameters are estimated and

e i=n p i=n P θ(ithclass)

or, if no parameter is estimated, s = 0 and

e i=n p i=n Pθ(ith class)


Homogeneity {X 11 , ... , X 1n1

X 21 , ... , X 2n2

⋮X L1 ,... , X lnL

K classesL samples

H0: The samples comefrom the same model

T 0(X )=∑i=1

L

∑ j=1

K (N ij−e ij)2

e ij

→d

χ KL−(L+K−1)2

= χ(K−1 )(L−1)2

where

e ij=n i pij=ni p j=ni

N⋅j

n

Independence

(X 1 ,Y 1)⋮

(X n ,Y n)

KL classes2 variables

H0: The bivariatesample comes from two

independent models

T 0(X ,Y )=∑i=1

L

∑ j=1

K (N ij− eij)2

e ij

→d

χ KL−(L−1+K−1+1)2

= χ(K−1)( L−1)2

where

e ij=n pij=n p i p j=nN i⋅

n

N⋅j

n

Remark 6T: Although because of different theoretical reasons, for the practical estimation of eij the same mnemonic rule can be used in bothhomogeneity and independence tests: for each position, multiply the absolut frequencies of the row and the column and divide by the total numberof elements n.

Kolmogorov--Smirnov Tests

Data Null Hypothesis Statistic

Goodness-of-Fit

X 1 ,... , X n

1 sample1 model F0

H0: The sample comes fromthe model F0

T 0(X )=max x∣Fn( x)−F0( x)∣

where

F 0(x)= P(X ≤ x)

F n(x )=1n

Number {X i≤ x }

Homogeneity {X 1 ,... , X n X

Y 1 , ... ,Y nY

2 samples

H0: The samples come fromthe same model

T 0(X ,Y )=maxt∣F nX(t)−F nY

(t )∣

where

F nX(t)=

1nX

Number {X i≤t }

F nY(t) =

1nY

Number {Y i≤t }

Other Tests DataNull

HypothesisStatistic

Runs Test(of Randomness)

X 1 ,... , X n

1 dichotomous propertyNyes elements with itNno = n–Nyes elements

without it

H0: The sample issimple and random(it has been selectedby applying simplerandom sampling)

Let R be the number of runs.

T 0(X )=R if Nyes < 20, Nno < 20,

and using the specific table. Or, for Nyes ≥ 20, Nno ≥ 20,

T 0(X )=T 0(X )−μ

√σ2→d

N (0,1)

with

μ=2n1 n2

n1+n2

+1 σ2=2n1 n2(2 n1 n2−n1−n2)

(n1+n2)2(n1+n2−1)

and using the table of the standard normal distribution


Signs Test(of Position)

X 1 ,... , X n

1 model F0

1 position measure Q(e.g. the median)

H0: The populationmeasure Q takes de

value q0

T 0(X )=Number {X i−q0>0 }

if n < 20, and using the specific table or the table of the Binomial(n,p), where p depends on Q (e.g. 1/2 for the median). Or, for n ≥ 20,

T 0(X )=T 0(X )−μ

√σ2→d

N (0,1)

with

μ=np σ2=n p(1−p)and using the table of the standard normal distribution

Wilcoxon Signed--Rank Test

(of Position)

X 1 ,... , X n

1 model F0

1 position measure Q(e.g. the median)

H0: The populationmeasure Q takes de

value q0

T 0(X )=∑{X i−q0>0 }

Ri

if n < 20, where Ri are the positions in the increasing sequence

of |Xi – q0|, and using the specific table. Or, for n ≥ 20,

T 0(X )=T 0(X )−μ

√σ2→d

N (0,1)

with

μ=n(n+1)

4 σ

2=

n(n+1)(2n+1)24

and using the table of the standard normal distribution

Remark 7s: In the statistics, the parameter of interest is the unknown for confidence intervals while it is supposed to be known for hypothesis tests.

Remark 8s: Usually the estimators involved in the statistic T (like s, S...) and the quantiles (like a ...) also depend on the sample size n, althoughthe notation is simplified.

Remark 9s: For big sample sizes, when the Central Limit Theorem can be applied to T or its standardization, quantiles or probabilities that are nottabulated can be approximated: p is directly calculated given a, and for p given a is calculated from the quantile z of the standard normaldistribution:

p=P (T≤a)=P( Z≤a−E (T )

√ Var (T )) z=a−E(T )

√ Var (T ) a=E (T )+ z √ Var (T )

This is used in the asymptotic approximations proposed in the tests of the last table.

Remark 10s: To consider the approximations, sample sizes bigger than 20 has been proposed in the last table, although it is possible to find othercutoff values in literature (like 8, 10 or 30); in practice, there is no severe change at any value.

Remark 11s: The goodness-of-fit chi-square test can also be used to test position measures: by considering two classes with probabilities (p,1–p).

Remark 12s: To test the symmetry of a distribution, the position tests can be used.

Remark 13s: Although different types of test can be applied to evaluate the same hypotheses H0 and H1 with the same α (type I error), their qualityis usually different, and β (type II error) should be taken into account. A global comparison can be done by using their power functions.


My notes:

Probability Tables

Standard Normal p=P (Z≤z )=∫−∞

z 1√2π

e−

z 2

2 dz for x∈(−∞ ,+∞)=ℝ

(Taken from: Kokoska, S., and C. Nevison. Statistical Tables and Formulae. Springer-Verlag, 1989.)


t p=P (X >x )=∫x

+∞

f (x )dx for x∈(−∞ ,+∞)

(Taken from: Newbold, P., W. Carlson and B. Thorne. Statistics for Business and Economics. Pearson-Prentice Hall.)


χ2 p=P (X >x )=∫x

+∞

f (x )dx for x∈[ 0,+∞)



F p=P (X >x )=∫x

+∞f (x )dx for x∈[ 0,+∞)




My notes:

Index(These references include only the most important concepts involved in each exercise.)

algebra, 4manalysis

complex, 3ptreal, 1m, 2m, 3m, 4m

analysis of variance, 1ht-avANOVA → analysis of varianceasymptoticness, 3pe-p, 1ci-m, 2ci-m, 3ci-m, 4ci-m, 2ci

(see also 'consistency')basic estimators, 12pe-p, 13pe-p

(see also 'sample mean', 'sample variance', 'sample quasivariance', 'sample proportion')bind, 1mbound, 5pe-p, 1m

(see also Cramér-Rao's lower bound)Bernoulli distribution, 1pe-m, 3pe-p, 12pe-p, 14pe-p, 3ci-m, 4ci-m, 6ht-T, 1ht-Λ, 1ht, 3pe-ci-ht, 3pt

(see also 'binomial distribution')binomial distribution, 1pe-m, 1pt, 3ptcharacteristic function, 3ptChebyshev's inequality, 1ci-s, 1ci, 2ci, 3ci, 4cichi-square distribution, 7pe-p, 1ptchi-square tests,

goodness-of-fit, 2ht-np, 3ht-np, 1hthomogeneity, 3ht-npindependence, 1ht-np, 3ht-np

cook → statistical cookcritical region, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-Λ, 1ht-av, 1ht-np, 2ht-np, 3ht-np, 1ht, 3pe-ci-ht, 4pe-ci-htcritical values → critical regioncompletion, 2pe-p, 4pe-p, 5pe-p

standardization, 1pe-p, 3pe-p, 4pe-p, 2ptcomplex analysis, 3ptconfidence intervals, 1ci-m, 2ci-m, 3ci-m, 4ci-m, 1ci-s, 1ci, 2ci, 3ci, 4ci, 1pe-ci-ht, 2pe-ci-ht, 3pe-ci-htconsistency, 6pe-p, 7pe-p, 9pe-p, 10pe-p, 12pe-p, 13pe-p, 14pe-p, 1pe, 2pe, 3peconvergence → rate of convergencecoordinates

rectangular, 4mpolar, 1m, 3m

Cramér-Rao's lower bound, 9pe-pdensity function → probability function

(see the continuous probability distributions)differential equation, 3ptefficiency, 9pe-p, 10pe-p, 3pe

(see also 'relative efficiency')exponential distribution, 3pe, 1ht-Λ, 3pt

two-parameter (or translated), 6pe-mexponential function, 1mfactorization theorem, 11pe-p, 3peF distribution, 1ptfrontier, 4mFubini's theorem, 1mgenerating functions

→ probability generating function → moments generating function → characteristic function


geometric distribution, 2pe-m, 11pe-p, 3ptgeometry, 4mgoodness-of-fit → chi-square testshomogeneity → chi-square testshypothesis tests, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-Λ, 1ht-av, 1ht-np, 2ht-np, 3ht-np, 1ht, 3pe-ci-ht, 4pe-ci-htindependence → chi-square testsindeterminate form, 2m, 3minference theory, 1it-spdintegral equation, 3ptintegral

improper, 3pt, 1mmultiple, 1m

integrationdirectly, 5pe-m, 7pe-m, 3ptby parts, 6pe-m, 3ptby substitution, 3pt, 1m

joint distribution, 1it-spdlikelihood function, 11pe-p, 3pelikelihood ratio tests, 1ht-Λ, 4pe-ci-htlimits, 2m, 3mlinear algebra, 4mmargin of error, 1ci-m, 2ci-m, 1ci-s, 1ci, 2ci, 3ci, 4cimass function → probability function

(see the discrete probability distributions)maximum likelihood method, 1pe-m, 2pe-m, 3pe-m, 4pe-m, 5pe-m, 6pe-m, 7pe-m, 1pe, 2pe, 3pe, 4pe-ci-htmean square error, 6pe-p, 7pe-p, 8pe-p, 9pe-p, 12pe-p, 13pe-p, 14pe-p, 1pe, 2pemethod of the moments, 1pe-m, 2pe-m, 3pe-m, 4pe-m, 5pe-m, 6pe-m, 7pe-m, 1pe, 2pe, 3pe, 4pe-ci-htmethod of the pivot, 1ci-m, 2ci-m, 3ci-m, 4ci-m, 1ci-s, 1ci, 2ci, 3ci, 4ci, 1pe-ci-ht, 2pe-ci-ht, 3pe-ci-htminimum sample size, 1ci-s, 1ci, 2ci, 3ci, 4cimoment generating function, 3ptmoment

(see 'population moment' and 'sample moment')movement, 4mNeyman-Pearson's lemma, 1ht-Λ, 4pe-ci-htnormal distribution, 4pe-m, 1pe-p, 2pe-p, 4pe-p, 5pe-p, 14pe-p, 1ci-m, 2ci-m, 1ci-s, 1ci, 2ci, 3ci, 4ci, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 1ht-Λ, 1ht-av, 1pe-ci-ht, 2pe-ci-ht, 1pt, 2pt, 3ptnormality, 12pe-p, 13pe-ppoint estimations, 1pe-m, 2pe-m, 3pe-m, 4pe-m, 5pe-m, 6pe-m, 1pe-p, 2pe-p, 3pe-p, 4pe-p, 5pe-p, 6pe-p, 7pe-p, 8pe-p, 9pe-p, 10pe-p, 11pe-p, 12pe-p, 13pe-p, 14pe-p, 1pe, 2pe, 3pe, 1pe-ci-ht, 2pe-ci-ht, 4pe-ci-htPoisson distribution, 3pe-m, 1ht-Λ, 1pt, 3ptpolar coordinates, 1m, 3m, 12pe-ppooled sample proportion → sample proportionpooled sample variance → sample variancepopulation mean, 12pe-p, 1ht-Tpopulation moment,

raw or crude, 3ptpopulation proportion, 12pe-p, 6ht-T, 3pe-ci-htpopulation standard deviation → population variancepopulation variance, 12pe-p, 13pe-p, 2ht-T, 3ht-T, 4ht-T, 5ht-Tposition signs test, 1htpower function, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht, 3pe-ci-htprobability, 1pe-p, 2pe-p, 3pe-p, 4pe-p, 5pe-p, 2pe-ci-ht, 1pt, 2pt, 3ptprobability function, 1it-spd, 10pe-p, 1pt probability generating function, 3ptprobability tables, 1ptplug-in principle, 1pe-m, 2pe-m, 3pe-m, 5pe-m, 6pe-m, 7pe-m, 3pe, 4pe-ci-htp-value, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-av, 1ht-np, 2ht-np, 3ht-np, 1ht, 3pe-ci-ht


quantile, 4pe-p, 1ptRayleigh distribution, 2perate of convergence, 6pe-p, 12pe-p, 13pe-p, 14pe-prelative efficiency, 8pe-p

(see also 'efficiency')rotation, 4msample mean, 1it-spd, 1pe-p, 4pe-p, 9pe-p, 10pe-p, 3pe, 2pt

trimmed, 6pe-psample moment

(see 'method of the moments')sample proportion, 3pe-p

pooled, 14pe-p, 4ci-msample quasivariance, 2pe-p, 4pe-psample variance

pooled, 14pe-p, 1pe-ci-ht, 2pe-ci-htsample size

minimum → minimum sample sizesample standard deviation → sample variancesampling distribution, 1it-spdsequence, 2m, 3m, 12pe-p, 13pe-p, 14pe-p

(see 'rate of convergence')series, 3ptstatistical cook, 4ht-Tstandard power function density, 4pe-ci-htsufficiency, 11pe-p, 3petable of frequencies, 1ht-np, 2ht-np, 3ht-np, 1htt distribution, 1pe-ci-ht, 1pttotal sum, 5pe-p, 2pttype I error, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-av, 1ht-np, 2ht-np, 3ht-np, 1ht, 3pe-ci-httype II error, 1ht-T, 2ht-T, 3ht-T, 4ht-T, 5ht-T, 6ht-T, 1ht-av, 1ht, 3pe-ci-htunbiasedness, 10pe-p

(see also 'consistency')uniform distribution

continuous, 5pe-m, 10pe-p, 1ptdiscrete, 1pt


My notes:

http://www.ucm.es/

solved exercises and problems of statistical...

Documents