s2e - stat2var - lessons - rev 2020€¦ · 2nd semester _____ bivariate statistics ... the average...
Post on 21-Jun-2020
4 Views
Preview:
TRANSCRIPT
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 1 / 14
SALES AND MARKETING Department
MATHEMATICS
2nd Semester
________ Bivariate statistics ________
LESSONS
Online document: http://jff-dut-tc.weebly.com section DUT Maths S2.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 2 / 14
TABLE OF CONTENTS
LESSONS 3
1 Introduction, vocabulary 3
1-1 Aims 3
1-2 Formatting 3
1-3 Scatter plot 4
2 Chi-square independence testing 5
II-1 The special case of a Chi-square independence testing 5
II-2 Methodology 5
II-3 Independence in a 2x2 table 6
II-4 Some clarification on Chi-square law 9
3 Fitting: Mayer’s method and moving means 10
3-1 Moving means 10
3-2 Purpose of linear fitting 11
3-3 Mayer’s method 11
4 Linear fitting: least square method 12
4-1 Parameters of a bivariate series 12
4-2 Least square method 13
4-3 Linear correlation coefficient 14
5 Non-linear fitting: variable change 17
6 Statistical prediction 18
6-1 Point estimate 18
6-2 Confidence interval 18
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 3 / 14
LESSONS
1 Introduction, vocabulary
1.1 Aims Two characters will be studied simultaneously on each individual in a population of size n, creating two
variables (lists of values) X and Y.
Aims : * highlight a relationship between both characters: their correlation;
* model this correlation by a mathematical function: regression;
* use this model for forecasting purposes: prediction, with an associated confidence level;
* test the hypothesis that X and Y are not related.
If a cause-and-effect relationship is to be studied, X will represent the cause and will be called the
explanatory variable, and Y will represent the effect and will be called the explained variable.
1.2 Formatting From one individual (no. i), an observation will be written down as an ordered pair of values (xi ; yi).
There are two possible ways to display the data series, depending on the situation:
* bivariate data series given in lists
e.g.: relationship between the quantity of spread fertilizer and the harvested production
fertilizer harvest
plot no. X (kg.ha-1) Y (q.ha-1)
1 150 46
2 80 37
3 120 46
4 220 51
5 100 43
e.g. of a time series: annual advertising expense of a company
X : year 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Y : expense 41 60 55 66 87 61 90 95 82 120 125 118
* bivariate data series + frequencies: contingency table
e.g.: relationship between age and visual acuity, data collected from 200 people
X : age
20 40 50 60
Y :
acuity
3/10 1 5 10 20
6/10 8 12 25 18
9/10 55 26 14 6
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 4 / 14
1.3 Scatter plot Every statistical series with two variables can be graphically represented by a point cloud, each variable
taking place on its own axis.
* series in lists: a pair (xi ; yi) corresponds to one individual and to one point.
second example in the previous page:
* series with contingency: a pair (xi ; yi) mostly corresponds to more than one individual (freq ≥ 1) and to
an object whose size is an increasing function of the associated frequency.
third example in the previous page:
year 1: 2006)
acuity
age
expense (k€)
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 5 / 14
2 Chi-square independence testing
A statistical test consists in deciding whether a hypothesis, made on the population from the results obtained
on a sample, can or cannot be rejected. This hypothesis is named "null hypothesis", H0.
If the decision leads to a rejection of H0, this is done with a certain risk of error, the probability of which is
called "significance level", or sometimes "risk threshold", of the test, and noted α. (It is also called p-value of the
test).
2.1 The special case of a Chi-square independence testing:
A study crosses two quantitative or qualitative variables (in the example of the next tutorial: sex and
relationship to tobacco), variables whose interdependence within a population is to be estimated, based
solely on the frequencies distribution obtained from a sample of respondents.
In the case of independence (H0), the theoretical answers are supposed to be distributed by keeping the
subtotals found from the sample (e.g.: a certain number of men and a certain number of women were
interviewed, possibly different numbers) and in proportion to these subtotals.
It involves calculating the deviation shown by the observed distribution compared to this theoretical one,
deviation noted as "χ²calc" (pronounce “calculated Chi-square”), and then deciding whether this deviation
is abnormally large or not – it’s proven that a population in which two variables are independent usually
gives samples with a slight deviation (due to the random nature of the sample selection), but rarely a
large deviation.
2.2 Methodology: n observations are conducted: n individuals are evaluated on two variables X and Y.
The variable X shows as results r different values, and Y shows k different values.
The null hypothesis H0 is by convention: the variables are independent.
The test compares reality to what would perfect independence have shown.
We can reject H0 in case the set of observations is « too far » from the theoretical distribution.
1. Calculation of the observed χ²
* table of observations on n individuals Y1 Y2 … Yk total X
X1 obs11 obs12 … obs1k total X1
X2 obs21 obs22 … obs2k total X2
… … … … … …
Xr obsr1 obs r2 … obsrk total Xr
total Y total Y1 total Y2 … total Yk n
* table of the theoretical distribution (independence)
This second table is built from the first, taking back every subtotal, then calculating each frequency
in proportion to these subtotals and to the general total n.
* calculation of χ²calc (global difference between obs and th): χ²calc = ( )2−
∑table
obs th
th
2. Rejection area
The χ² variable expresses the infinity of the possible χ² values that could be obtained from any
possible sample, under the null hypothesis. This variable is distributed in probability, by a law of the
same name, settled by its number of degrees of freedom (dof): dof = (r - 1)(k - 1)
To each possible χ² value (in [0 ; +∞[) corresponds a probability "α" that a sample would exceed it.
In an exercise, in case α is given, we can read the value of the corresponding χ²lim in the table.
3. Comparison and decision
If χ²calc > χ²lim , then we are allowed to reject H0 (the independence), with a risk α to be wrong.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 6 / 14
2.3 Independence in a 2x2 table
(from: ENFA - Bulletin du GRES n°9 – février 2000)
Let's have a look at the tools that are available to conduct a two-character independence test for a 2 x 2 table
(two qualitative variables each with two modalities - for example: male/female for one and smoking/non-smoking
for the other).
Let us take the example of YATES (1934) quoted in [M.G. KENDALL and A. STUART The advanced theory of statistics
Griffin 1960]. We consider a sample of 42 children, of whom 20 were breastfed and 22 bottle-fed. The arrangement
of the teeth of these children was observed.
Normal dentition Poorly implanted dentition margin frequencies
Breastfed (S) 4 16 20
Bottle-fed (B) 1 21 22
margin frequencies 5 37 42
The question is whether this sample alone can establish a link in the population between the way a baby is fed
and the quality of his or her dentition. This issue is addressed by an independence test.
2.3.1 Independence Chi-square test
The null hypothesis is "there is independence between the two characters" (mode of feeding and tooth
implantation).
The methodology of this test consists first of all in calculating the distance between the observed sample and
the average sample that would be taken from a population checking the null hypothesis. In order for the two
tables to be comparable, the marginal numbers (also called margins, i.e. subtotals) must be identical (i.e. the
numbers in bold and italics in the table are fixed).
Le tableau d’effectifs « théoriques » (en fait : ceux de l’échantillon moyen mentionné ci-dessus) est :
Normal dentition Poorly implanted dentition margin frequencies
Breastfed (S) 2.38095238 17.6190476 20
Bottle-fed (B) 2.61904762 19.3809524 22
margin frequencies 5 37 42
After comparison with the observed sample, this gives the following partial and total chi-2:
1.10095238 0.14877735
1.0008658 0.13525214
2.38584767
This chi-square value (2.386) calculated, for 1 dof, corresponds to a significance level higher than 10%.
The chi-square law tells us more precisely that a chi-square of 2.386 corresponds to a p-value of 12.24% (in
other words: in a population where our two variables are independent, there is a 12.24% chance that a sample
with the same subtotals is as different or more different from the average sample).
But here this would pose a problem, because the theoretical numbers are "too small", in the sense that,
according to textbooks, the Chi-2 test is only applicable if the theoretical numbers are all greater than or equal
to 5 (by the way, one may ask the question: why 5?).
This result of 12.24% is derived from the continuous Chi-square law, which is only an approximation of the
reality that is discrete here (for example, the "breastfed/normal dentition" frequency can only be 0, 1, 2, 3, 4
or 5, which is a "too discrete" situation to be effectively followed "closely" by a continuous law).
Section 2.3.2 below solves the problem.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 7 / 14
2.3.2 The Exact Approach: The Fisher Exact Test
[R. A. FISHER Les méthodes expérimentales PUF 1947]
If the margin frequencies are stated, then there are six different possible tables:
0 20 1 19 2 18 3 17 4 16 5 15
5 17 4 18 3 19 2 20 1 21 0 22
The question that arises then is to calculate, under the hypothesis of independence of the two characters, the
probability of appearance of each of the tables. It should be noted that, since the marginal numbers are fixed,
in order to fill in a table it is sufficient to know the number in the first row and first column.
The independence hypothesis can be interpreted as follows: of the 42 children, 20 are breastfed and 22 are
bottle-fed. If the mode of feeding has no influence on the dentition, then the 5 children with normal dentition
are distributed according to the proportions of the two modes of feeding.
Let's randomly choose 20 babies from 42 and call the "normal dentition" event a success. The number of
successes is described by the hypergeometric distribution H(42, 5, 20).
The probability of k successes (k compris entre 0 et 5) est 20
5 37
20
42
C C
C
k k−×.
The calculation, for each of these 6 values, leads to the following results:
To sum up:
Value first row first column 0 1 2 3 4 5
Probability 0.0310 0.1719 0.3440 0.3096 0.1253 0.0182
Let's go back to the first data table of the sample. If the null hypothesis is true, then the probability of obtaining
such a table (k = 4) or a table more distant from a proportionality table (k = 5) is 0.1435. The null hypothesis
can therefore only be rejected at a risk threshold greater than or equal to 14.35% (compared to the 12.24%
given by the Chi-square law), which is too high compared to the risk thresholds conventionally used (generally:
5% maximum).
To be more complete, we can say that for a 5% risk, we have the following decision rule:
Value first row first column 0 1, 2, 3, 4 5
Decision rejection of the
hypothesis
non-rejection of the
hypothesis
rejection of the
hypothesis
P. DAGNELIE in [Theoretical and Applied Statistics Volume II De Boeck 1998] states that: "Despite these
objections, like many authors, we still recommend the use of this test for small samples". The objections relate
to the very strong assumption that the margins are fixed.
"The processing of frequencies by a χ² is a useful approximation in practice because of the relative simplicity of
the calculations. The exact treatment, more time-consuming, but necessary in case of doubt, shows the true
nature of the inferences suggested by the method of χ². “
0 20 The probability to obtain 1 19 The probability to obtain
5 17 such a table is 0.0310 4 18 such a table is 0.1719
2 18 The probability to obtain 3 17 The probability to obtain
3 19 such a table is 0.3440 2 20 such a table is 0.3096
4 16 The probability to obtain 5 15 The probability to obtain
1 21 such a table is 0.1253 0 22 such a table is 0.0182
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 8 / 14
2.3.3 Additional comments
1°) A quote from M.J. Moroney in [Understanding the Statistics Marabout 1970]:
"A simple mathematical distribution can be perfectly well chosen because of its simplicity, whereas it fits the
facts less well than a more complex distribution, provided it fits our purpose well enough. A man going on a
trip may prefer to take a sketch with him rather than a headquarters map, because a sketch that is accurate
enough and simpler to follow better suits his needs. ”
The statistic of χ² is not the best fit for the previous independence test. Let us recall that the distribution of χ²
is continuous whereas the calculated chi-square can only take a finite number of values, but it is very simple
to use and sufficient in the sense of the author of the quotation.
2°) Instead of the term independence, some authors prefer the term association. The term association should
be understood in the sense: "is having bad teeth more associated with bottle-fed children than with breastfed
children? ”. In order to measure the degree of association of two characteristics each having two modalities,
various coefficients have been proposed, such as the YULE association coefficient and the FORBES-MARGALEF
association coefficient.
Let's take a look at the formal table: Presence of the character A Absence of the character A
Presence of the character B a c
Absence of the character B b d
• The coefficient of association in the sense of YULE (1900) is noted Q and by definition: ad bc
Qad bc
−=+
.
Note that this formula makes the numerator show the quantity ad bc− , the difference between the
crossproducts of the formal table, which cancels out if it is a proportion table, i.e. when there is independence
of the two characters.
Moreover, Q is between –1 and 1.
If Q = 1, then bc = 0. If, for example, b = 0, it means that if the character A is present, then B too (associated
characters).
If Q = –1, then ad = 0. If, for example, a = 0, it means that the presence of the character A leads to the absence
of B (dissociated characters).
• The FORBES coefficient is defined by ( )
( )( )a a b c d
a b c d
+ + ++ +
.
Its definition is based on a frequencistic approach and on the idea that if two non-zero reals are equal, then
their quotient is equal to 1. The probability (inferred from the observations) that an individual has both
character A and character B is equal to a
a b c d+ + +. If the two characters are independent (in the sense of
probabilities), then the probability that an individual has both character A and character B is equal to the
product of their probabilities, this probability (inferred from the observations) is equal to ( )( )
( )2
a b a c
a b c d
+ +
+ + +.
Therefore, if the two characters are independent, the quotient of these two observed probabilities must be
close to 1, this quotient is equal to ( )
( )( )a a b c d
a b c d
+ + ++ +
.
By looking at the two previous probabilities, you will be sure to reconcile the observed numbers with the
theoretical numbers. (they are equal to each other to the nearest ( )a b c d+ + + !).
3°) In R.A. FISHER's book and in books intended for commercial studies (e.g. [Y. FOURNIS Les études de marché
Dunod 1995]), there is another way of calculating the χ² observed.
Let us take back the latest table and name 1 2 1 2n n m m the margin frequencies.
Thus, the value of the observed χ2 is equal to ( )2
1 2 1 2
ad bc n
n n m m
−, formula that is easy to implement and automate.
Note again the presence of the term ad bc− in the numerator as for the YULE association coefficient.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 9 / 14
2.4 Some clarification on Chi-2 law
2.4.1 Definition
A Chi-2 law with d degrees of freedom is the continuous distribution of a variable, often noted K, defined as
the sum of the squares of d independent random variables Ui of the standard normal law:
( ) ( )χ=
=∑ 2 2
1
If 0 , 1 , then ∼ ∼
d
i i
i
U N K U d
(Like the exponential law and others, this law belongs to the group of "gamma" laws – Γ – which we will not
talk about here; let us simply mention that the law ( )2dχ is this way the law
1 ,
2 2
d Γ
).
2.4.2 Parameters of the law ( )2dχ
Mean : d Standard deviation : 2d Mode : − ≥2, if 2d d
The median depends on d in a more complex way:
d 1 2 3 4 5 6d ≥
median (approx.) 0.45 1.39 2.37 3.36 4.35 − 0.66d
2.4.3 Patterns of probability densities
* If d = 1 (blue), the density is decreasing in
]0 ; +∞[ and tends to infinity in zero.
* If d = 2 (green), it is also strictly decreasing
but is worth 0.5 in zero.
(The law ( )22χ is actually the exponential law
with a parameter (intensity) of 0.5)
* If 3d ≥ (yellow: 3, red: 5, brown: 8), the
density is first increasing then decreasing and
reaches its peak on the abscissa 2d − (mode)
On the occasion of an independence c² test, let
us not forget that we rely on the law of the same
name, which is continuous, to evaluate a
discrete situation (we generally test numbers of
quotations or numbers of successes, therefore
integers). This law can, in these cases, only give an approximation of the probabilities we are interested in.
2.4.4 Links with other laws (further study)
* The central limit theorem allows to give a good approximation of the law ( )2dχ by a normal law
( ) , 2d dN when d is “big enough” (criterion, here: d > 100).
* definition of a Student law : ( ) ( ) ( )χ =2If 0 , 1 and , then ∼ ∼ ∼
UU N K d T d
K dST
* definition of a Fisher law : ( ) ( ) ( )χ χ =2 2 1 11 1 2 2 1 2
2 2
If and , then , ∼ ∼ ∼
K dK d K d F d d
K dF
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 10 / 14
3 Fitting: Mayer’s method and moving means
3.1 Moving means
The moving means are most frequently used in the case of time series, the variable X represents time and
the variable Y a value that changes over time.
When the Y values show large oscillations through time, an overall upward or downward trend is hard to
detect. The moving means are there to provide an answer, by smoothing these oscillations.
Methodology:
* group successive values of Y in packets, always of the same number n (for example: take values three
by three, or four by four, etc.); this number is chosen according to the periodicity of seasonal
phenomena. When this periodicity is even, the moving average is calculated with one more value,
the two extreme observations being weighted by half;
* The next set consists of the previous one, in which the first value of Y is removed and the next one is
joined (sliding sets);
* The average value of Y is calculated in each set (providing a list of moving means), same for the
average value of X (providing an average location in time for each set);
* The corresponding points are plotted (graphically represented).
e.g.:
X (trimesters) 1 2 3 4 5 6 7 8
Y (thousands of tourists) 58 22 13 36 60 19 14 33
Let’s create the list of the 4×4 moving means:
X 3 4 5 6
Y 32.5 32.375 32.125 31.875
This new list of values (doubled by its graph) suggests a downward trend.
note:
* the first moving mean is the mean of the values n° 1 (coef 1/2), 2, 3, 4 and 5 (coef 1/2).
Here: (1/2+2+3+4+5/2)/4 = 3 for x and (58/2+22+13+36+60/2)/4 = 32.5 for y
* the second moving mean is the mean of the values n° 2 (coef 1/2), 3, 4, 5 and 6 (coef 1/2).
Here: (2/2+3+4+5+6/2)/4 = 4 for x and (22/2+13+36+60+19/2)/4 = 32.375 for y
* and so on…
+
+
+
+
+
+ +
+ + + + +
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 11 / 14
3.2 Purpose of linear fitting A point cloud may show a link between both variables if its points are apparently not gathered at random.
In some cases, this cloud's shape may be elongated, relatively thin, with a "directional axis" quite straight
showing a tendency ... Can we find an axis, a straight line, that "follows" the whole cloud "to the best"?
Let’s say this line has already been drawn
(D) : y′ = ax + b.
To a given value xi are associated the value
yi (ordinate of the point Mi in the cloud)
and the value y′ = axi + b (on the line).
definition: we name residue the number
ei = yi – iy′
The residue of a point Mi is then positive if this
point is above the line and negative in the
opposite situation.
Hence, we aim to find the line that « minimises to the best » the residues, the line that passes through the
cloud as close as possible to the points. This way, we perform a linear fitting, or linear regression. Once
done, this object is called fitting line, trend line or regression line of the series.
3.3 Mayer’s method
Some residues are positive, the other are negative. Mayer's assumption is that the "best" line is the one
that leads to a zero sum of residues (the negative residues offset the positive ones).
definition: we name Mayer’s principle the goal n
i
i
e=
=∑1
0
mathematical analysis:
( )i i i i ie y ax b y a x nb= − − = − −∑ ∑ ∑ ∑
This sum is zero 1 1 1
iff 0 iff 0i iy a x n b y ax bn n n
− − = − − =∑ ∑
That is to say: to obtain a cancellation of the global residue, it is necessary and sufficient that the
straight line contains the midpoint of the cloud, ( ),G x y . This property isn't sufficient in itself to make
a Mayer's line unique, since the only obligation is to own one given point. There are an infinite number
of straight lines making a zero sum of residues!
Mayer’s method:
* Divide the cloud into two subclouds:
Both subclouds must contain the same number of points: n/2 if n is even, or (n+1)/2 on one side and
(n-1)/2 on the other side if n is odd. The abscissas x in the first subcloud must all be less than the
abscissas x in the second one;
* Calculate the coordinates of G1 and G2, mean points (midpoints) of both subclouds;
* Determine (if asked) the expression of the line (G1G2), Mayer’s line that will be chosen; draw it
note: It’s been proved that the mean point of the whole cloud, G, belongs to the line (G1G2) in any
case, and then that the latter meets Mayer’s principle.
iy
ix
iy′
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 12 / 14
4 Linear fitting: least square method
4.1 Parameters of a bivariate series
4.1.1
The mean of X or of Y are:
1
n
i
i
x
xn
==∑
and
n
i
i
y
yn
==∑
1 without contingency (data series in lists – see p.3 examples 1 and 2);
r
i i
i
n x
xn
==∑
1 and
k
j j
j
n y
yn
==∑
1 with contingency (frequencies gathered into a crossed table – p.3 ex 3).
The special point ( ),G x y is named mean point or midpoint of the cloud.
4.1.2
The variance of X and the one of Y are easily accessible (manual calculations) by Koenig’s theorem:
( )
r
i
i
x
X xn
== −∑ 2
21V and ( )2
21V
r
i
i
y
Y yn
== −∑
without contingency;
( )2
21V
r
i i
i
n x
X xn
== −∑
and ( )2
21V
r
i i
i
n y
Y yn
== −∑
with contingency.
The standard deviations are still the square roots of the variances.
4.1.3
We name covariance of the pair (X,Y) the number : ( )( )( )
, 1Cov
n
i i
i
x x y y
X Yn
=
− −=∑
.
This is a « common variance » between both variables, which is necessary to analyze their correlation.
Koenig’s theorem gives an easier way to calculate the covariance:
( ), 1Cov
n
i i
i
x y
X Y x yn
== − ×∑
(without contingency) and ( ),
r k
ij i j
i j
n x y
X Y x yn
= == − ×∑∑
1 1Cov (with)
4.1.4
Using the calculator:
The means and standard deviations are given directly, in Stat mode.
Unfortunately, the calculator gives neither the variances nor the covariance.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 13 / 14
4.2 Least square method
The idea of this method is to square each residue, then to add these squares, and finally to say that the
"best" line is the one that minimizes this sum (obtain the smallest possible sum, considering the infinite
number of possible lines).
definition: We name least square principle the one that consists in finding a line leading to
2
1
is minimum within the cloudn
i
i
e=∑ (Gauss)
mathematical analysis: we set ( ) ( ),2
i iP a b y ax b= − −∑ : bivariate polynomial.
There are two different ways to expand it:
( ) ( ) ( ) ( ), ( )i i i i i iP a b y ax b nb b y ax y ax= − − = − − + −∑ ∑ ∑2 22 2 (1)
2nd degree trinomial, with respect to b;
( ) ( ) ( ) ( ), ( )i i i i i i iP a b y b ax a x a x y b x y b= − − = − − + −∑ ∑ ∑ ∑ ∑2 22 2 2 (2)
2nd degree trinomial with respect to a.
In this context, we can continue like this:
* consider a as a constant and b as a variable. P(a,b) (1) is minimum when its derivative (/b) is zero (its
1st coefficient, n, is non-negative), which leads to b y ax= −
* consider this latest value of b, and a as a variable. P(a,b) (2) is minimum when its derivative (/a) is
zero, which leads to ( )( )
.,i i
i
x y x y X YnaX
x xn
−= =
−
∑
∑ 2 2
1Cov
1 V
Calculus amateurs can try to find these results!
notes:
* such a value of b implies that the regression line owns the mean point of the cloud, G; that is to say:
it meets Mayer’s principle.
* This method conducts to a unique line and is mostly employed.
least square method:
* Calculate the coefficients ( )( )
,Cov
V
X Ya
X= and b y ax= − (you can get them on your calculator!)
* Write the expression of the Y on X regression line DY/X : y′ = ax + b
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 14 / 14
4.3 Linear correlation coefficient A scatterplot shows a more or less strong link between two variables X and Y, sometimes displaying an
elongated and almost right cloud: in this case, a linear model is relevant. The purpose of the linear
correlation coefficient is to evaluate the strength of a linear link, by a number.
linear correlation coefficient between X and Y : ( )
( ) ( ),Cov X Y
rX Yσ σ
=
It’s been stated that, whatever the data series, -1 ≤ r ≤ 1
(the capital R or the Greek letter ρ - « rhô », are sometimes used for this coefficient)
On the calculator:
A calculator generally writes it r… if it mentions it! (it depends on the type of calculator).
Therefore, we will calculate it by ourselves (which implies calculating the covariance first...).
Interpretation of its value:
The strongest the linear correlation is (cloud looking like a straight line), the closest to 1 is |r|.
CAUTION: THE RECIPROCAL IS NOT NECESSARILY TRUE!
A coefficient close to 1 can be obtained with a point cloud along a slightly curved axis, in a situation for
which the linear fit would not be relevant!
"positive correlation" : r is positive when Y overall increases with X
"negative correlation" : r is negative when Y overall decreases as X increases
0 ≤ |r| ≤ 0.5 : weak linear correlation, inappropriate linear model.
0.5 ≤ |r| ≤ 0.75 : mean linear correlation, non-appropriate linear model.
0.75 ≤ |r| ≤ 0.95 : tolerable linear correlation, the linear model may not be the best one.
0.95 ≤ |r| ≤ 1 : strong linear correlation, the linear model is maybe the best one.
Comments:
* are X and Y really linked ?
If r is close to 1 (or -1), the points are close to be collinear (it might follow a curve!). Nevertheless,
that doesn't always mean that X and Y are concretely related. E.g.: in France, from 1974 to 1981,
the wedding rate decreased and in the meantime the GDP (French : PIB) increased, so that the
scatter plot using both data sets is quasi-linear (fourth graph below). The linear correlation is
mathematically very strong, but facts and studies show there is no cause to effect relationship
between both variables! (after the year 1981, the following points are not at all collinear with the
previous ones any more).
* linear correlation
r only shows a linear link. A correlation between X and Y may be very strong, but not in a linear way
(curved). In that case, r is far from 1 and -1, and the study has to be expanded (see II-4). But if |r| is
far from 1, there is a chance that the linear fitting would be better than any othe to model the
points cloud – see the two first example of the following page.
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 15 / 14
E.g.s:
income (€) vs. duration in a company
r = 0.8449 duration
success rate vs. % of disadvantaged SPC
r = -0.7457
unit margin (€/u) vs. quantity
r = 0.6438 quantity (thousands)
wedding rate through time
r = -0.9875
Once again, beware of the relevance of a linear fit: the fact of knowing r, a and b is not enough to give us the
right to represent a bivariate series with a straight line!
R. Tomassone, E. Lesquoy and C. Miller, in their remarkable book "La régression, nouveaux regards sur une
ancienne méthode statistique" (Masson, 1983), present (p.21) the five series on the following page.
It turns out that all five have, up to the third decimal place, the same linear correlation coefficient and the same
least squares regression line coefficients (slightly more deviations for b); yet the five point clouds are very
different!
(for info, next page: 0.785 < r < 0.786; 0.808 < a < 0.809; 0.519 < b < 0.524)
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 16 / 14
X1 Y1 X2 Y2 X3 Y3 X4 Y4 X5 Y5
7 5,535 7 0,113 7 7,399 7 3,864 13,715 5,654
8 9,942 8 3,77 8 8,546 8 4,942 13,715 7,072
9 4,249 9 7,426 9 8,468 9 7,504 13,715 8,491
10 8,656 10 8,792 10 9,616 10 8,581 13,715 9,909
12 10,737 12 12,688 12 10,685 12 12,221 13,715 9,909
13 15,144 13 12,889 13 10,607 13 8,842 13,715 9,909
14 13,939 14 14,253 14 10,529 14 9,919 13,715 11,327
14 9,45 14 16,545 14 11,754 14 15,86 13,715 11,327
15 7,124 15 15,62 15 11,676 15 13,967 13,715 12,746
17 13,693 17 17,206 17 12,745 17 19,092 13,715 12,746
18 18,1 18 16,281 18 13,893 18 17,198 13,715 12,746
19 11,285 19 17,647 19 12,59 19 12,334 13,715 14,164
19 21,365 19 14,21 19 15,04 19 19,761 13,715 15,582
20 15,692 20 15,577 20 13,737 20 16,382 13,715 15,582
21 18,977 21 14,652 21 14,884 21 18,945 13,715 17,001
23 17,69 23 13,947 23 29,431 23 12,187 33,281 27,435
series 1 series 2
series 3 series 4
series 5
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 17 / 14
5 Non-linear fitting: the variable change
A variable change may be performed if the points seem to follow a curve in particular.
The function to consider will always be defined by the directions of an exercise. It may be:
* a logarithm or exponential function
* a polynomial function
* a trigonometric function
* One of the variables X or Y (or both!) has to be replaced each by a new one, noted T for instance,
following a given formula that allows its calculation starting from the former.
e.g.:
X 2 3 5 8
Y 9 13 28 70
As Y seems to vary as X squared, plus 5, we can define the variable change T = X ².
We have to build the following table, into which T shall replace X :
T 4 9 25 64
Y 9 13 28 70
* We perform a linear regression of the pair (T, Y), observing their order.
e.g.:
Here, the question is to determine the expression of their fitting line, y′ = at + b. If we are told to use
the least square method, the coefficients a and b will be given by the calculator: y′ = 1.02526 t + 3.856
* Finally, we can deduce the expression of a curve, fitting the non-linear relationship between X and Y,
just by writing the variable change again; we may draw this curve, if we are told to.
e.g.:
Since y′ = 1.02526 t + 3.856, we get: y′ = 1.02526 x² + 3.856 (expression of a parabola)
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 18 / 14
6 Statistical prediction
6.1 Point estimate The fitting straight line (obtained with or without a variable change) makes it possible, through its
expression, to estimate a value of the variable Y on choosing an unexplored value of the variable X
(generally greater than those collected in the genuine series). In this case, if X represents time, it is
possible to make a forecast to the future.
e.g.: let’s set the expression of a fitting line: y = 0.85x + 22.
a. Point estimate of y with x0 = 10. y’0 = 0.85×10 + 22 = 30.5.
b. Point estimate of x with y0 = 39. x’0 = (39 – 22)/0.85 = 20.
6.2 Confidence interval We ought to step back, considering the point estimate: according to the noise (dispersion) of the point
cloud, it is more or less trustable – it gives us a more or less precise prediction.
Here, the new idea is to give an estimate by a range (interval), around the point estimate, rather than a
single value, and to be able to associate a probability (confidence level) for the unknown reality to be
inside such a range.
Rates method (uses a linear model, estimates y from x):
1. For each value xi of the initial data set:
* calculate the values y'i following the expression of the regression line
* calculate the rates zi = yi / y'i
* calculate the mean and standard deviation of the variable Z
2. Z is considered as distributed by a normal law. Consequently:
95 % of Z values take place inside the interval [ ];1.96 1.96Z Z
z zσ σ− +
99 % of Z values take place inside the interval [ ];2.58 2.58Z Z
z zσ σ− +
3. Calculate the point estimate y'0 , associated to the new given value x0, thanks to the fitting line.
Now, we can predict the unexplored possible values y0 by an interval, as follows:
There are 95% chances that y0 would be in ( ) ( );0 01.96 1.96Z Zy z y zσ σ′ ′ − +
There are 99% chances that y0 would be in ( ) ( );0 02.58 2.58Z Zy z y zσ σ′ ′ − +
comments:
* this method is efficient only for r > 0 (non-negative correlation)
* the probability (95%, 99%, etc.) is named confidence level of the prediction.
Its complement (5%, 1%, etc.) is named significance level.
* The size of such an interval is related to the uncertainty of the answer. It increases when:
. the confidence level increases,
. |r| decreases,
. the distance between x0 and the abscissas xi of the point cloud increases.
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 19 / 14
IUT TC MATHEMATICS FORM FOR BIVARIATE STATISTICS
χ² law table
The table gives values χ²lim
such that p(χ² > χ²lim) = α
α α α α
χ²lim
χ²
α 1 − α
top related