statistics: first lesson

10
Dr. Abhijit Kar Gupta, [email protected] What is Statistics? Statistics is a systematic presentation of data out of which we may conclude something meaningful. Just collection of raw data is meaningless unless we are able to calculate some quantities out of them. It is only interesting when some patterns emerge out of the data that are representative of some event or measurement. After we collect a set of data, the first thing we like to do is to obtain the central tendency of it. Central Tendency Central tendency of a data set is obtained by calculating mean, median and mode. Mean: There are various kinds of mean, (i) arithmetic, (ii) geometric, (iii) harmonic. We usually calculate arithmetic mean and this we commonly call mean or average. Suppose, we have a set of -data points: . Arithmetic mean (A.M.) (1) The arithmetic mean or average is the measure of the ‘middle’ of the data set. 1

Upload: abhijit-kar-gupta

Post on 24-Nov-2014

106 views

Category:

Documents


1 download

DESCRIPTION

This is the first of the lecture notes on basic statistics. Any beginner can consult this document.

TRANSCRIPT

Page 1: Statistics: first lesson

Dr. Abhijit Kar Gupta, [email protected]

What is Statistics?

Statistics is a systematic presentation of data out of which we may conclude something meaningful.

Just collection of raw data is meaningless unless we are able to calculate some quantities out of them. It is only interesting when some patterns emerge out of the data that are representative of some event or measurement.

After we collect a set of data, the first thing we like to do is to obtain the central tendency of it.

Central Tendency

Central tendency of a data set is obtained by calculating mean, median and mode.

Mean:

There are various kinds of mean, (i) arithmetic, (ii) geometric, (iii) harmonic. We usually calculate arithmetic mean and this we commonly call mean or average.

Suppose, we have a set of -data points: .

Arithmetic mean (A.M.)

(1)

The arithmetic mean or average is the measure of the ‘middle’ of the data set.

Now suppose, appears times, appears times and so on in the data set. Here ,

… are called the frequencies. The arithmetic mean in this case is

, (2)

where .

Formula (2) is called the weighted mean.

Note : In the formula (2), if we put for all , we get back formula (1).

1

Page 2: Statistics: first lesson

Dr. Abhijit Kar Gupta, [email protected]

The above formula (2) can also be written as

,

Where is the relative frequency for each (each data point).

Example#1The ages of father, mother, son and daughter in a family are 60 years, 55 years, 25 years and 20 years respectively. What is the average age of the family members?

Ans. Average age = years.

Eaxmple#2In the game of ‘Ludo’ (dice throwing), you obtain ‘1’ two times, ‘2’ five times, ‘3’ two times, ‘4’ six times, ‘5’ four times and ‘6’ only once from the random throwing of a dice. What is the average value you get?

Ans. Average value = .

Geometric mean (G.M.):

G.M. =

Harmonic mean (H.M.):

H.M. =

It is useful to calculate arithmetic mean (A.M.) of any set of numbers unless they have some special properties among them. For example, if we are to find the mean of the following set of numbers: 2, 4, 8, 16, 32, it is useful to calculate the geometric mean (G.M.).

G.M.=

2

Page 3: Statistics: first lesson

Dr. Abhijit Kar Gupta, [email protected]

Note: The numbers 2, 4, 8…are in geometric progression.

If we are asked to find out the mean of the following numbers, , it would be interesting

to find out the harmonic mean (H.M.):

H.M.= .

Note: Here the numbers are in harmonic progression.

Useful method of mean calculation:

In practical calculations, when we are to obtain arithmetic mean (A.M.) of a set of big numbers, we follow a short cut method:

Step I: We assume a mean by just looking at the numbers. Let this be .

(This is our choice and we do this as per our convenience.)

Step II: Next, we calculate the deviation of this assumed mean from each data point: .

Now, the calculated mean

The actual arithmetic mean,

Similarly, for data with frequencies,

Here also, we get the same formula as above, = .

3

Page 4: Statistics: first lesson

Dr. Abhijit Kar Gupta, [email protected]

Example#1

Consider the following table. We are to calculate the mean rainfall over seven days in monsoon season.

Days Rainfall in mm.

1 250 502 240 403 190 -104 254 545 225 256 232 327 170 -30

Total 1561 161

Here the assumed mean, mm, .

mm. The actual mean, = mm.

Also, verify by direct calculation, mm.

Example#2

Calculate the mean of the following data with the help of assumed mean method.

Class interval10-20 45 4 5 2020-30 35 5 -5 -2530-40 48 3 8 2440-50 43 2 3 650-60 40 1 0 060-70 37 1 -3 -370-80 39 4 -1 -4Total 20 18

Here assumed mean, and number of data,

4

Page 5: Statistics: first lesson

Dr. Abhijit Kar Gupta, [email protected]

Mean of deviation,

The actual mean,

We can also check this from direct calculation,

Median:

Median is the data in the middle when the data set is arranged in ascending or descending order.

Example #1

9, 12, 6, 1, 11

After ordering, 1, 6, 9, 11, 12

Median = 9.

If the data set has even number of entries, the median is the mean of the two data point at the middle after the ordering.

Example #2

9, 12, 6, 1, 11, 13

After ordering, 1, 6, 9, 11, 12, 13

Median =

Mode:

Mode is the data value which has maximum frequency. This means this value occurs maximum number of times in the data set.

Example:

0, 2, 5, 9, 3, 2, 6, 2, 3, 5, 4, 2, 1

5

Page 6: Statistics: first lesson

Dr. Abhijit Kar Gupta, [email protected]

In the above data set the number 2 occurs maximum times. Mode = 2.

Usually, a data set follows the following approximate empirical formula:

Measures of Position

In statistical data analysis, we often like to measure the position of a data point relative to other values in the set. For example, we like to know the rank or position of a student relative to others in a certain examination.

The measures are done for rank-ordered data, where the elements in the data set are arranged in ascending order (from the smallest to the largest).

The following are the most common measures of position of the rank-ordered data:

Percentiles:

Percentile is the value of a variable below which a certain percent of observations fall. For example, 90th percentile is the value (or score) below which 90% of the data are to be found.

Suppose, we have -number of values. How is the percentile calculated?

1. First the data is rank-ordered (arranged in ascending order)

2. To calculate the -th percentile we calculate the rank:

3. Round off the above rank to the nearest integer and then take the value corresponding to the rank.

Example:

Given the numbers 2, 5, 4, 9, 8, 1

Rank ordered set: 1, 2, 4, 5, 8, 9

6

Mean – Mode = 3 (Mean – Median)

Page 7: Statistics: first lesson

Dr. Abhijit Kar Gupta, [email protected]

The rank of the 60th percentile, (rounded off to nearest

integer)

The 60th percentile is 5 (the 4th member in the ordered list).

Note: The 100th percentile is defined to be the largest value in the given data set.

Quartiles:

A quartile is one of the three points that divide a rank-ordered data set into four equal groups.

First quartile ( ): Cuts off lower 25% of data ⇨ 25th percentile

Second quartile ( ): Divides the data set into half ⇨ 50th percentile Third quartile ( ): Cuts off lowest 75% (or highest 25%) of data ⇨ 75th percentile

Inter quartile range = upper quartile – lower quartile

Note: The 50th percentile = Median

7