statistics notes 2005

34
Statistics for Spatial Analysis Slides are based on Notes of Shri. S.K. Mittal

Upload: sgrrsc

Post on 18-Jul-2016

214 views

Category:

Documents


1 download

DESCRIPTION

gnbdgb

TRANSCRIPT

Page 1: Statistics Notes 2005

Statistics for Spatial Analysis

Slides are based on Notes of Shri. S.K. Mittal

Page 2: Statistics Notes 2005

STATISTICSThe word `Statistics' has been derived from the Latin word `Status‘, the Italian word `Statista' and the German word `Statistik‘. Meaning of these words is a `political state' or a `Government‘. Presently, the word statistics is used in two different, but inter-related, ways, viz. (i) as a plural noun, and (ii) as a singular noun.

As a Plural noun - When used as a plural noun, the word `statistics' means statistical data. Prof. Horace Secrist defines statistics in this sense as given below :

By Statistics, we mean aggregate of facts affected to a marked extent by the multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable standard of accuracy, collected in a systematic manner for a pre-determined purpose and placed in relation to each other.

Page 3: Statistics Notes 2005

From the definition of statistics, we observe the following characteristics :

CHARACTERISTICS OF STATISTICAL DATA

1. Aggregates of facts2. Numerically expressed3. Affected by multiplicity of causes4. Estimated according to reasonable standards of accuracy5. Collected in a systematic manner6. Collected for a pre-determined purpose7. Placed in relation to each other

Page 4: Statistics Notes 2005

As a singular noun. As a singular noun, statistics refers to a science which deals with the methods of collection, classifying, presenting, comparing and interpreting numerical data.

In this sense, statistics is also known as `statistical methods'. The important statistical methods are as follows

STATISTICAL METHODS1. Collection of data2b. Classification of data3b. Presentation of data3a/4. Analysis of data2a/5. Interpretation of data 6. Forecasting of data

Page 5: Statistics Notes 2005

We conclude the followings

When used in the sense of data, `statistics' are numerical statement of facts, capable of further analysis and interpretation and when used as a science, it is concerned with the principles and methods used in the collection, presentation, analysis and interpretation of numerical data in a sphere of enquiry.

Statistical methods are growing in popularity and are being widely used in every branch of knowledge. But they cannot be applied to all kinds of phenomena and cannot answer all our doubts. They also suffer from various limitations.

Page 6: Statistics Notes 2005

LIMITATIONS OF STATISTICS

1. Does not deal with individual facts2. Ignores the qualitative aspects3. Is not an end in itself4. Can be misused5. Good understanding is required

Page 7: Statistics Notes 2005

COLLECTION OF DATAA sound structure of statistical investigation is based on a systematic collection of data. Data is generally classified in two groups, viz.

(a) internal data and(b) external data.

Internal data come from the internal records related to operations of a business firm, records of production, purchase and the accounting system. This is generally associated with the organizational and functional activities of the firm. The internal data can be either insufficient or inappropriate for the problem under investigation, thus we need external data to make decisions. The external data are collected and published by agency external to the enterprise. The external data can be collected either from the Primary or the Secondary source.

Page 8: Statistics Notes 2005

Primary and Secondary Data

The primary data is one, which is collected by the investigator himself for the first time. In India there are various agency which collect primary data: National Sample Survey is one of them.

The secondary data is one, which has already been collected by a source other than collected by the present investigator.

We may collect the data ourselves but somebody else decides to make use of this data. The same data will be primary data for us but secondary for others who make use of data.

Similarly, in order to compare the cost of living in Delhi and Bombay, we may decide to make use of the data published in `The Economic Times‘ - here we will be making use of the secondary data.

Page 9: Statistics Notes 2005

DISTINCTION BETWEEN PRIMARY AND SECONDARY DATA

S.No Basis Primary Data Secondary Data

1. Originality It is original, because the investigator himself collects the data

It is not original. The investigator makes use of the data collected by other agencies.

2. Collection It involves large expenses in terms of time, energy and money.

It is relatively a less costly method.

3. Suitability If the data has been collected in a systematic manner its suitability will be positive.

It may or may not suit the objects of enquiry.

4. Precautions No extra precautions need be taken in making use of this data.

It should be used with care.

Page 10: Statistics Notes 2005

Methods of collecting Primary dataI. Direct Personal Investigation II. Indirect Oral Investigation III. Information through correspondents, and IV. The Questionnaire Method

Source of secondary dataThe chief source of secondary data can be classified into two groups viz.(a) Published and(b) Unpublished

Precautions in the use of secondary data1. Whether the data are reliable. In order to know the

reliability of data, the integrity and experience of the collecting organization, the purpose, method of collection, degree of accuracy and test-checking must be ascertained.

2. Whether the data are suitable for the purpose?3. Whether the data are adequate?

Page 11: Statistics Notes 2005

Tabular form of data

105 93 97 101 115 149 135 120 130 140110 93 109 113 98 111 100 102 107 10390 142 111 108 102 109 107 119 113 96

120 135 91 110 117 104 105 120 114 92110 120 102 92 114 99 112 107 99 100115 115 90 136 110 106 123 109 114 109117 114 98 106 110 104 134 109 127 113119 113 116 124 123 110 136 132 116 108121 112 141 109 116 109 141 117 134 9892 110 109 122 109 97 93 107 104 10887 89 121 111 110 103 114 113 150 156

104 117 114 110 121 107 106 114 142 114120 112 116 109 111 113 114 98 113 112121 99 109 123 111 116 104 99 109 117109 109 110 97 105 102 109 101 97 103

Page 12: Statistics Notes 2005

Class Interval

f Class Interval

f

87 - 91 5 122-126 5 92 - 96 8 127-131 2 97-101 15 132-136 7102-106 18 137-141 3107-111 38 142-146 2112-116 28 147-151 2117-121 16 152-156 1

N = 150

Grouped Frequency Distribution Table

Note that in computations involving classified distribution, the midpoint will be used to substitute for each score in the interval. For this reason, we recommend the choice of an odd number for i whenever possible. Nothing is sacred about this suggestions, it just makes the midpoint a whole number of units, thus simplifying computation.

Page 13: Statistics Notes 2005

Class interval f Cum Cum %87-91 5 5 392-96 8 13 9

97-101 15 28 19102-106 18 46 31107-111 38 84 56112-116 28 112 75117-121 16 128 85122-126 5 133 89127-131 2 135 90132-136 7 142 95137-141 3 145 97142-146 2 147 98147-151 2 149 99152-156 1 150 100

N = 150

Cumulative Distribution Table

Page 14: Statistics Notes 2005

The Cumulative DistributionArranging data into a cumulative distribution is really helpful. It

allows us to obtain the number (or the proportion) of cases in a distribution below or above each class interval (or boundary). Cumulative Distribution Table

Class interval f Cum Cum %

87-91 5 5 3

92-96 8 13 9

97-101 15 28 19

102-106 18 46 31

107-111 38 84 56

112-116 28 112 75

117-121 16 128 85

122-126 5 133 89

127-131 2 135 90

132-136 7 142 95

137-141 3 145 97

142-146 2 147 98

147-151 2 149 99

152-156 1 150 100

N = 150

Page 15: Statistics Notes 2005

Graphic TechniquesThere are always some people who would rather not read

tables, who could understand the information better if it were presented in pictorial form. Our prehistoric ancestors undoubtedly knew this when they made the first cave drawings. Similarly, the Egyptians, Greeks and Romans used drawings and sculptures to convey information about their respective societies. Thus, art was used to carry information throughout the ages. Art is also valuable to us in describing information.

Graphs, the pictorial forms that follow, are not meant to substitute for tabular construction. Rather they are meant as visual aids that help us to describe and think about the shape of the distribution. In fact, you cannot plan or construct a graph until you have prepared the corresponding table. The graphic forms shown here correspond to both qualitative and quantitative distributions.

Page 16: Statistics Notes 2005

The HistogramGraphic equivalent of the grouped distribution for interval-

level data. It consists of a set of adjacent bars whose heights are proportional to either the absolute frequencies or to the proportions of cases in each interval of the variable.

The most noticeable feature of the histogram is its structural simplicity. Bars are understood more easily than numbers. The histogram shows the relative concentration of data in each interval as well as the shape of the distribution.

The PolygonIt is easy to convert a histogram into the much-used

polygon. All we need to do is to connect the midpoints of the tops of the bars with straight lines.

Polygons are particularly useful when we wish to present a comparison of two or more distribution on the same graph. They do not blur their respective outlines, as histograms do.

Page 17: Statistics Notes 2005

The OgiveWhen a graph is used to present a cumulative

percentage distribution, it is called an ogive. The ogive is constructed on a pair of perpendicular axes, just like the polygon.

The horizontal axis represents the values for the upper true limits of each class interval, and the vertical axis indicates the percentage of observations for each interval. A dot is then placed directly above the upper true limit of the class boundary, at whatever height if appropriate, to indicate the proportion of cases less than the upper true limit of the interval. After plotting all interval values with their corresponding percentages, the dots are joined by straight lines.

Page 18: Statistics Notes 2005

MEASUREMENTS

MEASURES OF CENTRAL TENDENCYA central tendency is a single figure that represents

whole of distribution. Individual observations in a distribution have the general characteristics of showing a tendency to concentrate at certain values usually somewhere in the centre of the distribution.

A central tendency will represent whole of the distribution. Thus, we talk of average per capita income of India, average size of holdings in India, average productivity of labour in India, average cost of production of cloth, average life of an India, etc.

Three important measures of central tendency are mean, median and mode.

Page 19: Statistics Notes 2005

Arithmetic MeanArithmetic mean, or simply known as

`mean', is the most commonly used of all averages, e.g., we frequently talk of average monthly income, average monthly expenditure, average marks secured by the students, average petrol consumption of car or scooter in a day, average productivity per farm, average bonus paid, etc.

Arithmetic mean is defined as the sum of values of a group of items divided by the number of items. _

X = X / N

X

Page 20: Statistics Notes 2005

MedianThe effect of an extreme value can be avoided if we

take a measure of central position in a given series. This position measure is called the median.

Median is a value which divides the series into two equal parts. Thus if we have the median value, the number of items less than this value and the number of items more than this value will be equal.

To get the median value, we make use of the following formula :

M = Size of (N+1)/2 th item

where M stands for median, and N for the number of items in the series.

Page 21: Statistics Notes 2005

Arithmetic mean is a good measure of central tendency when we are interested in finding the average value of any variate, e.g., average revenue, average cost, average productivity etc. Similarly, median is a good measure when the spread of items may be more on one side of the distribution. Median is also useful in those cases where the items are not capable of measurement in definite units e.g. quantities like intelligence, health etc.

Mode A third important measure of central tendency is

called mode, which is denoted as Z. Mode is the most common value found in a series.

For example, the daily wages of labourers employed in Defence Colony are Rs. 80, 85, 86, 86, 86, 87, 89, 90. The modal wage will be Rs.86 because it is most commonly found or it occurs most frequently.

Page 22: Statistics Notes 2005

Relationship between Mean, Median and Mode Mean, median and mode have their distinct role in statistical analysis. In no case they can be substituted for one another. In a moderately asymmetrical distribution, the following relationship exists.

Mode = 3 Median - 2 Mean

Comparative Evaluation of Characteristics of Mean, Median and modeS.No.

Characteristics Measures of Central Tendency

Mean Median Mode

1 It is rigidly defined Yes Yes No

2 It is situated in the centre of the distribution

No Yes Yes

3 It is easily understandable Yes Yes Yes

4 Its calculation is easy Yes Yes No

5 It is based on all the observations Yes No No

6 It is capable of further mathematical treatment

Yes No No

7 It is affected by the choice of sample Yes No No

8 It is affected by extreme values Yes No No

9 It can be represented graphically No Yes Yes

Page 23: Statistics Notes 2005

MEAN DEVIATIONMean deviation shows the scatter around in average.

It is like measuring the scatter of the population of a city. Some people live close to the centre of the city and others at varying distances. Their average distance from the centre indicates how scattered or dispersed they are.

Mean deviation is defined as an average or mean of the deviations of the values from the central tendency. The central tendency used can be either arithmetic mean or median. Here we take mean for the calculation of mean deviations.

M.D. = dx /N

Coefficient of Mean Deviation = M.D / Median

Page 24: Statistics Notes 2005

STANDARD DEVIATIONIt is another related measure of variation. In mean

deviation we can take the sum of deviations after ignoring their plus and minus signs. In standard deviation we achieve the same effect in another way. We square up all the deviations; the squared deviations will always be positive.

Standard deviation is the square root of the arithmetic mean of the squared deviations. Standard deviation is generally expressed as (read standard deviation sigma).

= (dx2 /N)

Page 25: Statistics Notes 2005

CORRELATIONMeasure of central tendency, dispersion and skewness

describe the nature of distribution relating to a single variable. One may also be interested in studying relationship between two and more variables e.g., income and consumption; price and demand; quantity of input and output are related variables; productivity and wage also depends upon each other.

Two variables may be positively related or negatively related.

price and supply are positively correlated. price and demand are negatively correlated. price index & dearness allowance- positively correlated.strikes and rate of production - negatively correlated.

Page 26: Statistics Notes 2005

Methods of Measuring CorrelationIt is not sufficient only to know that there exists

correlation between two variables, it is also necessary to quantify the extent of correlation. For our first purpose we make use of scatter diagrams, and for our second purpose we need define the value of co-efficient of correlation.

Scatter Diagram : A simple measure of correlation between two

variables is obtained by the use of scatter diagrams.

Values of the independent variable are measured on X-axis in a graph, and values of the dependent variable are measured on Y-axis. The two values are then plotted in the graph in the form of dots. When every dot representing a pair of figures has been plotted, we get a scatter diagram.

Page 27: Statistics Notes 2005

Coefficient of Correlation

The mathematical technique which describes the covariance in ratio terms is known as co-efficient of correlation. The co-efficient of correlation was initially conceived by statistician, Karl Pearson. Karl Pearson's coefficient of correlation (also known as product-moment co-efficient) generally denoted by `r' is expressed as follows :

dx dyr = ------------- N x y

dxdy is the sum of the products of deviations of respective observations in x and y series. N is the number of itemsx is the standard deviation of x series, andy is the standard deviation of y series

Page 28: Statistics Notes 2005

The values of r determine the degree of correlation between two variables.

‘r’ always lies between minus one to plus one.

Value of r Degree of correlation between two variables -1 Perfectly negative

+1 Perfectly positive

0 No relation

0.10 to 0.25

Low degree of correlation

0.30 to 0.55

Moderate correlation

0.60 to 0.99

High correlation

If the sign before r is minus, it will be negative correlation, and if the sign is plus, it will be positive correlation.

Page 29: Statistics Notes 2005

RANK CORRELATIONProf. Charles Spearman has conceived another co-

efficient of correlation.

This co-efficient is expressed as R and is based on the ranking of the various values of the two variables.

6 D2

R = 1 - ----------- N(N2 -1)

Page 30: Statistics Notes 2005

REGRESSIONThe term regression was first used by Sir Francis Galton

in his studies of ‘Inheritance of Stature’. He, along with his friend, Karl Pearson, studied the heights of 1,078 sons along with the heights of their fathers. It was found out that the tall fathers tend to have tall sons and the short fathers tend to have short sons but the average, height of sons of tall fathers was less than the height of their fathers, the average height of short sons was more than the average height of their fathers. Galton named this tendency as `regression'.

It is used to explain the value of one variable with respect to the value of other variable. It explains the functional relationship between the two variables.

The relationship is explained with the help of regression lines.

Page 31: Statistics Notes 2005

Regression Lines The line which shows the functional relationship between the two

variables is known as the ' line of best fit '. Since there are two variables, X and Y, therefore, there are two regression lines.

Regression line of X on Y explains the functional relationship of X when the value of Y variable is given, whereas, the regression line of Y on X explains the functional relationship of Y when the value of X variable is given.

Regression lines and the Coefficient of CorrelationThe regression lines help in estimating the nature and the type of

correlation between the two variables. If the two lines of regression overlap each other the correlation is said to be perfect correlation. If both the lines intersect at right angles, there is no correlation at all. The slope of the lines determines the nature of correlation, if the slope of the lines is positive the correlation is said to be positive and vice versa. The degree of correlation can be ascertained with the help of the angles formed by the two lines.

Page 32: Statistics Notes 2005

Regression equations The regression equations explain the functional relationship between the two variables. As there are two regression lines, there are two regression equations.

i) Regression equation of X on Y: In this equation the probable values of X are estimated with

the help of independent variable Y. Plotting these values on the graph paper we get the line known as regression of X on Y. ii) Regression equation of Y on X:

This equation is used in order to estimate the values of Y when the values of X are given, here the values of Y are dependent on the values of X. The line showing this relationship is known as regression line of Y on X.

Page 33: Statistics Notes 2005

Method of Least SquareThis method is the most useful technique of estimation. It

gives the best, unbiased, linear estimate. The value of two unknown constants is determined with the help of two normal equations.

Regression equation of X on Y Regression equation of Y

on X

X = a + bY Y = a + bX

The values of constants `a' and `b' can be estimated from the following normal equations:

Regression of X on Y Regression of Y

on X

X = Na + b Y Y = Na + b X

XY = a Y + b Y2 XY = a X + b X2

Page 34: Statistics Notes 2005

MULTIVARIATESWhen you study a single variable the case is called a

univariate case. When it is two variables the case is called a bivariate case. When there are more than 2 variables the case is called as a multivariate case.

Consider there are n variables. Then the parameter which we studied earlier such as mean, variance, covariance, correlation now becomes as

a mean vectorVariance-Covariance matrix andCorrelation matrix