t7 data analysis

Analyzing and interpreting data

By Rama Krishna Kompella

Myths– Complex analysis and big words impress people.

– Analysis comes at the end when there is data to analyze.

– Qualitative analysis is easier than quantitative analysis

– Data have their own meaning

– Stating limitations weakens the evaluation

– Computer analysis is always easier and better

Things aren’t always what we think!

Six blind men go to observe an elephant. One feels the side and thinks the elephant is like a wall. One feels the tusk and thinks the elephant is a like a spear. One touches the squirming trunk and thinks the elephant is like a snake. One feels the knee and thinks the elephant is like a tree. One touches the ear, and thinks the elephant is like a fan. One grasps the tail and thinks it is like a rope. They argue long and loud and though each was partly in the right, all were in the wrong.

For a detailed version of this fable see: http://www.wordinfo.info/words/index/info/view_unit/1/?letter=B&spage=3

Blind men and an elephant

- Indian fable

Data analysis and interpretation• Think about analysis EARLY• Start with a plan• Code, enter, clean• Analyze• Interpret• Reflect

− What did we learn?− What conclusions can we draw?− What are our recommendations?− What are the limitations of our analysis?

Why do I need an analysis plan?• To make sure the questions and your data

collection instrument will get the information you want

• Think about your “report” when you are designing your data collection instruments

Do you want to report…• the number of people who answered each

question?• how many people answered a, b, c, d?• the percentage of respondents who

answered a, b, c, d?• the average number or score?• the mid-point among a range of answers?• a change in score between two points in

time?• how people compared?• quotes and people’s own words

Common descriptive statistics• Count (frequencies)• Percentage• Mean• Mode• Median• Range• Standard deviation• Variance• Ranking

Key components of a data analysis plan

• Purpose of the evaluation• Questions• What you hope to learn from the question• Analysis technique • How data will be presented

Steps in Processing of Data

• Preparing of raw data• Editing

– Field editing– Office editing

• Coding– Establishment of appropriate category– Mutually exclusive

• Tabulation – Sorting and counting– Summarizing of data

Types of Tabulation

• Simple or one-way tabulation– Question with only one response (adds up to 100)– Multiple response to a question ( doesn’t add up

to 100)• Cross tabulation or two-way tabulation

Classification of Data

• Number of groups• Width of the class interval• Exclusive categories• Exhaustive categories• Avoid extremes

Frequency Distribution Tables

Lower Limit

Upper Limit

Frequency Distribution Tables

Measures of Central Tendency

• Measure of central tendency, of a data set is a measure of the "middle" value of the data set

• The mean, median and mode are all valid measures of central tendency

• But, under different conditions, some measures of central tendency become more appropriate to use than others

Mean

• The mean (or average) is the most popular and well known measure of central tendency

• It can be used with both discrete and continuous data, although its use is most often with continuous data

Median & Mode

• The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data.

• The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option.

Skewed Distributions

Choosing appropriate measure

Type of Variable Best measure of central tendency

Nominal Mode

Ordinal Median

Interval/Ratio (not skewed) Mean

Interval/Ratio (skewed) Median

How to represent the results

• Graphics should be used whenever practical• Generally used graphics to depict the results

are:– Bar charts– Line charts– Pie / round charts

Measures of Dispersion

Measures of dispersion (or variability or spread) indicate the extent to which the observed

values are “spread out” around that center — how “far apart” observed values typically are

from each other and therefore from some average value (in particular, the mean).

21

Measures of Dispersion

• There are three main measures of dispersion:– The range– The semi-interquartile range (SIR)– Variance / standard deviation

22

The Range

• The range is defined as the difference between the largest score in the set of data and the smallest score in the set of data, XL - XS

• What is the range of the following data:4 8 1 6 6 2 9 3 6 9

• The largest score (XL) is 9; the smallest score (XS) is 1; the range is XL - XS = 9 - 1 = 8

23

When To Use the Range

• The range is used when– you have ordinal data or– you are presenting your results to people with little

or no knowledge of statistics• The range is rarely used in scientific work as it is

fairly insensitive– It depends on only two scores in the set of data, XL

and XS– Two very different sets of data can have the same

range:1 1 1 1 9 vs 1 3 5 7 9

24

The Semi-Interquartile Range

• The semi-interquartile range (or SIR) is defined as the difference of the first and third quartiles divided by two– The first quartile is the 25th percentile– The third quartile is the 75th percentile

• SIR = (Q3 - Q1) / 2

25

SIR Example• What is the SIR for the

data to the right?• 25 % of the scores are

below 5– 5 is the first quartile

• 25 % of the scores are above 25– 25 is the third quartile

• SIR = (Q3 - Q1) / 2 = (25 - 5) / 2 = 10

246

5 = 25th %tile

81012142030

25 = 75th %tile

60

26

When To Use the SIR

• The SIR is often used with skewed data as it is insensitive to the extreme scores

Mean Deviation

The key concept for describing normal distributionsand making predictions from them is calleddeviation from the mean. We could just calculate the average distance between each

observation and the mean.• We must take the absolute value of the distance,

otherwise they would just cancel out to zero!Formula:

| |iX X

n

Mean Deviation: An Example

1. Compute X (Average)2. Compute X – X and take the

Absolute Value to get Absolute Deviations

3. Sum the Absolute Deviations

4. Divide the sum of the absolute deviations by N

X – Xi Abs. Dev.

7 – 6 1

7 – 10 3

7 – 5 2

7 – 4 3

7 – 9 2

7 – 8 1

Data: X = {6, 10, 5, 4, 9, 8} X = 42 / 6 = 7

Total: 12 12 / 6 = 2

What Does it Mean?• On Average, each observation is two units away

from the mean.

Is it Really that Easy?• No!• Absolute values are difficult to manipulate algebraically• Absolute values cause enormous problems for calculus (Discontinuity)• We need something else…

30

Variance

• Variance is defined as the average of the square deviations:

N

X 2

2

31

What Does the Variance Formula Mean?

• First, it says to subtract the mean from each of the scores– This difference is called a deviate or a deviation

score– The deviate tells us how far a given score is from

the typical, or average, score– Thus, the deviate is a measure of dispersion for a

given score

32


• Why can’t we simply take the average of the deviates? That is, why isn’t variance defined as:

N

X2

This is not the formula for variance!

33


• One of the definitions of the mean was that it always made the sum of the scores minus the mean equal to 0

• Thus, the average of the deviates must be 0 since the sum of the deviates must equal 0

• To avoid this problem, statisticians square the deviate score prior to averaging them– Squaring the deviate score makes all the squared

scores positive

34

Computational Formula

• When calculating variance, it is often easier to use a computational formula which is algebraically equivalent to the definitional formula:

NN

N XX

X 2

2

2

2

2 is the population variance, X is a score, is the population mean, and N is the number of scores

35

Computational Formula Example

X X2 X- (X-)2

9 81 2 4

8 64 1 1

6 36 -1 1

5 25 -2 4

8 64 1 1

6 36 -1 1 = 42 = 306 = 0 = 12

36

Computational Formula Example

26

126

2943066

6306

NN

42

XX

2

2

2

2

26

12N

X2

2

37

Variance of a Sample• Because the sample mean is not a perfect

estimate of the population mean, the formula for the variance of a sample is slightly different from the formula for the variance of a population:

1N

XXs

2

2

s2 is the sample variance, X is a score, X is the sample mean, and N is the number of scores

Homework

• The following are test scores from a class of 20 students:

• 96 95 93 89 83 83 81 77 77 77 71 71 70 68 68 65 57 55 48 42

• Find out the measures of central tendency and dispersion

• What do you observe from the values of measures of central tendency?

Q & As

t7 data analysis

Education

data analysis

skewed data

interpreting data

summarizing of data

processing of data

following data

ordinal data

quantitative analysis