t7 data analysis

39
Analyzing and interpreting data By Rama Krishna Kompella

Upload: kompellark

Post on 13-May-2015

1.205 views

Category:

Education


2 download

DESCRIPTION

Introduction to Data Analysis and descriptive measures

TRANSCRIPT

Page 1: T7 data analysis

Analyzing and interpreting data

By Rama Krishna Kompella

Page 2: T7 data analysis

Myths– Complex analysis and big words impress people.

– Analysis comes at the end when there is data to analyze.

– Qualitative analysis is easier than quantitative analysis

– Data have their own meaning

– Stating limitations weakens the evaluation

– Computer analysis is always easier and better

Page 3: T7 data analysis

Things aren’t always what we think!

Six blind men go to observe an elephant. One feels the side and thinks the elephant is like a wall. One feels the tusk and thinks the elephant is a like a spear. One touches the squirming trunk and thinks the elephant is like a snake. One feels the knee and thinks the elephant is like a tree. One touches the ear, and thinks the elephant is like a fan. One grasps the tail and thinks it is like a rope. They argue long and loud and though each was partly in the right, all were in the wrong.

For a detailed version of this fable see: http://www.wordinfo.info/words/index/info/view_unit/1/?letter=B&spage=3

Blind men and an elephant

- Indian fable

Page 4: T7 data analysis

Data analysis and interpretation• Think about analysis EARLY• Start with a plan• Code, enter, clean• Analyze• Interpret• Reflect

− What did we learn?− What conclusions can we draw?− What are our recommendations?− What are the limitations of our analysis?

Page 5: T7 data analysis

Why do I need an analysis plan?• To make sure the questions and your data

collection instrument will get the information you want

• Think about your “report” when you are designing your data collection instruments

Page 6: T7 data analysis

Do you want to report…• the number of people who answered each

question?• how many people answered a, b, c, d?• the percentage of respondents who

answered a, b, c, d?• the average number or score?• the mid-point among a range of answers?• a change in score between two points in

time?• how people compared?• quotes and people’s own words

Page 7: T7 data analysis

Common descriptive statistics• Count (frequencies)• Percentage• Mean• Mode• Median• Range• Standard deviation• Variance• Ranking

Page 8: T7 data analysis

Key components of a data analysis plan

• Purpose of the evaluation• Questions• What you hope to learn from the question• Analysis technique • How data will be presented

Page 9: T7 data analysis

Steps in Processing of Data

• Preparing of raw data• Editing

– Field editing– Office editing

• Coding– Establishment of appropriate category– Mutually exclusive

• Tabulation – Sorting and counting– Summarizing of data

Page 10: T7 data analysis

Types of Tabulation

• Simple or one-way tabulation– Question with only one response (adds up to 100)– Multiple response to a question ( doesn’t add up

to 100)• Cross tabulation or two-way tabulation

Page 11: T7 data analysis

Classification of Data

• Number of groups• Width of the class interval• Exclusive categories• Exhaustive categories• Avoid extremes

Page 12: T7 data analysis

Frequency Distribution Tables

Lower Limit

Upper Limit

Page 13: T7 data analysis

Frequency Distribution Tables

Page 14: T7 data analysis

Measures of Central Tendency

• Measure of central tendency, of a data set is a measure of the "middle" value of the data set

• The mean, median and mode are all valid measures of central tendency

• But, under different conditions, some measures of central tendency become more appropriate to use than others

Page 15: T7 data analysis

Mean

• The mean (or average) is the most popular and well known measure of central tendency

• It can be used with both discrete and continuous data, although its use is most often with continuous data

Page 16: T7 data analysis

Median & Mode

• The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data.

• The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option.

Page 17: T7 data analysis

Skewed Distributions

Page 18: T7 data analysis

Choosing appropriate measure

Type of Variable Best measure of central tendency

Nominal Mode

Ordinal Median

Interval/Ratio (not skewed) Mean

Interval/Ratio (skewed) Median

Page 19: T7 data analysis

How to represent the results

• Graphics should be used whenever practical• Generally used graphics to depict the results

are:– Bar charts– Line charts– Pie / round charts

Page 20: T7 data analysis

Measures of Dispersion

Measures of dispersion (or variability or spread) indicate the extent to which the observed

values are “spread out” around that center — how “far apart” observed values typically are

from each other and therefore from some average value (in particular, the mean).

Page 21: T7 data analysis

21

Measures of Dispersion

• There are three main measures of dispersion:– The range– The semi-interquartile range (SIR)– Variance / standard deviation

Page 22: T7 data analysis

22

The Range

• The range is defined as the difference between the largest score in the set of data and the smallest score in the set of data, XL - XS

• What is the range of the following data:4 8 1 6 6 2 9 3 6 9

• The largest score (XL) is 9; the smallest score (XS) is 1; the range is XL - XS = 9 - 1 = 8

Page 23: T7 data analysis

23

When To Use the Range

• The range is used when– you have ordinal data or– you are presenting your results to people with little

or no knowledge of statistics• The range is rarely used in scientific work as it is

fairly insensitive– It depends on only two scores in the set of data, XL

and XS– Two very different sets of data can have the same

range:1 1 1 1 9 vs 1 3 5 7 9

Page 24: T7 data analysis

24

The Semi-Interquartile Range

• The semi-interquartile range (or SIR) is defined as the difference of the first and third quartiles divided by two– The first quartile is the 25th percentile– The third quartile is the 75th percentile

• SIR = (Q3 - Q1) / 2

Page 25: T7 data analysis

25

SIR Example• What is the SIR for the

data to the right?• 25 % of the scores are

below 5– 5 is the first quartile

• 25 % of the scores are above 25– 25 is the third quartile

• SIR = (Q3 - Q1) / 2 = (25 - 5) / 2 = 10

246

5 = 25th %tile

81012142030

25 = 75th %tile

60

Page 26: T7 data analysis

26

When To Use the SIR

• The SIR is often used with skewed data as it is insensitive to the extreme scores

Page 27: T7 data analysis

Mean Deviation

The key concept for describing normal distributionsand making predictions from them is calleddeviation from the mean. We could just calculate the average distance between each

observation and the mean.• We must take the absolute value of the distance,

otherwise they would just cancel out to zero!Formula:

| |iX X

n

Page 28: T7 data analysis

Mean Deviation: An Example

1. Compute X (Average)2. Compute X – X and take the

Absolute Value to get Absolute Deviations

3. Sum the Absolute Deviations

4. Divide the sum of the absolute deviations by N

X – Xi Abs. Dev.

7 – 6 1

7 – 10 3

7 – 5 2

7 – 4 3

7 – 9 2

7 – 8 1

Data: X = {6, 10, 5, 4, 9, 8} X = 42 / 6 = 7

Total: 12 12 / 6 = 2

Page 29: T7 data analysis

What Does it Mean?• On Average, each observation is two units away

from the mean.

Is it Really that Easy?• No!• Absolute values are difficult to manipulate algebraically• Absolute values cause enormous problems for calculus (Discontinuity)• We need something else…

Page 30: T7 data analysis

30

Variance

• Variance is defined as the average of the square deviations:

N

X 2

2

Page 31: T7 data analysis

31

What Does the Variance Formula Mean?

• First, it says to subtract the mean from each of the scores– This difference is called a deviate or a deviation

score– The deviate tells us how far a given score is from

the typical, or average, score– Thus, the deviate is a measure of dispersion for a

given score

Page 32: T7 data analysis

32

What Does the Variance Formula Mean?

• Why can’t we simply take the average of the deviates? That is, why isn’t variance defined as:

N

X2

This is not the formula for variance!

Page 33: T7 data analysis

33

What Does the Variance Formula Mean?

• One of the definitions of the mean was that it always made the sum of the scores minus the mean equal to 0

• Thus, the average of the deviates must be 0 since the sum of the deviates must equal 0

• To avoid this problem, statisticians square the deviate score prior to averaging them– Squaring the deviate score makes all the squared

scores positive

Page 34: T7 data analysis

34

Computational Formula

• When calculating variance, it is often easier to use a computational formula which is algebraically equivalent to the definitional formula:

NN

N XX

X 2

2

2

2

2 is the population variance, X is a score, is the population mean, and N is the number of scores

Page 35: T7 data analysis

35

Computational Formula Example

X X2 X- (X-)2

9 81 2 4

8 64 1 1

6 36 -1 1

5 25 -2 4

8 64 1 1

6 36 -1 1 = 42 = 306 = 0 = 12

Page 36: T7 data analysis

36

Computational Formula Example

26

126

2943066

6306

NN

42

XX

2

2

2

2

26

12N

X2

2

Page 37: T7 data analysis

37

Variance of a Sample• Because the sample mean is not a perfect

estimate of the population mean, the formula for the variance of a sample is slightly different from the formula for the variance of a population:

1N

XXs

2

2

s2 is the sample variance, X is a score, X is the sample mean, and N is the number of scores

Page 38: T7 data analysis

Homework

• The following are test scores from a class of 20 students:

• 96 95 93 89 83 83 81 77 77 77 71 71 70 68 68 65 57 55 48 42

• Find out the measures of central tendency and dispersion

• What do you observe from the values of measures of central tendency?

Page 39: T7 data analysis

Q & As