t7 data analysis
DESCRIPTION
Introduction to Data Analysis and descriptive measuresTRANSCRIPT
Analyzing and interpreting data
By Rama Krishna Kompella
Myths– Complex analysis and big words impress people.
– Analysis comes at the end when there is data to analyze.
– Qualitative analysis is easier than quantitative analysis
– Data have their own meaning
– Stating limitations weakens the evaluation
– Computer analysis is always easier and better
Things aren’t always what we think!
Six blind men go to observe an elephant. One feels the side and thinks the elephant is like a wall. One feels the tusk and thinks the elephant is a like a spear. One touches the squirming trunk and thinks the elephant is like a snake. One feels the knee and thinks the elephant is like a tree. One touches the ear, and thinks the elephant is like a fan. One grasps the tail and thinks it is like a rope. They argue long and loud and though each was partly in the right, all were in the wrong.
For a detailed version of this fable see: http://www.wordinfo.info/words/index/info/view_unit/1/?letter=B&spage=3
Blind men and an elephant
- Indian fable
Data analysis and interpretation• Think about analysis EARLY• Start with a plan• Code, enter, clean• Analyze• Interpret• Reflect
− What did we learn?− What conclusions can we draw?− What are our recommendations?− What are the limitations of our analysis?
Why do I need an analysis plan?• To make sure the questions and your data
collection instrument will get the information you want
• Think about your “report” when you are designing your data collection instruments
Do you want to report…• the number of people who answered each
question?• how many people answered a, b, c, d?• the percentage of respondents who
answered a, b, c, d?• the average number or score?• the mid-point among a range of answers?• a change in score between two points in
time?• how people compared?• quotes and people’s own words
Common descriptive statistics• Count (frequencies)• Percentage• Mean• Mode• Median• Range• Standard deviation• Variance• Ranking
Key components of a data analysis plan
• Purpose of the evaluation• Questions• What you hope to learn from the question• Analysis technique • How data will be presented
Steps in Processing of Data
• Preparing of raw data• Editing
– Field editing– Office editing
• Coding– Establishment of appropriate category– Mutually exclusive
• Tabulation – Sorting and counting– Summarizing of data
Types of Tabulation
• Simple or one-way tabulation– Question with only one response (adds up to 100)– Multiple response to a question ( doesn’t add up
to 100)• Cross tabulation or two-way tabulation
Classification of Data
• Number of groups• Width of the class interval• Exclusive categories• Exhaustive categories• Avoid extremes
Frequency Distribution Tables
Lower Limit
Upper Limit
Frequency Distribution Tables
Measures of Central Tendency
• Measure of central tendency, of a data set is a measure of the "middle" value of the data set
• The mean, median and mode are all valid measures of central tendency
• But, under different conditions, some measures of central tendency become more appropriate to use than others
Mean
• The mean (or average) is the most popular and well known measure of central tendency
• It can be used with both discrete and continuous data, although its use is most often with continuous data
Median & Mode
• The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data.
• The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option.
Skewed Distributions
Choosing appropriate measure
Type of Variable Best measure of central tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median
How to represent the results
• Graphics should be used whenever practical• Generally used graphics to depict the results
are:– Bar charts– Line charts– Pie / round charts
Measures of Dispersion
Measures of dispersion (or variability or spread) indicate the extent to which the observed
values are “spread out” around that center — how “far apart” observed values typically are
from each other and therefore from some average value (in particular, the mean).
21
Measures of Dispersion
• There are three main measures of dispersion:– The range– The semi-interquartile range (SIR)– Variance / standard deviation
22
The Range
• The range is defined as the difference between the largest score in the set of data and the smallest score in the set of data, XL - XS
• What is the range of the following data:4 8 1 6 6 2 9 3 6 9
• The largest score (XL) is 9; the smallest score (XS) is 1; the range is XL - XS = 9 - 1 = 8
23
When To Use the Range
• The range is used when– you have ordinal data or– you are presenting your results to people with little
or no knowledge of statistics• The range is rarely used in scientific work as it is
fairly insensitive– It depends on only two scores in the set of data, XL
and XS– Two very different sets of data can have the same
range:1 1 1 1 9 vs 1 3 5 7 9
24
The Semi-Interquartile Range
• The semi-interquartile range (or SIR) is defined as the difference of the first and third quartiles divided by two– The first quartile is the 25th percentile– The third quartile is the 75th percentile
• SIR = (Q3 - Q1) / 2
25
SIR Example• What is the SIR for the
data to the right?• 25 % of the scores are
below 5– 5 is the first quartile
• 25 % of the scores are above 25– 25 is the third quartile
• SIR = (Q3 - Q1) / 2 = (25 - 5) / 2 = 10
246
5 = 25th %tile
81012142030
25 = 75th %tile
60
26
When To Use the SIR
• The SIR is often used with skewed data as it is insensitive to the extreme scores
Mean Deviation
The key concept for describing normal distributionsand making predictions from them is calleddeviation from the mean. We could just calculate the average distance between each
observation and the mean.• We must take the absolute value of the distance,
otherwise they would just cancel out to zero!Formula:
| |iX X
n
Mean Deviation: An Example
1. Compute X (Average)2. Compute X – X and take the
Absolute Value to get Absolute Deviations
3. Sum the Absolute Deviations
4. Divide the sum of the absolute deviations by N
X – Xi Abs. Dev.
7 – 6 1
7 – 10 3
7 – 5 2
7 – 4 3
7 – 9 2
7 – 8 1
Data: X = {6, 10, 5, 4, 9, 8} X = 42 / 6 = 7
Total: 12 12 / 6 = 2
What Does it Mean?• On Average, each observation is two units away
from the mean.
Is it Really that Easy?• No!• Absolute values are difficult to manipulate algebraically• Absolute values cause enormous problems for calculus (Discontinuity)• We need something else…
30
Variance
• Variance is defined as the average of the square deviations:
N
X 2
2
31
What Does the Variance Formula Mean?
• First, it says to subtract the mean from each of the scores– This difference is called a deviate or a deviation
score– The deviate tells us how far a given score is from
the typical, or average, score– Thus, the deviate is a measure of dispersion for a
given score
32
What Does the Variance Formula Mean?
• Why can’t we simply take the average of the deviates? That is, why isn’t variance defined as:
N
X2
This is not the formula for variance!
33
What Does the Variance Formula Mean?
• One of the definitions of the mean was that it always made the sum of the scores minus the mean equal to 0
• Thus, the average of the deviates must be 0 since the sum of the deviates must equal 0
• To avoid this problem, statisticians square the deviate score prior to averaging them– Squaring the deviate score makes all the squared
scores positive
34
Computational Formula
• When calculating variance, it is often easier to use a computational formula which is algebraically equivalent to the definitional formula:
NN
N XX
X 2
2
2
2
2 is the population variance, X is a score, is the population mean, and N is the number of scores
35
Computational Formula Example
X X2 X- (X-)2
9 81 2 4
8 64 1 1
6 36 -1 1
5 25 -2 4
8 64 1 1
6 36 -1 1 = 42 = 306 = 0 = 12
36
Computational Formula Example
26
126
2943066
6306
NN
42
XX
2
2
2
2
26
12N
X2
2
37
Variance of a Sample• Because the sample mean is not a perfect
estimate of the population mean, the formula for the variance of a sample is slightly different from the formula for the variance of a population:
1N
XXs
2
2
s2 is the sample variance, X is a score, X is the sample mean, and N is the number of scores
Homework
• The following are test scores from a class of 20 students:
• 96 95 93 89 83 83 81 77 77 77 71 71 70 68 68 65 57 55 48 42
• Find out the measures of central tendency and dispersion
• What do you observe from the values of measures of central tendency?
Q & As