© 2005 the mcgraw-hill companies, inc., all rights reserved. chapter 12 describing data
TRANSCRIPT
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Chapter 12
Describing Data
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Doing Exploratory Data Analysis
Use EXPLORATORY DATA ANALYSIS (EDA) to search for patterns in your data
Before conducting any inferential statistic, use EDA to ensure that your data meet the requirements and assumptions of the test you are planning to use (e.g., normally distributed)
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Steps involved in the EDA:
1. Organize and summarize your data on a data coding sheet
2. If desired, organize data for computer entry
3. Graph data (bar graph, histogram, line graph, or scatterplot) so that you can visually inspect distributions
This will help you choose the appropriate statistics
4. Display frequency distributions on a histogram, and create a STEMPLOT
5. Examine your graphs for normality or skewness in your distributions
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Graphing Your Data
Bar Graph Presents data as bars extending from the axis
representing the independent variable Length of each bar determined by value of the
dependent variable Width of each bar has no meaning Can be used to represent data from single-
factor and two-factor designs Best if independent variable is categorical
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Line Graph Data represented by a series of points
connected by a line Most appropriate for quantitative independent
variables Used to display functional relationships Line graphs can show different shapes
Positively accelerated: Curve starts flat and becomes progressively steeper as it moves along x-axis
Negatively accelerated: Curve is steep at first and then “levels off” as it moves along x-axis
Once the curve levels off it is said to be asymptotic
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
A line graph can vary in complexity A monotonic function represents a uniformly
increasing or decreasing function A nonmonotonic function has reversals in direction
Scatterplot Used to represent data from two dependent
variables The value of one dependent variable is
represented on the x-axis and the value of the other on the y-axis
Pie Chart Used to represent proportions or percentages
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
The Frequency Distribution
Represents a set of mutually exclusive categories into which actual values are classified
Can take the form of a table or a graph Graphically, a frequency distribution is
shown on a histogram A bar graph on which the bars touch The y-axis represents a frequency count of the
number of observations falling into a category Categories represented on the x-axis
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Histogram Showing a Normal Distribution
0
1
2
3
4
5
1 2 3 4 5 6 7
Response Categories
Freq
uen
cy
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Histogram Showing a Positive Skew
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9
Response Category
Fre
quen
cy
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Histogram Showing a Negative Skew
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9
Response Category
Fre
quen
cy
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
A Bimodal Distribution
Grade category
959085807570656055
Fre
quen
cy20
15
10
5
0
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Measures of Center: Characteristics and Applications
Mode Most frequent score in a distribution Simplest measure of center Scores other than the most frequent not
considered Limited application and value
Median Central score in an ordered distribution More information taken into account than with
the mode Relatively insensitive to outliers Used primarily when the mean cannot be used
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Mean Average of all scores in a distribution Value dependent on each score in a
distribution Most widely used and informative measure
of center
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Measures of Center: Applications
Mode Used if data are measured along a nominal
scale
Median Used if data are measured along an ordinal
or nominal scale Used if interval data do not meet
requirements for using the mean
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Mean Used if data are measured along an interval
or ratio scale Most sensitive measure of center Used if scores are normally distributed
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Measures of Spread: Characteristics
Range Subtract the lowest from the highest score in a
distribution of scores Simplest and least informative measure of
spread Scores between extremes are not taken into
account Very sensitive to extreme scores
Semi-Interquartile Range Less sensitive than the range to extreme scores Used when you want a simple, rough estimate
of spread
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Variance Average squared deviation of scores from the
mean Standard Deviation
Square root of the variance Most widely used measure of spread
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Measures of Spread: Applications
The range and standard deviation are sensitive to extreme scores In such cases the semi-interquartile range is
best When your distribution of scores is skewed,
the standard deviation does not provide a good index of spread
With a skewed distribution, use the semi-interquartile range
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
The Five Number Summary and Box Plots
Five Number Summary Convenient way to represent a distribution with
a few numbers Statistics included
Minimum score The first quartile The median (second quartile) Third quartile Maximum score
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Example of a Five Number Summary
Maximum 132
Third Quartile (Q2)
110
Median (Q2) 101
First Quartile (Q1)
90
Minimum 67
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Boxplot Graphic representation of the five number
summary First and third quartile define the ends of the
box A line in the box represents the median Vertical “whiskers” extending above and below
the box represent the maximum and minimum scores (respectively)
Data from multiple treatments are represented by side-by-side boxplots
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Example of a Boxplot
0
50
100
150
IQ
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
The Pearson Product–Moment Correlation (r)
Most widely used measure of correlation Value of r can range from +1 through 0 to
–1 Magnitude of r tells you the degree of
LINEAR relationship between variables Sign of r tells you the direction (positive or
negative) of the relationship between variables
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Presence of outliers affects the sign and magnitude of r
Variability of scores within a distribution affects the value of r
Used when scores are normally distributed
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Measures of Association Pearson Product-Moment Correlation
Index of linear relationship between two continuously measured variables
Point-Biserial Correlation Index of correlation between two variables, one
of which is measured on a nominal scale and the other on at least an interval scale
Spearman Rank-Order Correlation Index of correlation between two variables
measured along an ordinal scale
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Phi Coefficient Index of correlation between two variables
measured along a nominal scale
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Linear Regression and Prediction
Used to find the straight line that best fits the data plotted on a scatterplot
The best fitting straight line is known as the least squares regression line
The regression line is defined mathematically:
Y a bx
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
The regression weight (b) is based on raw scores and is difficult to interpret
The standardized regression weight (beta weight) is based on standard scores and is easier to interpret
You can predict a value of Y from a value of X once the regression equation has been calculated The difference between predicted and observed
values of Y is the standard error of estimate