© 2005 the mcgraw-hill companies, inc., all rights reserved. chapter 12 describing data

© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.

Chapter 12

Describing Data


Doing Exploratory Data Analysis

Use EXPLORATORY DATA ANALYSIS (EDA) to search for patterns in your data

Before conducting any inferential statistic, use EDA to ensure that your data meet the requirements and assumptions of the test you are planning to use (e.g., normally distributed)


Steps involved in the EDA:

1. Organize and summarize your data on a data coding sheet

2. If desired, organize data for computer entry

3. Graph data (bar graph, histogram, line graph, or scatterplot) so that you can visually inspect distributions

This will help you choose the appropriate statistics

4. Display frequency distributions on a histogram, and create a STEMPLOT

5. Examine your graphs for normality or skewness in your distributions


Graphing Your Data

Bar Graph Presents data as bars extending from the axis

representing the independent variable Length of each bar determined by value of the

dependent variable Width of each bar has no meaning Can be used to represent data from single-

factor and two-factor designs Best if independent variable is categorical


Line Graph Data represented by a series of points

connected by a line Most appropriate for quantitative independent

variables Used to display functional relationships Line graphs can show different shapes

Positively accelerated: Curve starts flat and becomes progressively steeper as it moves along x-axis

Negatively accelerated: Curve is steep at first and then “levels off” as it moves along x-axis

Once the curve levels off it is said to be asymptotic


A line graph can vary in complexity A monotonic function represents a uniformly

increasing or decreasing function A nonmonotonic function has reversals in direction

Scatterplot Used to represent data from two dependent

variables The value of one dependent variable is

represented on the x-axis and the value of the other on the y-axis

Pie Chart Used to represent proportions or percentages


The Frequency Distribution

Represents a set of mutually exclusive categories into which actual values are classified

Can take the form of a table or a graph Graphically, a frequency distribution is

shown on a histogram A bar graph on which the bars touch The y-axis represents a frequency count of the

number of observations falling into a category Categories represented on the x-axis


Histogram Showing a Normal Distribution

0

1

2

3

4

5

1 2 3 4 5 6 7

Response Categories

Freq

uen

cy


Histogram Showing a Positive Skew

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9

Response Category

Fre

quen

cy


Histogram Showing a Negative Skew

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9

Response Category

Fre

quen

cy


A Bimodal Distribution

Grade category

959085807570656055

Fre

quen

cy20

15

10

5

0


Measures of Center: Characteristics and Applications

Mode Most frequent score in a distribution Simplest measure of center Scores other than the most frequent not

considered Limited application and value

Median Central score in an ordered distribution More information taken into account than with

the mode Relatively insensitive to outliers Used primarily when the mean cannot be used


Mean Average of all scores in a distribution Value dependent on each score in a

distribution Most widely used and informative measure

of center


Measures of Center: Applications

Mode Used if data are measured along a nominal

scale

Median Used if data are measured along an ordinal

or nominal scale Used if interval data do not meet

requirements for using the mean


Mean Used if data are measured along an interval

or ratio scale Most sensitive measure of center Used if scores are normally distributed


Measures of Spread: Characteristics

Range Subtract the lowest from the highest score in a

distribution of scores Simplest and least informative measure of

spread Scores between extremes are not taken into

account Very sensitive to extreme scores

Semi-Interquartile Range Less sensitive than the range to extreme scores Used when you want a simple, rough estimate

of spread


Variance Average squared deviation of scores from the

mean Standard Deviation

Square root of the variance Most widely used measure of spread


Measures of Spread: Applications

The range and standard deviation are sensitive to extreme scores In such cases the semi-interquartile range is

best When your distribution of scores is skewed,

the standard deviation does not provide a good index of spread

With a skewed distribution, use the semi-interquartile range


The Five Number Summary and Box Plots

Five Number Summary Convenient way to represent a distribution with

a few numbers Statistics included

Minimum score The first quartile The median (second quartile) Third quartile Maximum score


Example of a Five Number Summary

Maximum 132

Third Quartile (Q2)

110

Median (Q2) 101

First Quartile (Q1)

90

Minimum 67


Boxplot Graphic representation of the five number

summary First and third quartile define the ends of the

box A line in the box represents the median Vertical “whiskers” extending above and below

the box represent the maximum and minimum scores (respectively)

Data from multiple treatments are represented by side-by-side boxplots


Example of a Boxplot

0

50

100

150

IQ


The Pearson Product–Moment Correlation (r)

Most widely used measure of correlation Value of r can range from +1 through 0 to

–1 Magnitude of r tells you the degree of

LINEAR relationship between variables Sign of r tells you the direction (positive or

negative) of the relationship between variables


Presence of outliers affects the sign and magnitude of r

Variability of scores within a distribution affects the value of r

Used when scores are normally distributed


Measures of Association Pearson Product-Moment Correlation

Index of linear relationship between two continuously measured variables

Point-Biserial Correlation Index of correlation between two variables, one

of which is measured on a nominal scale and the other on at least an interval scale

Spearman Rank-Order Correlation Index of correlation between two variables

measured along an ordinal scale


Phi Coefficient Index of correlation between two variables

measured along a nominal scale


Linear Regression and Prediction

Used to find the straight line that best fits the data plotted on a scatterplot

The best fitting straight line is known as the least squares regression line

The regression line is defined mathematically:

Y a bx


The regression weight (b) is based on raw scores and is difficult to interpret

The standardized regression weight (beta weight) is based on standard scores and is easier to interpret

You can predict a value of Y from a value of X once the regression equation has been calculated The difference between predicted and observed

values of Y is the standard error of estimate

© 2005 the mcgraw-hill companies, inc., all rights reserved. chapter 12 describing data

Documents

mcgrawhill companies

line graph data

describing data slide

graph data bar graph

xaxis slide

distributions slide

percentages slide

categorical slide