data entry data management basic descriptive statistics jamie lynn marincic leanne hicks survey,...

47
Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July 19-20, 2007 S P S S Statistical Package for the Social Sciences

Upload: georgiana-craig

Post on 24-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Data EntryData Management

Basic Descriptive Statistics

Jamie Lynn MarincicLeanne Hicks

Survey, Statistics, and Psychometrics Core Facility (SSP)July 19-20, 2007

S P S SStatistical Package for the Social Sciences

Page 2: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

OutlineThinking about Data

Variable types Levels of measurement Coding survey data

Data Entry Entering raw data and importing data

Frequencies Taking a quick look at your data

Data Management Computing and recoding variables

Measures of Central Tendency Mean, median, mode

Analyzing Subsets of your Data

Page 3: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Thinking about Data

Page 4: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Variable TypesVariables which record a response as a set of categories are termed categorical or qualitative.

e.g., ethnic group, religion, marital status, gender, birth order

Variables which record a response that has a numeric meaning are termed numerical or quantitative.

e.g., scores on tests of intelligence, pounds, seconds, dollars, age

Furthermore, numerical or quantitative variables are either continuous or discrete.

Numerical or quantitative data are discrete when only a finite number of values is possible (typically whole numbers). Fraction or decimal values are usually not meaningful (e.g., ½ a person, ½ a defect, etc.) Numerical or quantitative data are continuous when they can be measured on a continuum or a scale. Fraction or decimal values are meaningful (e.g., ½ a dollar (i.e., $.50), ½ an inch, etc.).

Page 5: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Levels of MeasurementNominal scales involve the simple classification of subjects into

categories. These scales lack an inherent order. e.g., ethnic group, religion, marital status, gender

Ordinal scales involve the simple classification of subjects into categories that have an inherent order. These scales do not have either equal intervals or a true zero point.

e.g., birth order

Interval scales have equal intervals but are measured from an arbitrary point.

e.g., scores on tests of intelligence, achievement, personality

Ratio scales have equal intervals with a true zero point, a point at which there is none of whatever the scale is measuring.

e.g., pounds, seconds, size of group, dollars, age

Page 6: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Synthesis

Qualitative

Quantitative

Categorical

Numerical

Nominal

Ordinal

Interval

Ratio

Variable TypeLevel of

Measurement

Discrete

Continuousor

Page 7: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Coding Survey Data

Variable Type qualitative/categorical

Level of Measurement nominal

Question assessing respondent religious denomination…

Question assessing respondent dis/agreement…2. Attending this presentation was worthwhile. __ Strongly Agree __ Agree __ Neither Agree nor Disagree __ Disagree __ Strongly Disagree

Variable Type quantitative/numerical and discrete

Level of Measurement ordinal treated as interval

1. Do you consider yourself to be Protestant, Catholic, J ewish, Muslim, something else, or do you consider yourself to have no religious affiliation?

__ Protestant __ Catholic __ J ewish __ Muslim __ Other ___________________________ __ No religious affiliation

Page 8: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Coding Survey DataWhen creating response options, consider the construct you are measuring (e.g., knowledge, dis/agreement).

Is it possible for a respondent to lack the construct entirely (e.g., knowledge)?

Does the construct have an inherent opposite? (i.e., will the scale be unipolar or bipolar?)

The knowledge scale is unipolar. (0, 1, 2, 3, 4, …)

The agree/disagree scale is bipolar. (…, -3, -2, -1, 0, 1, 2, 3, …)

Based on this information, we can create meaningful numeric codes for our data.

__ Very knowledgeable __ Somewhat knowledgeable __ Not very knowledgeable __ Not at all knowledgeable

3 Very knowledgeable 2 Somewhat knowledgeable 1 Not very knowledgeable 0 Not at all knowledgeable

__ Strongly Agree __ Agree __ Neither Agree nor Disagree __ Disagree __ Strongly Disagree

2 Strongly Agree 1 Agree 0 Neither Agree nor Disagree -1 Disagree -2 Strongly Disagree

Page 9: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Data Entry

Page 10: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Opening SPSS

Find program under Start Programs

or

Click on desktop icon

Page 11: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Opening SPSS

If you will be working with SPSS a lot, it might be worth your time to flip through this tutorial.

‘Help’ buttons placed throughout the program will take you to the appropriate section of the tutorial.

We will begin by learning how to enter our own data.

Page 12: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Data EntryVariable View

Create variables in Variable View

Page 13: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Data EntryName and Type

Name: Meaningful variable nameUse ‘_’ if spaces desired

Type: Variable typeDefault is ‘Numeric’, ‘String’ also common

Can also specify width of variable (maximum number of characters) if ‘string’ or specify number of decimal places displayed if ‘numeric’

Notice the ‘Help’ button.

Page 14: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Data EntryLabel and Values

Label: ‘String’ label for variable namee.g., variable name ‘income’ might be labeled ‘annual income for household’

Values: Numeric value assignments for categorical data

e.g., gender coded as 0/1 where male=0 and female=1

Commonly use the label of value ‘1’ as name for dichotomous variables (i.e., two-category nominal variables)

e.g., if females coded as ‘1’, then name variable ‘female’

Page 15: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Data EntryMissing

Missing: Coding of missing dataSystem-missing values are values automatically recognized as missing by SPSS

(i.e., blank/empty fields or cells).

User-missing values are numeric values that need to be defined as missing for SPSS (e.g., ‘7’: N/A, ‘8’: Don’t Know, ‘-99’: Missing)

Page 16: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Data EntryMeasure

Measure: Level of measurement of variable i.e., nominal, ordinal, scale (interval/ratio)

Influences the analyses you conduct

___________________________________

Note: You can copy and paste similar variable attributes.

Questions so far?

Page 17: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Data EntryData View

Enter data in Data ViewColumns are variables/Rows are observations (i.e., respondents)

For un-named variables, SPSS uses the default sequence VAR00001, VAR00002, etc.

Right click to insert/delete an observation or a variable

Notice that the toolbar and menu are the same in both windows

Page 18: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

STOP!!

Once you have created a complete data set, save

one version to never be modified and create a

second version with which you will work.

Page 19: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Importing Datafrom Excel

File Open Data

Select file from appropriate location Be sure to select ‘Files of type: All files (*.*)’

Page 20: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Importing DataOpening File Options

Select ‘Read variable names’ if variable names appear in first row of Excel spreadsheet

Indicate desired range of spreadsheet to be imported

Click ‘OK’

Once data is imported, be sure to save your new data set.

Page 21: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Frequencies

Page 22: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

FrequenciesFrom either Variable View or Data View screen…

Analyze Descriptive Statistics Frequencies

Select desired variable(s) and move to Variable(s) box by clicking on the arrow

‘Display frequency tables’ should be checked

Click ‘Paste’ to save your command to the syntax window

Page 23: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

FrequenciesSyntax

This is the syntax to obtain frequencies for the variable ‘age’.

Why use syntax?

_____________________________________________

Allows you to save your work (i.e., the analyses you perform).

Makes it easy to reproduce common analyses with different variables or combinations of variables. Simply copy and paste syntax and replace necessary variables.

Notice that each command ends with a ‘.’.

Page 24: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

FrequenciesSyntax Comments

You can also make comments in your syntax to remind yourself of what you were doing and when.

Begin a comment with an ‘*’ and end a comment with a ‘.’.

Syntax files (*.sps) are independent from data files (*.sav) so they must be saved separately.

i.e., you can run the same syntax file with different data sets

Page 25: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

FrequenciesRun

Highlight desired syntax and push blue triangle to ‘run’ the syntax and obtain output.

You can also click ‘Run’ from the drop-down menu and

choose to run the entire syntax file (‘All’), a selection of the

syntax file (‘Selection’), the current syntax (i.e., the block of

syntax in which your cursor rests) (‘Current’), or the syntax

appearing from the point of your cursor until the end of the

syntax file (‘To End’).

Run

Page 26: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

FrequenciesOutput

Sta tistics

age28628

136

Valid

Missing

N

There are N=28,628 valid observations (i.e., non-missing observations) and 136 missing cases for a total of 28,764 cases.

For example, there are 1131 observations of age 27 which account for 3.9% of the total number of cases (28,764) and which account for 4.0% of the total number of non-missing cases (28,628). Finally, 22.1% of the observations are age 27 or younger.

Page 27: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

FrequenciesResults Coach

If you are ever unsure of what certain output means, right click on the desired output and select ‘Results Coach’. You will be directed to the relevant section of the tutorial.

Page 28: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Data Management

Page 29: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Recoding VariablesIt is often useful to categorize continuous variables to get a more meaningful picture of your data.

For example, suppose we want to code respondent age into the following eight categories:

24 or younger

25-39

40-44

45-49

50-54

55-59

60-64

65 or older

We can simply recode the

current continuous age

variable into a new categorical

age variable.

Page 30: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Recoding VariablesTransform Recode Into Different Variables

Select desired variable(s) and move to Variable(s) box by clicking on the arrow Provide name and label for new Output Variable Click ‘Change’ to apply these new variable attributes

Page 31: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Recoding VariablesOld and New Values

Enter Old Value and desired New ValueClick ‘Add’Once complete, click Continue

Page 32: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Recoding VariablesSyntax

RECODE age (Lowest thru 24=1) (25 thru 39=2) (40 thru 44=3) (45 thru 49=4) (50 thru 54=5) (55 thru 59=6) (60 thru 64=7) (65 thru Highest=8) INTO age_cat .VARIABLE LABELS age_cat 'categorical age'.EXECUTE .

Click ‘Paste’ to convert command into syntax

Run your syntax_____________________________________________

We have recreated the variable agecat8 as age_cat.

Let’s run frequencies of both variables to compare.

Page 33: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Recoding VariablesComparison

There is a slight discrepancy. SPSS

has 3 more missing cases than we do;

however, our number of missing cases

matches that of the original continuous

variable. This suggests that in the

recoding of the original variable, SPSS

missed 3 cases in the first two

categories.

agecat8 age group

3106 10.8 10.9 10.9

16562 57.6 57.9 68.7

3888 13.5 13.6 82.3

2458 8.5 8.6 90.9

1584 5.5 5.5 96.4

662 2.3 2.3 98.7

255 .9 .9 99.6

110 .4 .4 100.0

28625 99.5 100.0

139 .5

28764 100.0

24 or less

25-39

40-44

45-49

50-54

55-59

60-64

65+

Total

Valid

SystemMissing

Total

Frequency Percent Valid PercentCumulative

Percent

age_cat categorical age

3107 10.8 10.9 10.9

16564 57.6 57.9 68.7

3888 13.5 13.6 82.3

2458 8.5 8.6 90.9

1584 5.5 5.5 96.4

662 2.3 2.3 98.7

255 .9 .9 99.6

110 .4 .4 100.0

28628 99.5 100.0

136 .5

28764 100.0

1

2

3

4

5

6

7

8

Total

Valid

SystemMissing

Total

Frequency Percent Valid PercentCumulative

Percent

FREQUENCIES VARIABLES=agecat8 age_cat /ORDER= ANALYSIS .

Page 34: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Computing Variables

Suppose we want to calculate each runner’s average miles per hour. We know the number of hours it took them to complete the marathon and we know a marathon is 26.2 miles. Therefore, we can compute mph by dividing 26.2 by time in hours.

Transform Compute

Enter desired formula and click ‘Paste’

COMPUTE mph = 26.2 / hours .EXECUTE .

Run your syntax

Page 35: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Measures of Central Tendency

Page 36: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Mean, Median, and ModeDefinitions

The arithmetic mean (or simply the mean) of a list of numbers is

the sum of all the members of the list divided by the number of

items in the list. We commonly call this the average.

The mode of a list of numbers is the number which occurs the

most frequently. A variable with only one mode is called uni-

modal. If the same maximum frequency occurs at two or more

values, the variable is called bi- or multi-modal.

The median of a list of numbers is the number dividing the higher

half of the list from the lower half. If there are an even number of

observations, the median is not unique, so one often takes the

mean of the two middle values.

Page 37: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Mean, Median, and ModeComputation

Proceed as if computing frequencies for a particular variable

Analyze Descriptive Statistics Frequencies

Click ‘Statistics’

Page 38: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Mean, Median, and ModeComputation

Select ‘Mean’, ‘Median’, and ‘Mode’ from ‘Central Tendency’ box

Click ‘Continue’

Then ‘Paste’

Run your syntax

Page 39: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Mean, Median, and ModeOutput

FREQUENCIES VARIABLES=age /STATISTICS=MEAN MEDIAN MODE /ORDER= ANALYSIS .

Statistics

age28628

136

35.50

34.00

30

Valid

Missing

N

Mean

Median

Mode

The mean allows us to say…

The average runner in our sample is 35.5 years old.

The median allows us to say…

50% of runners in our sample are older than 34 and 50% of

runners are younger than 34.

The mode allows us to say…

Runners of age 30 form the largest group.

_____________________________________________

Page 40: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

CautionMean

The mean is easily influenced by outliers. That is, observations with

unusually high or low values will pull the mean in their direction. For

example, consider the mean salary of the following five employees.

Employee A: $45,000

Employee B: $50,000

Employee C: $60,000

Employee D: $70,000

Employee E: $1,000,000

The mean salary is $245,000… a gross misrepresentation of an

‘average’ employee’s salary. Here, the median ($60,000) would be

a better indication of typical salary.

Page 41: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

CautionMedian and Mode

When your mean ≠ median, it is an indication that there are outliers in

your data set. In this case, it is best to interpret the median rather than

the mean.

It only makes sense to talk about the median value when your data can

be meaningfully ordered from smallest to largest. Therefore, calculating

the median value of a categorical variable is not appropriate.

The mode is not a very good summary measure for a variable that can

have many values, since several values can be tied for “largest

frequency” and the frequency need not represent a large percentage of

the cases.

Recall the marathon example. The mode age is 30; however, only 4.5% of the runners

fall in this group.

Page 42: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Analyzing Subsets

of your Data

Page 43: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Analyzing Subsets of your Data

Sometimes you might only be interested in analyzing a subset of

observations. For example, what is the average completion time

of males? What is the average completion time of females?

Data Select Cases

If condition is satisified

CAUTION:

Filter out unselected cases

or

Copy selected cases to a new dataset

NEVER DELETE ANYTHING!!!

Page 44: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Analyzing Subsetsof your Data

Click ‘Continue’

Then ‘Paste’

Run your syntax

COMPUTE filter_$=(sex = 'M').VARIABLE LABEL filter_$ "sex = 'M' (FILTER)".VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.FORMAT filter_$ (f1.0).FILTER BY filter_$.EXECUTE .

Page 45: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Analyzing Subsetsof your Data

Note the newly created filter variable and the slashes through unselected cases.

Page 46: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Analyzing Subsets of your Data

What is the average completion time of males?

Run Frequencies as usual…

Statistics

hours completion time in hours17337

0

4.1461

4.0639

3.96

Valid

Missing

N

Mean

Median

Mode

The average male completed

the marathon in 4.15 hours.

Page 47: Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

Questions?

Thank you!!