Sociology 5811:Lecture 2: Datasets and Simple
Descriptive Statistics
Copyright © 2005 by Evan Schofer
Do not copy or distribute without permission
Announcements
• 1. Lab meets Monday 1:25, Blegen 440
• 2. Course lecture notes at:– http://www.soc.umn.edu/~schofer– Click on “Soc 5811”, go to “Course Files”
From Measurement to Datasets
• Suppose we:– 1. Choose a unit of analysis– 2. Choose a measurement strategy– 3. Take measurements on relevant cases
• Result: We end up with sets of measurements on a group of cases
• Q: What next?
• A: Data is often organized in a spread sheet:– Rows contain all measurements on each case– Columns reflect sets of measurements or “variables”
Datasets: Example
• Suppose we measured 5 people regarding views on gun control and gun ownership:
Person Views on Gun Control
# Guns owned
1 Favor 0
2 Oppose 3
3 Favor 0
4 Favor 1
5 Oppose 1
Rows contain all info on each
person (a case)
Columns contain all measurements
on a particular topic (a variable)
From Measurement to Datasets
• Issue: To facilitate data analysis, it is best to enter data as numbers, rather than text– Often called “coding” data
• Less good option: Use text words “Favor” and “Oppose” in our gun control dataset
• Better option: Convert “Favor” and “Oppose” to numeric values– Example: 1 = favor, 0 = oppose
• Advantage: more computation options
• Disadvantage: Data is harder to interpret by eye
Datasets: Recoded Example
• In this dataset, “Favor” was recoded to 1, “Oppose” to zero.
Person Views on Gun Control
# Guns owned
1 1 0
2 0 3
3 1 0
4 1 1
5 0 1
Note that it is harder to visually
determine the meaning of the variable. You
have to remember what the numbers
mean…
Review
• Measurement: The task of gathering information that characterizes or represents a social phenomena
• Q: What is “Unit of Analysis”?– Answer: The type of thing which we are collecting
information about
• Q: What are 3 measurement scales? Examples?• Nominal
• Ordinal
• Interval / Ratio
Review: Measurement Problems
• Problems that arose in survey given last class:
• Question 10: What transportation do you generally use to get to class– Answer: Both “car” and “public transportation”
• Question 9: How many miles away do you live?– Answer: 4 blocks
• Question 6 (Liberal or conservative, from 1-10)– Answer: “3 or 4”
• How many CDs do you own?– Answer: “Over 100”
Today’s Class: Describing Information
• Tools for describing a single variable:
• List, Frequency lists, charts, histograms
• Characterization of “Typical” cases– Ex: Mean (“average”), Mode, Median
• Characterization of Variation– Ex: Min, Max, Variance, etc.
Listing Variables• Lists: Values of a variable for all cases
• Looking at the “raw data”
• Report command in SPSS– Or just look at data in the SPSS data editor
• Advantages:– Easy– Gives a rich description – you can see every case
• Disadvantages:– Cumbersome for large datasets– If data involves complex coding, you may not be able
to interpret it visually
Frequency Lists
• Frequency Lists: Tables that show how many cases take on a particular value– Also called “frequencies”, “frequency distributions”
• Examples:– Congressional vote. How many “Yes” vs “No”?– Social class: How many = low, middle, upper?– Age: How many = 1 years old, 2 years, … 100 years?
• Relevant SPSS Command: Frequencies
Example from SPSS• Note: Men coded as 1, Women coded as 2
GENDER Freq. % Valid% Cuml. %
1.00 6 33.335.335.3 2.00 11 61.164.7100.0
Total 17 94.4100.0
Missing:Systm 1 5.6
Total 18 100.0
Frequency Lists
• Advantages:– Useful for large datasets– Fairly rich description of data – once you get used to
reading them…
• Disadvantages:– Unlike a list, you can’t see which case is which or
compare with other variables– Best for nominal and some ordinal variables only– Not useful if all values are unique, such as: rank
orderings, many continuous variables
Visual Representations: Bar Charts
• “Bar Chart”– Essentially a visual representation of a frequency list– Height of bars represent number of cases– For nominal & some ordinal variables only
• Again, rank orders and continuous measures don’t work
• “Pie Chart”– Similar, but divides up a circle to show frequency
• All Accessible within Frequencies Menu– Just click Chart button– Or, look under Graphs menu
SPSS Bar Chart
GENDER
2.001.00Missing
Cou
nt
12
10
8
6
4
2
0
A Similar Approach: Pie Chart
GENDER
Missing
2.00
1.00
Graphing Continuous Measures• Issue: Continuous variables have an infinite
possible number of unique values. • Cases rarely have the exact same value
• Bar chart would have many bars of height 1• What would you do about zeros?
• Solution: use “grouped data”• Sets of similar values must be “grouped”
– Lumped together by constant intervals– Note: Information is destroyed in the process
• Result: A “Histogram”– Height of bar represents number of cases within a
given range of values
Histogram: Age (5-year interval)
AGE OF RESPONDENT
90.0
85.0
80.0
75.0
70.0
65.0
60.0
55.0
50.0
45.0
40.0
35.0
30.0
25.0
20.0
AGE OF RESPONDENT
Fre
qu
en
cy
300
200
100
0
Std. Dev = 17.81
Mean = 45.4
N = 1533.00
This doesn’t mean that 200 cases are exactly 30 years old… Rather, 200 cases fall in the 5-year interval around age 30
(from 27.5 and 32.5)
Histograms: Interval Width
• Previous example: People were grouped by age, within 5-year intervals– Bars represented ages 17.5-22.5, 22.5-27.5 and so on
• It is also possible to group people within 1 year intervals – or 50 year intervals– Small interval = more bars in the histogram– Wide interval = fewer bars in the histogram
• WARNING: Histograms look very different depending on how wide you set the intervals
Histogram: Age (1-year interval)
20 40 60 80
AGE OF RESPONDENT
10
20
30
40
Co
un
t
Histogram: Age (20-year interval)
20 40 60 80
AGE OF RESPONDENT
0
200
400
600
Co
un
t
Histograms: Interval Width• Changing the number of “bars” in the histogram
alters the appearance of the graph• Wide intervals/few bars results in greater simplification of
data
• Suggestion– 1. Try different intervals
• In SPSS, go to “interactive histogram”
– 2. Don’t over-interpret a crude histogram
• Another example: National Wealth– Unit of analysis = country– Variable = GDP per capita, a measure of wealth
Histogram: Wide Intervals
Penn 56 RGDPCH 1990
18750.015250.011750.08250.04750.01250.0
100
80
60
40
20
0
Std. Dev = 4915.68
Mean = 4810.4
N = 152.00
National Wealth 1990
Histogram: Narrow Intervals
Penn 56 RGDPCH 1990
20000.0
18000.0
16000.0
14000.0
12000.0
10000.0
8000.0
6000.0
4000.0
2000.0
0.0
Penn 56 RGDPCH 1990F
req
ue
ncy
50
40
30
20
10
0
Std. Dev = 4915.68
Mean = 4810.4
N = 152.00
National Wealth 1990
Histograms
• Advantages:
• 1. Useful for even continuous measures
• 2. Preserves information on distribution of variable
• Both peaks and zeros are apparent
• Disadvantages:
• 1. Interval width can be a problem• Too Wide results in loss of information
• Too Narrow results in too many bars – unreadable.
Interpreting Histograms: Age• Try to interpret: What is this sample like?
AGE OF RESPONDENT
90.0
85.0
80.0
75.0
70.0
65.0
60.0
55.0
50.0
45.0
40.0
AGE OF RESPONDENTF
req
ue
ncy
400
300
200
100
0
Std. Dev = 10.74
Mean = 56.3
N = 1533.00
Interpreting Histograms: Age• Try to interpret this histogram:
AGE OF RESPONDENT
60.0
55.0
50.0
45.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
AGE OF RESPONDENTF
req
ue
ncy
400
300
200
100
0
Std. Dev = 10.74
Mean = 26.3
N = 1533.00
Interpreting Histograms: Age• Try to interpret this histogram:
AGE OF RESPONDENT
100.095.0
90.085.0
80.075.0
70.065.0
60.055.0
50.045.0
40.035.0
30.025.0
20.0
AGE OF RESPONDENTF
req
ue
ncy
70
60
50
40
30
20
10
0
Std. Dev = 23.06
Mean = 59.9
N = 1537.00
Measures of “Central Tendency”
• Often, it is important to assess the “typical” values of a variable
• Examples: – We may wish to know how much money the typical
family earns– We may wish to know the age of the typical person in
our dataset
• Solution: Conduct calculations to determine what values are “typical
• However, this isn’t as easy as it sounds– Consider some examples…
What is the “Center”?
Penn 56 RGDPCH 1990
20000.0
18000.0
16000.0
14000.0
12000.0
10000.0
8000.0
6000.0
4000.0
2000.0
0.0
Penn 56 RGDPCH 1990F
req
ue
ncy
50
40
30
20
10
0
Std. Dev = 4915.68
Mean = 4810.4
N = 152.00
National Wealth 1990
What is the “Center”?
GENDER
2.001.00Missing
Cou
nt
12
10
8
6
4
2
0
The “Mode”
• The Mode = the value representing the largest number of cases -- called the “Modal” value
• Useful for Nominal, Ordinal variables
• Only useful for Continuous variables if you have grouped data into a histogram
• Otherwise, all values may very likely be unique
• Issue: Mode is not very helpful (even misleading) in certain circumstances
• Ex: If there are many peaks, or a single unusual one
• Ex: If the variable is distributed quite evenly.
Mode: Example
GENDER
2.001.00Missing
Cou
nt
12
10
8
6
4
2
0
Here, the mode is 2 (which corresponds
to “female”)
Mode: Example
AGE OF RESPONDENT
90.0
85.0
80.0
75.0
70.0
65.0
60.0
55.0
50.0
45.0
40.0
35.0
30.0
25.0
20.0
AGE OF RESPONDENT
Fre
qu
en
cy
300
200
100
0
Std. Dev = 17.81
Mean = 45.4
N = 1533.00
Here, the mode is 30 (though it might be
different if the histogram had a
different interval width)
Mode: Example• In this case, the mode (45) is not helpful
AGE OF RESPONDENT
100.095.0
90.085.0
80.075.0
70.065.0
60.055.0
50.045.0
40.035.0
30.025.0
20.0
AGE OF RESPONDENTF
req
ue
ncy
70
60
50
40
30
20
10
0
Std. Dev = 23.06
Mean = 59.9
N = 1537.00
Median
• The Median = the value of the “middle case”• Equal number of cases fall higher or lower
• Can be used for ordinal, continuous variables
• Advantages:• 1. Not influenced by unusual peaks
• 2. Useful even in very even distributions
• Disadvantages:• 1. Not useful for data spread in two distinct “clumps.”
Median Example
AGE OF RESPONDENT
90.0
85.0
80.0
75.0
70.0
65.0
60.0
55.0
50.0
45.0
40.0
35.0
30.0
25.0
20.0
AGE OF RESPONDENT
Fre
qu
en
cy
300
200
100
0
Std. Dev = 17.81
Mean = 45.4
N = 1533.00
The median case is 42 years old. Half are older, half are
younger!
Mean – “Average”
• The most well-known way of assessing the “middle”
• Calculated by adding values of all cases, then dividing by the total number of cases
• Advantages:• Useful for continuous measures
• Not overly influenced by any single peak
• Disadvantages:• Can be influenced by extreme values.
Calculating the Mean: Variables
• Each column of a dataset is considered a variable
• We’ll refer to a column generically as “Y”
Person # Guns owned
1 0
2 3
3 0
4 1
5 1
The variable “Y”
Note: The total number of cases in
the dataset is referred to as “N”.
Here, N=5.
Equation of Mean: Notation• Each case can be
identified a subscript• Yi represents “ith” case of
variable Y• i goes from 1 to N• Y1 = value of Y for first
case in spreadsheet• Y2 = value for second
case, etc.• YN = value for last case
Person # Guns owned (Y)
1 Y1 = 0
2 Y2 = 3
3 Y3 = 0
4 Y4 = 1
5 Y5 = 1
Calculating the Mean
• Equation:
• 1. Mean of variable Y represented by Y with a line on top – called “Y-bar”
• 2. Equals sign means equals: “is calculated by the following…”
• 3. N refers to the total number of cases for which there is data
• Summation () – will be explained next…
N
i
iYN
Y1
1
Equation of Mean: Summation
• Sigma (Σ): Summation– Indicates that you should add up a series of numbers
The thing on the right is the
item to be added
repeatedly
N
i
iY1
The things on top and bottom tell you how many times to add up Y-sub-i…
AND what numbers to
substitute for i.
Equation of Mean: Summation
• 1. Start with bottom: i = 1.– The first number to add is Y-sub-1
N
i
iY1
1Y 2Y 5Y3Y 4Y
• 2. Then, allow i to increase by 1 – The second number to add is i = 2, then i = 3
• 3. Keep adding numbers until i = N– In this case N=5, so stop at 5
Equation for the Mean: Example
Case Num CD’s
1 20 Y1
2 40 Y2
3 0 Y3
4 70 Y4
N
N
i
YYYYYi
...321
1
• Variable: Number of CD’s… How many CD’s does a person own?
Equation of the Mean: Example
4321
1
YYYYYN
i
i 1307004020
5.321304
11
1
N
i
iYN
Y