Download - Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Sociology 5811:Lecture 2: Datasets and Simple

Descriptive Statistics

Copyright © 2005 by Evan Schofer

Do not copy or distribute without permission

Announcements

• 1. Lab meets Monday 1:25, Blegen 440

• 2. Course lecture notes at:– http://www.soc.umn.edu/~schofer– Click on “Soc 5811”, go to “Course Files”

From Measurement to Datasets

• Suppose we:– 1. Choose a unit of analysis– 2. Choose a measurement strategy– 3. Take measurements on relevant cases

• Result: We end up with sets of measurements on a group of cases

• Q: What next?

• A: Data is often organized in a spread sheet:– Rows contain all measurements on each case– Columns reflect sets of measurements or “variables”

Datasets: Example

• Suppose we measured 5 people regarding views on gun control and gun ownership:

Person Views on Gun Control

# Guns owned

1 Favor 0

2 Oppose 3

3 Favor 0

4 Favor 1

5 Oppose 1

Rows contain all info on each

person (a case)

Columns contain all measurements

on a particular topic (a variable)

From Measurement to Datasets

• Issue: To facilitate data analysis, it is best to enter data as numbers, rather than text– Often called “coding” data

• Less good option: Use text words “Favor” and “Oppose” in our gun control dataset

• Better option: Convert “Favor” and “Oppose” to numeric values– Example: 1 = favor, 0 = oppose

• Advantage: more computation options

• Disadvantage: Data is harder to interpret by eye

Datasets: Recoded Example

• In this dataset, “Favor” was recoded to 1, “Oppose” to zero.

Person Views on Gun Control

# Guns owned

1 1 0

2 0 3

3 1 0

4 1 1

5 0 1

Note that it is harder to visually

determine the meaning of the variable. You

have to remember what the numbers

mean…

Review

• Measurement: The task of gathering information that characterizes or represents a social phenomena

• Q: What is “Unit of Analysis”?– Answer: The type of thing which we are collecting

information about

• Q: What are 3 measurement scales? Examples?• Nominal

• Ordinal

• Interval / Ratio

Review: Measurement Problems

• Problems that arose in survey given last class:

• Question 10: What transportation do you generally use to get to class– Answer: Both “car” and “public transportation”

• Question 9: How many miles away do you live?– Answer: 4 blocks

• Question 6 (Liberal or conservative, from 1-10)– Answer: “3 or 4”

• How many CDs do you own?– Answer: “Over 100”

Today’s Class: Describing Information

• Tools for describing a single variable:

• List, Frequency lists, charts, histograms

• Characterization of “Typical” cases– Ex: Mean (“average”), Mode, Median

• Characterization of Variation– Ex: Min, Max, Variance, etc.

Listing Variables• Lists: Values of a variable for all cases

• Looking at the “raw data”

• Report command in SPSS– Or just look at data in the SPSS data editor

• Advantages:– Easy– Gives a rich description – you can see every case

• Disadvantages:– Cumbersome for large datasets– If data involves complex coding, you may not be able

to interpret it visually

Frequency Lists

• Frequency Lists: Tables that show how many cases take on a particular value– Also called “frequencies”, “frequency distributions”

• Examples:– Congressional vote. How many “Yes” vs “No”?– Social class: How many = low, middle, upper?– Age: How many = 1 years old, 2 years, … 100 years?

• Relevant SPSS Command: Frequencies

Example from SPSS• Note: Men coded as 1, Women coded as 2

GENDER Freq. % Valid% Cuml. %

1.00 6 33.335.335.3 2.00 11 61.164.7100.0

Total 17 94.4100.0

Missing:Systm 1 5.6

Total 18 100.0

Frequency Lists

• Advantages:– Useful for large datasets– Fairly rich description of data – once you get used to

reading them…

• Disadvantages:– Unlike a list, you can’t see which case is which or

compare with other variables– Best for nominal and some ordinal variables only– Not useful if all values are unique, such as: rank

orderings, many continuous variables

Visual Representations: Bar Charts

• “Bar Chart”– Essentially a visual representation of a frequency list– Height of bars represent number of cases– For nominal & some ordinal variables only

• Again, rank orders and continuous measures don’t work

• “Pie Chart”– Similar, but divides up a circle to show frequency

• All Accessible within Frequencies Menu– Just click Chart button– Or, look under Graphs menu

SPSS Bar Chart

GENDER

2.001.00Missing

Cou

nt

12

10

8

6

4

2

0

A Similar Approach: Pie Chart

GENDER

Missing

2.00

1.00

Graphing Continuous Measures• Issue: Continuous variables have an infinite

possible number of unique values. • Cases rarely have the exact same value

• Bar chart would have many bars of height 1• What would you do about zeros?

• Solution: use “grouped data”• Sets of similar values must be “grouped”

– Lumped together by constant intervals– Note: Information is destroyed in the process

• Result: A “Histogram”– Height of bar represents number of cases within a

given range of values

Histogram: Age (5-year interval)

AGE OF RESPONDENT

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

35.0

30.0

25.0

20.0

AGE OF RESPONDENT

Fre

qu

en

cy

300

200

100

0

Std. Dev = 17.81

Mean = 45.4

N = 1533.00

This doesn’t mean that 200 cases are exactly 30 years old… Rather, 200 cases fall in the 5-year interval around age 30

(from 27.5 and 32.5)

Histograms: Interval Width

• Previous example: People were grouped by age, within 5-year intervals– Bars represented ages 17.5-22.5, 22.5-27.5 and so on

• It is also possible to group people within 1 year intervals – or 50 year intervals– Small interval = more bars in the histogram– Wide interval = fewer bars in the histogram

• WARNING: Histograms look very different depending on how wide you set the intervals


20 40 60 80

AGE OF RESPONDENT

10

20

30

40

Co

un

t


20 40 60 80

AGE OF RESPONDENT

0

200

400

600

Co

un

t

Histograms: Interval Width• Changing the number of “bars” in the histogram

alters the appearance of the graph• Wide intervals/few bars results in greater simplification of

data

• Suggestion– 1. Try different intervals

• In SPSS, go to “interactive histogram”

– 2. Don’t over-interpret a crude histogram

• Another example: National Wealth– Unit of analysis = country– Variable = GDP per capita, a measure of wealth

Histogram: Wide Intervals

Penn 56 RGDPCH 1990

18750.015250.011750.08250.04750.01250.0

100

80

60

40

20

0

Std. Dev = 4915.68

Mean = 4810.4

N = 152.00

National Wealth 1990

Histogram: Narrow Intervals

Penn 56 RGDPCH 1990

20000.0

18000.0

16000.0

14000.0

12000.0

10000.0

8000.0

6000.0

4000.0

2000.0

0.0

Penn 56 RGDPCH 1990F

req

ue

ncy

50

40

30

20

10

0

Std. Dev = 4915.68

Mean = 4810.4

N = 152.00


Histograms

• Advantages:

• 1. Useful for even continuous measures

• 2. Preserves information on distribution of variable

• Both peaks and zeros are apparent

• Disadvantages:

• 1. Interval width can be a problem• Too Wide results in loss of information

• Too Narrow results in too many bars – unreadable.

Interpreting Histograms: Age• Try to interpret: What is this sample like?

AGE OF RESPONDENT

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

AGE OF RESPONDENTF

req

ue

ncy

400

300

200

100

0

Std. Dev = 10.74

Mean = 56.3

N = 1533.00

Interpreting Histograms: Age• Try to interpret this histogram:

AGE OF RESPONDENT

60.0

55.0

50.0

45.0

40.0

35.0

30.0

25.0

20.0

15.0

10.0

AGE OF RESPONDENTF

req

ue

ncy

400

300

200

100

0

Std. Dev = 10.74

Mean = 26.3

N = 1533.00

Interpreting Histograms: Age• Try to interpret this histogram:

AGE OF RESPONDENT

100.095.0

90.085.0

80.075.0

70.065.0

60.055.0

50.045.0

40.035.0

30.025.0

20.0

AGE OF RESPONDENTF

req

ue

ncy

70

60

50

40

30

20

10

0

Std. Dev = 23.06

Mean = 59.9

N = 1537.00

Measures of “Central Tendency”

• Often, it is important to assess the “typical” values of a variable

• Examples: – We may wish to know how much money the typical

family earns– We may wish to know the age of the typical person in

our dataset

• Solution: Conduct calculations to determine what values are “typical

• However, this isn’t as easy as it sounds– Consider some examples…

What is the “Center”?

Penn 56 RGDPCH 1990

20000.0

18000.0

16000.0

14000.0

12000.0

10000.0

8000.0

6000.0

4000.0

2000.0

0.0

Penn 56 RGDPCH 1990F

req

ue

ncy

50

40

30

20

10

0

Std. Dev = 4915.68

Mean = 4810.4

N = 152.00


What is the “Center”?

GENDER

2.001.00Missing

Cou

nt

12

10

8

6

4

2

0

The “Mode”

• The Mode = the value representing the largest number of cases -- called the “Modal” value

• Useful for Nominal, Ordinal variables

• Only useful for Continuous variables if you have grouped data into a histogram

• Otherwise, all values may very likely be unique

• Issue: Mode is not very helpful (even misleading) in certain circumstances

• Ex: If there are many peaks, or a single unusual one

• Ex: If the variable is distributed quite evenly.

Mode: Example

GENDER

2.001.00Missing

Cou

nt

12

10

8

6

4

2

0

Here, the mode is 2 (which corresponds

to “female”)

Mode: Example

AGE OF RESPONDENT

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

35.0

30.0

25.0

20.0

AGE OF RESPONDENT

Fre

qu

en

cy

300

200

100

0

Std. Dev = 17.81

Mean = 45.4

N = 1533.00

Here, the mode is 30 (though it might be

different if the histogram had a

different interval width)

Mode: Example• In this case, the mode (45) is not helpful

AGE OF RESPONDENT

100.095.0

90.085.0

80.075.0

70.065.0

60.055.0

50.045.0

40.035.0

30.025.0

20.0

AGE OF RESPONDENTF

req

ue

ncy

70

60

50

40

30

20

10

0

Std. Dev = 23.06

Mean = 59.9

N = 1537.00

Median

• The Median = the value of the “middle case”• Equal number of cases fall higher or lower

• Can be used for ordinal, continuous variables

• Advantages:• 1. Not influenced by unusual peaks

• 2. Useful even in very even distributions

• Disadvantages:• 1. Not useful for data spread in two distinct “clumps.”

Median Example

AGE OF RESPONDENT

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

35.0

30.0

25.0

20.0

AGE OF RESPONDENT

Fre

qu

en

cy

300

200

100

0

Std. Dev = 17.81

Mean = 45.4

N = 1533.00

The median case is 42 years old. Half are older, half are

younger!

Mean – “Average”

• The most well-known way of assessing the “middle”

• Calculated by adding values of all cases, then dividing by the total number of cases

• Advantages:• Useful for continuous measures

• Not overly influenced by any single peak

• Disadvantages:• Can be influenced by extreme values.

Calculating the Mean: Variables

• Each column of a dataset is considered a variable

• We’ll refer to a column generically as “Y”

Person # Guns owned

1 0

2 3

3 0

4 1

5 1

The variable “Y”

Note: The total number of cases in

the dataset is referred to as “N”.

Here, N=5.

Equation of Mean: Notation• Each case can be

identified a subscript• Yi represents “ith” case of

variable Y• i goes from 1 to N• Y1 = value of Y for first

case in spreadsheet• Y2 = value for second

case, etc.• YN = value for last case

Person # Guns owned (Y)

1 Y1 = 0

2 Y2 = 3

3 Y3 = 0

4 Y4 = 1

5 Y5 = 1

Calculating the Mean

• Equation:

• 1. Mean of variable Y represented by Y with a line on top – called “Y-bar”

• 2. Equals sign means equals: “is calculated by the following…”

• 3. N refers to the total number of cases for which there is data

• Summation () – will be explained next…

N

i

iYN

Y1

1

Equation of Mean: Summation

• Sigma (Σ): Summation– Indicates that you should add up a series of numbers

The thing on the right is the

item to be added

repeatedly

N

i

iY1

The things on top and bottom tell you how many times to add up Y-sub-i…

AND what numbers to

substitute for i.

Equation of Mean: Summation

• 1. Start with bottom: i = 1.– The first number to add is Y-sub-1

N

i

iY1

1Y 2Y 5Y3Y 4Y

• 2. Then, allow i to increase by 1 – The second number to add is i = 2, then i = 3

• 3. Keep adding numbers until i = N– In this case N=5, so stop at 5

Equation for the Mean: Example

Case Num CD’s

1 20 Y1

2 40 Y2

3 0 Y3

4 70 Y4

N

N

i

YYYYYi

...321

1

• Variable: Number of CD’s… How many CD’s does a person own?

Equation of the Mean: Example

4321

1

YYYYYN

i

i 1307004020

5.321304

11

1

N

i

iYN

Y

Download - Sociology 5811: Lecture 2: Datasets and Simple Descriptive Statistics Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

Top Related