data analytics for social science data inspection and ... · data inspection and visualisation...

42
inspection Functions Measurement Tables Graphs References Data Analytics for Social Science Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin 31 January 2017

Upload: others

Post on 27-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Data Analytics for Social ScienceData inspection and visualisation

Johan A. Elkink

School of Politics & International Relations

University College Dublin

31 January 2017

Page 2: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Data analysis process

R at the cross-section of socialscience analysis and datascience

R can be extended usingpackages (video)

Data to be found in manyformats, or entered usingExcel (video)

Page 3: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Outline

1 Working with functions in R

2 Variables and measurement

3 Producing tables

4 Producing graphs

Page 4: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Functions

Page 5: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Using functions

The below code creates a new function called “myMean”,which takes one parameter “x” as input, and uses existingfunctions “sum” and “length” to calculate and return the mean(of “x”).

myMean <- function(x) {

sum(x) / length(x)

}

a <- c(1, 4, 3, 5)

myMean(a)

The code then creates a vector “a” with 4 values,a =

[1 4 3 5

]and then uses the function “myMean” to calculate the mean.

Page 6: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Defining functions

Page 7: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Calling functions

a <- c(1, 4, 3, 5)

This code calls function “c” (concatenate), passes as input tothe function the numbers 1, 4, 3 and 5, and saves the outputof the function to object “a”.

myMean(a)

This code calls function “myMean”, passes as input to thefunction the object “a”, and does not save the output, so Rwill just print the output on the screen.

Page 8: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Outline

1 Working with functions in R

2 Variables and measurement

3 Producing tables

4 Producing graphs

Page 9: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Measurement

“Measurement is the process of determining and recordingwhich of the possible traits of a variable an individual caseexhibits or possesses.”

“A case is an entity that displays or possesses the traits of agiven variable.”

“A population is the set of all cases of interest. A sample is asubset of the population.”

(Argyrous, 1997, 3–4)

Page 10: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Levels of measurement

Categorical Nominal categories

Ordinal ... in particular order

Scale Interval ... with meaningful distance

Ratio ... with meaningful zero

“A discrete variable is measured by a unit that cannot besubdivided. It has a countable number of values.”

“A continuous variable is measured by units that can besubdivided infinitely. It can take any value in a line interval.”

(Argyrous, 1997, 11)

Page 11: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Levels of measurement

Categorical Nominal categories

Ordinal ... in particular order

Scale Interval ... with meaningful distance

Ratio ... with meaningful zero

Example: UN membership (country), committee membership(MP)

“A discrete variable is measured by a unit that cannot besubdivided. It has a countable number of values.”

“A continuous variable is measured by units that can besubdivided infinitely. It can take any value in a line interval.”

(Argyrous, 1997, 11)

Page 12: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Levels of measurement

Categorical Nominal categories

Ordinal ... in particular order

Scale Interval ... with meaningful distance

Ratio ... with meaningful zero

Example: education (voter, in level), Likert scale

“A discrete variable is measured by a unit that cannot besubdivided. It has a countable number of values.”

“A continuous variable is measured by units that can besubdivided infinitely. It can take any value in a line interval.”

(Argyrous, 1997, 11)

Page 13: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Levels of measurement

Categorical Nominal categories

Ordinal ... in particular order

Scale Interval ... with meaningful distance

Ratio ... with meaningful zero

Example: left-right orientation (party, voters)

“A discrete variable is measured by a unit that cannot besubdivided. It has a countable number of values.”

“A continuous variable is measured by units that can besubdivided infinitely. It can take any value in a line interval.”

(Argyrous, 1997, 11)

Page 14: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Levels of measurement

Categorical Nominal categories

Ordinal ... in particular order

Scale Interval ... with meaningful distance

Ratio ... with meaningful zero

Example: geographical distance, education (voter, in years),GDP per capita (country).

“A discrete variable is measured by a unit that cannot besubdivided. It has a countable number of values.”

“A continuous variable is measured by units that can besubdivided infinitely. It can take any value in a line interval.”

(Argyrous, 1997, 11)

Page 15: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Levels of measurement

Categorical Nominal categories

Ordinal ... in particular order

Scale Interval ... with meaningful distance

Ratio ... with meaningful zero

“A discrete variable is measured by a unit that cannot besubdivided. It has a countable number of values.”

“A continuous variable is measured by units that can besubdivided infinitely. It can take any value in a line interval.”

(Argyrous, 1997, 11)

Page 16: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Levels of measurement: Why?

The level of measurement of a variable determines:

• What plots are appropriate to visualise the data.

• What tables can reasonably be produced.

• What statistics can be computed to summarize the data.

• What statistical analysis can be performed with the data.

Page 17: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Outline

1 Working with functions in R

2 Variables and measurement

3 Producing tables

4 Producing graphs

Page 18: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Tables

Frequency table: A table of one variable, showing the numberor proportion of cases in each category.

For categorical variables only.

Numerical equivalent of the barplots or piechart.

Cross table: A table of two or more variables, showing thenumber or proportion of cases in each combination ofcategories. Also: contingency table or pivot table.

For categorical variables only.

Similar to barplots by category, or stacked barplots.

Page 19: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Example: cross table

slides by Ivan Katchanovski, 2008

Page 20: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Outline

1 Working with functions in R

2 Variables and measurement

3 Producing tables

4 Producing graphs

Page 21: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Main types

univariateCategorical pie-charts

barplotsScale histogram

density plotboxplot

multivariateScale by scale scatterplotScale by categorical boxplots

Page 22: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Categorical variables

For categorical variables, it is often useful to look at thenumber of cases or the proportion of cases in a particularcategory.

Barplots and pie charts are useful for this.

0

50

100

150

Fianna Fail Fine Gael Labour Other Sinn Fein

party

coun

t

Fianna Fail

Fine Gael

Labour

Other

Sinn Fein

Page 23: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Barplot

0

50

100

150

Fianna Fail Fine Gael Labour Other Sinn Fein

party

coun

t

ggplot(ieVoters, aes(x = party)) + geom_bar()

Page 24: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Grammar of Graphics

“In brief, the grammar tells us that a statistical graphic is amapping from data to aesthetic attributes (colour, shape,size) of geometric objects (points, lines, bars)” (Wickham,2015, 5).

slides by Roger Peng

Page 25: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Barplot

0

50

100

150

Fianna Fail Fine Gael Labour Other Sinn Fein

party

coun

t

ggplot(ieVoters, aes(x = party)) + geom_bar()

Page 26: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Pie chart

Fianna Fail

Fine Gael

Labour

Other

Sinn Fein

pie(table(ieVoters$party))

Page 27: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Histogram

For continuous (or scale) variables, we often want to get anidea of the distribution of values.

Histograms are useful to get an impression.

• bin the data using equal-distance cut-off points• then produce a barplot of the number in each bin.

0

10

20

30

25 50 75

age

coun

t

Page 28: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

50 100 150 200

46

810

Influence scores

Month

Sco

re

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

PutinMedvedevVoloshin

Putin

Fre

quen

cy

5 6 7 8 9 10

010

2030

4050

6070

Medvedev

Fre

quen

cy

3 4 5 6 7 8 9

05

1015

2025

Voloshin

Fre

quen

cy

010

2030

40

(Baturo and Elkink, 2014, 2015)

Page 29: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Density plot

A density plot is a smoothened version of a histogram, based on a

non-parametric estimation of the shape of the distribution.

0.000

0.005

0.010

0.015

0.020

20 40 60 80

age

dens

ity

Page 30: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Density plot

A density plot is a smoothened version of a histogram, based on a

non-parametric estimation of the shape of the distribution.

0.00

0.02

0.04

0.06

25 50 75

age

dens

ity

Page 31: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Density plot

0.00

0.02

0.04

0.06

25 50 75

age

dens

ity

ggplot(ieVoters, aes(x=age)) +

geom_histogram(aes(y=..density..),

binwidth=.5,

colour="black", fill="white") +

geom_density(alpha=.2, fill="red")

Page 32: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Distributions

For graphs of distributions (histogram, density plot, boxplot,etc.) you want to get an impression of:

• the shape of the distribution;

• the center and spread of the distribution;

• the presence of outliers.

(Moore, 2003, 12)

Page 33: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Boxplot

Another way of looking at the distribution of a continuousvariable is to find out where the lowest 25% are located, wherethe lowest 50% are located, and where the top 25% are located.

A plot that shows this is the boxplot.

Medvedev Putin Voloshin

34

56

78

910

Page 34: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Time plot

When data is measured over time, another useful plot is a time plot,

to see trends over time.

1800 1850 1900 1950 2000

0.0

0.2

0.4

0.6

0.8

1.0

Year

Pro

port

ion d

em

ocra

cie

s

Polity IV (Marshall and Jaggers, 2002)

Page 35: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Basic principles

• Show comparisons

• Show causality, mechanisms

• Show multivariate data patterns

• Integrate multiple modes of evidence

• Describe the data, sources, measurement scales, etc.

• In the end it stands or falls by the quality and relevance ofthe data

slides by Roger Peng, based on Tufte (2006)

Page 36: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Comparisons

●●

●●

●●

●● ●

● ●

●●

●●

● ● ●

● ●

● ●

●●● ● ●

● ●

● ●

●●

● ●

●●

● ●

●●

● ●

● ● ●

●●

●● ●

● ●

● ● ●

● ●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

● ● ●

● ●

●●

● ● ●

●●

●●

● ●

● ●●

● ●

● ●

●●

● ●

● ●

●● ●

● ●●

● ●

●●

● ●

●●

● ●

● ●

● ●

●●

●●

3

6

9

20 40 60 80

age

econ

IMF

●●

●●

●●

●● ●

● ●

●●

●●

● ● ●

● ●

● ●

●●● ● ●

● ●

● ●

●●

● ●

●●

● ●

●●

● ●

● ● ●

●●

●● ●

● ●

● ● ●

● ●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

● ● ●

● ●

●●

● ● ●

●●

●●

● ●

● ●●

● ●

● ●

●●

● ●

● ●

●● ●

● ●●

● ●

●●

● ●

●●

● ●

● ●

● ●

●●

●●

3

6

9

20 40 60 80

ageec

onIM

F

On the right, with different lines for Dublin (red) vs others(green), tells more of a story.

Page 37: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Basic principles

• Show comparisons

• Show causality, mechanisms

• Show multivariate data patterns

• Integrate multiple modes of evidence

• Describe the data, sources, measurement scales, etc.

• In the end it stands or falls by the quality and relevance ofthe data

slides by Roger Peng, based on Tufte (2006)

Page 38: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Multivariate patterns

●●

●●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●● ●

●●

● ●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

Other Sinn Fein

Fianna Fail Fine Gael Labour

25 50 75 25 50 75

25 50 750

5

10

15

0

5

10

15

age

econ

Ban

ks dublin●

Dublin

Not Dublin

Page 39: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Basic principles

• Show comparisons

• Show causality, mechanisms

• Show multivariate data patterns

• Integrate multiple modes of evidence

• Describe the data, sources, measurement scales, etc.

• In the end it stands or falls by the quality and relevance ofthe data

slides by Roger Peng, based on Tufte (2006)

Page 40: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Example: knowledge and voting

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

Low objective knowledge of the Lisbon treaty

Pro

babi

lity

of N

O v

ote

SatisfiedDissatisfied

(?)

Page 41: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Basic principles

• Show comparisons

• Show causality, mechanisms

• Show multivariate data patterns

• Integrate multiple modes of evidence

• Describe the data, sources, measurement scales, etc.

• In the end it stands or falls by the quality and relevance ofthe data

slides by Roger Peng, based on Tufte (2006)

Page 42: Data Analytics for Social Science Data inspection and ... · Data inspection and visualisation Johan A. Elkink School of Politics & International Relations University College Dublin

inspection

Functions

Measurement

Tables

Graphs

References

Argyrous, George. 1997. Statistics for social research. Basingstoke: MacMillan.

Baturo, Alexander and Johan A. Elkink. 2014. “Office or Officeholder? Regime Deinstitutionalisation andSources of Individual Political Influence.” Journal of Politics 76(3).

Baturo, Alexander and Johan A. Elkink. 2015. “Dynamics of Regime Personalization and Patron-ClientNetworks in Russia, 1999–2014.” Post-Soviet Affairs 32(1):75–98.

Marshall, M.G. and K. Jaggers. 2002. “Polity IV project: political regime characteristics and transitions,1800-2002.”.URL: http://www.bsos.umd.edu/cidcm/polity/

Moore, David S. 2003. The basic practice of statistics. 3rd ed. New York: W.H. Freeman.

Tufte, Edward R. 2006. Beautiful Evidence. Graphics Press.

Wickham, Hadley. 2015. ggplot2: Elegant Graphics for Data Analysis. Springer.URL: http://ms.mcmaster.ca/∼bolker/misc/ggplot2-book.pdf