introduction to data science section 4 data matters 2015 sponsored by the odum institute, renci, and...

33
Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey [email protected] 1

Upload: penelope-lee

Post on 23-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

1

Introduction to Data ScienceSection 4Data Matters 2015

Sponsored by the Odum Institute, RENCI, and NCDS

Thomas M. [email protected]

Page 2: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

2

Data Science Nuts and Bolts

Page 3: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

3

Data Collection

• Data exist all around us– Government statistics– Prices on products– Surveys (polls, the Census, Business surveys, etc.)– Weather reports– Stock prices

• Potential data is ubiquitous– Every action, attitude, behavior, opinion, physical

attribute, etc. that you could imagine being measured.

Page 4: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

4

Methods of Data Collection

• Traditional Methods:– Observe and record– Interview, Survey– Experiment

• Newer methods employ these techniques, but also include:– Remote observation (e.g. sensors, satellites)– Computer assisted interviewing– Biological and physiological measurement– Web scraping, digital path tracing– Crowd sourcing

Page 5: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

5

Measurement is the Key

• Regardless of how you collect data, you must consider measurement.

• Measurement links an observable indicator, scale, or other metric to a concept of interest.

• There is always some slippage in measurement• Basic types and concerns:– Nominal, Ordinal, Interval, Ratio– Dimensions, error, validity, reliability.

Page 6: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

6

Measurement Validity

• Validity refers to how well the measure captures the concept.– Construct Validity

• How well does the scale measure the construct it was intended to measure. (Correlations can be potential measures)

– Content Validity: • Does the measure include everything it should and nothing

that it should not? This is subjective (no statistical test here)

– Criterion Validity• How well does the measure compare to other measures

and/or predictors

Page 7: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

7

Measurement Reliability

• Reliability revers to whether a measure is consistent and stable.– Can the measure be confirmed by further

measurement or observations?– If you measure the same thing with the same

measurement tool, would you get the same score?

Page 8: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

8

Why Measurement Matters

• If the measurement of the outcome you care about has random error, your ability to model and predict it will decrease.

• If the measurement of predictors of the outcome has random error, you will get biased estimates of how those predictors are related to the outcome you care about.

• If either outcomes or predictors have systematic measurement error, you might get relationships right, but you’ll be wrong on levels.

Page 9: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

9

Storing Collected Data

• Once you collect data, you need to store it.• Flat “spreadsheet” like files• Relational data bases• Audio, Video, Text?• Numeric or non-Numeric?• Plan for adding more observations, more

variables, or merging with other data sources

Page 10: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

10

Data Analysis

• We analyze data to extract meaning from it.• Virtually all data analysis focuses on data reduction• Data reduction comes in the form of:– Descriptive statistics– Measures of association– Graphical visualizations

• The objective is to abstract from all of the data some feature or set of features that captures evidence of the process you are studying

Page 11: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

11

Why Data Reduction?

• Data reduction lets us see critical features or patterns in the data.

• Which features are important depends on the question we are asking– Road maps, topographical maps, precinct maps,

etc.• Much of data reduction in data science falls

under the heading of statistics

Page 12: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

12

Some Definitions

• Data is what we observe and measure in the world around us

• Statistics are calculations we produce that provide a quantitative summary of some attribute(s) of the data.

• Cases/Observations are the objects in the world for which we have data.

• Variables are the attributes of cases (or other features related to the cases) for which we have data.

Page 13: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

13

Quantitative vs. Qualitative

• Much of the “tension” between these two approaches is misguided.

• Both are Data• Both are or can be:– Empirical– Scientific– Systematic– Wrong– Limited

Page 14: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

14

Qual and Quant (cont.)

• It is not as simple as Quant=numbers and Qual=words.– Much of quantitative data is merely categorization of

underlying concepts• Countries are labeled “Democratic” or not• Kids are labeled “Gifted” or not• Couples are labeled “Committed” or “In Love” or not• Baseball players commit “Errors” or not• Different types of chocolate are “Good” or not

– Increasing quantitative analysis of text

Page 15: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

15

So what are Statistics?

• Quantities we calculate to summarize data– Central tendency– Dispersion– Distributional characteristics– Associations and partial associations/correlation

• Statistics are exact representations of data, but serve only as estimates of population characteristics. Those estimates always come with uncertainty.

Page 16: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

16

Goals of Statistical Analysis• Description offers an account or summary, but not

an explanation of why something is the way it is.• Causality offers a statement about influence.

– The “fundamental problem of causation”– A causal statement is NOT necessarily a theoretical statement: theory

demands an explanation for why something happens.

• Inference involves extrapolating from what you find in your data to those cases for which you do not have data.– It will always be probabilistic

• We can have both Descriptive and Causal inference.

Page 17: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

17

Missing Data

• We have talked about representativeness of data, but mostly from a sampling perspective.

• We can also have missing data.– If data is missing at random, we can cope.– If data is missing, but not at random, we have

major problems with representativeness.

Page 18: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

18

Responses to Missing Data

• Casewise deletion: This is by far the most common response and is also now understood to be the worst thing you can do.

• Single Imputation: Better, but this ignores uncertainty in imputed data.

• Multiple Imputation: Impute several times and average across results. Addresses limitations of previous two.

Page 19: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

19

Page 20: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

20

Missing Data – Summing Up

• The problem with missing data is representativeness of the remaining sample.

• A representative sample of 100 or 1000 is often much better than a non-representative sample of 1 million.

• Imputation is still an imperfect solution because it assumes we can learn (statistically) about missing data from the data we have.

Page 21: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

21

Claims about Causality

• Imagine we have a single treatment variable coded as Treated or Not Treated (1 or 0, respectively).

• We have an outcome, Y, we care about.• We observe Y1 for treated and Y0 for Untreated• Before the study, every case had a potential

value for Y1 AND Y0.• However, we only observe Y1 OR Y0 for each case.• We NEVER observe the counter-factual

Page 22: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

22

Causality and Experiments

• Experiments allow us to estimate an Average Treatment Effects (ATE) through random assignment of Treatment.

• We still can’t talk about individual causality, but random assignment lets us make ATE claims.

Page 23: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

23

Causality in Observational Studies

• In observational studies, treatment is not randomly assigned.

• These means that the Treated and Non-Treated groups could be different for two reasons:– One is treated and one is not treated– Any other factor(s) that predict treatment

• Thus, treatment might be correlated with the outcome of interest, but not its cause.

Page 24: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

24

Page 25: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

25

Causality in Observational Studies

• The typical response is to “control” for other potential confounding factors statistically.– Forces us to have those measures and know the

functional form of the confounding effects.• Sometimes we can use instrumental variables– Often works poorly in practice

• Matching– Or Matching then regression/controls.

Page 26: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

26

Matching

• For each treated observation, find an untreated observation that is otherwise identical (similar) to the treated observation.

• Matching is done using independent variables only – NOT the outcome of interest.

• There are many methods of matching.• You almost always lose data – some

observations don’t match any others very well.

Page 27: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

27

Better Null Hypothesis

• We are taught to specify a Null hypothesis, then test to see if we should accept or reject it.

• We frequently articulate a Null of “nothing”– No relationship– No structure– No pattern

• Such Nulls are often:– Silly– Far to easy to reject– Run the risk of us misinterpreting structure as

meaningful.

Page 28: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

28

Example 1: Zipf’s Law

Page 29: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

29

Example 2: Superstitious Learning

Page 30: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

30

Example 3: Residential Segregation

• Thomas Schelling (1978) noted that having even slight preferences for living in areas with some people similar to you produced highly segregated neighborhoods.– He used coins on graph paper to illustrate the

process.– We can use agent-based model simulation• NetLogo Social Science Segregation

Page 31: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

31

Example 4: Traffic Grid

• How can lowering the speed limit reduce the time spent waiting at traffic lights?– Another Agent Based model.• NetLogo Social Science Traffic Grid

Page 32: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

32

One More: Complexity in Regulating Ecosystems

• Another Agent Based Model– NetLogo Biology Wolf Sheep Predation

Page 33: Introduction to Data Science Section 4 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

33

A Better Null Hypothesis

• A better Null hypothesis offers a plausible prediction of what would happen in a world where all but your causal variable of interest were present.

• In other words, posit a simple model and generate expectations about observable patterns

• Then modify the model, make updated predictions, and evaluate if those predictions match observable data.