introduction to data science section 4 data matters 2015 sponsored by the odum institute, renci, and...

1

Introduction to Data ScienceSection 4Data Matters 2015

Sponsored by the Odum Institute, RENCI, and NCDS

Thomas M. [email protected]

2

Data Science Nuts and Bolts

3

Data Collection

• Data exist all around us– Government statistics– Prices on products– Surveys (polls, the Census, Business surveys, etc.)– Weather reports– Stock prices

• Potential data is ubiquitous– Every action, attitude, behavior, opinion, physical

attribute, etc. that you could imagine being measured.

4

Methods of Data Collection

• Traditional Methods:– Observe and record– Interview, Survey– Experiment

• Newer methods employ these techniques, but also include:– Remote observation (e.g. sensors, satellites)– Computer assisted interviewing– Biological and physiological measurement– Web scraping, digital path tracing– Crowd sourcing

5

Measurement is the Key

• Regardless of how you collect data, you must consider measurement.

• Measurement links an observable indicator, scale, or other metric to a concept of interest.

• There is always some slippage in measurement• Basic types and concerns:– Nominal, Ordinal, Interval, Ratio– Dimensions, error, validity, reliability.

6

Measurement Validity

• Validity refers to how well the measure captures the concept.– Construct Validity

• How well does the scale measure the construct it was intended to measure. (Correlations can be potential measures)

– Content Validity: • Does the measure include everything it should and nothing

that it should not? This is subjective (no statistical test here)

– Criterion Validity• How well does the measure compare to other measures

and/or predictors

7

Measurement Reliability

• Reliability revers to whether a measure is consistent and stable.– Can the measure be confirmed by further

measurement or observations?– If you measure the same thing with the same

measurement tool, would you get the same score?

8

Why Measurement Matters

• If the measurement of the outcome you care about has random error, your ability to model and predict it will decrease.

• If the measurement of predictors of the outcome has random error, you will get biased estimates of how those predictors are related to the outcome you care about.

• If either outcomes or predictors have systematic measurement error, you might get relationships right, but you’ll be wrong on levels.

9

Storing Collected Data

• Once you collect data, you need to store it.• Flat “spreadsheet” like files• Relational data bases• Audio, Video, Text?• Numeric or non-Numeric?• Plan for adding more observations, more

variables, or merging with other data sources

10

Data Analysis

• We analyze data to extract meaning from it.• Virtually all data analysis focuses on data reduction• Data reduction comes in the form of:– Descriptive statistics– Measures of association– Graphical visualizations

• The objective is to abstract from all of the data some feature or set of features that captures evidence of the process you are studying

11

Why Data Reduction?

• Data reduction lets us see critical features or patterns in the data.

• Which features are important depends on the question we are asking– Road maps, topographical maps, precinct maps,

etc.• Much of data reduction in data science falls

under the heading of statistics

12

Some Definitions

• Data is what we observe and measure in the world around us

• Statistics are calculations we produce that provide a quantitative summary of some attribute(s) of the data.

• Cases/Observations are the objects in the world for which we have data.

• Variables are the attributes of cases (or other features related to the cases) for which we have data.

13

Quantitative vs. Qualitative

• Much of the “tension” between these two approaches is misguided.

• Both are Data• Both are or can be:– Empirical– Scientific– Systematic– Wrong– Limited

14

Qual and Quant (cont.)

• It is not as simple as Quant=numbers and Qual=words.– Much of quantitative data is merely categorization of

underlying concepts• Countries are labeled “Democratic” or not• Kids are labeled “Gifted” or not• Couples are labeled “Committed” or “In Love” or not• Baseball players commit “Errors” or not• Different types of chocolate are “Good” or not

– Increasing quantitative analysis of text

15

So what are Statistics?

• Quantities we calculate to summarize data– Central tendency– Dispersion– Distributional characteristics– Associations and partial associations/correlation

• Statistics are exact representations of data, but serve only as estimates of population characteristics. Those estimates always come with uncertainty.

16

Goals of Statistical Analysis• Description offers an account or summary, but not

an explanation of why something is the way it is.• Causality offers a statement about influence.

– The “fundamental problem of causation”– A causal statement is NOT necessarily a theoretical statement: theory

demands an explanation for why something happens.

• Inference involves extrapolating from what you find in your data to those cases for which you do not have data.– It will always be probabilistic

• We can have both Descriptive and Causal inference.

17

Missing Data

• We have talked about representativeness of data, but mostly from a sampling perspective.

• We can also have missing data.– If data is missing at random, we can cope.– If data is missing, but not at random, we have

major problems with representativeness.

18

Responses to Missing Data

• Casewise deletion: This is by far the most common response and is also now understood to be the worst thing you can do.

• Single Imputation: Better, but this ignores uncertainty in imputed data.

• Multiple Imputation: Impute several times and average across results. Addresses limitations of previous two.

20

Missing Data – Summing Up

• The problem with missing data is representativeness of the remaining sample.

• A representative sample of 100 or 1000 is often much better than a non-representative sample of 1 million.

• Imputation is still an imperfect solution because it assumes we can learn (statistically) about missing data from the data we have.

21

Claims about Causality

• Imagine we have a single treatment variable coded as Treated or Not Treated (1 or 0, respectively).

• We have an outcome, Y, we care about.• We observe Y1 for treated and Y0 for Untreated• Before the study, every case had a potential

value for Y1 AND Y0.• However, we only observe Y1 OR Y0 for each case.• We NEVER observe the counter-factual

22

Causality and Experiments

• Experiments allow us to estimate an Average Treatment Effects (ATE) through random assignment of Treatment.

• We still can’t talk about individual causality, but random assignment lets us make ATE claims.

23

Causality in Observational Studies

• In observational studies, treatment is not randomly assigned.

• These means that the Treated and Non-Treated groups could be different for two reasons:– One is treated and one is not treated– Any other factor(s) that predict treatment

• Thus, treatment might be correlated with the outcome of interest, but not its cause.

25

Causality in Observational Studies

• The typical response is to “control” for other potential confounding factors statistically.– Forces us to have those measures and know the

functional form of the confounding effects.• Sometimes we can use instrumental variables– Often works poorly in practice

• Matching– Or Matching then regression/controls.

26

Matching

• For each treated observation, find an untreated observation that is otherwise identical (similar) to the treated observation.

• Matching is done using independent variables only – NOT the outcome of interest.

• There are many methods of matching.• You almost always lose data – some

observations don’t match any others very well.

27

Better Null Hypothesis

• We are taught to specify a Null hypothesis, then test to see if we should accept or reject it.

• We frequently articulate a Null of “nothing”– No relationship– No structure– No pattern

• Such Nulls are often:– Silly– Far to easy to reject– Run the risk of us misinterpreting structure as

meaningful.

28

Example 1: Zipf’s Law

29

Example 2: Superstitious Learning

30

Example 3: Residential Segregation

• Thomas Schelling (1978) noted that having even slight preferences for living in areas with some people similar to you produced highly segregated neighborhoods.– He used coins on graph paper to illustrate the

process.– We can use agent-based model simulation• NetLogo Social Science Segregation

31

Example 4: Traffic Grid

• How can lowering the speed limit reduce the time spent waiting at traffic lights?– Another Agent Based model.• NetLogo Social Science Traffic Grid

32

One More: Complexity in Regulating Ecosystems

• Another Agent Based Model– NetLogo Biology Wolf Sheep Predation

33

A Better Null Hypothesis

• A better Null hypothesis offers a plausible prediction of what would happen in a world where all but your causal variable of interest were present.

• In other words, posit a simple model and generate expectations about observable patterns

• Then modify the model, make updated predictions, and evaluate if those predictions match observable data.

introduction to data science section 4 data matters 2015 sponsored by the odum institute, renci, and...

Documents

data collection data

data reduction data

data matters

data analysis

data sources

collected data

data science nuts

data science section