introduction to data science section 2 data matters 2015 sponsored by the odum institute, renci, and...

24
Introduction to Data Science Section 2 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey [email protected] 1

Upload: osborn-lewis

Post on 01-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

1

Introduction to Data ScienceSection 2Data Matters 2015

Sponsored by the Odum Institute, RENCI, and NCDS

Thomas M. [email protected]

2

The Data Lifecycle

3

Data Science is More than Analysis

• Data analysis gets most of the attention in data science.

• In that sense, many people struggle to distinguish data science from applied statistics.

• Analysis is obviously important, but statistical analysis skills are only useful if the data can be collected in put in a usable form.

• Data Science is much broader than just data analysis.

4

The Data Lifecycle

• Data science considers data at every stage of what is called the data lifecycle.

• This lifecycle generally refers to everything from collecting data to analyzing it to sharing it so others can re-analyze it.– In fact, it includes the planning process that should be in place

before any other work begins.• New visions of this process in particular focus on integrating

every action that creates, analyzes, or otherwise touches data.• These same new visions treat the process as dynamic – data

archives are not just digital shoe boxes under the bed.• There are many representations of the this lifecycle.

5

6

7

8

9

Lessons from the Lifecycle

• Data Science is more than just data analysis.• Effective data science requires– Planning– Vision– Storage– Interoperability of systems– A team approach– Adaptability and Scalability

10

What is Missing?

• Most definitions of data science underplay or leave out discussions of:– Substantive theory– Metadata– Privacy and Ethics– Greater Consideration for missing data,

representativeness, and uncertainty– More thinking about the proper Null hypothesis– Leadership on leveraging data science for the

public good

11

Substantive Theory

12

The Data Generating Process (DGP)

• Most of the time we don’t care about the data itself.• Most of the time we are trying to learn something

about an underlying process that produces the data – a DGP.

• Technically trained folks might be good at uncovering patterns in data, but you need substantive expertise to:– Know where to look in the first place– Know what to look for– Know what you find actually might mean

13

What is the DGP?

• Good analysis starts with a question you want to answer.– Blind data mining can only get you so far, and really, there is no

such thing as completely blind mining• Answering that question requires laying out expectations

of what you will find and explanations for those expectations.

• Those expectations and explanations rest on assumptions.• If your data collection, data management, and data

analysis are not compatible with those assumptions, you risk producing meaningless or misleading answers.

14

The DGP (cont.)

• Think of the world you are interested in as governed by dynamic processes.

• Those processes produce observable bits of information about themselves – data

• We can use data science to:– Collect, catalog, and organize those bits of information– Discover patterns in data and fit models to that data– Make predictions outside of our data– Inform explanations of both those patterns and those predictions.

• Real discovery is NOT about modeling patterns in observable data. It is about understanding the processes that produced that data.

15

Theories and DGPs

• Theories provide explanations for the processes we care about.

• They answer the question, Why does something work the way it does.

• Theories make predictions about what we should see in data.

• We use data to test the predictions, but we never completely test a theory.

16

Why do we need theory?

• Can’t we just find “truth” in the data if we have enough of it? Especially if we have all of it?

• No!– More data does not mean more representative data.– Every method of analysis makes some assumptions, so

we are better off if we make them explicit.– Patterns without understanding are a best

uninformative and at worst deeply misleading.

17Robert Mathews Aston, 2000. “Storks Deliver Babies (P=0.008).”Teaching Statistics. Volume 22, Number 2, Summer 2000

18

New Behaviors Require New Theories

• The Target example illustrated how existing theories about habit formation informed their data mining efforts.

• However, whole new behaviors exist that are creating a lot of the data that data scientists want to analyze:– Online shopping– Cell phone usage– Crowd sourced recommendation systems– Facebook, Google searching, etc.– Online mobilization of social protests

• We need new theories for these new behaviors.

19

Metadata

20

What is Metadata?

• Metadata is data about data. It is frequently ignored or misunderstood.

• Metadata is required to give data meaning.• It includes:– Variable names and labels, value labels, information on

who collected the data, when, by what methods, in what locations, for what purpose, etc.

• Metadata is essential to use data effectively, to reuse data, to share data, and to integrate data.

• Data without metadata is worthless.

21

The Value of Metadata

• Data by itself is just a bunch of 0’s and 1’s.• Metadata– Provides meaning– Allows for cataloging– Facilitates search and discovery– Enables linking data sets

22

Types of Metadata

• NICO Defines three types:– Structural: describes how the components of the

data are organized (columns, rows, chapters, etc.)– Descriptive: provides titles, authors, keywords,

subjects, etc. that facilitate attribution and search/discovery.

– Administrative: technical information on how file was created, software used, formats for storage, etc.• Includes rights and preservation metadata

23

Metadata Standards

• There are emerging standards for metadata– The American National Standards Institute– The International Organization for Standardization

• Dublin Core – 15 classis metadata terms.– Title, Creator, Subject, Description, Publisher,

Contributor, Data, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights

24

Privacy and Ethics

We will do this at the end