introduction to data science section 1 data matters 2015 sponsored by the odum institute, renci, and...

31
Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey [email protected] 1

Upload: ariel-hall

Post on 25-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

1

Introduction to Data ScienceSection 1Data Matters 2015

Sponsored by the Odum Institute, RENCI, and NCDS

Thomas M. [email protected]

Page 2: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

2

Course Materials

• I used many sources in preparing for this course:– Practical Data Science using R by Zumel and Mount

– http://www.manning.com/zumel/

– Data Mining with R: Learning with Case Studies, by Torgo– http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR/

– An Introduction to Data Science, Version 3, by Stanton– http://jsresearch.net/

– Monte Carlo Simulation and Resampling Methods for Social Science, by Carsey and Harden

– http://www.sagepub.com/books/Book241131/reviews?course=Course14&subject=J00&sortBy=defaultPubDate%20desc&fs=1#tabview=title

– Machine Learning with R by Lantz– http://www.packtpub.com/machine-learning-with-r/book

Page 3: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

3

Additional Materials

• A Simple Introduction to Data Science, by Burlingame and Nielsen

• http://newstreetcommunications.com/businesstechnical/a_simple_introduction_to_data_science

• Ethics of Big Data, by Davis• http://shop.oreilly.com/product/0636920021872.do

• Privacy and Big Data, by Craig and Ludloff• http://shop.oreilly.com/product/0636920020103.do

• Doing Data Science: Straight Talk from the Frontline, by O’Neil and Schutt

• http://shop.oreilly.com/product/0636920028529.do

Page 4: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

4

Learning R

• Lots of places to learn more about R– All of the sources on the first slide have R code available– Comprehensive R Archive Network (CRAN)

– http://cran.r-project.org/manuals.html

– Springer Textbooks Use R! Series– http://www.springer.com/series/6991

– Online search tool Rseek– http://www.rseek.org/

– The RStudio site– http://www.rstudio.com/

– The Odum Institute’s online course– http://www.odum.unc.edu/odum/contentSubpage.jsp?nodeid=670

Page 5: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

5

What is Data Science?

Page 6: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

6

What is Data Science?

• What words come to mind when you think of Data Science?

• What experience do you have with Data Science?

• Why are you taking an Introduction to Data Science Class?

Page 7: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

7

The Data Science Revolution

• Data science is exploding in importance and the attention it receives.

• It’s hard to sort through the substance and the hype.

• There is real value in data science, but you should have a purpose or goal in mind first.

Page 8: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

8

Page 9: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

9

The Roots of Data Science

• Simple observation and recording those observations dates back to the most ancient civilizations– The Greeks were the first western civilization to adopt

observation and measurement • Some call Aristotle the first empirical scientist

– Muslim scholars between the 10th and 14th centuries developed experimentation (Haytham)

– Roger Bacon (1214-1284) promoted inductive reasoning (inference)

– Descartes (1596-1650) shifted focus to deductive reasoning.

Page 10: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

10

What is Data Science?

• “How Companies Learn Your Secrets” NYT, by Charles Duhigg, February 16, 2012

• http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=1&_r=2&hp&

Page 11: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

11

What did Target Do?

• Mining of data on shopping patterns– Specific products purchased– Combination of products purchased– Combined with demographic and other data

• Psychology and neuroscience– Habits:• Cue-routine-reward• When are habits open to change?

Page 12: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

12

Lessons from Target

• Yes, Data Science is about mining data• There are deeper theoretical issues involved in

understanding what you find• Left out of that long article are most of the

critical steps that precede the analysis• In short, Data Science > data mining

Page 13: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

13

Definition of Data Science

• There are many, but most say data science is:– Broad – broader than any one existing discipline– Interdisciplinary: Computer Science, Statistics,

Information Science, databases, mathematics• Also substantive domains (environmental science, sociology,

public health, etc.)

– Applied focus on extracting knowledge from data to inform decision making.

– Focuses on the skills needed to collect, manage, store, distribute, analyze, visualize, and reuse data.

• There are many visual representations of Data Science

Page 14: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

14

Some definitions link computational, statistical, and substantive expertise

Page 15: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

15

Other definitions focus more on technical skills alone

Page 16: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

16

Still other definitions are so broad as to include nearly everything

Page 17: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

17

There are many “Word Cloud” representations of Data Science as well

Page 18: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

18

Page 19: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

19

Page 20: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

20

Definition of Data Science

• The field is immature, cluttered by hype, unfocused.

• But, key features should include:– Data across its lifecycle– Interdisciplinary skills– Substantive knowledge

Page 21: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

21

Defining Some Terms

Page 22: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

22

MapReduce and Hadoop

– Designed to process large operations quickly– Distributes the problem across multiple servers– The Map part filters and sorts data into bins or

queues based on some share characteristic– The Reduce part then executes some operation on

each bin of data.– Results are then reassembled– It is like parallel processes, but distributed across

servers rather than just processors– Scalable and has a fault tolerance– Hadoop is an open-source version

Page 23: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

23

More on MapReduce

• Pig– Software platform used for creating MapRedce

programs used by Hadoop.• Hive– A date warehouse infrastructure built on top of

Hadoop. Used to query, summarize, or analyzed data.

Page 24: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

24

Database Management

• SQL – Structured Query Language– A programming language designed for management of

relational databases• MySQL – Open source implementation of an SQL-like

system for management of relational databases (used by Wikipedia, Google, Facebook, Twitter, Flickr, YouTube)

• NoSQL – (Not Only SQL)– Used for databases where the data is in some form other than

tabular relations like those used in relational databases– Cassandra (Apache)

• Distributed database management with not single node of failure• Scalable with no down time

Page 25: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

25

Page 26: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

26

Cloud Computing

• Standard client-server model where computing operations don’t happen on the local (desktop) machine.

• What’s new? Virtualization. You are not connecting to a specific server. – Servers are virtual– One server can run multiple virtual machines– One virtual machine can use multiple servers

• This makes the “machine” scalable, moveable, configurable.

• Allows selling software, platforms, and even computing infrastructure as a “service”

Page 27: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

27

Page 28: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

28

Data Mining/Machine Learning

• Machine learning uses computer algorithms to get a machine to learn and adapt to new information.

• Data Mining more explicitly focuses on discovering patterns or structure in a given set of data.

• Often used as synonyms by non-experts without much loss of information.

Page 29: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

29

Page 30: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

30

Web Scraping

• This is a process of collecting information from websites and then organizing it for some sort of analysis.

• Scraping is just about getting the data; the analysis comes later.

Page 31: Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

31

Programming Tools

– R – Statistical programming (object oriented, scripting language)

– Python – a scripting programming language that supports object-oriented programming, structured programming, functional programming

– SQL – Relational Database– SAS – General purpose data analysis software– Julia – Faster than R and more scalable than Python– Kafka and Storm – Used for real-time streaming

analysis