demystifying data science · 2018-03-20 · demystifying data science alyson wilson, ph.d., pstat...

Post on 20-May-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Demystifying Data Science

Alyson Wilson, Ph.D., PStat

Department of StatisticsLaboratory for Analytic SciencesNorth Carolina State University

agwilso2@ncsu.edu

March 22, 2018

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 1 / 33

Objectives

• Lots of “Data” Definitions

• Classes of Algorithms (with descriptions)

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 2 / 33

It’s All About Data

• Big data

• Data engineering

• Data science

• Data analytics

• Data mining

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 3 / 33

Data Engineering

Data engineers design, build, and manage the infrastructure to supportdata collection, storage, and analysis.

One of the key functions is managing extract, transform, load (ETL).

• Extract data from a variety of sources.

• Transform to the proper format or structure to support querying andanalysis.

• Load the data into the final database, for example, a data store, datamart, or data warehouse.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 4 / 33

Is “Big Data” the same as “Data Science”?

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 5 / 33

Five Data Science Skills

• Technical• Algorithmic/computational/predictive and data/statistical/inferential

methods• Mathematics (particularly modeling and linear algebra)• Obtaining, wrangling, curating, managing and processing data,

exploring data

• Communication

• Collaboration

• Tools

• Subject Matter Expertise

Adapted from Michael Rappa, NCSU Institute for Advanced Analytics andCurriculum Guidelines for Undergraduate Programs in Data Science, AmericanStatistical Association

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 6 / 33

Data Product

A data product is the production output from a statistical analysis. Dataproducts automate complex analysis tasks or use technology to expand theutility of a data informed model, algorithm or inference.

• interactive analytics (e.g., R Shiny)

• packages of analysis tools (e.g., an R package)

• interactive graphics

The idea is to use technology to tell a story about data to a broadaudience.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 7 / 33

Data Science is a Team Sport

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 8 / 33

A Few Words about Data Wrangling (Munging)

Data is messy. Much (but not all) of what falls in data wrangling is whatwe have traditionally called data cleaning.

• Data scraping is the use of a program to extract data fromhuman-readable sources. Think web site, online datasets, etc.

• There is an emerging standard for data wrangling embodied in the Rpackage dplyr.

• select() take a subset of columns/features/variables• filter() take a subset of rows/observations• mutate() add or modify existing columns• arrange() sort rows• summarize() aggregate data across rows

• See also tidyr, an R package which describes a way to think aboutstoring and formatting data.http://vita.had.co.nz/papers/tidy-data.html

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 9 / 33

Data Analytics

Data analytics are essentially statistics with a lower-case “s”. Analyticsare computations that one makes with data to answer questions. They areoften described by data type, by application area, or by method class.

• Geospatial analytics• Text analytics• Network analytics• Forecasting = “time series

analytics”• Business analytics• Visualization• Neural nets• Deep learning

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 10 / 33

Data Mining

“Statistics at scale and speed” (and simplicity)

D. Pregibon (1999). 2001: A statistical odyssey. Invited talk at The Fifth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining,ACM Press, NY.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 11 / 33

More Definitions

• Machine learning: focused on prediction, based on known propertieslearned from the training data.

• Data mining: focused on the discovery of (previously) unknownproperties in the data.

Data mining + machine learning are currently being rebranded as artificialintelligence.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 12 / 33

Algorithms

• Supervised learning: Providing an algorithm with labeled records inwhich an output variable of interest is known and the algorithm learnshow to predict this value with new records where the output isunknown.

• Unsupervised learning: Providing an algorithm without labeledrecords in which the goal is to draw inferences from only input data.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 13 / 33

Prediction and Classification

• Prediction: supervised learning when the response is a continuousvariable

• Classification: supervised learning when the response is a categoricalvariable

The goal is to predict the value of the response using the explanatoryvariables

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 14 / 33

Data

Gender Height Weight

F 66 135M 70 165F 70 155M 72 200F 62 140

Response? Explanatory?

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 15 / 33

Predictive Analytics

Predictive analytics is a general term encompassing classification andprediction (and sometimes association analysis).

Common Algorithms:

• K-Nearest Neighbors

• Linear Regression

• Logistic Regression

• Classification Trees

• Regression Trees

• Neural Networks

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 16 / 33

Explanatory v Predictive Models

• The typical explanatory model is for the “small data case” (classicalstatistics); a typical predictive model is for the “large data case (datamining).”

• A good explanatory model fits the data closely; a good predictivemodel predicts new cases accurately.

• In explanatory models, the whole dataset is used for estimating thecoefficients and picking the “best model.” Performance measuresassess how well the model fits the data.

• In predictive models, the training data is used to estimate the model,and the validation data set is used to assess performance (more in aminute). Performance measures assess predictive accuracy.

We are focused on predictive models.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 17 / 33

Assessing Performance

When we are choosing a predictive analytic (method, model, algorithm),we typically divide (partition) our data into three parts.

• Training Partition: Usually the largest, used to build the model(s).

• Validation (Test) Partition: Assess performance of each model.

• Test (Holdout, Evaluation) Partition: Assess the performance of thechosen model with new data.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 18 / 33

K-Nearest Neighbors

• Nonparametric technique: no assumption about the form of therelationship between the response and explanatory variables

• Can be used either for classification or prediction

• Idea: classify/predict a new record by finding “similar” records in thetraining data. These “neighbors” are used to derive a classification/prediction by voting (classification) or averaging (prediction)

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 19 / 33

K-Nearest Neighbors

Algorithm:

• Compute the distance from each record to each other record, usingonly the explanatory variables

• Using the k nearest neighbors, classify each record into the categorythat has the most of the k neighbors

• OR, using the k nearest neighbors, predict the response of each recordas the average response of the k nearest neighbors

Implementation:

• Choose k

• Normalize data

• Choose distance metric

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 20 / 33

K-Nearest Neighbors

What changes if the response is continuous?

L. da F Costa, P. Boas, F. Silva, F. Rodrigues (2010). A pattern recognition approach to complex networks. Journal ofStatistical Mechanics: Theory and Experiment, 2010(11), P11015.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 21 / 33

Assess Performance (Classification)

Classification/Confusion Matrix

Predicted Class = 1 Predicted Class = 0

Actual Class = 1 n11 n10Actual Class = 0 n01 n00

Overall error rate = Estimated misclassification rate = n10+n01n11+n10+n01+n00

Overall accuracy = 1 - overall error

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 22 / 33

Assess Performance (Prediction)

Root mean squared error =

√∑(yi−yi )2

n

where

yi = observed response variable

yi = predicted response variable

n = sample size

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 23 / 33

Prediction: Linear Regression

This is perhaps the most popularmodel for making predictions.

Y = β0+β1X1+β2X2+. . .+βnXn+ε

The response variable (Y) is equalto a weighted sum of theexplanatory variables (X) plus noise(ε).

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 24 / 33

Classification: Logistic Regression

In linear regression, we model theresponse as a function of theexplanatory variables. In logisticregression, the response variable isbinary, and we model the probabilitythat the response = 1 (p) as afunction of the explanatoryvariables.

logp

1 − p= β0+β1X1+β2X2+. . .+βnXn

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 25 / 33

Prediction: Regression Tree

Explode Size Age Manufacturer

1 25 5 A1 30 5 A1 35 10 A1 40 10 A0 40 10 A0 35 10 B0 40 10 B1 50 10 B0 60 15 B0 55 15 B

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 26 / 33

Prediction: Regression Tree

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 27 / 33

Classification: Classification Tree

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 28 / 33

Clustering

Cluster analysis is an unsupervised learning technique.

Unsupervised learning techniques are often not ends in themselves, but aremethods for finding relationships and patterns that might be used forsubsequent predictive analysis.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 29 / 33

K-Means Clustering

Algorithm

• Pick a number of clusters (k)

• Assign each record to one of the k clusters

• Calculate the centroid (vector mean) for each cluster

• At each step, each record is reassigned to the cluster with the“closest” centroid

• Recompute the centroids

• Stop when moving any more records increases cluster dispersion

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 30 / 33

Association Analysis

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 31 / 33

Association Analysis

Examples

• We have information on what items were purchased by eachconsumer at Harris Teeter. We would like to use this information togenerate coupons.

• We are an online merchant. We see what the customer is purchasingand recommend another item (and potentially offer it at a discount).

Details

• Also called affinity analysis or market basket analysis

• Goal is to identify item clusterings in transaction-type databases(“what goes with what”)

• The classic algorithm is the a priori algorithm.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 32 / 33

My Go-To Books for Teaching Data Science

• G. Shmueli, P. Bruce, I. Yahav, N. Patel, K. Lichtendahl (2018).Data Mining for Business Analytics: Concepts, Techniques, andApplications in R. John Wiley & Sons.

• B. Baumer, D. Kaplan, N. Horton (2017). Modern Data Science withR. CRC Press.

• D. Nolan, D. Temple Lang (2015). Data Science in R: A Case StudiesApproach to Computational Reasoning and Problem Solving.Chapman & Hall.

A. Wilson (NCSU Statistics) Demystifying Data Science March 22, 2018 33 / 33

top related