data profiling with r

84
Want to follow along with this session using R? Download the script and data from the session scheduler. Also download R and RStudio. It’s easy to follow along!

Upload: michelle-kolbe

Post on 08-Jan-2017

1.158 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data Profiling with R

Want to follow along with this session using R?

Download the script and data from the session

scheduler. Also download R and RStudio.

It’s easy to follow along!

Page 2: Data Profiling with R

© 2016 RED PILL Analytics

Text Here

Page 3: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Using R for Data Profiling

3

Michelle Kolbemedium.com/@datacheesehead @mekolbe linkedin.com/in/michellekolbe

[email protected]

Page 4: Data Profiling with R

© 2016 RED PILL Analytics

Do you have a data quality problem?

Page 5: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

What to Check for?

• Accuracy• Consistency• Completeness• Uniqueness• Distribution• Range

5

Page 6: Data Profiling with R

© 2016 RED PILL Analytics

Why Profile Your Data?

Page 7: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Benefits

• Trust in data• Find problems in advance• Shorten development time on projects• Improve understanding of data & business knowledge

7

Page 8: Data Profiling with R

© 2016 RED PILL Analytics

Why R?

Page 9: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Why R?

• Free!• Easy to use• Flexible• Powerful analytics• Great community!

9

Page 10: Data Profiling with R

© 2016 RED PILL Analytics

Getting Started in R

Page 11: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

What is R?

• A programming environment• Fairly simple to use & understand• Allows a user to manipulate & analyze data• Open source• Real power comes from available packages you can install from LARGE community

• Easy to learn with programming background• Con: Memory management & speed vs C++ or Python

11

Page 12: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Tools for R

• First download R from r-project.org• Then download R Studio, the best R IDE

12

Page 13: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

R Basics

• Case sensitive• <- assigns to a variable• # begins a comment• ??<keyword> will search R documentation for help

13

Page 14: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Using Packages

• First install install.packages(“<package name>”)

• Once installed, load the package library(“<package name>”)

• Note that every time you open R you’ll need to load the packages you’ll be using

• You’ll see your packages that are installed and loaded in R Studio

14

Page 15: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Connecting to Data in R

• Data should be read into R and stored into an object• Easiest with CSV• Can download datasets from a url or located on a drived <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")

15

Page 16: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Connecting to Oracle

• RODBC• Load package in R library(RODBC)

• View available data sourcesodbcDataSources()

• Can read tables and send sql queriescon <- odbcConnect("Oracle Sample", uid="system", pwd="oracle")d <- sqlQuery(con, "select sysdate from dual”)

16

ODBC

Con

necti

on N

ame

Page 17: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Connecting to Oracle• RJDBC

• Load Package library(RJDBC)

• Create connection driverjdbcDriver <- JDBC(driverClass=“oracle.jdbc.OracleDriver”, classPath=“lib/ojdbc6.jar”)

• Open Connection jdbcConnection <- dbConnect(jdbcDriver, “jdbc:oracle:thin@//database.hostname.com:port/service_name_or_sid”, “username”, “password”)

• QuerydbGetQuery(jdbcConnection, “select sysdate from dual”)

• Close Connection dbDisconnect(jdbcConnection)

17

Page 18: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

ROracle

• Open Source but maintained by Oracle• Faster: 79 times faster than RJDBC and 2.5 times faster than RODBC

• Provides scalability and stability

18

Page 19: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Variables

• Can store data in variables using <- or =• Do not need to define variable first• RStudio shows your variables on the right

19

Page 20: Data Profiling with R

© 2016 RED PILL Analytics

Using R Studio

Page 21: Data Profiling with R
Page 22: Data Profiling with R
Page 23: Data Profiling with R
Page 24: Data Profiling with R
Page 25: Data Profiling with R
Page 26: Data Profiling with R
Page 27: Data Profiling with R
Page 28: Data Profiling with R
Page 29: Data Profiling with R
Page 30: Data Profiling with R
Page 31: Data Profiling with R
Page 32: Data Profiling with R
Page 33: Data Profiling with R
Page 34: Data Profiling with R
Page 35: Data Profiling with R
Page 36: Data Profiling with R
Page 37: Data Profiling with R
Page 38: Data Profiling with R
Page 39: Data Profiling with R

© 2016 RED PILL Analytics

Our Data Set to Profile

Page 40: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

First, Load the Data into R

40

Page 41: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Summarize the Data• Summary is an R function to show you basic details about each column in your dataset

41

Page 42: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Summarize the Data

42

Page 43: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Filter the dataset• Use Function Nesting to get a subset of data in the summary

43

Page 44: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Bad Data?• If the Mean is 218 for Yards, is it possible to have a max of 5177 or is this bad data?

44

Page 45: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Group Data by Position• Here we are grouping with the by function and getting the mean of 4 columns

45

Page 46: Data Profiling with R

© 2016 RED PILL Analytics

Visualizing Data

Page 47: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Grammar of Graphics Package• ggplot2 provides many graphing and charting capabilities with R• Based on Grammar of Graphics by Leland Wilkinson

47

Page 48: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Bar Chart• Let’s view our distribution by Age. Since this is basically discrete data, we’ll use a Bar Chart.

48

Page 49: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Histogram• Our data imported into R with Factors for some metrics

• Change to Int by converting to a matrix then back to data frame

49

Page 50: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Histogram

50

Page 51: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Histogram

51

Page 52: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Histogram with Some Data Cleanup• Removed low values

52

Page 53: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Distribution• Density charts are thought to be superior to histograms because you do not need to be concerned with bins

53

Page 54: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Distribution with 0 value data back in

54

Page 55: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Quick Clean Uprm removes a variable or dataset

55

Page 56: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Group the Chart by a Dimension• We can add a “facet wrap” to group our charts by a dimension

56

Page 57: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Distributions for Categorical Data• Can get a count of how many records exist for each value in a table format

57

Page 58: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Distribution for 2 data points• Can change this to a 2 way cross tab distribution

58

Page 59: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Boxplot

59

Page 60: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Scatterplot

60

Page 61: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics 61

Page 62: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Scatterplot with Regression

62

Page 63: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Line Chart

63

Page 64: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Add a Bar Chart to the Line

64

Page 65: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Stacked Bars are Rarely Helpful

65

Page 66: Data Profiling with R

© 2016 RED PILL Analytics

What about Text fields?

Page 67: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Word cloud

67

Page 68: Data Profiling with R

© 2016 RED PILL Analytics

Missing Data

Page 69: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Null vs NA in RR treats NA like other languages consider NULL

69

NULL NADefinition Null object, a reserved word Logical constant of length 1 containing a

missing value indicator

Behavior in Vector Not allowed. Won’t save within vector. Exists and represents missing value.

Behavior in List (such as Data Frame)

Can exist if not assigned but created with it.

Exists and represents missing value.

Page 70: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Nulls on ImportOur dataset had nulls in it when we pulled it into R. How were they assigned?

70

Page 71: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Finding Missing Data

71

Page 72: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

But look what else we found in Jeff’s records!

72

Page 73: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Make Missing Data Consistent in R

73

Page 74: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Check the whole dataset now

74

Page 75: Data Profiling with R

© 2016 RED PILL Analytics

What to do about missing & bad data?

Page 76: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Handling Bad Data in ETL

76

RejectClean

& Fill InLoad As Is

Page 77: Data Profiling with R

© 2016 RED PILL Analytics

Using Data Quality Package

Page 78: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

DataQualityR Package

78

Page 79: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Numerical Results

79

Page 80: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Categorical Results

80

Page 81: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

In Summary

• R gives you a quick and easy way to learn about your data before investing time into ETL

• Open source means no investment into tools• R isn’t scary or all statistical and stuff!

81

Page 82: Data Profiling with R

© 2016 RED PILL Analytics

Text Here

Page 83: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics 83

Page 84: Data Profiling with R

www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics

Using R for Data ProfilingSession #1805

84

Michelle Kolbemedium.com/@datacheesehead @mekolbe linkedin.com/in/michellekolbe

[email protected]

Fill out a session survey in the mobile

app!!