fast & iterative data wrangling with grammar &...

Post on 06-Jun-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fast & Iterative Data Wrangling with Grammar & Visualization

Kan Nishida @kanaugust Co-Founder / CEO, Exploratory

Much more data available than you think (or hope)

If you know how to access

and, how to wrangle with

Yesterday

Apps

ETL Report / Dashboard

Data Transformation Business Data Modeling Data AnalysisReporting

AppsApps

BI Model

ETL Report / DashboardBI Model

Business Analysts

Report DevelopersBI DevelopersETL Developers

DBA DBA

Data Transformation Business Data Modeling Data AnalysisReporting

AppsApps

Apps

ETL Report / DashboardBI Model

Business Analysts

Report DevelopersBI DevelopersETL Developers

DBA DBA

Data Transformation Business Data Modeling Data AnalysisReporting

AppsApps

Apps

ETL Report / DashboardBI Model

Business Analysts

Report DevelopersBI DevelopersETL Developers

DBA DBA

Data Transformation Business Data Modeling Data AnalysisReporting

RequirementsApps

AppsApps

ETL Report / DashboardBI Model

Business Analysts

Report DevelopersBI DevelopersETL Developers

DBA DBA

Data Transformation Business Data Modeling Data AnalysisReporting

RequirementsApps

AppsApps

A few weeks to 12 months

Today

Manipulation

Data Access

Tidy

Exploratory Data Analysis

Visualization

Data Analyst

Tidy Data

1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.

Name Year CountPanera Bread 2010 10Panera Bread 2011 2Panera Bread 2012 2Panera Bread 2013 14

Taco Bell 2010 5Taco Bell 2011 15Taco Bell 2012 20Taco Bell 2013 8

Name 2010 2011 2012 2013Panera Bread 10 2 2 14

Taco Bell 5 15 20 8

Tidy Data

Messy Data

Why Tidy Data

1. Makes it easier to Filter, Calculate, Group, and Aggregate

2. Makes it easier to Visualize

3. Makes it easier to Build Models

Hadleyverse

Hadley WickhamAuthor of tools (computational and cognitive) that make data science easier, faster, and more fun with R

dplyr, tidyr, readr, readxl, haven, httr, rvest, xml2, lubridate, stringr, ggplot2, ggvis, devtools, testthat, roxygen2, etc.

Manipulation

Data Access

Tidy

Exploratory Data Analysiswith Hadleyverse

Visualization

Data Analyst

tidyr, reshape2

dplyr

readr, readxl, haven, xml2, httr, rvest

ggplot2

lubridate, stringr

Data Wrangling

Manipulation

Data Access

Tidy

Grammar for Data Wrangling

Visualization

Data Analyst

tidyr

dplyr

select left_join right_join full_join inner_join semi_join anti_join union intersect setoff bind_rows bind_columns

CombineReshape

gather separate unite separate nest unnest arrange rename

Subset Rows

filter distinct sample_n sample_frac slice top_n

Subset Columns

Group

group_by

Create Variable

Summarize

mutate

summarize

Grammar for Data Wrangling

tidyr dplyr

select left_join right_join full_join inner_join semi_join anti_join union intersect setoff bind_rows bind_columns

CombineReshape

gather separate unite separate nest unnest arrange rename

Subset Rows

filterdistinct sample_n sample_frac slice top_n

Subset Columns

Group

group_by

Create Variable

Summarize

mutate

summarize

Grammar for Data Wrangling

tidyr dplyr

Manipulation

Data Access

Tidy

Grammar for Data Wrangling

Visualization

Data Analyst

tidyr

dplyr

Data Access

Manipulation

Tidy

Data Access

Visualization

Data Access

Data Analyst

Relational Database

Oracle MSFT DB2 MySQL Postgres TeradataAmazon Redshift VerticaNetezza

Relational Database

Non Relational Database

Cloud Apps

Files

CSV, Delimited Excel Log Files JSON Files XML Files PDF FilesWeb Pages Stats Files

Oracle MSFT DB2 MySQL Postgres TeradataAmazon Redshift Vertica

MongoDB CouchDB Reddis Hadoop Spark

Google Analytics Github Mail

Chimp Stripe Twitter Flurry

Google BigQuery neo4J

Salesforce Segment

PrestoHive

Netezza

Mixpanel

Apache Arrow

Relational Database

Non Relational Database

Cloud Apps

Files

CSV, Delimited Excel Log Files JSON Files XML Files PDF FilesWeb Pages Stats Files

Oracle MSFT DB2 MySQL Postgres TeradataAmazon Redshift Vertica

MongoDB CouchDB Reddis Hadoop Spark

Google Analytics Github Mail

Chimp Stripe Twitter Google Spreadsheet

Google BigQuery neo4J

Salesforce Segment

PrestoHive

Netezza

Mixpanel

Apache Arrow

Manipulation

Tidy

Hadleyverse

Visualization

Data Access

Data Analyst

readr, readxl, haven, xml2, httr, rvest

Manipulation

Tidy

Need to access diverse data sets

Visualization

Data Access

Data Analyst

readr, readxl, haven, xml2, httr, rvest

jsonlite, rjson, rjsonio,mongolite, Rmongo, rmongodb,RGoogleAnalytics, RGA, GAR

Visualization

Manipulation

Tidy

Data Visualization

Data Analyst

Visualization

Data Access

Access to any type of data easily and quickly

Wrangle with data iteratively and flexibly using well-defined grammar

Visualize often and understand data quickly

Exploratory Data Analysis

Exploratory Fast and Iterative Data Wrangling with Grammar & Visualization

• Text, Log, JSON, XML • Cloud apps - Google

Analytics, Github, etc. • Web Page Scraping • REST API • Custom R Script

Access to any type of data easily

• Command Line Interface

• Context Aware Syntax Suggestion

• Data Analysis Pipeline Management

Data Wrangling with Grammar

• Summary View • Chart View

• Histogram • Boxplot • Scatterplot • Map • Bar • Line • Area

Visualize and understand quickly

top related