fast & iterative data wrangling with grammar &...
TRANSCRIPT
Fast & Iterative Data Wrangling with Grammar & Visualization
Kan Nishida @kanaugust Co-Founder / CEO, Exploratory
Much more data available than you think (or hope)
If you know how to access
and, how to wrangle with
Yesterday
Apps
ETL Report / Dashboard
Data Transformation Business Data Modeling Data AnalysisReporting
AppsApps
BI Model
ETL Report / DashboardBI Model
Business Analysts
Report DevelopersBI DevelopersETL Developers
DBA DBA
Data Transformation Business Data Modeling Data AnalysisReporting
AppsApps
Apps
ETL Report / DashboardBI Model
Business Analysts
Report DevelopersBI DevelopersETL Developers
DBA DBA
Data Transformation Business Data Modeling Data AnalysisReporting
AppsApps
Apps
ETL Report / DashboardBI Model
Business Analysts
Report DevelopersBI DevelopersETL Developers
DBA DBA
Data Transformation Business Data Modeling Data AnalysisReporting
RequirementsApps
AppsApps
ETL Report / DashboardBI Model
Business Analysts
Report DevelopersBI DevelopersETL Developers
DBA DBA
Data Transformation Business Data Modeling Data AnalysisReporting
RequirementsApps
AppsApps
A few weeks to 12 months
Today
Manipulation
Data Access
Tidy
Exploratory Data Analysis
Visualization
Data Analyst
Tidy Data
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Name Year CountPanera Bread 2010 10Panera Bread 2011 2Panera Bread 2012 2Panera Bread 2013 14
Taco Bell 2010 5Taco Bell 2011 15Taco Bell 2012 20Taco Bell 2013 8
Name 2010 2011 2012 2013Panera Bread 10 2 2 14
Taco Bell 5 15 20 8
Tidy Data
Messy Data
Why Tidy Data
1. Makes it easier to Filter, Calculate, Group, and Aggregate
2. Makes it easier to Visualize
3. Makes it easier to Build Models
Hadleyverse
Hadley WickhamAuthor of tools (computational and cognitive) that make data science easier, faster, and more fun with R
dplyr, tidyr, readr, readxl, haven, httr, rvest, xml2, lubridate, stringr, ggplot2, ggvis, devtools, testthat, roxygen2, etc.
Manipulation
Data Access
Tidy
Exploratory Data Analysiswith Hadleyverse
Visualization
Data Analyst
tidyr, reshape2
dplyr
readr, readxl, haven, xml2, httr, rvest
ggplot2
lubridate, stringr
Data Wrangling
Manipulation
Data Access
Tidy
Grammar for Data Wrangling
Visualization
Data Analyst
tidyr
dplyr
select left_join right_join full_join inner_join semi_join anti_join union intersect setoff bind_rows bind_columns
CombineReshape
gather separate unite separate nest unnest arrange rename
Subset Rows
filter distinct sample_n sample_frac slice top_n
Subset Columns
Group
group_by
Create Variable
Summarize
mutate
summarize
Grammar for Data Wrangling
tidyr dplyr
select left_join right_join full_join inner_join semi_join anti_join union intersect setoff bind_rows bind_columns
CombineReshape
gather separate unite separate nest unnest arrange rename
Subset Rows
filterdistinct sample_n sample_frac slice top_n
Subset Columns
Group
group_by
Create Variable
Summarize
mutate
summarize
Grammar for Data Wrangling
tidyr dplyr
Manipulation
Data Access
Tidy
Grammar for Data Wrangling
Visualization
Data Analyst
tidyr
dplyr
Data Access
Manipulation
Tidy
Data Access
Visualization
Data Access
Data Analyst
Relational Database
Oracle MSFT DB2 MySQL Postgres TeradataAmazon Redshift VerticaNetezza
Relational Database
Non Relational Database
Cloud Apps
Files
CSV, Delimited Excel Log Files JSON Files XML Files PDF FilesWeb Pages Stats Files
Oracle MSFT DB2 MySQL Postgres TeradataAmazon Redshift Vertica
MongoDB CouchDB Reddis Hadoop Spark
Google Analytics Github Mail
Chimp Stripe Twitter Flurry
Google BigQuery neo4J
Salesforce Segment
PrestoHive
Netezza
Mixpanel
Apache Arrow
Relational Database
Non Relational Database
Cloud Apps
Files
CSV, Delimited Excel Log Files JSON Files XML Files PDF FilesWeb Pages Stats Files
Oracle MSFT DB2 MySQL Postgres TeradataAmazon Redshift Vertica
MongoDB CouchDB Reddis Hadoop Spark
Google Analytics Github Mail
Chimp Stripe Twitter Google Spreadsheet
Google BigQuery neo4J
Salesforce Segment
PrestoHive
Netezza
Mixpanel
Apache Arrow
Manipulation
Tidy
Hadleyverse
Visualization
Data Access
Data Analyst
readr, readxl, haven, xml2, httr, rvest
Manipulation
Tidy
Need to access diverse data sets
Visualization
Data Access
Data Analyst
readr, readxl, haven, xml2, httr, rvest
jsonlite, rjson, rjsonio,mongolite, Rmongo, rmongodb,RGoogleAnalytics, RGA, GAR
Visualization
Manipulation
Tidy
Data Visualization
Data Analyst
Visualization
Data Access
Access to any type of data easily and quickly
Wrangle with data iteratively and flexibly using well-defined grammar
Visualize often and understand data quickly
Exploratory Data Analysis
Exploratory Fast and Iterative Data Wrangling with Grammar & Visualization
• Text, Log, JSON, XML • Cloud apps - Google
Analytics, Github, etc. • Web Page Scraping • REST API • Custom R Script
Access to any type of data easily
• Command Line Interface
• Context Aware Syntax Suggestion
• Data Analysis Pipeline Management
Data Wrangling with Grammar
• Summary View • Chart View
• Histogram • Boxplot • Scatterplot • Map • Bar • Line • Area
Visualize and understand quickly