enable domain experts explore, normalize and enrich their data via a self service data ... -...
TRANSCRIPT
refinepro.com - @RefinePro – [email protected] 1
Enable domain experts explore,
normalize and enrich their data via a
self service data preparation platform
refinepro.com - @RefinePro – [email protected] 2
Garbage In – Garbage Out
refinepro.com - @RefinePro – [email protected] 3
Data Processing Pipeline
refinepro.com - @RefinePro – [email protected] 4
60 to 80% of data analysis
is spent on the process of
cleaning, transformation and integration
Data Processing Pipeline
refinepro.com - @RefinePro – [email protected] 5
Analytics need clean data to be
reliable
Legacy data need to be
migrated to a new system
Data must be reconciled
against a master data set
Data projects needs access to
reliable data quickly
refinepro.com - @RefinePro – [email protected] 6
• Messy and inaccurate data.
• Individual and business units data have unique needs
• New predictive and enrichment services made available
using an API first approach.
• Current tools are challenged the speed of those new
requirements.
Data Integration Challenge
refinepro.com - @RefinePro – [email protected] 7
• Duplicate value & Typos
• Multi value cells
• Data in the wrong field
• Missing / Partial Values
• Encoding Errors
• Change format (text, number, date)
• Flat to relational data set
• Schema alignment
• Transpose rows and columns
• Join data-set
• Enrichment from other sources
(MDM, API calls)
Data Quality & Integration &
Is Time Consuming
refinepro.com - @RefinePro – [email protected] 8
• Which field should it contains?
• What format should it follow?
• What geographical scope should it support?
• Enforce data integrity rules (eg. postal code vs city)?
–
Individual and business units data
needs are not consistent
What is clean data?
How do you know define a clean address?
refinepro.com - @RefinePro – [email protected] 9
• New economy of machine learning, predictive and data
enrichment service:
• Geocoding and address cleaning
• Name recognition and extraction
• Churn prediction
• …
Those services come with an API first approach requiring
technical skills
Data Service have an
API first approach
refinepro.com - @RefinePro – [email protected] 10
DBA
ETL
Data Science
Spreadsheet User
Data Visualization / Interpretation
User Base
Understand the Data
(Business Skills)
Know How To Transform Data (Technical Skills)
Today's data environment challenge traditional
technologies
Excel doesn't scale or
automate well
IT can't pace with the volume
of requests
refinepro.com - @RefinePro – [email protected] 11
Agile Process
• Let's you adjust and adapt to a changing environment
by working in iteration
• Stop at the right level of quality
Tools: Self Service Data Preparation
• Empower the domain experts
• Allow to iterate faster through the process
New process & tools to the rescue!
refinepro.com - @RefinePro – [email protected] 12
Data Discovery
& Profiling
Track / Measure
Data Consumption
Data Transformation
Agile -
Incremental Data
Processing
refinepro.com - @RefinePro – [email protected] 13
Data Discovery & Profiling
Place data in context
Test data service
Is it useful?
What Can I do with it?
Track / Measure
Check data integrity
Find quality gaps
Learn from your experience
Agile -
Incremental Data
ProcessingData Consumption
Analytics
Migration
Reconciliation
Data Transformation
Define strategy
Perform data preparation
refinepro.com - @RefinePro – [email protected] 14
Self Service Data Preparation Bridges The Skill Gap
DBA
ETL
Data Science
Spreadsheet User
Data Visualization / Interpretation
OpenRefine
Excel doesn't scale or
automate well
IT can't pace with the volume
of requests
User Base
Understand the Data
(Business Skills)
Know How To Transform Data (Technical Skills)
refinepro.com - @RefinePro – [email protected] 15
OpenRefine Functionality
XLS, CSV, JSON,
XML Input &
Output Support
Point & Click
Cluster &
Deduplication
Filter &
Sort
Transpose Custom Query
Language
Enrich data via
APIs
Join, Merge
& Reconcile
Split to rows
and columns
Undo /
Redo
refinepro.com - @RefinePro – [email protected] 16
OpenRefine
Community developed for 5 years
Gridworks > Google Refine > OpenRefine
5,000+ monthly download
Run on a local machine
Large usage among Data Journalist, Library, Semantic
web, Open Data and Bio Science experts.
refinepro.com - @RefinePro – [email protected] 17
TrainingCloud & on-
premise hosting
Integration & Custom
Development
RefinePro helps teams and
organization to scale OpenRefine
refinepro.com - @RefinePro – [email protected] 18
Demo
1. Toronto Build Permit Data Set:
Explore what data we have available
Geocode The address
2. Salesforce dump:
Remove duplicate name
Add Facebook and Twitter profile via FullContact
refinepro.com - @RefinePro – [email protected] 19
Frequency- number
of use case
ProfilingPreparation
DiscoveryData Wrangling
1 32
Sense MakingData Exploration
Is the data useful?What Can I do with it?
OpenRefine in the Data Quality & Integration Pipeline
refinepro.com - @RefinePro – [email protected] 20
Frequency- number
of use case
ProfilingPreparation
DiscoveryData Wrangling
1 32
Personal ETL & Analysis
Prototype
One time migration
Sense MakingData Exploration
Is the data useful?What Can I do with it?
OpenRefine in the Data Quality & Integration Pipeline
refinepro.com - @RefinePro – [email protected] 21
Frequency- number
of use case
ProfilingPreparation
DiscoveryData Wrangling
1 32
Big Data
Real -Time Processing
Enterprise ETL
Personal ETL & Analysis
Prototype
One time migration
Sense MakingData Exploration
Is the data useful?What Can I do with it?
OpenRefine in the Data Quality & Integration Pipeline
refinepro.com - @RefinePro – [email protected] 22
Understand the Data
(Business Skills)
Know How To Transform Data (Technical Skills)
Frequency- number
of use case
ProfilingPreparation
DiscoveryData Wrangling
1 2 3
OpenRefine in the Data Quality & Integration Pipeline
refinepro.com - @RefinePro – [email protected] 23
Enable domain experts explore,
normalize and enrich their data via a
self service data preparation platform
refinepro.com - @RefinePro – [email protected] 24
OpenRefine Eco-System
refinepro.com - @RefinePro – [email protected] 25
OpenRefine Eco-System
Reconciliation service sit outside of Refine and
enable user to align and enrich data against
domain specific master data
refinepro.com - @RefinePro – [email protected] 26
OpenRefine Eco-System
Extensions add
functionality to
Refine core.
API processing plugin enable
seamless data process with
API based services
Batch Processing
library enable
lightweight ETL process
refinepro.com - @RefinePro – [email protected] 27
OpenRefine Eco-System
New Distributions focus on
- domain specific integration
- explore new functionality
- hosted version of Refine