enable domain experts explore, normalize and enrich their data via a self service data ... -...

27
refinepro.com - @RefinePro – [email protected] 1 Enable domain experts explore, normalize and enrich their data via a self service data preparation platform

Upload: others

Post on 21-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 1

Enable domain experts explore,

normalize and enrich their data via a

self service data preparation platform

Page 2: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 2

Garbage In – Garbage Out

Page 3: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 3

Data Processing Pipeline

Page 4: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 4

60 to 80% of data analysis

is spent on the process of

cleaning, transformation and integration

Data Processing Pipeline

Page 5: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 5

Analytics need clean data to be

reliable

Legacy data need to be

migrated to a new system

Data must be reconciled

against a master data set

Data projects needs access to

reliable data quickly

Page 6: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 6

• Messy and inaccurate data.

• Individual and business units data have unique needs

• New predictive and enrichment services made available

using an API first approach.

• Current tools are challenged the speed of those new

requirements.

Data Integration Challenge

Page 7: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 7

• Duplicate value & Typos

• Multi value cells

• Data in the wrong field

• Missing / Partial Values

• Encoding Errors

• Change format (text, number, date)

• Flat to relational data set

• Schema alignment

• Transpose rows and columns

• Join data-set

• Enrichment from other sources

(MDM, API calls)

Data Quality & Integration &

Is Time Consuming

Page 8: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 8

• Which field should it contains?

• What format should it follow?

• What geographical scope should it support?

• Enforce data integrity rules (eg. postal code vs city)?

Individual and business units data

needs are not consistent

What is clean data?

How do you know define a clean address?

Page 9: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 9

• New economy of machine learning, predictive and data

enrichment service:

• Geocoding and address cleaning

• Name recognition and extraction

• Churn prediction

• …

Those services come with an API first approach requiring

technical skills

Data Service have an

API first approach

Page 10: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 10

DBA

ETL

Data Science

Spreadsheet User

Data Visualization / Interpretation

User Base

Understand the Data

(Business Skills)

Know How To Transform Data (Technical Skills)

Today's data environment challenge traditional

technologies

Excel doesn't scale or

automate well

IT can't pace with the volume

of requests

Page 11: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 11

Agile Process

• Let's you adjust and adapt to a changing environment

by working in iteration

• Stop at the right level of quality

Tools: Self Service Data Preparation

• Empower the domain experts

• Allow to iterate faster through the process

New process & tools to the rescue!

Page 12: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 12

Data Discovery

& Profiling

Track / Measure

Data Consumption

Data Transformation

Agile -

Incremental Data

Processing

Page 13: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 13

Data Discovery & Profiling

Place data in context

Test data service

Is it useful?

What Can I do with it?

Track / Measure

Check data integrity

Find quality gaps

Learn from your experience

Agile -

Incremental Data

ProcessingData Consumption

Analytics

Migration

Reconciliation

Data Transformation

Define strategy

Perform data preparation

Page 14: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 14

Self Service Data Preparation Bridges The Skill Gap

DBA

ETL

Data Science

Spreadsheet User

Data Visualization / Interpretation

OpenRefine

Excel doesn't scale or

automate well

IT can't pace with the volume

of requests

User Base

Understand the Data

(Business Skills)

Know How To Transform Data (Technical Skills)

Page 15: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 15

OpenRefine Functionality

XLS, CSV, JSON,

XML Input &

Output Support

Point & Click

Cluster &

Deduplication

Filter &

Sort

Transpose Custom Query

Language

Enrich data via

APIs

Join, Merge

& Reconcile

Split to rows

and columns

Undo /

Redo

Page 16: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 16

OpenRefine

Community developed for 5 years

Gridworks > Google Refine > OpenRefine

5,000+ monthly download

Run on a local machine

Large usage among Data Journalist, Library, Semantic

web, Open Data and Bio Science experts.

Page 17: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 17

TrainingCloud & on-

premise hosting

Integration & Custom

Development

RefinePro helps teams and

organization to scale OpenRefine

Page 18: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 18

Demo

1. Toronto Build Permit Data Set:

Explore what data we have available

Geocode The address

2. Salesforce dump:

Remove duplicate name

Add Facebook and Twitter profile via FullContact

Page 19: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 19

Frequency- number

of use case

ProfilingPreparation

DiscoveryData Wrangling

1 32

Sense MakingData Exploration

Is the data useful?What Can I do with it?

OpenRefine in the Data Quality & Integration Pipeline

Page 20: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 20

Frequency- number

of use case

ProfilingPreparation

DiscoveryData Wrangling

1 32

Personal ETL & Analysis

Prototype

One time migration

Sense MakingData Exploration

Is the data useful?What Can I do with it?

OpenRefine in the Data Quality & Integration Pipeline

Page 21: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 21

Frequency- number

of use case

ProfilingPreparation

DiscoveryData Wrangling

1 32

Big Data

Real -Time Processing

Enterprise ETL

Personal ETL & Analysis

Prototype

One time migration

Sense MakingData Exploration

Is the data useful?What Can I do with it?

OpenRefine in the Data Quality & Integration Pipeline

Page 22: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 22

Understand the Data

(Business Skills)

Know How To Transform Data (Technical Skills)

Frequency- number

of use case

ProfilingPreparation

DiscoveryData Wrangling

1 2 3

OpenRefine in the Data Quality & Integration Pipeline

Page 23: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 23

Enable domain experts explore,

normalize and enrich their data via a

self service data preparation platform

Page 24: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 24

OpenRefine Eco-System

Page 25: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 25

OpenRefine Eco-System

Reconciliation service sit outside of Refine and

enable user to align and enrich data against

domain specific master data

Page 26: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 26

OpenRefine Eco-System

Extensions add

functionality to

Refine core.

API processing plugin enable

seamless data process with

API based services

Batch Processing

library enable

lightweight ETL process

Page 27: Enable domain experts explore, normalize and enrich their data via a self service data ... - presentation.pdf · 2018-08-09 · Enable domain experts explore, normalize and enrich

refinepro.com - @RefinePro – [email protected] 27

OpenRefine Eco-System

New Distributions focus on

- domain specific integration

- explore new functionality

- hosted version of Refine