enable domain experts explore, normalize and enrich their data via a self service data ... -...

Post on 21-May-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

refinepro.com - @RefinePro – martin@refinepro.com 1

Enable domain experts explore,

normalize and enrich their data via a

self service data preparation platform

refinepro.com - @RefinePro – martin@refinepro.com 2

Garbage In – Garbage Out

refinepro.com - @RefinePro – martin@refinepro.com 3

Data Processing Pipeline

refinepro.com - @RefinePro – martin@refinepro.com 4

60 to 80% of data analysis

is spent on the process of

cleaning, transformation and integration

Data Processing Pipeline

refinepro.com - @RefinePro – martin@refinepro.com 5

Analytics need clean data to be

reliable

Legacy data need to be

migrated to a new system

Data must be reconciled

against a master data set

Data projects needs access to

reliable data quickly

refinepro.com - @RefinePro – martin@refinepro.com 6

• Messy and inaccurate data.

• Individual and business units data have unique needs

• New predictive and enrichment services made available

using an API first approach.

• Current tools are challenged the speed of those new

requirements.

Data Integration Challenge

refinepro.com - @RefinePro – martin@refinepro.com 7

• Duplicate value & Typos

• Multi value cells

• Data in the wrong field

• Missing / Partial Values

• Encoding Errors

• Change format (text, number, date)

• Flat to relational data set

• Schema alignment

• Transpose rows and columns

• Join data-set

• Enrichment from other sources

(MDM, API calls)

Data Quality & Integration &

Is Time Consuming

refinepro.com - @RefinePro – martin@refinepro.com 8

• Which field should it contains?

• What format should it follow?

• What geographical scope should it support?

• Enforce data integrity rules (eg. postal code vs city)?

Individual and business units data

needs are not consistent

What is clean data?

How do you know define a clean address?

refinepro.com - @RefinePro – martin@refinepro.com 9

• New economy of machine learning, predictive and data

enrichment service:

• Geocoding and address cleaning

• Name recognition and extraction

• Churn prediction

• …

Those services come with an API first approach requiring

technical skills

Data Service have an

API first approach

refinepro.com - @RefinePro – martin@refinepro.com 10

DBA

ETL

Data Science

Spreadsheet User

Data Visualization / Interpretation

User Base

Understand the Data

(Business Skills)

Know How To Transform Data (Technical Skills)

Today's data environment challenge traditional

technologies

Excel doesn't scale or

automate well

IT can't pace with the volume

of requests

refinepro.com - @RefinePro – martin@refinepro.com 11

Agile Process

• Let's you adjust and adapt to a changing environment

by working in iteration

• Stop at the right level of quality

Tools: Self Service Data Preparation

• Empower the domain experts

• Allow to iterate faster through the process

New process & tools to the rescue!

refinepro.com - @RefinePro – martin@refinepro.com 12

Data Discovery

& Profiling

Track / Measure

Data Consumption

Data Transformation

Agile -

Incremental Data

Processing

refinepro.com - @RefinePro – martin@refinepro.com 13

Data Discovery & Profiling

Place data in context

Test data service

Is it useful?

What Can I do with it?

Track / Measure

Check data integrity

Find quality gaps

Learn from your experience

Agile -

Incremental Data

ProcessingData Consumption

Analytics

Migration

Reconciliation

Data Transformation

Define strategy

Perform data preparation

refinepro.com - @RefinePro – martin@refinepro.com 14

Self Service Data Preparation Bridges The Skill Gap

DBA

ETL

Data Science

Spreadsheet User

Data Visualization / Interpretation

OpenRefine

Excel doesn't scale or

automate well

IT can't pace with the volume

of requests

User Base

Understand the Data

(Business Skills)

Know How To Transform Data (Technical Skills)

refinepro.com - @RefinePro – martin@refinepro.com 15

OpenRefine Functionality

XLS, CSV, JSON,

XML Input &

Output Support

Point & Click

Cluster &

Deduplication

Filter &

Sort

Transpose Custom Query

Language

Enrich data via

APIs

Join, Merge

& Reconcile

Split to rows

and columns

Undo /

Redo

refinepro.com - @RefinePro – martin@refinepro.com 16

OpenRefine

Community developed for 5 years

Gridworks > Google Refine > OpenRefine

5,000+ monthly download

Run on a local machine

Large usage among Data Journalist, Library, Semantic

web, Open Data and Bio Science experts.

refinepro.com - @RefinePro – martin@refinepro.com 17

TrainingCloud & on-

premise hosting

Integration & Custom

Development

RefinePro helps teams and

organization to scale OpenRefine

refinepro.com - @RefinePro – martin@refinepro.com 18

Demo

1. Toronto Build Permit Data Set:

Explore what data we have available

Geocode The address

2. Salesforce dump:

Remove duplicate name

Add Facebook and Twitter profile via FullContact

refinepro.com - @RefinePro – martin@refinepro.com 19

Frequency- number

of use case

ProfilingPreparation

DiscoveryData Wrangling

1 32

Sense MakingData Exploration

Is the data useful?What Can I do with it?

OpenRefine in the Data Quality & Integration Pipeline

refinepro.com - @RefinePro – martin@refinepro.com 20

Frequency- number

of use case

ProfilingPreparation

DiscoveryData Wrangling

1 32

Personal ETL & Analysis

Prototype

One time migration

Sense MakingData Exploration

Is the data useful?What Can I do with it?

OpenRefine in the Data Quality & Integration Pipeline

refinepro.com - @RefinePro – martin@refinepro.com 21

Frequency- number

of use case

ProfilingPreparation

DiscoveryData Wrangling

1 32

Big Data

Real -Time Processing

Enterprise ETL

Personal ETL & Analysis

Prototype

One time migration

Sense MakingData Exploration

Is the data useful?What Can I do with it?

OpenRefine in the Data Quality & Integration Pipeline

refinepro.com - @RefinePro – martin@refinepro.com 22

Understand the Data

(Business Skills)

Know How To Transform Data (Technical Skills)

Frequency- number

of use case

ProfilingPreparation

DiscoveryData Wrangling

1 2 3

OpenRefine in the Data Quality & Integration Pipeline

refinepro.com - @RefinePro – martin@refinepro.com 23

Enable domain experts explore,

normalize and enrich their data via a

self service data preparation platform

refinepro.com - @RefinePro – martin@refinepro.com 24

OpenRefine Eco-System

refinepro.com - @RefinePro – martin@refinepro.com 25

OpenRefine Eco-System

Reconciliation service sit outside of Refine and

enable user to align and enrich data against

domain specific master data

refinepro.com - @RefinePro – martin@refinepro.com 26

OpenRefine Eco-System

Extensions add

functionality to

Refine core.

API processing plugin enable

seamless data process with

API based services

Batch Processing

library enable

lightweight ETL process

refinepro.com - @RefinePro – martin@refinepro.com 27

OpenRefine Eco-System

New Distributions focus on

- domain specific integration

- explore new functionality

- hosted version of Refine

top related