2015 - extract sf - data quality

66
It's Time to Start Caring About Data Quality Data Quality at Scale

Upload: ignacio-elola-villar

Post on 11-Apr-2017

192 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: 2015 - Extract SF - Data Quality

It's Time to Start Caring About Data Quality

Data Quality at Scale

Page 2: 2015 - Extract SF - Data Quality

Ignacio Elola

Page 3: 2015 - Extract SF - Data Quality

Everyone is talking about how useful data is

Page 4: 2015 - Extract SF - Data Quality

data can save your business

Page 5: 2015 - Extract SF - Data Quality

data can save your life

Page 6: 2015 - Extract SF - Data Quality
Page 7: 2015 - Extract SF - Data Quality

but...

Page 8: 2015 - Extract SF - Data Quality

all that is only true if you have the right data

Page 9: 2015 - Extract SF - Data Quality

data tend to be dirty and unstructured

Page 10: 2015 - Extract SF - Data Quality

specially web data!

Page 11: 2015 - Extract SF - Data Quality
Page 12: 2015 - Extract SF - Data Quality

Let’s start simple

Page 13: 2015 - Extract SF - Data Quality

I’ve created an extractor

Page 14: 2015 - Extract SF - Data Quality
Page 15: 2015 - Extract SF - Data Quality

I’ve pass a bunch of queries (bulk)

Page 16: 2015 - Extract SF - Data Quality
Page 17: 2015 - Extract SF - Data Quality
Page 18: 2015 - Extract SF - Data Quality

and got a dataset

Page 19: 2015 - Extract SF - Data Quality
Page 20: 2015 - Extract SF - Data Quality

How can you QA this data?

Page 21: 2015 - Extract SF - Data Quality

eyeballing

Page 22: 2015 - Extract SF - Data Quality
Page 23: 2015 - Extract SF - Data Quality
Page 24: 2015 - Extract SF - Data Quality

eyeballing we can find anomalies without having domain expertise

Page 25: 2015 - Extract SF - Data Quality

Quick summary:

- create extractors- combine extractors

- schedule data extraction

Page 26: 2015 - Extract SF - Data Quality

What if we need to scale up?

Page 27: 2015 - Extract SF - Data Quality

if you have:- more than ~3 datasources

- more than ~2 extractors per ds- big volume of queries- pre or post processing

Page 28: 2015 - Extract SF - Data Quality

you will need:- people to create and maintain

extractors- process to clean and validate

data

Page 29: 2015 - Extract SF - Data Quality

Data Quality

think about it pre and post data extraction!

Page 30: 2015 - Extract SF - Data Quality

tips and tricks to increase data quality

Page 31: 2015 - Extract SF - Data Quality

XPaths

Page 32: 2015 - Extract SF - Data Quality
Page 33: 2015 - Extract SF - Data Quality
Page 34: 2015 - Extract SF - Data Quality

//div[@id="priceBlock"]/table/tbody/tr/td[b/@class="priceLarge"]/b

better than

//*[@id="priceBlock"]/table/tbody/tr[2]/td[2]/b[1]

Page 35: 2015 - Extract SF - Data Quality

Regex

Page 36: 2015 - Extract SF - Data Quality
Page 37: 2015 - Extract SF - Data Quality
Page 39: 2015 - Extract SF - Data Quality

Required column

Page 40: 2015 - Extract SF - Data Quality
Page 41: 2015 - Extract SF - Data Quality
Page 42: 2015 - Extract SF - Data Quality
Page 43: 2015 - Extract SF - Data Quality
Page 44: 2015 - Extract SF - Data Quality

measuring data quality

Page 45: 2015 - Extract SF - Data Quality
Page 46: 2015 - Extract SF - Data Quality

completeness

Page 47: 2015 - Extract SF - Data Quality

coverage

Page 48: 2015 - Extract SF - Data Quality
Page 49: 2015 - Extract SF - Data Quality

post extraction data quality improvements?

Page 50: 2015 - Extract SF - Data Quality
Page 51: 2015 - Extract SF - Data Quality

how we do it

Page 52: 2015 - Extract SF - Data Quality
Page 53: 2015 - Extract SF - Data Quality
Page 54: 2015 - Extract SF - Data Quality

Smart automation

Page 55: 2015 - Extract SF - Data Quality

anomaly detection

Page 56: 2015 - Extract SF - Data Quality

variance, variability, noise

Page 57: 2015 - Extract SF - Data Quality

normalization

Page 58: 2015 - Extract SF - Data Quality

confidence score

Page 59: 2015 - Extract SF - Data Quality

Human input

Page 60: 2015 - Extract SF - Data Quality

Transparency

Page 61: 2015 - Extract SF - Data Quality

summary

Page 62: 2015 - Extract SF - Data Quality

Data Quality is essential

Page 63: 2015 - Extract SF - Data Quality

think about it from the very beginning

Page 64: 2015 - Extract SF - Data Quality

develop a process to measure data quality before scaling up

Page 65: 2015 - Extract SF - Data Quality

if you don’t want to reinvent the wheel - contact us!

Page 66: 2015 - Extract SF - Data Quality

Thank [email protected]