data profiling revisited - tu berlin · data profiling revisited big data workshop @tu berlin...

39
Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Upload: others

Post on 04-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013

Felix Naumann

Page 2: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Profiling in Spreadsheets

Felix Naumann | Data Profiling | Big Data Workshop 2013

3

Page 3: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Felix Naumann | Data Profiling | Big Data Workshop 2013

4

Page 4: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Felix Naumann | Data Profiling | Big Data Workshop 2013

5

Page 5: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Felix Naumann | Data Profiling | Big Data Workshop 2013

6

Page 6: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Felix Naumann | Data Profiling | Big Data Workshop 2013

7

Page 7: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Felix Naumann | Data Profiling | Big Data Workshop 2013

8

Page 8: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Felix Naumann | Data Profiling | Big Data Workshop 2013

9

Page 9: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Many interesting questions remain

■ What are possible keys?

□ Phone

□ firstname, lastname, street

■ Are there any functional dependencies?

□ zip -> city

□ race -> voting behavior

■ Which columns correlate?

□ county and first name

□ DoB and last name

■ What are frequent patterns in a column?

□ ddddd

□ dd aaaa St

Page 10: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Definition Data Profiling

■ Data profiling is the process of examining the data available in an

existing data source [...] and collecting statistics and information

about that data.

Wikipedia 09/2013

■ Data profiling refers to the activity of creating small but

informative summaries of a database.

Ted Johnson, Encyclopedia of Database Systems

■ A fixed set of data profiling tasks / results

Felix Naumann | Data Profiling | Big Data Workshop 2013

11

Page 11: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

„Big“ Data Profiling or How big is „Big“?

Measuring the „Vs“

■ Volume

□ Row counts, etc.

■ Velocity

□ Temporal profiling

■ Variability

□ How difficult to integrate and analyse

■ Veracity

□ How good is it?

Felix Naumann | Data Profiling | Big Data Workshop 2013

12

Page 12: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Challenges of (Big) Data Profiling

Felix Naumann | Data Profiling | Big Data Workshop 2013

13

■ Computational complexity

□ Number of rows

□ Number of columns (and column combinations)

■ Large solution space

■ New data types (beyond strings and numbers)

■ New data models (beyond relational): RDF, XML, etc.

■ New requirements

□ User-oriented

□ Interactive

□ Streaming data

Page 13: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Agenda

14

■ Use cases for data profiling

■ Data profiling tasks

■ Data profiling tools

■ Advanced profiling

■ Profiling heterogeneous data

Felix Naumann | Data Profiling | Big Data Workshop 2013

Page 14: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Use Cases for Profiling

■ Query optimization

□ Counts and histograms

■ Data cleansing

□ Patterns and violations

■ Data integration

□ Cross-DB inclusion dependencies

■ Scientific data management

□ Handle new datasets

■ Data inspection, analytics, and mining

□ Profiling as preparation to decide on models and questions

■ Database reverse engineering

■ Data profiling as preparation for any other data management task

Felix Naumann | Data Profiling | Big Data Workshop 2013

15

Page 15: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Agenda

16

■ Use cases for data profiling

■ Data profiling tasks

■ Data profiling tools

■ Advanced profiling

■ Profiling heterogeneous data

Felix Naumann | Data Profiling | Big Data Workshop 2013

Page 16: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Classification of Traditional Profiling Tasks

Felix Naumann | Data Profiling | Big Data Workshop 2013

17

Data

pro

filing

Single column

Cardinalities

Patterns and data types

Value distributions

Multiple columns

Uniqueness

Key discovery

Conditional

Partial

Inclusion dependencies

Foreign key discovery

Conditional

Partial

Functional dependencies

Conditional

Partial

Page 17: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Single-column vs. multi-column

■ Single column profiling

□ Most basic form of data profiling

□ Assumption: All values are of same type

□ Assumption: All values have some common properties

◊ That are to be discovered

□ Often part of the basic statistics gathered by DBMS

□ Complexity: Number of values/rows

■ Multicolumn profiling

□ Discover joint properties

□ Discover dependencies

□ Complexity: Number of columns and number of values

Felix Naumann | Data Profiling | Big Data Workshop 2013

18

Page 18: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Cardinalities and distributions

■ Number of values / distinct values

■ Number of NULLs

■ MIN and MAX value

■ Distributions and histograms

Felix Naumann | Data Profiling | Big Data Workshop 2013

19

0

2

4

6

8

10

12

14

16

543,73,332,72,321,71,31

Grade distribution

Page 19: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Uniqueness and keys

■ Unique column

□ Only unique values

■ Unique column combination

□ Only unique value combinations

□ Minimality: No subset is unique

■ Uniques: {A, AB, AC, BC, ABC}

■ Minimal uniques: {A, BC}

■ (Maximal) Non-uniques: {B, C} Felix Naumann | Data Profiling | Big Data Workshop 2013

20

A B C

a 1 x

b 2 x

c 2 y

Page 20: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

TPCH – Uniques and Non-Uniques

Felix Naumann | Data Profiling | Big Data Workshop 2013

21 unique non-unique

8 columns

9 columns

10 columns

Page 21: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Scalable profiling

■ Scalability in number of rows

■ Scalability in number of columns

□ “Small” table with 100 columns:

2100 – 1 = 1,267,650,600,228,229,401,496,703,205,375

= 1.3 nonillion column combinations

◊ Impossible to check or even enumerate

■ Solutions

□ Scale up: More RAM, faster CPUs

◊ Expensive

□ Scale in: More cores

◊ More complex (threads)

□ Scale out: More machines

◊ Communication overhead

□ Intelligent enumeration and aggressive pruning

Felix Naumann | Data Profiling | Big Data Workshop 2013

22

Page 22: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Uniques and non-uniques in voter data

■ A unique: voter_reg_num, zip_code, race_code

■ A non-unique: voter_reg_num, status_cd, voter_status_desc, reason_cd, voter_status_reason_desc, absent_ind, name_prefx_cd, name_sufx_cd, half_code, street_dir, street_type_cd, street_sufx_cd, unit_designator, unit_num, state_cd, mail_addr2, mail_addr3, mail_addr4, mail_state, area_cd, phone_num, full_phone_number, drivers_lic, race_code, race_desc, ethnic_code, ethnic_desc, party_cd, party_desc, sex_code, sex, birth_place, precinct_abbrv, precinct_desc, municipality_abbrv, municipality_desc, ward_abbrv, ward_desc, cong_dist_abbrv, cong_dist_desc, super_court_abbrv, super_court_desc, judic_dist_abbrv, judic_dist_desc, nc_senate_abbrv, nc_senate_desc, nc_house_abbrv, nc_house_desc, county_commiss_abbrv, county_commiss_desc, township_abbrv, township_desc, school_dist_abbrv, school_dist_desc, fire_dist_abbrv, fire_dist_desc, water_dist_abbrv, water_dist_desc, sewer_dist_abbrv, sewer_dist_desc, sanit_dist_abbrv, sanit_dist_desc, rescue_dist_abbrv, rescue_dist_desc, munic_dist_abbrv, munic_dist_desc, dist_1_abbrv, dist_1_desc, dist_2_abbrv, dist_2_desc, confidential_ind, age, vtd_abbrv, vtd_desc

Felix Naumann | Data Profiling | Big Data Workshop 2013

23

Page 23: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Agenda

24

■ Use cases for data profiling

■ Data profiling tasks

■ Data profiling tools

■ Advanced profiling

■ Profiling heterogeneous data

Felix Naumann | Data Profiling | Big Data Workshop 2013

Page 24: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Tools have very long feature lists

Felix Naumann | Data Profiling | Big Data Workshop 2013

25

■ Num rows

■ Min value length

■ Median value length

■ Max value length

■ Avg value length

■ Precision of numeric values

■ Scale of numeric values

■ Quartiles

■ Basic data types

■ Num distinct values ("cardinality")

■ Percentage null values

■ Data class and data type

■ Uniqueness and constancy

■ Single-column frequency histogram

■ Multi-column frequency histogram

■ Pattern discovery (Aa9)

■ Soundex frequencies

■ Benford Law Frequency

■ Single column primary key discovery

■ Multi-column primary key discovery

■ Single column IND discovery

■ Inclusion percentage

■ Single-column FK discovery

■ Multi-column IND discovery

■ Multi-column FK discovery

■ Value overlap (cross domain analysis)

■ Single-column FD discovery

■ Multi-column FD discovery

■ Text profiling

Page 25: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

An Aside: Benford Law Frequency (“first digit law”)

■ Statement about the distribution of first digits d in (many)

naturally occurring numbers:

□ 𝑃 𝑑 = 𝑙𝑜𝑔10 𝑑 + 1 − 𝑙𝑜𝑔10 𝑑 = 𝑙𝑜𝑔10 1 + 1𝑑

□ Holds if log(x) is uniformly distributed

Felix Naumann | Data Profiling | Big Data Workshop 2013

26

0

20

40

1 2 3 4 5 6 7 8 9

Page 26: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Examples for Benford‘s Law

■ Surface areas of 335 rivers

■ Sizes of 3259 US populations

■ 104 physical constants

■ 1800 molecular weights

■ 5000 entries from a mathematical handbook

■ 308 numbers contained in an issue of Reader's Digest

■ Street addresses of the first 342 persons listed in American Men of Science

Felix Naumann | Data Profiling | Big Data Workshop 2013

27

Heights of the 60 tallest structures

Page 27: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Felix Naumann | Data Profiling | Big Data Workshop 2013

28

Screenshots from Talend

Page 28: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Screenshots from IBM Information Analyzer

Felix Naumann | Data Profiling | Big Data Workshop 2013

29

Page 29: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Typical shortcomings of tools (and research methods)

■ Usability

□ Complex to configure

□ Results complex to view and interpret

■ Scalability

□ Main-memory based

□ SQL based

■ Efficiency

□ Coffee, Lunch, Overnight

■ Functionality

□ Restricted to simplest tasks

□ Restricted to individual columns or small column sets

◊ “Realistic” key candidates vs. further use-cases

□ „Checking“ vs. „discovery“

Felix Naumann | Data Profiling | Big Data Workshop 2013

30

Page 30: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Agenda

31

■ Use cases for data profiling

■ Data profiling tasks

■ Data profiling tools

■ Advanced profiling

■ Profiling heterogeneous data

Felix Naumann | Data Profiling | Big Data Workshop 2013

Page 31: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Online profiling

■ Profiling is long procedure

□ Boring for developers

□ Expensive for machines (I/O and CPU)

■ Challenge: Display intermediate results

□ …of improving/converging accuracy

□ Allows early abort of profiling run

■ Gear algorithms toward that goal

□ Allow intermediate output

□ Enable early output: “progressive” profiling

Felix Naumann | Data Profiling | Big Data Workshop 2013

32

Page 32: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Incremental profiling

■ Data is dynamic

□ Insert (batch or tuple-based)

□ Updates

□ Deletes

■ Problem: Keep profiling results up-to-date…

□ … without re-profiling the entire data set.

□ Easy examples: SUM, MIN, MAX, COUNT, AVG

□ Difficult examples: MEDIAN, uniqueness, dependencies

Felix Naumann | Data Profiling | Big Data Workshop 2013

33

Page 33: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Agenda

Felix Naumann | Data Profiling | Big Data Workshop 2013

34

■ Use cases for data profiling

■ Data profiling tasks

■ Data profiling tools

■ Advanced profiling

■ Profiling heterogeneous data

Page 34: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Profiling for Integration

■ Profile multiple sources

■ Create measures to estimate integration (and cleansing) effort

□ Schema and data overlap

□ Severity of heterogeneity

■ Schema matching/mapping

□ What constitutes the “difficulty”

of matching/mapping?

■ Duplicate detection

□ Estimate data overlap

□ Estimate fusion effort

Felix Naumann | Data Profiling | Big Data Workshop 2013

35

Page 35: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Topic discovery and clustering

■ What is a dataset about?

□ Domain(s), topics, entity types

■ Single-topic datasets

□ GraceDB, IMDB, Sports databases, etc.

■ Multi-topic datasets

□ General purpose knowledge bases: YAGO, DBpedia

□ 154 million tables crawled from the Web (2008)

Felix Naumann | Data Profiling | Big Data Workshop 2013

36

Page 36: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Profiling new types of data

■ Traditional data profiling: Single table or multiple tables

■ More and more data in other models

□ XML / nested relational / JSON

□ RDF triples

□ Textual data: Blogs, Tweets, News

□ Multimedia data

■ Different models offer new dimensions to profile

□ XML: Nestedness, measures at different nesting levels

□ RDF: Graph structure, in- and outdegrees

□ Multimedia: Color, video-length, volume, etc.

□ Text: Sentiment, sentence structure, complexity, and other

linguistic measures

Felix Naumann | Data Profiling | Big Data Workshop 2013

37

Page 37: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Average sentence length

Felix Naumann | Data Profiling | Big Data Workshop 2013

38

„Literature Fingerprinting: A New Method for Visual Literary Analysis” by Daniel A. Keim and Daniela Oelke

Page 38: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Hapax Legomena

Felix Naumann | Data Profiling | Big Data Workshop 2013

39

„Literature Fingerprinting: A New Method for Visual Literary Analysis” by Daniel A. Keim and Daniela Oelke

Page 39: Data Profiling Revisited - TU Berlin · Data Profiling Revisited Big Data Workshop @TU Berlin 20.9.2013 Felix Naumann

Summary

40

■ Use cases for data profiling

■ Data profiling tasks

■ Data profiling tools

■ Advanced profiling

■ Profiling heterogeneous data

Felix Naumann | Data Profiling | Big Data Workshop 2013