advanced data profiling introduction · 2017-10-16 · profiling introduction ws2017/18 data...

23
Advanced Data Profiling Introduction Prof. Dr. Felix Naumann, Thorsten Papenbrock, Tobias Bleifuß, Hazar Harmouch and Lan Jiang WS2017/18 Image credit: NASA/Ames/JPL-Caltech

Upload: others

Post on 04-Mar-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

Advanced Data ProfilingIntroduction

Prof. Dr. Felix Naumann, Thorsten Papenbrock, Tobias Bleifuß, Hazar Harmouch and Lan Jiang

WS2017/18

Imag

e cr

edit:

NASA/A

mes

/JPL

-Cal

tech

Page 2: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ Data profiling is the process of examining the data available in an existing data source [...] and collecting statistics and information about that data.

■ Wikipedia 09/2013

■ Data profiling refers to the activity of creating small but informative summaries of a database.

■ Ted Johnson, Encyclopedia of Database Systems

■ A fixed set of data profiling tasks / results

Advanced Data Profiling Introduction WS2017/18

Definition Data Profiling

Slide 2

Page 3: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

Data ProfilingThe meaning of “knowing data”

Name Type Equatorialdiameter Mass Orbital

radiusOrbitalperiod

Rotationperiod

Confirmedmoons Rings Atmosphere

Mercury Terrestrial 0.382 0.06 0.47 0.24 58.64 0 no minimal

Venus Terrestrial 0.949 0.82 0.72 0.62 −243.02 0 no CO2, N2

Earth Terrestrial 1.000 1.00 1.00 1.00 1.00 1 no N2, O2, Ar

Mars Terrestrial 0.532 0.11 1.52 1.88 1.03 2 no CO2, N2, Ar

Jupiter Giant 11.209 317.80 5.20 11.86 0.41 67 yes H2, He

Saturn Giant 9.449 95.2 9.54 29.46 0.43 62 yes H2, He

Uranus Giant 4.007 14.6 19.22 84.01 −0.72 27 yes H2, He

Neptune Giant 3.883 17.2 30.06 164.8 0.67 14 yes H2, He

format

size

CH

AR(3

2)

CH

AR(1

6)

FLO

AT

BO

OLE

AN

VARCH

AR

FLO

AT

FLO

AT

FLO

AT

FLO

AT

INTE

GER

data types

range min = 0.382

max = 11.209aggregation sum = 173

avg = 21.625distribution

01234

Slide 3

Page 4: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

Data ProfilingThe meaning of “knowing data”

Name Type Equatorialdiameter Mass Orbital

radiusOrbitalperiod

Rotationperiod

Confirmedmoons Rings Atmosphere

Mercury Terrestrial 0.382 0.06 0.47 0.24 58.64 0 no minimal

Venus Terrestrial 0.949 0.82 0.72 0.62 −243.02 0 no CO2, N2

Earth Terrestrial 1.000 1.00 1.00 1.00 1.00 1 no N2, O2, Ar

Mars Terrestrial 0.532 0.11 1.52 1.88 1.03 2 no CO2, N2, Ar

Jupiter Giant 11.209 317.80 5.20 11.86 0.41 67 yes H2, He

Saturn Giant 9.449 95.2 9.54 29.46 0.43 62 yes H2, He

Uranus Giant 4.007 14.6 19.22 84.01 −0.72 27 yes H2, He

Neptune Giant 3.883 17.2 30.06 164.8 0.67 14 yes H2, He

keysuniqueMass ~ Confirmed Moons

relationships

rulesEquatorial diameter x Mass > 0 Atmosphere à Rings

intra table dependencies

inter table dependenciesMoon.Planet ⊆ Planet.Name

Slide 4

Page 5: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

Classification of Traditional Profiling Tasks

Advanced Data Profiling Introduction WS2017/18

Dat

a pr

ofili

ng

Single column

Cardinalities

Patterns and data types

Value distributions

Multiple columns

Uniqueness

Key discovery

Conditional

Partial

Inclusion dependencies

Foreign key discovery

Conditional

Partial

Functional dependencies

Conditional

PartialSlide 5

Page 6: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ Data profiling gathers technical metadata to support data management

■ Data mining and data analytics discovers non-obvious results to support business management

■ Data profiling results: information about columns and column sets

■ Data mining results: information about rows or row sets

□ clustering, summarization, association rules, …

■ Rahm and Do, 2000

□ Profiling: Individual attributes

□ Mining: Multiple attributes

Advanced Data Profiling Introduction WS2017/18

Data profiling vs. data mining

Slide 6

Page 7: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ INDs (typically) involve more than one relation.

■ Let D be a relational schema and let I be an instance of D.

■ R[A1, …, An] denotes projection of I on attributes A1, … An of relation R: R[A1, …, An] = πA1, …, An(R)

■ IND R[A1, …, An] Í S[B1, …, Bn], where R, S are (possibly identical) relations of D.

□ Projection on R and S must have same number of attributes.

□ An instance I of D satisfies an IND if I(R)[A1, …, An] Í I(S)[B1, …, Bn]

□ IND is maximal if R[XA] Í S[YB] is invalid for any AÎR, BÎS

□ Values of R: “dependent values”

□ Values of S: “referenced values”

■ Task: Find all maximal, non-trivial INDs

□ Typical assumptions: No repeating attributes, disjoint LHS and RHS

Inclusion Dependencies: Definition

Advanced Data Profiling Introduction WS2017/18

Slide 7

Page 8: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ Each Title in Showings should appear as a Title in Movies

□ Showings[Title] Í Movie[Title]

■ Aka. “referential integrity”

□ Referenced attributes need not be a key (or unique)

□ Foreign key: helps prune candidates

Example

Advanced Data Profiling Introduction WS2017/18

Slide 8

Page 9: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ Reflexivity: R[X] Í R[X]

■ Projection:

□ R[A1, …, An] Í S[B1, …, Bn] => R[Ai1, …, Aim] Í S[Bi1, …, Bim] for each sequence i1, …,im of Integers in {1,…,n}

■ Transitivity:

■ R[X] Í S[Y] and S[Y] Í T[Z] => R[X] Í T[Z]

■ Example: “transitive foreign keys” for 1:1 relationships

Inference rules for INDs

Advanced Data Profiling Introduction WS2017/18

Slide 9

Page 10: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ Unary INDs□ INDs on single attributes: R[A] Í S[B]

■ n-ary INDs□ INDs on multiple attributes: R[X] Í S[Y] □ |X| = |Y|

■ Partial INDs□ IND R[A] Í S[B] is satisfied for x% of all tuples in R□ IND R[A] Í S[B] is satisfied for all but x tuples in R

■ Approximate INDs□ IND R[A] Í S[B] is satisfied with probability p.□ Based on sampling or other heuristics

IND types

Advanced Data Profiling Introduction WS2017/18

Slide 10

Page 11: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ Unary: R[C] Í S[F]

■ N-ary: R[B,C] Í S[G,F]

■ Partial: R[A] Í75% S[F]

■ Approximate: R[BA] Í S[GH]

Examples

R A B C1 x 12 x 13 y 25 z 4

S F G H1 x 12 y 33 z 44 z 4

Advanced Data Profiling Introduction WS2017/18

Slide 11

Page 12: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ General insight into data

■ Detect unknown foreign keys

■ Example

□ PDB: Protein Data Bank

□ OpenMMS provides relational schema

– Parses protein and nucleic acid macromolecular structure data from the standard mmCIF format.

□ 175 tables with primary key constraints

□ 2705 attributes

□ But: Not a single foreign key constraint!

Motivation for IND discovery

Advanced Data Profiling Introduction WS2017/18

Slide 12

Page 13: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ Ensembl – genome database□ Shipped as MySQL dump files□ More than 200 tables□ Not a single foreign key constraint!

■ Web tables: No schema, no constraints, but many connections

■ Why are FKs missing?□ Lack of support for checking foreign key constraints in the host system– Example: Oracle did not support FKs up to v6

□ Fear that checking such constraints would impede database performance□ Lack of database knowledge within the development team□ Dirty data prevents setting the constraint

Motivation for IND discovery

Advanced Data Profiling Introduction WS2017/18

Slide 13

Page 14: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

Unary IND detection complexity

Advanced Data Profiling Introduction WS2017/18

Name Type Equatorialdiameter Mass Orbital

radiusOrbitalperiod

Rotationperiod

Confirmedmoons Rings Atmosphere

Mercury Terrestrial 0.382 0.06 0.47 0.24 58.64 0 no minimalVenus Terrestrial 0.949 0.82 0.72 0.62 −243.02 0 no CO2, N2

Earth Terrestrial 1.000 1.00 1.00 1.00 1.00 1 no N2, O2, ArMars Terrestrial 0.532 0.11 1.52 1.88 1.03 2 no CO2, N2, Ar

Jupiter Giant 11.209 317.8 5.20 11.86 0.41 67 yes H2, HeSaturn Giant 9.449 95.2 9.54 29.46 0.43 62 yes H2, HeUranus Giant 4.007 14.6 19.22 84.01 −0.72 27 yes H2, He

Neptune Giant 3.883 17.2 30.06 164.8 0.67 14 yes H2, He

■ Name ⊆ Type ?■ Name ⊆ Equatorial_diameter ?■ Name ⊆ Mass ?■ Name ⊆ Orbital_radius ?■ Name ⊆ Orbital_period ?■ Name ⊆ Rotation_period ?■ Name ⊆ Confirmed_moons ?■ Name ⊆ Rings ?■ Name ⊆ Atmosphere ?

■ Type ⊆ Name ?■ Type ⊆ Equatorial_diameter ?■ Type ⊆ Mass ?■ Type ⊆ Orbital_radius ?■ Type ⊆ Orbital_period ?■ Type ⊆ Rotation_period ?■ Type ⊆ Confirmed_moons ?■ Type ⊆ Rings ?■ Type ⊆ Atmosphere ?

■ Mass ⊆ Name ?■ Mass ⊆ Type ?■ Mass ⊆ Equatorial_diameter ?■ …

Complexity: O(n2-n) for n attributes

Example:10 attr ~ 90 checks1,000 attr ~ 999,000 checks

Slide 14

Page 15: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

YX

N-ary IND detection complexity

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABDABE ACD ACEADE BCD BCE BDE CDE

ABCDABCE ABDE ACDE BCDE

ABCDE

A B C D E

AB AC AD AE BC BD BE CD CE DE

A B C D E

AB AC AD AE BC BD BE CD CE DE

A B C D E

AB AC AD AE BC BD BE CD CE DE

A B C D E

AB AC AD AE BC BD BE CD CE DEAB AC AD AE BC BD BE CD CE DEAB AC AD AE BC BD BE CD CE DE

A B C D E

Test combination with all other

combinations of same size!

No n-ary INDs here! Why?

𝑿Í𝒀 :𝑿 ∩ 𝒀 = ∅

𝑛𝑘 ∗

𝑛 − 𝑘𝑘 ∗ 𝑘!

IND Candidates in level k:

nodes

other, non-overlapping nodes

all permu-tations

Advanced Data Profiling Introduction WS2017/18

Slide 15

Page 16: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

N-ary IND detection complexity

total 0 2 6 24 80 330 1302 5936 26784 133650 669350 3609672 19674096 113525594 66440031015 014 0 013 0 0 012 0 0 0 011 0 0 0 0 010 0 0 0 0 0 09 0 0 0 0 0 0 08 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 17297280 2594592006 0 0 0 0 0 0 665280 8648640 60540480 3027024005 0 0 0 0 0 30240 332640 1995840 8648640 30270240 908107204 0 0 0 0 1680 15120 75600 277200 831600 2162160 5045040 108108003 0 0 0 120 840 3360 10080 25200 55440 110880 205920 360360 6006002 0 0 12 60 180 420 840 1512 2520 3960 5940 8580 12012 163801 0 2 6 12 20 30 42 56 72 90 110 132 156 182 210

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of attributes: n

Number of attributes: n

Num

ber

of le

vels

: k

Advanced Data Profiling Introduction WS2017/18

Slide 16

Page 17: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ Bell&Brockhausen: Siegfried Bell and Peter Brockhausen.“Discovery of Data Dependencies in Relational Databases.”Statistics, Machine Learning and Knowledge Discovery in Databases, ML–Net Familiarization Workshop, 53–58, 1995.

■ Zigzag: Fabien De Marchi and Jean-Marc Petit.“Zigzag: A New Algorithm for Mining Large Inclusion Dependencies in Databases.”In Proceedings of the International Conference on Data Mining (ICDM), 27–34, 2003.

■ FIND2: Andreas Koeller and Elke. A. Rundensteiner.“Discovery of High-Dimensional Inclusion Dependencies.”In Proceedings of the International Conference on Data Engineering (ICDE), 683–685, 2003.

Algorithms

Advanced Data Profiling Introduction WS2017/18

Slide 17

Page 18: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ SPIDER: Jana Bauckmann, Ulf Leser, Felix Naumann, and Veronique Tietz.“Efficiently Detecting Inclusion Dependencies.”In Proceedings of the International Conference on Data Engineering (ICDE), 1448–1450, 2007.

■ deMarchi/MIND: Fabien De Marchi, Stéphane Lopes, and Jean Marc Petit.“Unary and N-Ary Inclusion Dependency Discovery in Relational Databases.”Journal of Intelligent Information Systems 32 (1): 53–73, 2009.

■ BINDER: Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann.“Divide & Conquer-Based Inclusion Dependency Discovery.”In Proceedings of the VLDB Endowment, 8:774–785, 2015.

Algorithms

Advanced Data Profiling Introduction WS2017/18

Slide 18

Page 19: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ S-INDD: Nuhad Shaabani and Christoph Meinel.“Scalable Inclusion Dependency Discovery.”In Proceedings of the International Conference on Database Systems forAdvanced Applications (DASFAA), 425–440, 2015.

■ SINDY: Sebastian Kruse, Thorsten Papenbrock, and Felix Naumann.“Scaling Out the Discovery of Inclusion Dependencies.”In Proceedings of the Conference Database Systems for Business, Technology and Web (BTW), 445–454, 2015.

■ MIND2: Nuhad Shaabani and Christoph Meinel.“Detecting Maximum Inclusion Dependencies without CandidateGeneration.”Proceedings of the Conference International Conference on Database andExpert (DEXA), 118–133, 2016.

Algorithms

Advanced Data Profiling Introduction WS2017/18

Slide 19

Page 20: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

MetanomeAn extensible architecture

Advanced Data Profiling Introduction WS2017/18

Slide 20

§ Algorithm execution§ Result & Resource

management

§ Algorithm configuration§ Result & Resource

presentation

Configuration

Resource LinksSPIDER

jar

txt tsv

xmlcsv

DB2DB2

MySQLResults DUCCjar

BINDERjar

DFDjar

SWANjar

Page 21: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

Organisation

Advanced Data Profiling Introduction WS2017/18

Slide 21

Group allocation

Study your algorithm(s)

Present your algorithmsImplement your

algorithms

Present your implementations

6 participants,3 teams of 2 students

Prepare experiments to evaluate and compare

your algorithms

Swap algorithms (around Christmas)

Run your experiments for all algorithms

Improve implementations of algorithms

Present improvements

Implementation freeze

Final paper writing

Write algorithm descriptions for paperDescribe your results

Page 22: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

Active participation in meetings and discussions

Initial presentation of your algorithm(s)

Implementation of your algorithm(s) using the Metanome interface

Presentation of your implementation

Implementation of improvements to another team’s algorithm(s)

Presentation of your improvements

Final paper-style submission

Grading

Advanced Data Profiling Introduction WS2017/18

Slide 22

10%

15%

20%

15%

10%

10%

20%

Page 23: Advanced Data Profiling Introduction · 2017-10-16 · Profiling Introduction WS2017/18 Data profiling vs. data mining Slide 6 INDs (typically) involve more than one relation. Let

■ To apply for this seminar (bindingly):

□ Send an email to [email protected]

□ Deadline: 24.10.2017 23:59

□ In case of too many applications, we need to choose randomly

■ Meeting next week: Data Profiling / Efficient Java Code

■ 30.10.17: Group allocation

Further Procedure

Advanced Data Profiling Introduction WS2017/18

Slide 23