advanced data profiling introduction · 2017-10-16 · profiling introduction ws2017/18 data...

Advanced Data ProfilingIntroduction

Prof. Dr. Felix Naumann, Thorsten Papenbrock, Tobias Bleifuß, Hazar Harmouch and Lan Jiang

WS2017/18

Imag

e cr

edit:

NASA/A

mes

/JPL

-Cal

tech

■ Data profiling is the process of examining the data available in an existing data source [...] and collecting statistics and information about that data.

■ Wikipedia 09/2013

■ Data profiling refers to the activity of creating small but informative summaries of a database.

■ Ted Johnson, Encyclopedia of Database Systems

■ A fixed set of data profiling tasks / results

Advanced Data Profiling Introduction WS2017/18

Definition Data Profiling

Slide 2

Data ProfilingThe meaning of “knowing data”

Name Type Equatorialdiameter Mass Orbital

radiusOrbitalperiod

Rotationperiod

Confirmedmoons Rings Atmosphere

Mercury Terrestrial 0.382 0.06 0.47 0.24 58.64 0 no minimal

Venus Terrestrial 0.949 0.82 0.72 0.62 −243.02 0 no CO2, N2

Earth Terrestrial 1.000 1.00 1.00 1.00 1.00 1 no N2, O2, Ar

Mars Terrestrial 0.532 0.11 1.52 1.88 1.03 2 no CO2, N2, Ar

Jupiter Giant 11.209 317.80 5.20 11.86 0.41 67 yes H2, He

Saturn Giant 9.449 95.2 9.54 29.46 0.43 62 yes H2, He

Uranus Giant 4.007 14.6 19.22 84.01 −0.72 27 yes H2, He

Neptune Giant 3.883 17.2 30.06 164.8 0.67 14 yes H2, He

format

size

CH

AR(3

2)

CH

AR(1

6)

FLO

AT

BO

OLE

AN

VARCH

AR

FLO

AT

FLO

AT

FLO

AT

FLO

AT

INTE

GER

data types

range min = 0.382

max = 11.209aggregation sum = 173

avg = 21.625distribution

01234

Slide 3

Data ProfilingThe meaning of “knowing data”


radiusOrbitalperiod

Rotationperiod


Mercury Terrestrial 0.382 0.06 0.47 0.24 58.64 0 no minimal

Venus Terrestrial 0.949 0.82 0.72 0.62 −243.02 0 no CO2, N2

Earth Terrestrial 1.000 1.00 1.00 1.00 1.00 1 no N2, O2, Ar

Mars Terrestrial 0.532 0.11 1.52 1.88 1.03 2 no CO2, N2, Ar

Jupiter Giant 11.209 317.80 5.20 11.86 0.41 67 yes H2, He

Saturn Giant 9.449 95.2 9.54 29.46 0.43 62 yes H2, He

Uranus Giant 4.007 14.6 19.22 84.01 −0.72 27 yes H2, He


keysuniqueMass ~ Confirmed Moons

relationships

rulesEquatorial diameter x Mass > 0 Atmosphere à Rings

intra table dependencies

inter table dependenciesMoon.Planet ⊆ Planet.Name

Slide 4

Classification of Traditional Profiling Tasks


Dat

a pr

ofili

ng

Single column

Cardinalities

Patterns and data types

Value distributions

Multiple columns

Uniqueness

Key discovery

Conditional

Partial

Inclusion dependencies

Foreign key discovery

Conditional

Partial

Functional dependencies

Conditional

PartialSlide 5

■ Data profiling gathers technical metadata to support data management

■ Data mining and data analytics discovers non-obvious results to support business management

■ Data profiling results: information about columns and column sets

■ Data mining results: information about rows or row sets

□ clustering, summarization, association rules, …

■ Rahm and Do, 2000

□ Profiling: Individual attributes

□ Mining: Multiple attributes


Data profiling vs. data mining

Slide 6

■ INDs (typically) involve more than one relation.

■ Let D be a relational schema and let I be an instance of D.

■ R[A1, …, An] denotes projection of I on attributes A1, … An of relation R: R[A1, …, An] = πA1, …, An(R)

■ IND R[A1, …, An] Í S[B1, …, Bn], where R, S are (possibly identical) relations of D.

□ Projection on R and S must have same number of attributes.

□ An instance I of D satisfies an IND if I(R)[A1, …, An] Í I(S)[B1, …, Bn]

□ IND is maximal if R[XA] Í S[YB] is invalid for any AÎR, BÎS

□ Values of R: “dependent values”

□ Values of S: “referenced values”

■ Task: Find all maximal, non-trivial INDs

□ Typical assumptions: No repeating attributes, disjoint LHS and RHS

Inclusion Dependencies: Definition


Slide 7

■ Each Title in Showings should appear as a Title in Movies

□ Showings[Title] Í Movie[Title]

■ Aka. “referential integrity”

□ Referenced attributes need not be a key (or unique)

□ Foreign key: helps prune candidates

Example


Slide 8

■ Reflexivity: R[X] Í R[X]

■ Projection:

□ R[A1, …, An] Í S[B1, …, Bn] => R[Ai1, …, Aim] Í S[Bi1, …, Bim] for each sequence i1, …,im of Integers in {1,…,n}

■ Transitivity:

■ R[X] Í S[Y] and S[Y] Í T[Z] => R[X] Í T[Z]

■ Example: “transitive foreign keys” for 1:1 relationships

Inference rules for INDs


Slide 9

■ Unary INDs□ INDs on single attributes: R[A] Í S[B]

■ n-ary INDs□ INDs on multiple attributes: R[X] Í S[Y] □ |X| = |Y|

■ Partial INDs□ IND R[A] Í S[B] is satisfied for x% of all tuples in R□ IND R[A] Í S[B] is satisfied for all but x tuples in R

■ Approximate INDs□ IND R[A] Í S[B] is satisfied with probability p.□ Based on sampling or other heuristics

IND types


Slide 10

■ Unary: R[C] Í S[F]

■ N-ary: R[B,C] Í S[G,F]

■ Partial: R[A] Í75% S[F]

■ Approximate: R[BA] Í S[GH]

Examples

R A B C1 x 12 x 13 y 25 z 4

S F G H1 x 12 y 33 z 44 z 4


Slide 11

■ General insight into data

■ Detect unknown foreign keys

■ Example

□ PDB: Protein Data Bank

□ OpenMMS provides relational schema

– Parses protein and nucleic acid macromolecular structure data from the standard mmCIF format.

□ 175 tables with primary key constraints

□ 2705 attributes

□ But: Not a single foreign key constraint!

Motivation for IND discovery


Slide 12

■ Ensembl – genome database□ Shipped as MySQL dump files□ More than 200 tables□ Not a single foreign key constraint!

■ Web tables: No schema, no constraints, but many connections

■ Why are FKs missing?□ Lack of support for checking foreign key constraints in the host system– Example: Oracle did not support FKs up to v6

□ Fear that checking such constraints would impede database performance□ Lack of database knowledge within the development team□ Dirty data prevents setting the constraint

Motivation for IND discovery


Slide 13

Unary IND detection complexity



radiusOrbitalperiod

Rotationperiod


Mercury Terrestrial 0.382 0.06 0.47 0.24 58.64 0 no minimalVenus Terrestrial 0.949 0.82 0.72 0.62 −243.02 0 no CO2, N2

Earth Terrestrial 1.000 1.00 1.00 1.00 1.00 1 no N2, O2, ArMars Terrestrial 0.532 0.11 1.52 1.88 1.03 2 no CO2, N2, Ar

Jupiter Giant 11.209 317.8 5.20 11.86 0.41 67 yes H2, HeSaturn Giant 9.449 95.2 9.54 29.46 0.43 62 yes H2, HeUranus Giant 4.007 14.6 19.22 84.01 −0.72 27 yes H2, He


■ Name ⊆ Type ?■ Name ⊆ Equatorial_diameter ?■ Name ⊆ Mass ?■ Name ⊆ Orbital_radius ?■ Name ⊆ Orbital_period ?■ Name ⊆ Rotation_period ?■ Name ⊆ Confirmed_moons ?■ Name ⊆ Rings ?■ Name ⊆ Atmosphere ?

■ Type ⊆ Name ?■ Type ⊆ Equatorial_diameter ?■ Type ⊆ Mass ?■ Type ⊆ Orbital_radius ?■ Type ⊆ Orbital_period ?■ Type ⊆ Rotation_period ?■ Type ⊆ Confirmed_moons ?■ Type ⊆ Rings ?■ Type ⊆ Atmosphere ?

■ Mass ⊆ Name ?■ Mass ⊆ Type ?■ Mass ⊆ Equatorial_diameter ?■ …

Complexity: O(n2-n) for n attributes

Example:10 attr ~ 90 checks1,000 attr ~ 999,000 checks

Slide 14

YX

N-ary IND detection complexity

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABDABE ACD ACEADE BCD BCE BDE CDE

ABCDABCE ABDE ACDE BCDE

ABCDE

A B C D E


A B C D E


A B C D E


A B C D E

AB AC AD AE BC BD BE CD CE DEAB AC AD AE BC BD BE CD CE DEAB AC AD AE BC BD BE CD CE DE

A B C D E

Test combination with all other

combinations of same size!

No n-ary INDs here! Why?

𝑿Í𝒀 :𝑿 ∩ 𝒀 = ∅

𝑛𝑘 ∗

𝑛 − 𝑘𝑘 ∗ 𝑘!

IND Candidates in level k:

nodes

other, non-overlapping nodes

all permu-tations


Slide 15

N-ary IND detection complexity

total 0 2 6 24 80 330 1302 5936 26784 133650 669350 3609672 19674096 113525594 66440031015 014 0 013 0 0 012 0 0 0 011 0 0 0 0 010 0 0 0 0 0 09 0 0 0 0 0 0 08 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 17297280 2594592006 0 0 0 0 0 0 665280 8648640 60540480 3027024005 0 0 0 0 0 30240 332640 1995840 8648640 30270240 908107204 0 0 0 0 1680 15120 75600 277200 831600 2162160 5045040 108108003 0 0 0 120 840 3360 10080 25200 55440 110880 205920 360360 6006002 0 0 12 60 180 420 840 1512 2520 3960 5940 8580 12012 163801 0 2 6 12 20 30 42 56 72 90 110 132 156 182 210

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of attributes: n

Number of attributes: n

Num

ber

of le

vels

: k


Slide 16

■ Bell&Brockhausen: Siegfried Bell and Peter Brockhausen.“Discovery of Data Dependencies in Relational Databases.”Statistics, Machine Learning and Knowledge Discovery in Databases, ML–Net Familiarization Workshop, 53–58, 1995.

■ Zigzag: Fabien De Marchi and Jean-Marc Petit.“Zigzag: A New Algorithm for Mining Large Inclusion Dependencies in Databases.”In Proceedings of the International Conference on Data Mining (ICDM), 27–34, 2003.

■ FIND2: Andreas Koeller and Elke. A. Rundensteiner.“Discovery of High-Dimensional Inclusion Dependencies.”In Proceedings of the International Conference on Data Engineering (ICDE), 683–685, 2003.

Algorithms


Slide 17

■ SPIDER: Jana Bauckmann, Ulf Leser, Felix Naumann, and Veronique Tietz.“Efficiently Detecting Inclusion Dependencies.”In Proceedings of the International Conference on Data Engineering (ICDE), 1448–1450, 2007.

■ deMarchi/MIND: Fabien De Marchi, Stéphane Lopes, and Jean Marc Petit.“Unary and N-Ary Inclusion Dependency Discovery in Relational Databases.”Journal of Intelligent Information Systems 32 (1): 53–73, 2009.

■ BINDER: Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Felix Naumann.“Divide & Conquer-Based Inclusion Dependency Discovery.”In Proceedings of the VLDB Endowment, 8:774–785, 2015.

Algorithms


Slide 18

■ S-INDD: Nuhad Shaabani and Christoph Meinel.“Scalable Inclusion Dependency Discovery.”In Proceedings of the International Conference on Database Systems forAdvanced Applications (DASFAA), 425–440, 2015.

■ SINDY: Sebastian Kruse, Thorsten Papenbrock, and Felix Naumann.“Scaling Out the Discovery of Inclusion Dependencies.”In Proceedings of the Conference Database Systems for Business, Technology and Web (BTW), 445–454, 2015.

■ MIND2: Nuhad Shaabani and Christoph Meinel.“Detecting Maximum Inclusion Dependencies without CandidateGeneration.”Proceedings of the Conference International Conference on Database andExpert (DEXA), 118–133, 2016.

Algorithms


Slide 19

MetanomeAn extensible architecture


Slide 20

§ Algorithm execution§ Result & Resource

management

§ Algorithm configuration§ Result & Resource

presentation

Configuration

Resource LinksSPIDER

jar

txt tsv

xmlcsv

DB2DB2

MySQLResults DUCCjar

BINDERjar

DFDjar

SWANjar

Organisation


Slide 21

Group allocation

Study your algorithm(s)

Present your algorithmsImplement your

algorithms

Present your implementations

6 participants,3 teams of 2 students

Prepare experiments to evaluate and compare

your algorithms

Swap algorithms (around Christmas)

Run your experiments for all algorithms

Improve implementations of algorithms

Present improvements

Implementation freeze

Final paper writing

Write algorithm descriptions for paperDescribe your results

Active participation in meetings and discussions

Initial presentation of your algorithm(s)

Implementation of your algorithm(s) using the Metanome interface

Presentation of your implementation

Implementation of improvements to another team’s algorithm(s)

Presentation of your improvements

Final paper-style submission

Grading


Slide 22

10%

15%

20%

15%

10%

10%

20%

■ To apply for this seminar (bindingly):

□ Send an email to [email protected]

□ Deadline: 24.10.2017 23:59

□ In case of too many applications, we need to choose randomly

■ Meeting next week: Data Profiling / Efficient Java Code

■ 30.10.17: Group allocation

Further Procedure


Slide 23

advanced data profiling introduction · 2017-10-16 · profiling introduction ws2017/18 data...

Documents