data cleansing exercise: duplicate detection · data cleansing exercise: duplicate detection...
TRANSCRIPT
Data Cleansing Exercise: Duplicate Detection
Thorsten Papenbrock
PhD Candidate
Hasso-Plattner-Institute
Advanced Profiling
Three important metadata
Chart 2
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
Normalization criterion
Atmosphere Rings
Foreign key candidates
Domicile ⊆ Name
Key candidates
|Name|
Name Type Equatorial diameter
Mass
Mercury Terrestrial 0.382 0.06
Venus Terrestrial 0.949 0.82
Earth Terrestrial 1.000 1.00
Mars Terrestrial 0.532 0.11
Jupiter Giant 11.209 317.8
Saturn Giant 9.449 95.2
Uranus Giant 4.007 14.6
... ... ... ...
Name Type
Mercury Terrestrial
Venus Terrestrial
Earth Terrestrial
Mars Terrestrial
Jupiter Giant
Saturn Giant
Uranus Giant
... ...
Sign Domicile
Aries Mars
Taurus Venus
Gemini Mercury
Cancer Moon
Leo Sun
Virgo Mercury
Libra Venus
Scorpio Pluto
Sagittarius Jupiter
Capricorn Saturn
Aquarius Uranus
... ...
Name Atmosphere Rings
Mercury minimal no
Venus CO2, N2 no
Earth N2, O2, Ar no
Mars CO2, N2, Ar no
Jupiter H2, He yes
Saturn H2, He yes
Uranus H2, He yes
... ... ...
Exercise 3
Discovery of functional dependencies
Chart 3
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
■ All teams have passed the exercise:
□ 34 submissions
□ No duplicate algorithm names!
□ Still a few incorrect results (even after correction round)
□ No import errors in Metanome (apart from execution errors)
Exercise 3
Short presentations – Part 1
Chart 4
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
Functional
Dependencies
Exercise 3
Our evaluation
Chart 5
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
■ DELL Optiplex 9010
□ CPU: Intel i5 3.2 GHz
□ RAM: 8 GB (2 GB for Metanome JVM)
□ OS: Debian 64-bit
□ JVM: Java 1.8
Exercise 3
Correctness for abalone.csv
Chart 6
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil
AlexoFredFunctionals
dennis_marius.fd
DJ_FD
dpdc-cnms-fd
DreamteamFd
FanctionalDepundancy
fastTane
fd_finke_dullweber
FD_grundke_wiese
FD_JungRohloff
FD_Kirsten_Zwerg
FD_schaeffer_zoellner
FD_SPIRO
FdBotheJoerkeReissaus
FDFMJR
FdPerchykSchmidt
FrohnOttoFuncDep-LAME-TANE
FuncDep
FunctionalDependencyDetector
FunctionalDerpendency
GottaCatchAllFD
HorLehTane
klinger_marten_fd
LucieKerstinFD
MMFUncDep
MyFd
PCFD
PutePute
RT_FD
SBMMFD
smart_data_cat-FD
Tsun12Fd
YuckFunc
Exercise 3
Correctness for abalone.csv
Chart 7
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil
AlexoFredFunctionals
dennis_marius.fd
DJ_FD
dpdc-cnms-fd
DreamteamFd
FanctionalDepundancy
fastTane
fd_finke_dullweber
FD_grundke_wiese
FD_JungRohloff
FD_Kirsten_Zwerg
FD_schaeffer_zoellner
FD_SPIRO
FdBotheJoerkeReissaus
FDFMJR
FdPerchykSchmidt
FrohnOttoFuncDep-LAME-TANE
FuncDep
FunctionalDependencyDetector
FunctionalDerpendency incorrect
GottaCatchAllFD
HorLehTane
klinger_marten_fd
LucieKerstinFD
MMFUncDep
MyFd
PCFD
PutePute
RT_FD
SBMMFD
smart_data_cat-FD
Tsun12Fd
YuckFunc
0
200000
400000
600000
800000
1000000
1200000
1400000fd
_finke_dullw
eber
FD_schaeffer_zo…
FU
N
fastT
ane
Tsun12Fd
FD
_gru
ndke_w
iese
MM
FU
ncD
ep
aiw
endil
HorL
ehTane
Tane
klinger_
mart
en_fd
Dre
am
team
Fd
FuncD
ep
RT_FD
Pute
Pute
YuckFunc
FanctionalDepun…
PCFD
AlexoFredFunctio…
sm
art
_data
_cat-
FD
FrohnOttoFuncDe…
FD
FM
JR
Gott
aCatc
hAllFD
FdPerc
hykSchm
idt
dpdc-c
nm
s-f
d
dennis
_m
arius.f
d
fdep
SBM
MFD
Lucie
Kers
tinFD
MyFd
DJ_
FD
FD
_SPIR
O
FD
_Ju
ngRohlo
ff
FdBotheJoerkeRe…
FD
_Kirste
n_Zw
erg
FunctionalDepen…
Ru
nti
me [
ms]
Exercise 3
Runtime for abalone.csv
Chart 8
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
■ Columns: 9
■ Rows: 4,177
■ FDs: 137
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000fd
_finke_dullw
eber
FD_schaeffer_zo…
FU
N
fastT
ane
Tsun12Fd
FD
_gru
ndke_w
iese
MM
FU
ncD
ep
aiw
endil
HorL
ehTane
Tane
klinger_
mart
en_fd
Dre
am
team
Fd
FuncD
ep
RT_FD
Pute
Pute
YuckFunc
FanctionalDepun…
PCFD
AlexoFredFunctio…
sm
art
_data
_cat-
FD
FrohnOttoFuncDe…
FD
FM
JR
Gott
aCatc
hAllFD
FdPerc
hykSchm
idt
dpdc-c
nm
s-f
d
dennis
_m
arius.f
d
fdep
SBM
MFD
Lucie
Kers
tinFD
MyFd
DJ_
FD
FD
_SPIR
O
FD
_Ju
ngRohlo
ff
FdBotheJoerkeRe…
FD
_Kirste
n_Zw
erg
FunctionalDepen…
Ru
nti
me [
ms]
Exercise 3
Runtime for abalone.csv (<10s)
Chart 9
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
■ Columns: 9
■ Rows: 4,177
■ FDs: 137
0
100
200
300
400
500
600
700
800
900
1000fd
_finke_dullw
eber
FD_schaeffer_zo…
FU
N
fastT
ane
Tsun12Fd
FD
_gru
ndke_w
iese
MM
FU
ncD
ep
aiw
endil
HorL
ehTane
Tane
klinger_
mart
en_fd
Dre
am
team
Fd
FuncD
ep
RT_FD
Pute
Pute
YuckFunc
FanctionalDepun…
PCFD
AlexoFredFunctio…
sm
art
_data
_cat-
FD
FrohnOttoFuncDe…
FD
FM
JR
Gott
aCatc
hAllFD
FdPerc
hykSchm
idt
dpdc-c
nm
s-f
d
dennis
_m
arius.f
d
fdep
SBM
MFD
Lucie
Kers
tinFD
MyFd
DJ_
FD
FD
_SPIR
O
FD
_Ju
ngRohlo
ff
FdBotheJoerkeRe…
FD
_Kirste
n_Zw
erg
FunctionalDepen…
Ru
nti
me [
ms]
Exercise 3
Runtime for abalone.csv (<1s)
Chart 10
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
■ Columns: 9
■ Rows: 4,177
■ FDs: 137
Exercise 3
Correctness for bridges.csv
Chart 11
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil
AlexoFredFunctionals
dennis_marius.fd
DJ_FD
dpdc-cnms-fd
DreamteamFd
FanctionalDepundancy
fastTane
fd_finke_dullweber
FD_grundke_wiese
FD_JungRohloff
FD_Kirsten_Zwerg
FD_schaeffer_zoellner
FD_SPIRO
FdBotheJoerkeReissaus
FDFMJR
FdPerchykSchmidt
FrohnOttoFuncDep-LAME-TANE
FuncDep
FunctionalDependencyDetector
FunctionalDerpendency incorrect
GottaCatchAllFD
HorLehTane
klinger_marten_fd
LucieKerstinFD
MMFUncDep
MyFd
PCFD
PutePute
RT_FD
SBMMFD
smart_data_cat-FD
Tsun12Fd
YuckFunc
Exercise 3
Correctness for bridges.csv
Chart 12
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil incorrect
AlexoFredFunctionals
dennis_marius.fd
DJ_FD
dpdc-cnms-fd SerializationError
DreamteamFd
FanctionalDepundancy
fastTane
fd_finke_dullweber
FD_grundke_wiese
FD_JungRohloff
FD_Kirsten_Zwerg
FD_schaeffer_zoellner
FD_SPIRO
FdBotheJoerkeReissaus
FDFMJR
FdPerchykSchmidt SerializationError
FrohnOttoFuncDep-LAME-TANE
FuncDep
FunctionalDependencyDetector
FunctionalDerpendency incorrect
GottaCatchAllFD
HorLehTane
klinger_marten_fd
LucieKerstinFD
MMFUncDep
MyFd
PCFD
PutePute
RT_FD
SBMMFD
smart_data_cat-FD
Tsun12Fd
YuckFunc
Exercise 3
Runtime for bridges.csv
Chart 13
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
0
5000
10000
15000
20000
25000
30000
35000fd
ep
FuncD
ep
Tsun12Fd
fd_finke_dullw
eber
FD
_SPIR
O
MM
FU
ncD
ep
FD_schaeffer_zo…
FU
N
HorL
ehTane
FD
FM
JR
FanctionalDepun…
fastT
ane
RT_FD
sm
art
_data
_cat-
FD
Pute
Pute
Tane
klinger_
mart
en_fd
Dre
am
team
Fd
PCFD
FD
_gru
ndke_w
iese
FdBotheJoerkeRe…
AlexoFredFunctio…
FrohnOttoFuncDe…
YuckFunc
MyFd
FD
_Ju
ngRohlo
ff
FD
_Kirste
n_Zw
erg
Gott
aCatc
hAllFD
DJ_
FD
Lucie
Kers
tinFD
dennis
_m
arius.f
d
SBM
MFD
FunctionalDepen…
Ru
nti
me [
ms]
■ Columns: 13
■ Rows: 108
■ FDs: 142
Exercise 3
Runtime for bridges.csv (<1s)
Chart 14
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
0
100
200
300
400
500
600
700
800
900
1000fd
ep
FuncD
ep
Tsun12Fd
fd_finke_dullw
eber
FD
_SPIR
O
MM
FU
ncD
ep
FD_schaeffer_zo…
FU
N
HorL
ehTane
FD
FM
JR
FanctionalDepun…
fastT
ane
RT_FD
sm
art
_data
_cat-
FD
Pute
Pute
Tane
klinger_
mart
en_fd
Dre
am
team
Fd
PCFD
FD
_gru
ndke_w
iese
FdBotheJoerkeRe…
AlexoFredFunctio…
FrohnOttoFuncDe…
YuckFunc
MyFd
FD
_Ju
ngRohlo
ff
FD
_Kirste
n_Zw
erg
Gott
aCatc
hAllFD
DJ_
FD
Lucie
Kers
tinFD
dennis
_m
arius.f
d
SBM
MFD
FunctionalDepen…
Ru
nti
me [
ms]
■ Columns: 13
■ Rows: 108
■ FDs: 142
Exercise 3
Correctness for hepatitis.csv
Chart 15
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil incorrect
AlexoFredFunctionals
dennis_marius.fd
DJ_FD
dpdc-cnms-fd SerializationError
DreamteamFd
FanctionalDepundancy
fastTane
fd_finke_dullweber
FD_grundke_wiese
FD_JungRohloff
FD_Kirsten_Zwerg
FD_schaeffer_zoellner
FD_SPIRO
FdBotheJoerkeReissaus
FDFMJR
FdPerchykSchmidt SerializationError
FrohnOttoFuncDep-LAME-TANE
FuncDep
FunctionalDependencyDetector
FunctionalDerpendency incorrect
GottaCatchAllFD
HorLehTane
klinger_marten_fd
LucieKerstinFD
MMFUncDep
MyFd
PCFD
PutePute
RT_FD
SBMMFD
smart_data_cat-FD
Tsun12Fd
YuckFunc
Exercise 3
Correctness for hepatitis.csv
Chart 16
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil incorrect
AlexoFredFunctionals
dennis_marius.fd
DJ_FD
dpdc-cnms-fd SerializationError
DreamteamFd
FanctionalDepundancy
fastTane SerializationError
fd_finke_dullweber
FD_grundke_wiese
FD_JungRohloff
FD_Kirsten_Zwerg
FD_schaeffer_zoellner
FD_SPIRO
FdBotheJoerkeReissaus
FDFMJR
FdPerchykSchmidt SerializationError
FrohnOttoFuncDep-LAME-TANE
FuncDep
FunctionalDependencyDetector
FunctionalDerpendency incorrect GottaCatchAllFD
HorLehTane
klinger_marten_fd
LucieKerstinFD
MMFUncDep
MyFd
PCFD
PutePute
RT_FD
SBMMFD
smart_data_cat-FD
Tsun12Fd
YuckFunc
Exercise 3
Correctness for hepatitis.csv
Chart 17
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil incorrect
AlexoFredFunctionals > 1h ?
dennis_marius.fd > 1h ?
DJ_FD > 1h ?
dpdc-cnms-fd SerializationError
DreamteamFd
FanctionalDepundancy
fastTane SerializationError
fd_finke_dullweber
FD_grundke_wiese > 1h
FD_JungRohloff > 1h ?
FD_Kirsten_Zwerg > 1h ?
FD_schaeffer_zoellner
FD_SPIRO
FdBotheJoerkeReissaus > 1h ?
FDFMJR > 1h
FdPerchykSchmidt SerializationError
FrohnOttoFuncDep-LAME-TANE > 1h ? FuncDep
FunctionalDependencyDetector > 1h ? FunctionalDerpendency incorrect GottaCatchAllFD > 1h ? HorLehTane
klinger_marten_fd
LucieKerstinFD > 1h ? MMFUncDep
MyFd > 1h ? PCFD > 1h PutePute
RT_FD
SBMMFD > 1h ? smart_data_cat-FD
Tsun12Fd
YuckFunc > 1h
0
200000
400000
600000
800000
1000000
1200000
1400000
Ru
nti
me [
ms]
Exercise 3
Runtime for hepatitis.csv
Chart 18
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
■ Columns: 20
■ Rows: 155
■ FDs: 8,250
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Ru
nti
me [
ms]
Exercise 3
Runtime for hepatitis.csv (<10s)
Chart 19
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
■ Columns: 20
■ Rows: 155
■ FDs: 8,250
Exercise 3
Correctness for fd-reduced-15.csv
Chart 20
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil incorrect
AlexoFredFunctionals > 1h ?
dennis_marius.fd > 1h ?
DJ_FD > 1h ?
dpdc-cnms-fd SerializationError
DreamteamFd
FanctionalDepundancy
fastTane SerializationError
fd_finke_dullweber
FD_grundke_wiese > 1h
FD_JungRohloff > 1h ?
FD_Kirsten_Zwerg > 1h ?
FD_schaeffer_zoellner
FD_SPIRO
FdBotheJoerkeReissaus > 1h ?
FDFMJR > 1h
FdPerchykSchmidt SerializationError
FrohnOttoFuncDep-LAME-TANE > 1h ? FuncDep
FunctionalDependencyDetector > 1h ? FunctionalDerpendency incorrect GottaCatchAllFD > 1h ? HorLehTane
klinger_marten_fd
LucieKerstinFD > 1h ? MMFUncDep
MyFd > 1h ? PCFD > 1h PutePute
RT_FD
SBMMFD > 1h ? smart_data_cat-FD
Tsun12Fd
YuckFunc > 1h
Exercise 3
Correctness for fd-reduced-15.csv
Chart 21
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil incorrect
AlexoFredFunctionals > 1h ?
dennis_marius.fd > 1h ?
DJ_FD > 1h ?
dpdc-cnms-fd SerializationError
DreamteamFd
FanctionalDepundancy
fastTane SerializationError
fd_finke_dullweber incorrect
FD_grundke_wiese > 1h
FD_JungRohloff > 1h ?
FD_Kirsten_Zwerg > 1h ?
FD_schaeffer_zoellner
FD_SPIRO > 30min
FdBotheJoerkeReissaus > 1h ?
FDFMJR > 1h
FdPerchykSchmidt SerializationError
FrohnOttoFuncDep-LAME-TANE > 1h ? FuncDep
FunctionalDependencyDetector > 1h ? FunctionalDerpendency incorrect GottaCatchAllFD > 1h ? HorLehTane > 30min klinger_marten_fd
LucieKerstinFD > 1h ? MMFUncDep
MyFd > 1h ? PCFD > 1h PutePute
RT_FD > 30min SBMMFD > 1h ? smart_data_cat-FD
Tsun12Fd
YuckFunc > 1h
Exercise 3
Runtime for fd-reduced-15.csv
Chart 22
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
■ Columns: 30
■ Rows: 250,000
■ FDs: 89,571
0
100
200
300
400
500
600
700
800
Ru
nti
me [
sec]
Exercise 3
Runtime for fd-reduced-15.csv (<30sec)
Chart 23
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
■ Columns: 30
■ Rows: 250,000
■ FDs: 89,571
0
5
10
15
20
25
30
Ru
nti
me [
sec]
Exercise 3
Correctness for plista1k.csv
Chart 24
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil incorrect
AlexoFredFunctionals > 1h ?
dennis_marius.fd > 1h ?
DJ_FD > 1h ?
dpdc-cnms-fd SerializationError
DreamteamFd
FanctionalDepundancy
fastTane SerializationError
fd_finke_dullweber Incorrect
FD_grundke_wiese > 1h
FD_JungRohloff > 1h ?
FD_Kirsten_Zwerg > 1h ?
FD_schaeffer_zoellner
FD_SPIRO > 30min
FdBotheJoerkeReissaus > 1h ?
FDFMJR > 1h
FdPerchykSchmidt SerializationError
FrohnOttoFuncDep-LAME-TANE > 1h ? FuncDep
FunctionalDependencyDetector > 1h ? FunctionalDerpendency incorrect GottaCatchAllFD > 1h ? HorLehTane > 30min klinger_marten_fd
LucieKerstinFD > 1h ? MMFUncDep
MyFd > 1h ? PCFD > 1h PutePute
RT_FD > 30min SBMMFD > 1h ? smart_data_cat-FD
Tsun12Fd
YuckFunc > 1h
Exercise 3
Correctness for plista1k.csv
Chart 25
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
aiwendil incorrect
AlexoFredFunctionals > 1h ?
dennis_marius.fd > 1h ?
DJ_FD > 1h ?
dpdc-cnms-fd SerializationError
DreamteamFd > 1h ?
FanctionalDepundancy > 1h ?
fastTane SerializationError
fd_finke_dullweber Incorrect
FD_grundke_wiese > 1h
FD_JungRohloff > 1h ?
FD_Kirsten_Zwerg > 1h ?
FD_schaeffer_zoellner OutOfMemory
FD_SPIRO > 30min
FdBotheJoerkeReissaus > 1h ?
FDFMJR > 1h
FdPerchykSchmidt SerializationError
FrohnOttoFuncDep-LAME-TANE > 1h ? FuncDep
FunctionalDependencyDetector > 1h ? FunctionalDerpendency incorrect GottaCatchAllFD > 1h ? HorLehTane > 30min klinger_marten_fd OutOfMemory LucieKerstinFD > 1h ? MMFUncDep OutOfMemory MyFd > 1h ? PCFD > 1h PutePute ArrayIndexOutOfBounds RT_FD > 30min SBMMFD > 1h ? smart_data_cat-FD > 1h Tsun12Fd > 1h YuckFunc > 1h
Exercise 3
Correctness for plista1k.csv
Chart 26
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
Algorithm Runtime [ms]
FuncDep 7,043
fdep 18,492
TANE OutOfMemory
FUN OutOfMemory
Exercise 3
Short presentations – Part 2
Chart 27
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
Functional
Dependencies
Data Cleansing
Duplicate Detection
Chart 28
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
Exercise 4
Duplicate Detection
Chart 29
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
Exercise 4
Duplicate Detection
Chart 30
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
Exercise 4
Duplicate Detection
Chart 31
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
Exercise 4
Duplicate Detection
Chart 32
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome
Exercise 4
Duplicate Detection
Chart 33
Thorsten Papenbrock, PhD Candidate, 17th November, 2014
Data Profiling with Metanome