lazy big data integration - uni-leipzig.de · visualization data integration interpre-tation. 19...

23
Lazy Big Data Integration Prof. Dr. Andreas Thor Hochschule für Telekommunikation Leipzig (HfTL) Martin-Luther-Universität Halle-Wittenberg 16.12.2016

Upload: others

Post on 18-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

Lazy Big Data Integration

Prof. Dr. Andreas ThorHochschule für Telekommunikation Leipzig (HfTL)

Martin-Luther-Universität Halle-Wittenberg16.12.2016

Page 2: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

2

Agenda

• Data Integration– Data analytics for domain-specific questions

– Use cases: Bibliometrics & Life Sciences

• Big Data Integration– Techniques for efficient big data management

– Exploiting cloud infrastructures (MapReduce, NoSQL data stores)

• Lazy Big Data Integration– (Efficient and) effective goal-oriented data integration

– Integrated analytical approach for big data analytics

Page 3: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

3

Use Case: Bibliometrics

• Does the peer review process actually work? Does it select the „best“ papers?

• Data from reviewing process (e.g., Easy Chair)– Bibliographic information (title, authors, …) of submitted papers

– Review score(s) incl. editorial decision

• Data from bibliographic data sources (e.g., Google Scholar)– Accepted papers and rejected papers that are published elsewhere

– Number of citations

• Determine covariance between review score(s) and #citations

Page 4: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

4

Data Integration

• Combining data residing at different sources and providing the user with a unified view of this data– Added value by linking &

merging data

– Queries that can only be answered using multiple sources

• Schema Matching– Finding mappings of

corresponding attributes

• Entity Matching– Finding equivalent

data objects

Page 5: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

5

Data Quality

• Can/should we use Google Scholar citations for ranking …– Papers

– Researchers

– Institutions

– etc.

• Convergent validity of citation analyses?– Comparison of analysis results for source overlap

Google Scholar Web of Science

Coverage Huge Limited

Data quality Medium (fully automatic) High (manually curated)

Costs Free Expensive

Page 6: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

6

Use Case: Life Sciences

• Gene Annotation Graph– Genes are annotated with

Gene Ontology (GO) and Plant Ontology (PO) terms

• Links form a graph that captures meaningful biological knowledge

• Sense making of graph is important

• Prediction of new annotations – hypothesis for wet lab experiments

Is this annotation likely to be added in

the future?

Page 7: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

7

Graph Summarization + Link Prediction

• Graph summary = Signature + Corrections

• Signature: graph pattern / structure– Super nodes = partitioning of nodes

– Super edges = edges between super nodes = all edges between nodes of super nodes

• Corrections: edges e between individual nodes

– Additions: e G but e signature

– Deletions: e G but e signature

• p(PO_20030, CIB5) 0.96– High prediction score because

it is the “only missing piece” to a “perfect 4x6 pattern”

PO_20030

PO_9006

PO_37

PO_20038

HY5

PHOT1

CIB5

CRY2

COP1

CRY1

PO_20030

PO_9006

PO_37

PO_20038

HY5

PHOT1

CIB5

CRY2

COP1

CRY1

=

Page 8: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

8

(Big) Data Analytics Pipeline

Schema & Entity

Matching

DataExtraction/

Cleaning

DataAcquisition

DataFusion/

Aggregation

Bibliometrics

Life Sciences

DataAnalytics /

Visualization

Page 9: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

9

Agenda

• Data Integration– Data analytics for domain-specific questions

– Use cases: Bibliometrics & Life Sciences

• Big Data Integration– Techniques for efficient big data management

– Exploiting cloud infrastructures (MapReduce, NoSQL data stores)

• Lazy Big Data Integration– (Efficient and) effective goal-oriented data integration

– Integrated analytical approach for big data analytics

Page 10: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

10

(Big) Data Analytics Pipeline

Schema & Entity

Matching

DataExtraction/

Cleaning

DataAcquisition

DataFusion/

Aggregation

DataAnalytics /

Visualization

Bibliometrics

Life Sciences

QueryStrategies

Product CodeExtraction

LoadBalancing

NoSQLSharding

Dedoop (Deduplication with Hadoop)

Page 11: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

11

How to speed up entity matching?

• Entity matching is expensive (due to pair-wise comparisons)

• Blocking to reduce search space– Group similar entities within blocks based on blocking key

– Restrict matching to entities from the same block

• Parallelization– Split match computation in sub-tasks to be executed in parallel

– Exploitation of cloud infrastructures and frameworks like MapReduce

Page 12: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

12

Blocking + MapReduce: Naïve

• Data skew leads to unbalanced workload

red

uce

1re

du

ce2

red

uce

3

Part

itio

nin

g: “

Blo

ckKe

y m

od

#R

edu

ce T

asks

Key Valuew Aw Bx Cx Dx Ez Fz G

Key Valuew Hw Iy Ky Lz Mz Nz O

Key Valuew Aw Bw Hw Iz Fz Gz Mz Nz O

PairsA-B, A-H, A-I, B-H, B-I, H-IF-G, F-M, F-N, F-O,G-M, G-N, G-O, M-N, M-O, N-O

Π1

Aw

Bw

Cx

Dx

Ex

Fz

Gz

Π2

Hw

Iw

Ky

Ly

Mz

Nz

Oz

map

1m

ap2

Map: Blocking Reduce: Matching

Key Valuey Ky L

Key Valuex Cx Dx E

PairsK-L

PairsC-D, C-E, D-E

S

Aw

Bw

Cx

Dx

Ex

Fz

Gz

Hw

Iw

Ky

Ly

Mz

Nz

Oz

Inp

ut

Split

Page 13: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

13

Load Balancing for MR-based EM

Analysis to identify blocking key distribution

Global load balancing algorithms assigning similar number of pairs to reduce tasks

Block Distribution

Matrix

Π’1

Π’2

map1

map2

reduce1

reduce2

reduce3Load

Balancing SimilarityComputation

MR Job2: Matching

M1

M2

M3

MatchPartitions

MatchResult

Π1 map1

map2

reduce1

Π 2

Blocking Count Occurrences

MR Job1: AnalysisInput

Partitions

InputEntities reduce2

Partition Overall

Π1 Π2 #E #P

Blo

cks

w Φ1 2 2 4 6

y Φ2 0 2 2 1

x Φ3 3 0 3 3

z Φ4 2 3 5 10

Partition Π1 Π2

Entity A B C D E F G H I K L M N OBlocking Key w w x x x z z w w y y z z z

Page 14: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

14

BlockSplit

• Large blocks split into m sub-blocks– according to m input partitions

– large if #PBlock > #POverall / #Reducer

• Two types of match tasks– Single (small blocks and sub-blocks)

– Two sub-blocks

• Greedy load balancing– Sort match tasks by number of pairs

in descending order

– Assign match task to reducer with lowest number of pairs

• Example– r=3 reduce tasks, split Φ4 in m=2 sub-blocks

– Φ4‘s match tasks: Φ4.1 , Φ4.2 , and Φ4.1×2

Partition Overall

Π1 Π2 #E #P

Blo

cks

w Φ1 2 2 4 6

y Φ2 0 2 2 1

x Φ3 3 0 3 3

z Φ4 2 3 5 10

#P Reducer

Mat

ch T

asks

Φ1 6 1

Φ4.1×2 6 2

Φ3 3 3

Φ4.2 3 3

Φ2 1 1

Φ4.1 1 2

Page 15: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

15

BlockSplit: MR-Dataflow

red

uce

1re

du

ce2

red

uce

3

Part

itio

nin

g b

y R

edu

cer-

Ind

ex

Key Value1.1.* A1.1.* B3.3.* C3.3.* D3.3.* E2.4.1 F2.4.12 F1

2.4.1 G2.4.12 G1

Key Value1.1.* H1.1.* I1.2.* K1.2.* L2.4.12 M2

3.4.2 M2.4.12 N2

3.4.2 N2.4.12 O2

3.4.2 O

Key Value1.1.* A1.1.* B1.1.* H1.1.* I1.2.* K1.2.* L

PairsA-B, A-H, A-I, B-H, B-I, H-IK-L

Π1

Aw

Bw

Cx

Dx

Ex

Fz

Gz

Π2

Hw

Iw

Ky

Ly

Mz

Nz

Oz

map

1m

ap2

MapReduce

Key Value2.4.12 F0

2.4.12 G0

2.4.12 M1

2.4.12 N1

2.4.12 O1

2.4.1 F2.4.1 G

Key Value3.3.* C3.3.* D3.3.* E3.4.2 M3.4.2 N3.4.2 O

PairsF-M, F-N, F-O, G-M, G-N, G-OF-G

PairsC-D, C-E, D-EM-N, M-O, N-O

MapReduce

Techniques

• MapKey =ReducerIndex +MatchTask

• Replicate entities of sub-blocks

Page 16: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

16

Evaluation: Robustness + Scalability

„All entities in a single block“

„Uniform distribution“

Robust against data

skew

Scalable

Page 17: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

17

Agenda

• Data Integration– Data analytics for domain-specific questions

– Use cases: Bibliometrics & Life Sciences

• Big Data Integration– Techniques for efficient big data management

– Exploiting cloud infrastructures (MapReduce, NoSQL data stores)

• Lazy Big Data Integration– (Efficient and) effective goal-oriented data integration

– Integrated analytical approach for big data analytics

Page 18: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

18

(Big) Data Analytics Pipeline

Schema & Entity

Matching

DataExtraction/

Cleaning

DataAcquisition

DataFusion/

Aggregation

DataAnalytics /

Visualization

Data IntegrationInterpre-

tation

Page 19: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

19

Citation Analysis Pipeline

• For a given set of Bibtex entries– Find matching Google Scholar entries

– Determine aggregated citation counts

• Analytical questions for a researcher– Complete publication list + #citations

– Top-5 publications

– H-Index, Average Number of citations

• Analytical questions for comparing– Institutions

– Research fields

– …

Page 20: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

20

„Lazy Machine“: Effectiveness

• Do the right thing! Do only things that are needed!– Priorization / filtering of data objects to be processed

• Example: Top-5 publications of a researcher– Entity Matching for highly cited

Google Scholar entries

– Cutoff data that does not contribute to the analytical result (anymore)

– „does not“ → „is not likely to“

• Pipeline stages– Data Akquisition: query strategies

– Data Extraction: on-demand

– Data Matching: relevant entities only

Page 21: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

21

„Lazy User“: Data Quality

• Automatic data integration does not give 100% data quality– Data acquisition might miss relevant data

– Matching is imperfect (precision, recall)

– …

• Pipeline & integrated result should effectively point the user to the “weak points”

• Examples– What (non-)matching pairs have the most effect on the analytical

result?

– Outlier detection → What pipeline stage caused the effect?

Page 22: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

22

Lazy Big Data Integration

• Integrated approach for both– Data integration workflow

– Analytical query

• Current work based on Gradoop (Graph Analytics on Hadoop)– Graph model + operators for analytical pipelines

– Efficient execution in distributed environment

• Next steps– Operators for complex analytical queries / statistics (e.g., h-index)

– Data provenance model for measuring the impact of data objects to specific results

Page 23: Lazy Big Data Integration - uni-leipzig.de · Visualization Data Integration Interpre-tation. 19 Citation Analysis Pipeline • For a given set of Bibtex entries –Find matching

23

Summary

• Data Integration– Data analytics for domain-specific questions

– Use cases: Bibliometrics & Life Sciences

• Big Data Integration– Techniques for efficient big data management

– Exploiting cloud infrastructures (MapReduce, NoSQL data stores)

• Lazy Big Data Integration– (Efficient and) effective goal-oriented data integration

– Integrated analytical approach for big data analytics