record linkage in a distributed environment huang yipeng wing group meeting, 11 march 2011 1

29
Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

Upload: alison-snow

Post on 14-Jan-2016

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

1

Record Linkagein a Distributed Environment

Huang YipengWing group meeting, 11 March 2011

Page 2: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

2Introduction

Record LinkageDetermining if pairs of personal

records refer to the same entity

E.g. Distinguishing betweendata belonging to…

<Yipeng, author of this presentation> and <Yipeng, son of PM Lee>

Page 3: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

3Introduction

The Distributed Environment

Why?◦ Dealing with large

data ◦ Limitation of

blockingAdvantages

◦ Parallel computation

◦ Data source flexibility

◦ Complementary to blocking methods

O(nC2)

Amanda

Beverley

Katherine

Amanda

Amanda

Amanda

Amanda

Amanda

Page 4: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

4Introduction

The Distributed Environment

MapReduce◦ Distributed

environment for large data sets

Hadoop ◦ Open source

implementation

◦ Convenient model for scaling Record Linkage

◦ Protects users from system level concerns

Page 5: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

5Introduction

Research ProblemDisconnect between generic

parallel framework and specific Record Linkage problem

The goal Tailor Hadoop for Record Linkage tasks

Page 6: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

6

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

Page 7: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

7Related Work

Related WorkRecord Linkage Literature

◦Blocking techniquesParallel Record Linkage Literature

◦P-Febrl (P Christen 2003),

◦P-Swoosh (H Kawai 2006),

◦Parallel Linkage (H Kim 2007)

Hadoop Literature ◦Evaluation Metrics◦Pairwise comparisons (T Elsayed 2008)

Page 8: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

8

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

Page 9: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

9Methodology

MapReduce Workflow

Partitioner

Page 10: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

10Methodology

ImplementationMapPurpose:

◦ Parallelism ◦ Data manipulation◦ Blocking

Reads lines of input and outputs <key, value> pairs.

ReducePurpose:

◦ Parallelism ◦ Record Linkage

ops

Records with the same <key> in same Reduce().

Linkage results

Page 11: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

11Methodology

Hash Partitioner Default implementation Hash(Key) mod NGood for uniformed data but not

for skewed distributions

Node

10 22 21 3 4 5 6 7 2 80

20

40

60

Reduce task list for Job x

Name Distribution Comparisons

joshua 5000 12497500

emiily 48 1128

jack 35 595

thomas 33 528

lachlan 32 496

benjamin 31 465

5416986 comparisons

210 comparisons

Page 12: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

13Methodology

Record Linkage Partitioner

Goal: Have all nodes finish the reduce

phase at the same time Attain a better runtime but

retaining the same level of accuracy

Page 13: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

14Methodology

Domain principlesCounting pairwise comparisons

gives a more accurate picture of the true computational workload

The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000)

Page 14: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

15Methodology

Record Linkage Workflow

Round 1

Round 2

Round 3

Range partition based on comparison workload

Merge lost comparisons from Round 1

Remove cross duplicates

Page 15: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

16Methodology

Input

Round 1

Map Phase

Distribution

1. Calc avg comparison workload over N nodes

2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below.

3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes

Page 16: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

Methodology

Round 2

A

17

B

List X

A B

A R1

B R1

A B

A R1

B R2 R1

Page 17: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

Methodology

Round 2

18

A B

A

B Job 1

A B C

A

B Job 1

C Job 2 Job 3

1. Only acts on lost comparisons

2. Because input is indistinct, a 3rd round of deduplication may be needed.

Page 18: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

19

OutlineIntroductionRelated WorkMethodology Evaluation Conclusion

Introduction

Page 19: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

20Evaluation

Performance MetricsPerformance evaluation in

absolute runtime, speedup & scaleup on a shared cluster.◦“It’s what users care about” ◦Representative of real operations

Page 20: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

21Methodology

Input Records

10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution.

<rec-359705-org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , 19090518, 38, 07 34366927, 6174819, 9>

Page 21: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

22Methodology

Data setsSynthetic data produced with

Febrl data generator◦Artificially skewed distribution

1 1352694035376718050

200

400

600

800

1000

1200

1400

Comparisons

Name Distribution Comparisons

joshua 50 1225

emiily 48 1128

jack 35 595

thomas 33 528

lachlan 32 496

benjamin 31 465

Page 22: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

23Evaluation

Utilization

Node 1 Node 20

2

4

6

8

10

12

IdleComputation

Page 23: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

24Evaluation

Utilization

Node 1 Node 2 Node 30

2

4

6

8

10

12

IdleComputation

Page 24: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

25Evaluation

Utilization

Node 1 Node 2 Node 30

2

4

6

8

10

12

IdleComputation

A

B

C

Page 25: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

26Evaluation

Utilization

Node 1 Node 2 Node 30

2

4

6

8

10

12

Idle

Redistributed Computation

Original Computa-tion

CA B

Page 26: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

Round 2

27

A B C

A

B

C

J1

J3 J5

J2

J4 J6 ?

Node Utilization 50-100%

Page 27: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

28Evaluation

Results so far….Default Workflow

RL Workflow

2 nodes, 5000 records, 2433 duplicates

71.5 secs 75 secs

2 nodes, 7000 records, 4814 duplicates

>10 mins 196.8 secs

Page 28: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

29Evaluation

Results so far….RL Workflow runtime

◦Similar to Hash-based runtime on small datasets

◦Better as the size of the dataset grows

Page 29: Record Linkage in a Distributed Environment Huang Yipeng Wing group meeting, 11 March 2011 1

30

ConclusionParallelism a right step in the

right direction for record linkage ◦Complementary to existing

approaches

Hadoop can be tailored for Record Linkage tasks◦“Record Linkage” Partitioner /

Workflow is just one an example of possible improvements

Conclusion