byung-won on (penn state univ.) nick koudas (univ. of toronto) dongwon lee (penn state univ.) divesh...

Group Linkage

Byung-Won On (Penn State Univ.)Nick Koudas (Univ. of Toronto)Dongwon Lee (Penn State Univ.)Divesh Srivastava (AT&T Labs Research)Group LinkageICDE 20071OutlineIntroductionMatchingBipartite GraphGroup LinkageBipartite matchingPre-processing step to speed upGreedy matchingHeuristic measureExperiment & ResultConclusion

2IntroductionPoor quality data in databasesTranscription errors Lack of standards for recordingPoor database designHow to identify whether two entities are approximately the same?Group linkage problemEx:J.UllmanJ.D.UllmanUllman, Jeffrey3Group Linkage ProblemEx :Lily HsuehPaper1Davy JonesPeter PanPaper5Paper4Paper3Paper2ACMDBLPK.L.HsuehGroup : each authorRecords : a list of citations per authorImplement 4MatchingMatching: A matching in a graph G is a set of non-loop edges with no shared endpoints

Maximum matching: A matching that contains the largest possible number of edges.

5Bipartite GraphBipartite Graph: A graph is bipartite if V is the union of two disjoint independent sets called partite sets of G

Bipartite matching66Group Linkage(1)Jaccard similarity measure between two sets s1 and s2

Records from the two groups can be put into matching when they are identical.

7Group Linkage(2)NotationDescriptionDRelation of multi-attribute recordsg1,g2,Groups of records in Dg1, r2, Records in Dsim(ri, rj)Arbitrary record-level similarity functionGroup-level similarity thresholdRecord-level similarity thresholdMMaximum weight bipartite matchingBMBipartite matching based group linkage8Group Linkage(3)g2g1r11r25r24r23r22r21r14r13r12

,eachnormalizeGroup similaritySimilar recordsK.L.HsuehLily.HsuehRegister Allocation & Spilling via graph coloringRegister Allocation and Spilling via graph coloring9Bipartite Matching10Record-level similarity measure[5] S.Chaudhuri, V.Ganti, and R. Kaushik. A primitive Operator for Similarity Joins in Data Cleaning. In IEEE ICED, 2006Maximum weight bipartite matching (BM)[10] S. Guha, N.Koudas, A. Marathe, and D. Srivastava. Merging the Results of Approximate Match Operations. In VLCB, pages 636-647, 2004.Applying this strategy for every pair of groups is infeasible. pre-processing stepGreedy matchingHeuristic measureGreedy Matching(1)S1: For each record ri g1, find a record rj g2 with the highest record-level similarity among those with sim() .S2: Same as S1g2g1r11r25r24r23r22r21r14r13r12May not be a matching!11Greedy Matching(2)Upper and lower bounds to BMsim,

g2g1r11r25r24r23r22r21r14r13r12

12Greedy Matching(2) is bounded

Only when , the more expensive computation would be needed.

13Heuristic MeasureIn practice that pairs of groups with a high value of will share at least one record with a high record-level similarity. Simpler and faster measure

14ImplementationImplemented UBsim,, LBsim,, and MAXsim, in SQL.(We only discuss UB)Notation:groupauthorrecord in a groupcitations of an authorgroup linkage problemlinkage between authorskey to linkauthor names15ExperimentReal data sets: Data sets from ACM and DBLP citation digital libraries.R1: uniform data setsR1aaverage # of citations: left=41, right=25R1baverage # of citations: left=40, right=55R2: skewed data setsR2DB average # of citations: left=30, right=9R2AI average # of citations: left=31, right=10R2Net average # of citations: left=22, right=6

16ExperimentSynthetic data sets:S1a and S1b: same as R1a, but dummy authors are injected to the rightS1a: # of citations1/3S1b: # of citations3S2: using dbgen tool to generate dummy authors with varying levels of errors and inserted it to the right data set.

17ExperimentEvaluation Metricsaverage recallif a2 is included in the top-k answer window for a1, then recall becomes 1, and 0 otherwiseCompared Methods

A(k1)|B(k2).Step1: A, window size k1Step2: B, window size k2Microsoft SQL Server 2000 on Pentium III 3GHZ/512MB machine

18Resultsuniform data set : R1 real data set

19ResultsS1 and S2 synthetic data sets

JA incorrect select dummy authors JA and BM are directly applied to S2BM outperforms JA by 16-17%20ResultsR2 real data set

UBMAXUB outperform MAX in recallUBMAXPre-processing using:21ResultsRecord-level similarity measure :cosine similarity with TF/IDF weighting.

Running time against R2 (in sec)

22

ResultsWindow size

23ConclusionProposed a bipartite matching based group similarity measure to solve group linkage problem.Proved upper and lower bounds of BM can be used for speed-up.BM is more robust group similarity measure than others24

byung-won on (penn state univ.) nick koudas (univ. of toronto) dongwon lee (penn state univ.) divesh...

Documents