unique entity estimation with application to the …...restaurants restaurant guide 864 112 4 752 cd...

Unique Entity Estimation with Application to the Syrian ConflictBeidi Chen ú, Anshumali Shrivastava ú, Rebecca C. Steorts †

úDepartment of Computer Science, Rice University, †Department of Statistical Science, Duke University

Goal: Unique Entity Estimation (UEE)

Given a dataset of M records, estimate the number of distinctidentities (records) n < M.

ChallengesI Data Duplications: Aggregate from various resources

probability of colliding in the range space than non-similarones

I Data Noises: Human Error/ BiasI Computational Challenges

LOCALITY SENSITIVE HASHING (LSH)

PropertiesI A family of functions, with the property that similar input objects

in the domain of these functions have a higher probability ofcolliding in the range space than non-similar ones

I A favorite technique for approximate near-neighbor search

Figure : LSH

Popular LSH

Minwise Hashing and Resemblance SimilarityI The resemblance similarity between two given sets

x, y ™ � = {1, 2, ..., D} is defined as

R = |x fl y||x fi y| = a

f1 + f2 ≠ a

, (1)

where f1 = |x|, f2 = |y|, and a = |x fl y|I Minwise hashing æ the LSH for resemblance similarityI Apply a random permutation fi : � æ �

h

min

fi

(x) = min(fi(x)). (2)

Pr(hmin

fi

(x) = h

min

fi

(y)) = |x fl y||x fi y| = R. (3)

Motivation

Estimate Syrian Conflict Casualties across datasets from di�erent sources.

Method

Existing ApproachesI Simply compare every two records

Problem: O(M2) 4.5X1010 Pairs, 6days(Syrian Dataset)

I Put similar records into blocks or binsProblem:Sacrifice accuracy for e�ciency

Our Approach

Figure : Flow

Estimator

ENTITY ESTIMATION WITH APPLICATION TO THE SYRIAN CONFLICT 11

nent, say C�i (clique), will be sub-sampled and can appear as some possibly 344

smaller connected components in G�. For example, a singleton set in G� will 345

remain the same in G�. An isolated edge, on the other hand, can appear as 346

an edge in G� with probability p and as two singleton vertices in G� with 347

probability 1 � p. A triangle can decompose into three possibilities with 348

probability shown in Figure 2. Each of these possibilities provides a linear 349

equation connecting n�i to n�

i. These equations up to cliques of size three are 350

E[n�3] = n�

3 · p2 · (3 � 2p)(2)

E[n�2] = n�

2 · p + n�3 · (3 · (1 � p)2 · p)(3)

E[n�1] = n�

1 + n�2 · (2 · (1 � p)) + n�

3 · (3 · (1 � p)2).(4)

Since we observe n�i, we can solve for the estimator of each n�

i and compute 351

the number of connected components by summing up all n�i . 352

Fig 2: A general example illustrating the transformation and probabilitiesof connected components from G� to G�.

Unfortunately, this process quickly becomes combinatorial, and in fact, is 353

at least #P hard (Provan and Ball, 1983) to compute for cliques of larger 354

sizes. A large clique of size k can appear as many separate connected com- 355

ponents and the possibilities of smaller size components it can break into 356

are exponential (Aleksandrov, 1956). Fortunately, we can safely ignore large 357

connected components without significant loss in estimation with two rea- 358

sons. First, in practical entity resolution tasks, when M is large and contains 359

at least one string-valued feature, it is observed that most entities are repli- 360

cated no more than three or four times. Second, a large clique can only 361

induce large errors if it is broken into many connected components due to 362

imsart-aoas ver. 2013/03/06 file: aoas_revision_2017_09_17_FINAL_rcs.tex date: September 17, 2017

12 CHEN, SHRIVASTAVA, AND STEORTS

undersampling. According to Erdos and Renyi (1960), it will almost surely363

stay connected if p is high, which is the case with our sampling method.364

Assumption: As argued above, we safely assume that the cliques of sizesequal or larger than 4 in the original graph would retain their structures,i.e., �i � 4, n�

i = n�i. With this assumption, we can write down the formula

for estimating n�1, n�

2, n�3 by solving Equations 2–4 as,

n�3 =

E[n�3]

p2 · (3 � 2p), n�

2 =E[n�

2] � n�3 · (3 · (1 � p)2 · p)

p(5)

n�1 = E[n�

1] � n�2 · (2 · (1 � p)) � n�

3 · (3 · (1 � p)2)(6)

It directly follows that our estimator, which we call the Locality SensitiveHashing Estimator (LSHE) for the number of connected components is givenby

LSHE = n�1 + n�

2 · 2p � 1

p+ n�

3 · 1 � 6 · (1 � p)2 · p

p2 · (3 � 2p)+

M�

i=4

n�i.(7)

3.4. Optimality Properties of LSHE. We now prove two properties of our365

unique entity estimator, namely, that it is unbiased and that is has provably366

low variance than random sampling approaches.367

Theorem 1. Assuming �i � 4, n�i = n�

i, we have

E[LSHE] = n unbiased(8)

V[LSHE] = n�3 · (p � 1)2 · (3p2 � p + 1)

p2 · (3 � 2p)+ n�

2(1 � p)

p(9)

The above estimator is unbiased and the variance is given by Equation 9.368

Theorem 2 proves the variance of our estimator is monotonically decreas-369

ing.370

Theorem 2. V[LSHE] is monotonically decreasing when p increases in371

range (0, 1].372

The proof of Theorem 2 directly follows from the following Lemma 2.373

Lemma 1. First order derivative of V[LSHE] is negative when p � (0, 1].374

Note that when p = 1, V[LSHE] = 0 which means the observed graph G�375

is exactly the same as G�. For detailed proofs of unbiasedness and Lemma376

2, see Appendix B.377


Analysis

12 CHEN, SHRIVASTAVA, AND STEORTS

undersampling. According to Erdos and Renyi (1960), it will almost surely363

stay connected if p is high, which is the case with our sampling method.364

Assumption: As argued above, we safely assume that the cliques of sizesequal or larger than 4 in the original graph would retain their structures,i.e., �i � 4, n�

i = n�i. With this assumption, we can write down the formula

for estimating n�1, n�

2, n�3 by solving Equations 2–4 as,

n�3 =

E[n�3]

p2 · (3 � 2p), n�

2 =E[n�

2] � n�3 · (3 · (1 � p)2 · p)

p(5)

n�1 = E[n�

1] � n�2 · (2 · (1 � p)) � n�

3 · (3 · (1 � p)2)(6)

It directly follows that our estimator, which we call the Locality SensitiveHashing Estimator (LSHE) for the number of connected components is givenby

LSHE = n�1 + n�

2 · 2p � 1

p+ n�

3 · 1 � 6 · (1 � p)2 · p

p2 · (3 � 2p)+

M�

i=4

n�i.(7)

3.4. Optimality Properties of LSHE. We now prove two properties of our365

unique entity estimator, namely, that it is unbiased and that is has provably366

low variance than random sampling approaches.367

Theorem 1. Assuming �i � 4, n�i = n�

i, we have

E[LSHE] = n unbiased(8)

V[LSHE] = n�3 · (p � 1)2 · (3p2 � p + 1)

p2 · (3 � 2p)+ n�

2(1 � p)

p(9)

The above estimator is unbiased and the variance is given by Equation 9.368

Theorem 2 proves the variance of our estimator is monotonically decreas-369

ing.370

Theorem 2. V[LSHE] is monotonically decreasing when p increases in371

range (0, 1].372

The proof of Theorem 2 directly follows from the following Lemma 2.373

Lemma 1. First order derivative of V[LSHE] is negative when p � (0, 1].374

Note that when p = 1, V[LSHE] = 0 which means the observed graph G�375

is exactly the same as G�. For detailed proofs of unbiasedness and Lemma376

2, see Appendix B.377


Variance is monotonically decreasing when p increases

Experiments

I Datasets20 CHEN, SHRIVASTAVA, AND STEORTS

DBname Domain Size # Matching Pairs # Attributes # Entities

Restaurants Restaurant Guide 864 112 4 752CD Music CDs 9,763 299 106 9,508Voter Registration Info 324,074 70,359 6 255,447Syria Death Records 296,245 N/A 7 N/A

Table 1: We present five important features of the four data sets. Domainreflects the variety of the data type we used in the experiments. Size is thenumber of total records respectively. # Matching Pairs shows how manypair of records point to the same entity in each data set. # Attributesrepresents the dimensionality of individual record. # Entities is the numberof unique records.

sets come from the Violation Documentation Centre (VDC), Syrian596

Center for Statistics and Research (CSR-SY), Syrian Network for Hu-597

man Rights (SNHR), and Syria Shuhada website (SS). Each database598

lists a di�erent number of recorded victims killed in the Syrian con-599

flict, along with available identifying information including full Arabic600

name, date of death, death location, gender, among others.5601

The above datasets cover a wide spectrum of di�erent varieties observed602

in practice. For each data set, we report summary information in Table 1.603

4.1. Evaluation Settings. In this section, we outline our evaluation set-604

tings. We denote Algorithm 1 as the LSH Estimator (LSHE). We make605

comparisons to the non-adaptive variant of our estimator (PRSE), where606

the sampling used is plain random (instead of the adaptive sampler). This607

baseline uses the exact same procedure as our proposed LSHE, except that608

the sampling is done uniformly. A comparison with PRSE quantifies the ad-609

vantages of proposed adaptive sampling over random sampling. In addition,610

we implemented the two other known sampling methods, for connected com-611

ponent estimation, proposed in Frank (1978) and Chazelle, Rubinfeld and612

Trevisan (2005). For convenience, we denote them as Random Sub-Graph613

based Estimator (RSGE), and BFS on Random Vertex based Estimator614

(BFSE) respectively. Since the algorithms are based on sampling (adaptive615

or random), to ensure fairness, we fix a budget m as the number of pairs616

of vertices considered by the algorithm. Note that any query for an edge is617

a part of the budget. If the fixed budget is exhausted, then we stop sam-618

pling process and use the corresponding estimate, using all the information619

5These databases include documented identifiable victims and not those who are miss-ing in the conflict. Hence, any estimate reported only refers to the data at hand.


I Setting

ENTITY ESTIMATION WITH APPLICATION TO THE SYRIAN CONFLICT 23

gleton nodes, which leads to a poor accuracy of BFSE. Thus, it is expected 678

that random sampling will perform poorly. Unfortunately, there is no other 679

baseline for unbiased estimation of the number of unique entities. 680

From Figure 4 observe that the RE for proposed estimator LSHE is ap- 681

proximately one to two orders of magnitude lower than the other considered 682

methods, where the y-axis is on the log-scale. Undoubtedly, our proposed 683

estimator LSHE consistently leads to significantly lower RE (lower error 684

rates) than the other three estimators. This is not surprising from the anal- 685

ysis shown in section 3.5. The variance of random sampling based method- 686

ologies will be significantly higher because sampling random pairs has the 687

probability of being a duplicate pair of close to zero. 688

Taking a closer look at LSHE, we notice that we are able to e�ciently 689

generate samples with very high values of p (See Table 2). In addition, we 690

can clearly see that LSHE achieves high accuracy with very few samples. 691

For example, for the CD data set, with a sample size less than 0.05% of the 692

total possible pairs of records of the entire data set, LSHE achieves 0.0006 693

RE. Similarly, for the Voter data set, with a sample size less than 0.012% 694

of the total possible pairs of records of the entire data set, LSHE achieves 695

0.003 RE. 696

As mentioned earlier, we also evaluated the e�ect of using SVM prediction 697

as a proxy for actual labels with our LSHE. The dotted plot show those 698

results. We remark on the results for LSHE + SVM in the next section 5. 699

Restaurant CD Voter

Size 1.0 2.5 5.0 10 0.005 0.01 0.02 0.04 0.002 0.006 0.009 0.013p 0.42 0.54 0.65 0.82 0.72 0.74 0.82 0.92 0.62 0.72 0.76 0.82K 1 1 1 1 1 1 1 1 4 4 4 4L 4 8 12 20 5 6 8 14 25 32 35 40

Table 2: We illustrate part of the sample sizes (in % in TOTAL) for di�erentset of samples generated by Min-Wise Hashing and their corresponding p inall three data sets.

5. Estimation of Casualties in Syrian Conflict. In this section, we 700

describe how we estimate the number of documented identifiable deaths for 701

the Syrian data set. As noted before, we do not have have ground truth la- 702

bels for all record pairs, but the data set was partially labelled with 217,788 703

record pairs. Furthermore, with doubt to the accuracy of the partially la- 704

belled record pairs, we propose an alternative method of labelling the sam- 705

pler pairs, which is also needed by our proposed estimation algorithm. More 706


I Results

Sample Size (in % of total)0 2 4 6 8 10 12

RE

(R

ela

tive

Err

or)

10-3

10-2

10-1

100Estimation on Restuarant

LSHEPRSERSGEBFSELSHE+SVM

Sample Size (in % of total)0 0.01 0.02 0.03 0.04 0.05

RE

(R

ela

tive

Err

or)

10-4

10-3

10-2

10-1

100Estimation on CD

LSHERSGEBFSELSHE+SVM

Sample Size (in % of total)0 0.002 0.004 0.006 0.008 0.01 0.012

RE

(R

ela

tive

Err

or)

10-3

10-2

10-1

100Estimation on Voter

LSHERSGEBFSELSHE+SVM

ConclusionsI Estimated number of casualties in Syrian Conflict

190,369±207 (Done in 119 Seconds)I It closely matches the 2014 HRDAG 190,000 estimate, by

manual hand-matching

[email protected]

unique entity estimation with application to the …...restaurants restaurant guide 864 112 4 752 cd...

Documents