unique entity estimation with application to the …...restaurants restaurant guide 864 112 4 752 cd...
TRANSCRIPT
Unique Entity Estimation with Application to the Syrian ConflictBeidi Chen ú, Anshumali Shrivastava ú, Rebecca C. Steorts †
úDepartment of Computer Science, Rice University, †Department of Statistical Science, Duke University
Goal: Unique Entity Estimation (UEE)
Given a dataset of M records, estimate the number of distinctidentities (records) n < M.
ChallengesI Data Duplications: Aggregate from various resources
probability of colliding in the range space than non-similarones
I Data Noises: Human Error/ BiasI Computational Challenges
LOCALITY SENSITIVE HASHING (LSH)
PropertiesI A family of functions, with the property that similar input objects
in the domain of these functions have a higher probability ofcolliding in the range space than non-similar ones
I A favorite technique for approximate near-neighbor search
Figure : LSH
Popular LSH
Minwise Hashing and Resemblance SimilarityI The resemblance similarity between two given sets
x, y ™ � = {1, 2, ..., D} is defined as
R = |x fl y||x fi y| = a
f1 + f2 ≠ a
, (1)
where f1 = |x|, f2 = |y|, and a = |x fl y|I Minwise hashing æ the LSH for resemblance similarityI Apply a random permutation fi : � æ �
h
min
fi
(x) = min(fi(x)). (2)
Pr(hmin
fi
(x) = h
min
fi
(y)) = |x fl y||x fi y| = R. (3)
Motivation
Estimate Syrian Conflict Casualties across datasets from di�erent sources.
Method
Existing ApproachesI Simply compare every two records
Problem: O(M2) 4.5X1010 Pairs, 6days(Syrian Dataset)
I Put similar records into blocks or binsProblem:Sacrifice accuracy for e�ciency
Our Approach
Figure : Flow
Estimator
ENTITY ESTIMATION WITH APPLICATION TO THE SYRIAN CONFLICT 11
nent, say C�i (clique), will be sub-sampled and can appear as some possibly 344
smaller connected components in G�. For example, a singleton set in G� will 345
remain the same in G�. An isolated edge, on the other hand, can appear as 346
an edge in G� with probability p and as two singleton vertices in G� with 347
probability 1 � p. A triangle can decompose into three possibilities with 348
probability shown in Figure 2. Each of these possibilities provides a linear 349
equation connecting n�i to n�
i. These equations up to cliques of size three are 350
E[n�3] = n�
3 · p2 · (3 � 2p)(2)
E[n�2] = n�
2 · p + n�3 · (3 · (1 � p)2 · p)(3)
E[n�1] = n�
1 + n�2 · (2 · (1 � p)) + n�
3 · (3 · (1 � p)2).(4)
Since we observe n�i, we can solve for the estimator of each n�
i and compute 351
the number of connected components by summing up all n�i . 352
Fig 2: A general example illustrating the transformation and probabilitiesof connected components from G� to G�.
Unfortunately, this process quickly becomes combinatorial, and in fact, is 353
at least #P hard (Provan and Ball, 1983) to compute for cliques of larger 354
sizes. A large clique of size k can appear as many separate connected com- 355
ponents and the possibilities of smaller size components it can break into 356
are exponential (Aleksandrov, 1956). Fortunately, we can safely ignore large 357
connected components without significant loss in estimation with two rea- 358
sons. First, in practical entity resolution tasks, when M is large and contains 359
at least one string-valued feature, it is observed that most entities are repli- 360
cated no more than three or four times. Second, a large clique can only 361
induce large errors if it is broken into many connected components due to 362
imsart-aoas ver. 2013/03/06 file: aoas_revision_2017_09_17_FINAL_rcs.tex date: September 17, 2017
12 CHEN, SHRIVASTAVA, AND STEORTS
undersampling. According to Erdos and Renyi (1960), it will almost surely363
stay connected if p is high, which is the case with our sampling method.364
Assumption: As argued above, we safely assume that the cliques of sizesequal or larger than 4 in the original graph would retain their structures,i.e., �i � 4, n�
i = n�i. With this assumption, we can write down the formula
for estimating n�1, n�
2, n�3 by solving Equations 2–4 as,
n�3 =
E[n�3]
p2 · (3 � 2p), n�
2 =E[n�
2] � n�3 · (3 · (1 � p)2 · p)
p(5)
n�1 = E[n�
1] � n�2 · (2 · (1 � p)) � n�
3 · (3 · (1 � p)2)(6)
It directly follows that our estimator, which we call the Locality SensitiveHashing Estimator (LSHE) for the number of connected components is givenby
LSHE = n�1 + n�
2 · 2p � 1
p+ n�
3 · 1 � 6 · (1 � p)2 · p
p2 · (3 � 2p)+
M�
i=4
n�i.(7)
3.4. Optimality Properties of LSHE. We now prove two properties of our365
unique entity estimator, namely, that it is unbiased and that is has provably366
low variance than random sampling approaches.367
Theorem 1. Assuming �i � 4, n�i = n�
i, we have
E[LSHE] = n unbiased(8)
V[LSHE] = n�3 · (p � 1)2 · (3p2 � p + 1)
p2 · (3 � 2p)+ n�
2(1 � p)
p(9)
The above estimator is unbiased and the variance is given by Equation 9.368
Theorem 2 proves the variance of our estimator is monotonically decreas-369
ing.370
Theorem 2. V[LSHE] is monotonically decreasing when p increases in371
range (0, 1].372
The proof of Theorem 2 directly follows from the following Lemma 2.373
Lemma 1. First order derivative of V[LSHE] is negative when p � (0, 1].374
Note that when p = 1, V[LSHE] = 0 which means the observed graph G�375
is exactly the same as G�. For detailed proofs of unbiasedness and Lemma376
2, see Appendix B.377
imsart-aoas ver. 2013/03/06 file: aoas_revision_2017_09_17_FINAL_rcs.tex date: September 17, 2017
Analysis
12 CHEN, SHRIVASTAVA, AND STEORTS
undersampling. According to Erdos and Renyi (1960), it will almost surely363
stay connected if p is high, which is the case with our sampling method.364
Assumption: As argued above, we safely assume that the cliques of sizesequal or larger than 4 in the original graph would retain their structures,i.e., �i � 4, n�
i = n�i. With this assumption, we can write down the formula
for estimating n�1, n�
2, n�3 by solving Equations 2–4 as,
n�3 =
E[n�3]
p2 · (3 � 2p), n�
2 =E[n�
2] � n�3 · (3 · (1 � p)2 · p)
p(5)
n�1 = E[n�
1] � n�2 · (2 · (1 � p)) � n�
3 · (3 · (1 � p)2)(6)
It directly follows that our estimator, which we call the Locality SensitiveHashing Estimator (LSHE) for the number of connected components is givenby
LSHE = n�1 + n�
2 · 2p � 1
p+ n�
3 · 1 � 6 · (1 � p)2 · p
p2 · (3 � 2p)+
M�
i=4
n�i.(7)
3.4. Optimality Properties of LSHE. We now prove two properties of our365
unique entity estimator, namely, that it is unbiased and that is has provably366
low variance than random sampling approaches.367
Theorem 1. Assuming �i � 4, n�i = n�
i, we have
E[LSHE] = n unbiased(8)
V[LSHE] = n�3 · (p � 1)2 · (3p2 � p + 1)
p2 · (3 � 2p)+ n�
2(1 � p)
p(9)
The above estimator is unbiased and the variance is given by Equation 9.368
Theorem 2 proves the variance of our estimator is monotonically decreas-369
ing.370
Theorem 2. V[LSHE] is monotonically decreasing when p increases in371
range (0, 1].372
The proof of Theorem 2 directly follows from the following Lemma 2.373
Lemma 1. First order derivative of V[LSHE] is negative when p � (0, 1].374
Note that when p = 1, V[LSHE] = 0 which means the observed graph G�375
is exactly the same as G�. For detailed proofs of unbiasedness and Lemma376
2, see Appendix B.377
imsart-aoas ver. 2013/03/06 file: aoas_revision_2017_09_17_FINAL_rcs.tex date: September 17, 2017
Variance is monotonically decreasing when p increases
Experiments
I Datasets20 CHEN, SHRIVASTAVA, AND STEORTS
DBname Domain Size # Matching Pairs # Attributes # Entities
Restaurants Restaurant Guide 864 112 4 752CD Music CDs 9,763 299 106 9,508Voter Registration Info 324,074 70,359 6 255,447Syria Death Records 296,245 N/A 7 N/A
Table 1: We present five important features of the four data sets. Domainreflects the variety of the data type we used in the experiments. Size is thenumber of total records respectively. # Matching Pairs shows how manypair of records point to the same entity in each data set. # Attributesrepresents the dimensionality of individual record. # Entities is the numberof unique records.
sets come from the Violation Documentation Centre (VDC), Syrian596
Center for Statistics and Research (CSR-SY), Syrian Network for Hu-597
man Rights (SNHR), and Syria Shuhada website (SS). Each database598
lists a di�erent number of recorded victims killed in the Syrian con-599
flict, along with available identifying information including full Arabic600
name, date of death, death location, gender, among others.5601
The above datasets cover a wide spectrum of di�erent varieties observed602
in practice. For each data set, we report summary information in Table 1.603
4.1. Evaluation Settings. In this section, we outline our evaluation set-604
tings. We denote Algorithm 1 as the LSH Estimator (LSHE). We make605
comparisons to the non-adaptive variant of our estimator (PRSE), where606
the sampling used is plain random (instead of the adaptive sampler). This607
baseline uses the exact same procedure as our proposed LSHE, except that608
the sampling is done uniformly. A comparison with PRSE quantifies the ad-609
vantages of proposed adaptive sampling over random sampling. In addition,610
we implemented the two other known sampling methods, for connected com-611
ponent estimation, proposed in Frank (1978) and Chazelle, Rubinfeld and612
Trevisan (2005). For convenience, we denote them as Random Sub-Graph613
based Estimator (RSGE), and BFS on Random Vertex based Estimator614
(BFSE) respectively. Since the algorithms are based on sampling (adaptive615
or random), to ensure fairness, we fix a budget m as the number of pairs616
of vertices considered by the algorithm. Note that any query for an edge is617
a part of the budget. If the fixed budget is exhausted, then we stop sam-618
pling process and use the corresponding estimate, using all the information619
5These databases include documented identifiable victims and not those who are miss-ing in the conflict. Hence, any estimate reported only refers to the data at hand.
imsart-aoas ver. 2013/03/06 file: aoas_revision_2017_09_17_FINAL_rcs.tex date: September 17, 2017
I Setting
ENTITY ESTIMATION WITH APPLICATION TO THE SYRIAN CONFLICT 23
gleton nodes, which leads to a poor accuracy of BFSE. Thus, it is expected 678
that random sampling will perform poorly. Unfortunately, there is no other 679
baseline for unbiased estimation of the number of unique entities. 680
From Figure 4 observe that the RE for proposed estimator LSHE is ap- 681
proximately one to two orders of magnitude lower than the other considered 682
methods, where the y-axis is on the log-scale. Undoubtedly, our proposed 683
estimator LSHE consistently leads to significantly lower RE (lower error 684
rates) than the other three estimators. This is not surprising from the anal- 685
ysis shown in section 3.5. The variance of random sampling based method- 686
ologies will be significantly higher because sampling random pairs has the 687
probability of being a duplicate pair of close to zero. 688
Taking a closer look at LSHE, we notice that we are able to e�ciently 689
generate samples with very high values of p (See Table 2). In addition, we 690
can clearly see that LSHE achieves high accuracy with very few samples. 691
For example, for the CD data set, with a sample size less than 0.05% of the 692
total possible pairs of records of the entire data set, LSHE achieves 0.0006 693
RE. Similarly, for the Voter data set, with a sample size less than 0.012% 694
of the total possible pairs of records of the entire data set, LSHE achieves 695
0.003 RE. 696
As mentioned earlier, we also evaluated the e�ect of using SVM prediction 697
as a proxy for actual labels with our LSHE. The dotted plot show those 698
results. We remark on the results for LSHE + SVM in the next section 5. 699
Restaurant CD Voter
Size 1.0 2.5 5.0 10 0.005 0.01 0.02 0.04 0.002 0.006 0.009 0.013p 0.42 0.54 0.65 0.82 0.72 0.74 0.82 0.92 0.62 0.72 0.76 0.82K 1 1 1 1 1 1 1 1 4 4 4 4L 4 8 12 20 5 6 8 14 25 32 35 40
Table 2: We illustrate part of the sample sizes (in % in TOTAL) for di�erentset of samples generated by Min-Wise Hashing and their corresponding p inall three data sets.
5. Estimation of Casualties in Syrian Conflict. In this section, we 700
describe how we estimate the number of documented identifiable deaths for 701
the Syrian data set. As noted before, we do not have have ground truth la- 702
bels for all record pairs, but the data set was partially labelled with 217,788 703
record pairs. Furthermore, with doubt to the accuracy of the partially la- 704
belled record pairs, we propose an alternative method of labelling the sam- 705
pler pairs, which is also needed by our proposed estimation algorithm. More 706
imsart-aoas ver. 2013/03/06 file: aoas_revision_2017_09_17_FINAL_rcs.tex date: September 17, 2017
I Results
Sample Size (in % of total)0 2 4 6 8 10 12
RE
(R
ela
tive
Err
or)
10-3
10-2
10-1
100Estimation on Restuarant
LSHEPRSERSGEBFSELSHE+SVM
Sample Size (in % of total)0 0.01 0.02 0.03 0.04 0.05
RE
(R
ela
tive
Err
or)
10-4
10-3
10-2
10-1
100Estimation on CD
LSHERSGEBFSELSHE+SVM
Sample Size (in % of total)0 0.002 0.004 0.006 0.008 0.01 0.012
RE
(R
ela
tive
Err
or)
10-3
10-2
10-1
100Estimation on Voter
LSHERSGEBFSELSHE+SVM
ConclusionsI Estimated number of casualties in Syrian Conflict
190,369±207 (Done in 119 Seconds)I It closely matches the 2014 HRDAG 190,000 estimate, by
manual hand-matching