presentation of 'blocking methods applied to casualty ... · context: still looking at...
TRANSCRIPT
Presentation of ’Blocking Methods Applied toCasualty Records from the Syrian Conflict’
Margaret J. Foster
September 27, 2016
Margaret J. Foster
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’
High-level overview:Goal: Identify a better method to block the Syria casualty dataMain finding: Locality-Sensitive Hashing has a more accurate(LSH) reduces splitting paired records across blocks from 50%1 toless than 1% of the time.However: LSH has low precision2
1Note, later the discussion says 20-60% of the time for “traditional”blocking methods
2precision = 1-False Positive RateMargaret J. Foster
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’
Context: still looking at blocking the Syrian casualty data.Reminder of what this looks like (more or less):
⇒
Margaret J. Foster
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’
Data
I 300,000 total death records
I Four databases, with different collection characteristics
I “Very” sparse in duplicates3
I Sparse in usable features for matching.
3magnitude?Margaret J. Foster
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’
Blocking Strategies
Approach Does Drawback
“TraditionalBlocking”
Blocks records that agree onall fields of comparison
Assumes theFoC are error-free
Transitive local-ity sensitive hash-ing (TLSH)
Community detection to as-sociate records
...
k-means localitysensitive hashing(KLSH)
Group by similarity in avector-space representation
k-nearest neigh-bors clustering
Cluster by distance betweenrecords
Computationallyexpensive, clusterquality
Canopies Overlapping clusters Doesn’t work;expensive
Margaret J. Foster
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’
Blocking Tools II
Approach Does Drawback(s)
Conjunction Rules Combination of simi-larity relationships thatmust exist
Over/underfitting data,scalability,application-specificity
Conjunction rules +Soundex)
Conjunctions used withstring data4
Only as goodas the distancemeasure
Hashing ensembles Cluster based on hashingstrategies
Aggressive clus-tering
4Via Arabic Edit Distance Algorithm, a weighted measure of similarityMargaret J. Foster
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’
Hashing Ensembles
Table: Hashing Techniques
Method DoesKLSH Uses bag-of-shingles; similarity measured
by inner product
Minihashing Applies random permutation π on record set(S), stores only minimum value
Densified OnePermutationHashing (DOPH)
Reduces number of passes through dataneeded to create buckets
Weighted Densi-fied One Permu-tation Hashing
Adds in weights to the hash components,to recapture important information in asparse environment
Margaret J. Foster
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’
Evaluation
Evaluation strategy: Traditional and hash-based blocking methodsapplied to HRDAG’s data of Syrian namesFeatures are:
I Full Arabic name
I Date of death
I Governorate of death
I Sex
Evaluated on:
FNR = FNTP+FN and FPR = FP
TP+FPrecall =1-FNR, precision = 1-FPR
Reduction ratio (RR) is matched pairs over non-matched records
Margaret J. Foster
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’
Overview of results
Approach Outcome
Traditional blocking Recall and RR never above 0.3
KLSH (shingle = 1) Recall approx. 0.6KLSH (“reasonable” block size) Recall about 0.4
Conjunctions For best rules: recall about 0.8;RR at ∼ 0.93− 0.99
Minihashing Recall and RR are both close toone, regardless of shingle
Unweighted DOPH Recall and RR close to 0.99(Shingle size = 3)
Weighted DOPH Recall and RR close to 0.99(Shingle size = 2-3) But: precision near 0
Margaret J. Foster
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’
Questions:
I Is “traditional blocking” a fair comparison?
I What, exactly, is “soft” transitivity? Not defined here.5
5It is an uncommon term: here are 8 Google Scholar papers that use thisphrase and one paper on ArXiv (this one).
Margaret J. Foster
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’