presentation of 'blocking methods applied to casualty ... · context: still looking at...

10
Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’ Margaret J. Foster September 27, 2016 Margaret J. Foster Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’

Upload: others

Post on 29-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presentation of 'Blocking Methods Applied to Casualty ... · Context: still looking at blocking the Syrian casualty data. Reminder of what this looks like (more or less):) Margaret

Presentation of ’Blocking Methods Applied toCasualty Records from the Syrian Conflict’

Margaret J. Foster

September 27, 2016

Margaret J. Foster

Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’

Page 2: Presentation of 'Blocking Methods Applied to Casualty ... · Context: still looking at blocking the Syrian casualty data. Reminder of what this looks like (more or less):) Margaret

High-level overview:Goal: Identify a better method to block the Syria casualty dataMain finding: Locality-Sensitive Hashing has a more accurate(LSH) reduces splitting paired records across blocks from 50%1 toless than 1% of the time.However: LSH has low precision2

1Note, later the discussion says 20-60% of the time for “traditional”blocking methods

2precision = 1-False Positive RateMargaret J. Foster

Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’

Page 3: Presentation of 'Blocking Methods Applied to Casualty ... · Context: still looking at blocking the Syrian casualty data. Reminder of what this looks like (more or less):) Margaret

Context: still looking at blocking the Syrian casualty data.Reminder of what this looks like (more or less):

Margaret J. Foster

Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’

Page 4: Presentation of 'Blocking Methods Applied to Casualty ... · Context: still looking at blocking the Syrian casualty data. Reminder of what this looks like (more or less):) Margaret

Data

I 300,000 total death records

I Four databases, with different collection characteristics

I “Very” sparse in duplicates3

I Sparse in usable features for matching.

3magnitude?Margaret J. Foster

Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’

Page 5: Presentation of 'Blocking Methods Applied to Casualty ... · Context: still looking at blocking the Syrian casualty data. Reminder of what this looks like (more or less):) Margaret

Blocking Strategies

Approach Does Drawback

“TraditionalBlocking”

Blocks records that agree onall fields of comparison

Assumes theFoC are error-free

Transitive local-ity sensitive hash-ing (TLSH)

Community detection to as-sociate records

...

k-means localitysensitive hashing(KLSH)

Group by similarity in avector-space representation

k-nearest neigh-bors clustering

Cluster by distance betweenrecords

Computationallyexpensive, clusterquality

Canopies Overlapping clusters Doesn’t work;expensive

Margaret J. Foster

Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’

Page 6: Presentation of 'Blocking Methods Applied to Casualty ... · Context: still looking at blocking the Syrian casualty data. Reminder of what this looks like (more or less):) Margaret

Blocking Tools II

Approach Does Drawback(s)

Conjunction Rules Combination of simi-larity relationships thatmust exist

Over/underfitting data,scalability,application-specificity

Conjunction rules +Soundex)

Conjunctions used withstring data4

Only as goodas the distancemeasure

Hashing ensembles Cluster based on hashingstrategies

Aggressive clus-tering

4Via Arabic Edit Distance Algorithm, a weighted measure of similarityMargaret J. Foster

Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’

Page 7: Presentation of 'Blocking Methods Applied to Casualty ... · Context: still looking at blocking the Syrian casualty data. Reminder of what this looks like (more or less):) Margaret

Hashing Ensembles

Table: Hashing Techniques

Method DoesKLSH Uses bag-of-shingles; similarity measured

by inner product

Minihashing Applies random permutation π on record set(S), stores only minimum value

Densified OnePermutationHashing (DOPH)

Reduces number of passes through dataneeded to create buckets

Weighted Densi-fied One Permu-tation Hashing

Adds in weights to the hash components,to recapture important information in asparse environment

Margaret J. Foster

Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’

Page 8: Presentation of 'Blocking Methods Applied to Casualty ... · Context: still looking at blocking the Syrian casualty data. Reminder of what this looks like (more or less):) Margaret

Evaluation

Evaluation strategy: Traditional and hash-based blocking methodsapplied to HRDAG’s data of Syrian namesFeatures are:

I Full Arabic name

I Date of death

I Governorate of death

I Sex

Evaluated on:

FNR = FNTP+FN and FPR = FP

TP+FPrecall =1-FNR, precision = 1-FPR

Reduction ratio (RR) is matched pairs over non-matched records

Margaret J. Foster

Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’

Page 9: Presentation of 'Blocking Methods Applied to Casualty ... · Context: still looking at blocking the Syrian casualty data. Reminder of what this looks like (more or less):) Margaret

Overview of results

Approach Outcome

Traditional blocking Recall and RR never above 0.3

KLSH (shingle = 1) Recall approx. 0.6KLSH (“reasonable” block size) Recall about 0.4

Conjunctions For best rules: recall about 0.8;RR at ∼ 0.93− 0.99

Minihashing Recall and RR are both close toone, regardless of shingle

Unweighted DOPH Recall and RR close to 0.99(Shingle size = 3)

Weighted DOPH Recall and RR close to 0.99(Shingle size = 2-3) But: precision near 0

Margaret J. Foster

Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’

Page 10: Presentation of 'Blocking Methods Applied to Casualty ... · Context: still looking at blocking the Syrian casualty data. Reminder of what this looks like (more or less):) Margaret

Questions:

I Is “traditional blocking” a fair comparison?

I What, exactly, is “soft” transitivity? Not defined here.5

5It is an uncommon term: here are 8 Google Scholar papers that use thisphrase and one paper on ArXiv (this one).

Margaret J. Foster

Presentation of ’Blocking Methods Applied to Casualty Records from the Syrian Conflict’