linguistic considerations of identity resolution (2008)
TRANSCRIPT
GOVERNMENT USERSConference“Navigating the Human Terrain” College Park, MD, May 20-21, 2008
Linguistic Considerations of Identity ResolutionDavid MurgatroydSoftware ArchitectBasis Technology
2
Outline
Introduction Linguistic Challenges
Variation (Intentional & Unintentional) Composition Frequency Under-specification Multilinguality
Integration Challenges Inputs & Outputs Properties
Evaluation Challenges Corpora: Find or Build? Metrics: Adopt or Create?
Conclusion
3
Introduction: An Exercise
Jim Killeen Kileen, J. D.
Jaime Kilin
جمس كلين
Is there a >50% chance these refer to the same person? If…US Citizens; On a ferry to Spain;In a documentary
4
What is Identity Resolution?
Identity Resolution (aka Entity Resolution): determining if two or more given references refer to
the same entity.
Different from name matching as it’s about identity of entities not similarity of names
See also: Murgatroyd, D. Some Linguistic Considerations of
Entity Resolution and Retrieval. In Proceedings of LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management.
5
What sorts of references?
Non-linguistic reference examples: Numerical identifiers
— SSN— Some portions of address (Street Number, Zip Code)
Visual identifiers (e.g., pictures, symbols) Biometrics (e.g., DNA, iris, signature, voice)
Linguistic reference examples: Nouns or pronouns in documents (e.g., “the CEO of Basis”) Names of associated/related entities
— Locations (e.g., Street or City Name)— Organizations— Individuals
Name of entity <- we’re going to focus on this one
6
Let’s focus on names of people
Common and familiar Often fairly identifying piece of personal
information Demonstrate typical challenges of resolution
with linguistic data
7
Outline
Introduction Linguistic Challenges
Variation (Intentional & Unintentional) Composition Frequency Under-specification Multilinguality
Integration Challenges Inputs & Outputs Properties
Evaluation Challenges Corpora: Find or Build? Metrics: Adopt or Create?
Conclusion
8
Variation (Intentional)
Variation may be intentional References may be draw on a large set of names:
— Formality (e.g., nicknames)— Transparency (e.g., aliases)— Location (e.g., toponym)— Life status
Vocation (e.g., titles) Marital status (e.g., marriage/divorce/widowhood) Parenthood (e.g., patronymic) Faith (e.g., christening, pilgrimage) Death (e.g., posthumous names)
— Dialect (e.g., adolescent girls preferring “Jenni” over “Jenny”)— Style of text (e.g., “Sollermun” for “Solomon” in Huck Finn)
Jim Killeen
9
Variation (Unintentional)
Variation may be unintentional, arising from: Typos
— E.g., “Killeen” vs. “Kileen” Guessing spelling based on pronunciation
— E.g., “Caliin” Ambiguities inherent in the encoding (e.g., Unicode):
— Characters with the same glyph E.g., Latin and Cyrillic small “i”
— Characters with similar glyphs E.g., Latin “K” and Greenlandic “ĸ”
— Characters with composed/combined forms E.g., ņ (n with cedilla) vs. n (n + combining cedilla)
Kileen, J. D.
10
Composition
Names have differing orders: Given v. Surname: “Killen, Jim” v. “Jim Killeen” Varies by culture
Name references may be partial: “Jim” v. “Jim Killeen”
11
Under-specification
Name components may be abbreviated Initials (e.g., “J. D.”) Abbreviations (e.g., “Jas.”)
Name references may have incomplete… orthography (e.g., Semitic languages) segmentation (e.g., Asian languages) phonology (e.g., Ideographic languages)
Kileen, J. D.
جمس كلين
12
Frequency
Any person can make up a name (an open class) A few are common, most are very uncommon Zipfian distribution Lesson:
Valuable to know common names
Valuable to have a strategy for unknown names
13
Multilinguality
Names may appear in many languages-of-use This leads to variation at many linguistic levels. Orthographic:
transliteration confronts skew in:—orthographic-to-phonetic mappings of source and
target languages-of-use—sound systems between the languages
James Klein <-> جمس كلين
14
Multilinguality (cont’d)
Syntactic: different languages-of-use may imply different name
word order
Semantic: name words which communicate meaning (e.g.,
titles) may vary (e.g., “Jr.” for “الصغر “which means “the younger”)
Pragmatic: different languages-of-use may use different names
based on the audience (e.g., “Mr. Laden” vs. “المير ”which means “the prince”)
15
Outline
Introduction Linguistic Challenges
Variation (Intentional & Unintentional) Composition Frequency Under-specification Multilinguality
Integration Challenges Inputs & Outputs Properties
Evaluation Challenges Corpora: Find or Build? Metrics: Adopt or Create?
Conclusion
16
Inputs & Outputs
Inputs options include: Pair-wise: simple integration, but no shared effort Set-based: harder integration, but able to optimize
Output options include: Feature-based: with weights/tuning Probability-based:
—more principled combination—NOTE: similarity is not probability
17
Integration Properties
Certain properties help make efficient implementations: Reflexivity:
—Resolve(a,a) is always true—NOTE: does not imply Resolve(a,a’) where a~a’
Commutativity:—Resolve(a,b) Resolve(b,a)
Transitivity:—Resolve(a,b) & Resolve(b,c) => Resolve(a,c)
18
Outline
Introduction Linguistic Challenges
Variation (Intentional & Unintentional) Composition Frequency Under-specification Multilinguality
Integration Challenges Inputs & Outputs Properties
Evaluation Challenges Corpora: Find or Build? Metrics: Adopt or Create?
Conclusion
19
Corpora: Find or Build?
Requirements: Annotated for ground truth Represent linguistic challenges Scalable/practical
Options Adapt public “database” corpora:
— Wikipedia: Annotated: yes Representative: somewhat Scalable: yes
— Citation DBs: Annotated: no Representative: somewhat Scalable: yes
20
Corpora: Find or Build? (cont’d)
Adapt public “document” corpora: — Co-reference documents:
Annotated: yes Representative: less as often single doc/language-of-use Scalable: yes
Create corpora by hand:— From scratch: “parrot sessions” (auditory or visual)
Annotated: yes Representative: largely Scalable: no
— From un-annotated databases: Annotated: no Representative: yes Scalable/practical: no; databases may be private
— Synthesize from generative model Annotated: yes Representative: no, tied to generating model Scalable: yes
21
Metrics
Back to our initial example
Jim Killeen Kileen, J. D.
Jaime Kilin
جمس كلين
Jim
JDKJimK illeen
J. Diw Killeen
ReferenceSystem ASystem B
22
Metrics: Adopt or Create?
How to quantify the quality of the system’s resolutions vs. the reference?
Goals: Discriminative: separates good v. bad systems for users’ needs Interpretable: number aligns with intuition
Considerations: Assume transitive closure (TC) of output? Apply weights to try to be more discriminative?
Common concepts: Precision: % of stuff in answer that’s right Recall: % of right stuff in answer F-Score: Harmonic mean of these = 2*P*R/(P+R)
23
Candidate Metrics
Pair-wise % correct: over all N*(N-1)/2 node pairs Pair-wise P&R: based on links drawn Edit-distance: # of links to add/subtract to correct Metrics used in document co-reference resolution:
MUC-6: entity-based P&R on missing links from graph B-CUBED: average per-reference P&R of links CEAF (Constrained Entity-Alignment F): entities aligned
using some similarity measure; P&R are % of possible similarity level achieved
24
Comparing Metrics
Jim Killeen
Jaime Kilin
جمس كلين
Jim
JDKJimK illeen
J. Diw KilleenReferenceSystem ASystem B
Kileen, J. D.
No TCTC
36
14
Edit-dist
81858973717982B90788062618279A
No TCTCNo TCTCCEAF(TC)
B-CUBED(TC)
MUC-6(TC)
Pairwise F% Correct
My preference
25
Conclusion
Identity resolution systems face linguistic challenges
They need to be carefully integrated to meet these challenges
Evaluation corpora should reflect these challenges
Evaluation metrics should align with qualitative judgements
26
Bibliography
Bagga, A., Baldwin., B. (1998). Algorithms for scoring coreference chains. In Proceedings of the First International Conference on Language Resources and Evaluation Workshop on Linguistic Coreference.
Fellegi, I. P., Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, Vol. 64, No. 328, pp. 1183--1210.
Luo, X. (2005). On coreference resolution performance metrics. In Proc. of HLT-EMNLP, pp 25--32.
Menestrina, D., Benjelloun, O., Garcia-Molina, H. (2006). Generic entity resolution with data confidences. In First International VLDB Workshop on Clean Databases. Seoul, Korea.
Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and Retrieval. In Proceedings of LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management.
Spock Team (2008). The Spock Challenge. http://challenge.spock.com/ (Retrieved February 5.)
Vilain, M. Burger, J. Aberdeen, J. Connolly, D., Hirschman, L. (1995). A model-theoretic coreference scoring scheme. In Proceedings of the 6th Message Understanding Conference (MUC6). Morgan Kaufmann, pp. 45--52.