geographic text search corporate proprietary, copyright 1999-2003, metacarta, inc. analysis of...
TRANSCRIPT
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Analysis of geographic references
András Kornai, Beth Sundheim
HLT/NAACL03 workshop 31 May 2003
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Thanks to
Program committee:Doug AppeltMerrick Lex BermanSean BoisenQuintin CongdonJim CowieDoug JonesLinda HillGeorge Wilson
TIDES AQUAINT
Conference support:Ed HovyJames AllenSteven AbneyDragomir RadevAli HakimDekang Lin
Sponsors:
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Program
• 19 papers submitted, 12 accepted
• 2 invited speakers
• 2 discussion periods
• Authors asked to email presentation to [email protected] by end of day
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Changes
• Afternoon invited speaker: Jerry Hobbs (ISI) replaces Randy Flynn (NIMA)
• Paper presentation ordering: Li et al swapped with Manov et al
(9:30am v 12:10pm)
• Additional workshop event: Linda Hill (UCSB) poster during breaks
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Workshop goals
• Exchange information on work in the analysis and grounding of place names and other forms of geographic reference
• Informally assess state of art in handling various aspects of the problem
• Identify ways to follow up on workshop as a community
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
External resources• Diversity across projects:
ADL, Tipster, NIMA/USGS, UN-LOCODE, TGN, GB Historical GIS, web, …
• Integrated resources: KIM KB (Manov et al.), named entity word list in
InfoXtract, extended multi-gazetteer MetaCarta db, …
• Net result – how happy are we with current resources and integration solutions? With coverage of named places, richness of information,
utility for NLP analysis as well as for grounding references? With using a named entity finder as an analysis
preprocessor?
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Entity finding in text
• Some systems (for now) entirely manual
• Semi-automated (with human review)
• Fully automated FS template matching (Weighted) rule-based HMM-based Confidence-based
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Disambiguation
• What do we mean? Discrimination between names of places and other types of
names Disambiguation of place reference by location of place Disambiguation of place reference by type of place
• How well do current techniques work, and what hard problems remain? Relative difficulty given texts about U.S., detailed location
references, historical texts Relation to general word sense disambiguation problem Use of non-local descriptive references, coreference, … Co-occurrence of names with non-spatial clue terms (“San
Francisco” and “earthquake”)
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Disambiguation (2)
Observations from Nov. ’02 name annotation round:
• For 80% of all name instances, evidence from local context was enough to determine which gazetteer entry was the corresponding one in over 75% of casesThis augurs well for successful automation
• No gazetteer linkage could be made for 20% of all name instances – either the name did not appear in the gazetteer at all (majority), or it appeared there in the wrong senseThis lack of gazetteer coverage presents a significant
challenge
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Failure modes (1)
• Lack of complete match on name St. Petersburg – no variant in gazetteer with “St[.]”
• Multiple acceptable entries [the] Crimea – one for “regions”, one for “capes”
• Transliteration differences Sheremetyevo -> Sheremet’yevo Belarus -> Byelarus
• Mismatch on feature type Simferopol, Vladikavkaz – “capital” in doc, but not in
gazetteer
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Failure modes (2)
• Many matching entries, but no clear winner Prigorodny – 16 hits on Prigorod (many in Russia)
• No entry for general places Asia – no entry in gazetteer
• Variant name missing from entry America – no match in gazetteer (i.e., not a listed
variant)
• Name in doc matches wrong entry in gaz The Heavenly Ski Resort – exactly matches entry with
BUILDING feature, but correct entry is under Heavenly Valley Ski Area (with LOCALE feature in USGS GNIS and “sports facilities” feature in ADL gaz)
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Foreign language
• Example: TIDES surprise language exercise Challenge: Develop resources and NLP tools
for a foreign language in a month (June) Can’t expect to find an existing placename
gazetteer for this language This language is likely to have a non-western
script; ease of transliteration unpredictable
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Community
• Offerings from SPAWAR Systems Center: Annotated corpora available to those with
licenses for source texts, along with annotation protocol
“Modernized” (with respect to diacritics) Tipster gazetteer available upon request
• Call for papers: Special issue of TALIP journal on temporal
and spatial information processing (Editors: Mani, Pustejovsky, Sundheim)
Submissions due December 1 – think about it!
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Tagging
• Finding the entity in text
• Disambiguation
• Type assignment
• Grounding Linking to unique gazetteer entry Assigning coordinates
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Annotation standards• Example: Automatic Content Extraction (ACE)
XML-based Levels: mentions (instances), entities, inter-entity
relations Types of mentions: names, nominals (descriptive
references), pronouns Entity categories wrt places: LOCATION, FACILITY,
GEOPOLITICAL ENTITY (GPE) Each category has defined subtypes (new) Scheme allows for metonymic usage and fuzzy meaning Software tools to support manual annotation, output
format transformation, annotation lookup and review Entity and relation schemes could/should be elaborated
further over time
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Volume and pressure
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Conclusions
• Procedural input sought from participants: shall we summarize at the end?
• Who is we: Organizers? Session chairs? Committee members? Panel?