![Page 1: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/1.jpg)
Extracting Geographical Gazetteers from the Internet
Olga Uryupina30.05.03
![Page 2: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/2.jpg)
Overview
• Named Entity Recognition & Gazetteers
• Data• Initial Algorithm• Bootstrapping approach• Evaluation• ToDo
![Page 3: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/3.jpg)
NE Recognition
National Gallery of Scotland – The nucleus of the Gallery was formed by the Royal Institution‘s collection, later expanded by bequests and purchasing. Playfair designed (1850-57) the imposing classical building to house the works.
![Page 4: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/4.jpg)
State-of-the-art systems
Standard approaches usually combine• Rules• Statistics• Gazetteers Classes distinguished:• Person• Organisation• Location
![Page 5: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/5.jpg)
NE Recognition – with and without gazetteers
(Mikheev, Moens, and Grover, 1999) ran their system in different modes
Full gazetteer No gazetteerRecal
lPrecision Recall Precision
organisation 90% 93% 86% 85%person 96% 98% 90% 95%location 95% 94% 46% 59%
![Page 6: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/6.jpg)
Fine-grained NER
Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
![Page 7: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/7.jpg)
Fine-grained NER
Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
![Page 8: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/8.jpg)
Fine-grained NER
Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
![Page 9: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/9.jpg)
Fine-grained NER
Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
![Page 10: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/10.jpg)
Manually created gazetteers
Available resources:• Word lists from the Web• Atlases & maps• Digital gazetteers (e.g. Alexandria Digital
Library)
![Page 11: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/11.jpg)
Manually created gazetteers – drawbacks
• Only positive data (no way to find out whether Mainau island does not exist or is simly not listed)
• Difficult to adjust when new classes are required
• Not available for most languages:Aquisgrana
![Page 12: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/12.jpg)
Task
We can get rid of manually compiled gazetteers by using the Internet.
Task: subclassify locations using the Internet counts (obtained from the Altavista Search Engine).
Offline vs. Online processing
![Page 13: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/13.jpg)
Data
Manually created gazetteer (1260 items)Classes:• COUNTRY Pitcairn• REGIONBavaria/Bayern• RIVER Oder• ISLAND Savai‘i• MOUNTAIN Ohmberge• CITY Nancy
Washington: 11xCITY, 1xMOUNTAIN, 2xISLAND, (31+1+1)xREGION
![Page 14: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/14.jpg)
Data
Gazetteer exampleToronto CITYTotonicapan CITY, REGIONTrinidad CITY, RIVER, ISLAND
![Page 15: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/15.jpg)
Data For each class we sample 100 items
from the gazetteer. As the lists overlap, this results in 520 different items (TRAINING data). The rest was used for TESTING.
CITY: ... REGION: ... COUNTRY: ...RIVER: ..., Victoria, ...ISLAND: ..., Victoria, ...MOUNTAIN: ..., Victoria, ... TRAINING:
Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]
![Page 16: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/16.jpg)
Initial system
For each class a set of keywords was created.
ISLAND islandislandsarchipelago
![Page 17: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/17.jpg)
Initial system
For each item X to be classified, queries of the form “X KEYWORD“ and “KEYWORD of X“ are sent to the Altavista search engine.
Newfoundland622385
Newfoundland islandisland of NewfoundlandNewfoundland islandsislands of NewfoundlandNewfound. archipelago
50135057831
0.000800.005630.000010.000130.00000
![Page 18: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/18.jpg)
Initial system
Machine learners use the counts to induce classifications. Learners tested for this task:
• C4.5• TiMBL• Ripper
![Page 19: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/19.jpg)
Initial system – drawbacks
Still needs manually created resources:
• Set of patterns• Initial gazetteer (TRAINING) Only online (slow) processing – the
system can only classify items, provided by the user, but not extract new names itself
![Page 20: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/20.jpg)
Bootstrapping
Riloff & Jones, 1999 – Bootstrapping for IE task
ITEMS PATTERNS
![Page 21: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/21.jpg)
Bootstrapping
Main problem – noise: the patterns set can get infected
Remedies:• Vaccine (external algorithm for evaluating
patterns)• Stop lists• Human experts
![Page 22: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/22.jpg)
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
![Page 23: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/23.jpg)
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
![Page 24: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/24.jpg)
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
![Page 25: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/25.jpg)
Collecting patterns (step 1)
• Go to AltaVista• ask for an item• download first n pages• match with a simple regexppatterns
![Page 26: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/26.jpg)
Example – step 1
10 best patterns for ISLAND:of X 70the X 60X and 58X the 55to X 53in X 52and X 47X is 45X in 45on X 45
![Page 27: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/27.jpg)
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
![Page 28: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/28.jpg)
Rescoring (step 2)
Goal: discard too general patterns
– score of pattern p for class c
– penalty for appearing in more than one class
ij
ijjii ccacpscpscps ),(),(),(),('
),( cps
ij
ijj ccacps ),(),(
![Page 29: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/29.jpg)
Example – step 2
10 best patterns for ISLAND:X island 17island of X 9X islands 8island X 7islands X 7insel X 7the island X 6X elects 5of X islands 5zealand X 4
![Page 30: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/30.jpg)
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
![Page 31: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/31.jpg)
Learning classifiers (step 3)
20 best patterns are used to train Ripper (as in the initial system)
Produced classifiers:• high-recall• high-accuracy• high-precision
![Page 32: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/32.jpg)
Example – step 3• High-recall classifier for ISLAND:if #(„X island“)/#X >= 0.003879
classify X as +ISLANDif #(„and X islands“)/#X >= 0.000002
classify X as +ISLANDif #(„insel X“)/#X >= 0.017099
classify X as +ISLANDotherwise
classify X as –ISLAND• Extraction patterns:„X island“, „and X islands“, „insel X“
![Page 33: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/33.jpg)
One more example – step 3• High-accuracy classifier for ISLAND:if #(„X island“)/#X >= 0.000636
classify X as +ISLANDif #(„and X islands“)/#X >= 0.000002 and #(„X sea“)/#X>=0.000013 and #(„X geography“)<13
classify X as +ISLANDif #(„X islands“)/#X >= 0.000056 and #(„pacific islands X“)/#X>=0.000006
classify X as +ISLANDotherwise
classify X as –ISLAND
![Page 34: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/34.jpg)
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
![Page 35: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/35.jpg)
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
![Page 36: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/36.jpg)
Collecting and discarding items (steps 4&5)
The same procedure as the step 1:go to AltaVista, ask for extraction
patterns (cf. step 3), ..
Discarding: common names (beginning with low-case letters), stop words (not necessary, but save time)
![Page 37: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/37.jpg)
Example – steps 4 and 5Extracted islands (alphabetically):
AboutAbyssAchillActive
AdataraAkutan
AlaskaAlaskanAlbarella
AllAmelia
American
![Page 38: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/38.jpg)
Extractionitems
Collectingpatterns
Discardingmost general
patterns
Learning classifiers
Extractionpatterns
Collectingitems
Discardingcommon names
Classifyingitems
Learnedhigh-precision
classifier
Initialgazetteer
![Page 39: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/39.jpg)
Classifying (step 6)
High-precision classifier (cf. step 3) is run on collected items
rejected items are discarded accepted items used for extraction
at the next loop
![Page 40: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/40.jpg)
Example – step 6Extracted islands (alphabetically):
AchillAkutan
AlbarellaAmelia
AndamanAscension
BainbridgeBaltrumBeaver
BigBlock
Bouvet
![Page 41: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/41.jpg)
Evaluation
Classifiers:• initial system• bootstrapping from the seed gazetteer• bootstrapping from positive examples onlyItems lists:• bootstrapping from the seed gazetteer
![Page 42: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/42.jpg)
Initial system – evaluation
Class AccuracyCITY 74.3%ISLAND 95.8%RIVER 88.8%MOUNTAIN 88.7%COUNTRY 98.8%REGION 82.3%average 88.1%
![Page 43: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/43.jpg)
Bootstrapping – evaluation
Class Initial
systemAfter
the 1st loop
After the 2nd loop
CITY 74.3% 51.2% 62.0%ISLAND 95.8% 91.4% 96.4%RIVER 88.8% 91.5% 89.6%MOUNTAIN 88.7% 89.1% 88.8%COUNTRY 98.8% 99.2% 99.6%REGION 82.3% 80.4% 82.6%average 88.1% 83.8% 86.5%
![Page 44: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/44.jpg)
Comparing the performanceRIVER, MOUNTAIN, COUNTRY – the new
system is better!ISLAND – the new system improved
and became better after the 2nd loop.REGION – infected category
(„departments of X“); however, the system is improving.
CITY – very heterogeneous class (homonymy); 1st loop – „streets of X“, 2nd loop – „km from X“, „ort X“.
![Page 45: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/45.jpg)
Comparing the systems
Bootstrapping (vs. the initial system):+ patterns learned automatically+ word lists produced- cheap seed gazetteer
Problem: it‘s easy to download huge lists of islands etc., but very difficult to check them and classify properly
![Page 46: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/46.jpg)
Learning from positivesCITY: ... REGION: ... COUNTRY: ...RIVER: ..., Victoria, ...ISLAND: ..., Victoria, ...MOUNTAIN: ..., Victoria, ...Before: => TRAINING:
Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]
Now: => TRAINING: Victoria [-CITY, -REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]
![Page 47: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/47.jpg)
Initial system – evaluation
Class Precompiled gazetteer
Positives only
CITY 74.3% 50.3%ISLAND 95.8% 94.1%RIVER 88.8% 91.0%MOUNTAIN 88.7% 89.3%COUNTRY 98.8% 99.6%REGION 82.3% 86.9%average 88.1% 85.2%
![Page 48: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/48.jpg)
Bootstrapping with positives only – evaluation
Class 1st loop 2nd loop
CITY 39.3% 44.1%ISLAND 94.5% 95.8%RIVER 91.2% 91.1%MOUNTAIN 90.1% 91.2%COUNTRY 98.7% 99.6%REGION 86.5% 81.6%average 83.4% 83.9%
![Page 49: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/49.jpg)
New items
New ISLANDs:true islands 121 (90.3%)
found in the atlases 93not found 28
descriptions 5 (3.7%)parts of names 3 (2.2%)mistakes 5 (3.7%)_______all 134
![Page 50: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/50.jpg)
Conclusion
Advantages of our approach:• very few manually collected data required
(seed gazetteer)• no sophisticated engineering – patterns
produced automatically• on-line classifiers provide negative
information and are applicable to any entity
• new items (off-line gazetteer) collected automatically
![Page 51: Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03](https://reader036.vdocument.in/reader036/viewer/2022062401/5a4d1b0f7f8b9ab05998e222/html5/thumbnails/51.jpg)
ToDo
• new classes -> hierarchy• multi-word expressions• more elaborated learning from
positive examples• determine locations (where is X?)