ontology-aware information extraction hamish cunningham, kalina bontcheva department of computer...
TRANSCRIPT
Ontology-Aware Information Extraction
http://gate.ac.uk/
Hamish Cunningham, Kalina Bontcheva
Department of Computer Science, University of Sheffield
OntoWeb 4, SIG 5, 2002
2(12)
GATE, a General Architecture for Text Engineering
GATE is….• An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Free software (LGPL). Mature robust software (in development since 1995). Download at http://gate.ac.uk/download
Comes with…• Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.
3(12)
Applications; languagesGATE has been used for a variety of applications, including:
• MUMIS: automatic creation of semantic indexes for multimedia programme material
• MUSE: a multi-genre IE system
• EMILLE: a 70 million word corpus of Indic languages
• Metadata for Medline (at Merck)
• Creation of metadata for Semantic Web Services; documentation using NLG
• HSE: summarisation of health and safety information from company reports
• OldBaileyIE: NE recognition on 17th century Old Bailey Court reports.
• AKT: language technology in knowledge management
• AMITIES: call centre automation
• Digital libraries / e-philology for ancient languages researchers
• Various Medical Informatics and database technology projects
• IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and
French (Arabic, Chinese and Russian next year)
4(12)
Some users…At time of writing a representative fraction of GATE users includes: • Longman Pearson publishing, UK; • BT Exact Technologies, UK;• Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN Technologies, US;• Sirma AI Ltd., Bulgaria; • Resco AB, Sweden/Finland/Germany;• Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts• Master Foods NV: extraction of commodities events from news• the American National Corpus project, US; • Imperial College, London, the University of Manchester, Queen Mary
College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities;
• the Perseus Digital Library project, Tufts University, US.
5(12)
Scientific method and HLT
• How do we really know that this stuff works?!• Open source systems make experimental
repeatability easier and therefore cut down on site-specific skew effects.
• GATE's IE tools have competed in MUC, TREC (QA), ACE, and DUC. TIDES Surprise Language exercise next year.
• GATE includes markup and automated evaluation tools: easier quantitative evaluation.
6(12)
Collaboration opportunities
• Interoperation, integration, not re-invention: collaboration not competition
• Take the code, do what you like with it, perhaps contribute something back
• Involve us in your 6th Framework projects
• Join KITShare: a network of excellence in Knowledge and Interface Tool Sharing.
7(12)
The Holy Grail
• Problem: gap between many current IE tools and SemWeb needs
8(12)
What is needed?
• Content, not Information Extraction– Identify the ontological reference, not just the
class – Maintain referential integrity (coreference)
• Ontology-aware IE tools– Use instances already in the ontology– React to changes in the ontology
• Support experienced users to change the IE tools
9(12)
GATE and Content Extraction
ANNIE - Open-source IE system in GATE, providing modules needed for content extraction
• Pre-processing• Named entity recognition• Coreference resolution
– ANNIE handles proper names, pronouns, and nominals
• Easy-to-use pattern-action rule language to enable customisation and postprocessing of the IE results
10(12)
Populating Ontologies with ANNIE
11(12)
Ontologies as explicit IE resources
• Reuse, not reinvention: – Protégé for ontology maintenance– Sesame/KAON for storage and reasoning
• Ontology-aware gazetteers– Provide the ontological class of each entry– Use instances from the ontology for IE
12(12)
Ontology-aware IE
• The IE tools can use available formal knowledge and reasoning
• Ontology-based anaphora resolution– G. Bush, G. Brown, the president
• The correct ontological classes are assigned to the recognised entities
• Changes in the ontology available to the IE tools