Download - Iesl03 Multiling IEff
-
8/10/2019 Iesl03 Multiling IEff
1/18
1(18)
GATE: A Unicode-based
Infrastructure SupportingMultilingual Information Extraction
Kalina Bontcheva, Diana Maynard,Valentin Tablan, Hamish Cunningham
Department of Computer Science, University of Sheffield
http://gate.ac.uk/
Structure of the talk: A brief introduction to GATE
Multilingual infrastructure in GATE
Simple multilingual IE components
http://gate.ac.uk/http://gate.ac.uk/ -
8/10/2019 Iesl03 Multiling IEff
2/18
2(18)
GATE is... An architecture A macro-level organisational picture for LE
software systems. A framework For programmers, GATE is an object-oriented
class library that implements the architecture. A development environment For language engineers,
computational linguists et al, a graphical developmentenvironment.
GATE comes with... Some free components... ...and wrappers for other people's
components Tools for: evaluation; visualise/edit; persistence; IR; IE;
dialogue; ontologies; etc. Free software (LGPL). Download at
http://gate.ac.uk/download/
http://gate.ac.uk/download/http://gate.ac.uk/download/ -
8/10/2019 Iesl03 Multiling IEff
3/18
3(18)
Architectural principles
Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse
XML support, integration of Protg, Jena, Weka...) (Almost) everything is a component, and component sets
are user-extendable (Almost) all operations are available both from API and GUI
-
8/10/2019 Iesl03 Multiling IEff
4/18
4(18)
Component-based development
CREOLE Collection of REusable Objects for Language Engineering: Java Beans: an OO way of chunking software GATE components: modified Java Beans with XML
configuration The minimal component = 10 lines of Java, 10 lines of
XML, 1 URL Three types: Language Resources, Processing
Resources, Visual Resources
Why bother? Allows the system to load arbitrary language processing
components
-
8/10/2019 Iesl03 Multiling IEff
5/18
5(18)
Language Resources (LRs) LRs are documents, ontologies, corpora, lexicons,
LRs can be associated with DataStores (Oracle,PostgreSQL, XML, Java Serialisation) Documents / corpora:
Diverse document formats: text, html, XML, email,RTF, SGML
Optional format-preserving markup analyse / save Standoff annotation model (start, end, type, features),
derivative of TIPSTER, compatible with ATLAS andXCES
Coping with diverse character encodings: New internationalised versions of JVM support >100
different encodings. Other encodings: developing system for user-entry of
mapping tables (remove programming from the process)
-
8/10/2019 Iesl03 Multiling IEff
6/18
6(18)
Processing Resources (PRs) Algorithmic components knows as PRs beans
with execute methods. All PRs can handle Unicode data by default. Clear distinction between code and data (simple
repurposing).
20-30 freebies with GATE Controllers: execute a set of PRs
SerialController: sequential run of arbitrary PR set SerialAnalyserController: analyser PRs over corpus
Conditional controllers: execute depend on features Parallel controller? PRs + Controller = Applications Application parameterisation state can be saved
and restored, and used for embedding / batching
-
8/10/2019 Iesl03 Multiling IEff
7/18
-
8/10/2019 Iesl03 Multiling IEff
8/18
8(18)
VRs (2): Coreference
-
8/10/2019 Iesl03 Multiling IEff
9/18
9(18)
VRs (3): Syntax
-
8/10/2019 Iesl03 Multiling IEff
10/18
10(18)
Displaying Multilingual Data
GATE uses standard (& imperfect) Java rendering engine for displaying text.
-
8/10/2019 Iesl03 Multiling IEff
11/18
11(18)
GATE Unicode Kit (GUK) Complements Javas facilities
Support for definingInput Methods (IMs)
Currently 30 IMs
for 17 languages Pluggable in otherapplications (e.g.JEdit, EUDICO)
Can use virtual kybdor standard layoutsover QWERTY
IMs defined in plain text files GUK comes with a
standalone Unicode editor
Editing Multilingual Data
-
8/10/2019 Iesl03 Multiling IEff
12/18
12(18)
Processing Multilingual Data All processing, visualisation and editing tools use GUK
-
8/10/2019 Iesl03 Multiling IEff
13/18
13(18)
Multilingual IE ComponentsThe ANNIE system a reusable and easily extendable set of
components
-
8/10/2019 Iesl03 Multiling IEff
14/18
14(18)
The Unicode Tokeniser A very portable component for multliple languages:
splits text into typed tokens based on FSM dynamically constructed from rules based on
character categories defined by the Unicode, e.g.:UPPERCASE_LETTER(LOWERCASE_LETTER|DASH_PUNCTUATION)*
> Token;orth=upperInitial;kind=word; output generally localised by a later module (e.g.
dont do nt) 23 rules seem able to handle without changes Indo-European languages.
the English tokeniser: Unicode tokeniser + pattern
grammar FST
-
8/10/2019 Iesl03 Multiling IEff
15/18
15(18)
POS tagging in new languages
TIDES Surprise Language: Hepple tagger butsubstituted Cebuano/Hindi lexicon for English
Used empty ruleset since no training data
available Used default heuristics (e.g. return NNP forcapitalised words)
Very experimental, but reasonable results
67% correctness for Hindi and 75% forCebuano
Adaptation time per language - 2 days
-
8/10/2019 Iesl03 Multiling IEff
16/18
16(18)
Porting NE grammars
Most English JAPE rules based on POS tagsand gazetteer lookup
Grammars can be reused for languages withsimilar word order, orthography etc.
No time to make detailed study of Cebuano,but very similar in structure to English
Most of the rules left as for English, but someadjustments to handle especially dates
Used both English and Cebuano grammarsand gazetteers, because NEs appear in bothlanguages
-
8/10/2019 Iesl03 Multiling IEff
17/18
17(18)
TIDES Evaluation Results
Cebuano EnglishBaseline
Entity P R F P R F
Person 71 65 68 36 36 36
Org 75 71 73 31 47 38
Location 73 78 76 65 7 12Date 83 100 92 42 58 49
Total 76 79 77.5 45 41.7 43
-
8/10/2019 Iesl03 Multiling IEff
18/18
18(18)
Conclusion
GATE a Unicode-based NLP infrastructure,particularly suitable for multilingual adaptation ofIE systems
Requires little involvement of native speakers
and very little annotated data for a basic job Future work
Improving multilingual support, e.g.,
morphology support, automatic language andencoding identification Learning gazetteer lists from annotated
corpora