gate, a general architecture for text engineering gate.ac.uk

28
GATE, a General Architecture for Text Engineering http://gate.ac.uk/ Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield University of Leeds, February 20 th 2003 Structure of the talk: Introduction: Software Architecture and GATE Examples: KT and HLT; indexing football Ragbag of features and colourful pictures Demo

Upload: river

Post on 13-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

GATE, a General Architecture for Text Engineering http://gate.ac.uk/ Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield University of Leeds, February 20 th 2003. Structure of the talk: Introduction: Software Architecture and GATE - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GATE, a General Architecture for Text Engineering gate.ac.uk

GATE, a General Architecture for Text Engineering

http://gate.ac.uk/

Hamish Cunningham, Kalina BontchevaDepartment of Computer Science,

University of Sheffield

University of Leeds, February 20th 2003

Structure of the talk:• Introduction: Software Architecture and GATE• Examples: KT and HLT; indexing football• Ragbag of features and colourful pictures• Demo

Page 2: GATE, a General Architecture for Text Engineering gate.ac.uk

2/29

                                                                                                                           

Motivation for Software Infrastructure for Language Engineering

• Need for scalable, reusable, and portable HLT solutions

• Support for large data, in multiple media, languages, formats, and locations

• Lowering the cost of creation of new language processing components

• Promoting quantitative evaluation metrics via tools and a level playing field

Page 3: GATE, a General Architecture for Text Engineering gate.ac.uk

3/29

                                                                                                                           

Motivation (II): software lifecycle in collaborative research

Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to.

Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg.

Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator.

Integration and Testing: The lead partner gets desperate and decides to hard-code the results for a small set of examples into the demonstrator, and have a fail-safe crash facility for unknown input ("well, you know, it's still a prototype...").

Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry).

Page 4: GATE, a General Architecture for Text Engineering gate.ac.uk

4/29

                                                                                                                           

GATE, a General Architecture for Text Engineering• An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.• Free software (LGPL). Download at http://gate.ac.uk/download/

Page 5: GATE, a General Architecture for Text Engineering gate.ac.uk

5/29

                                                                                                                           

Architectural principles• Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) • (Almost) everything is a component, and component sets are user-extendable

Component-based development• An OO way of chunking software: Java Beans • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.

Page 6: GATE, a General Architecture for Text Engineering gate.ac.uk

6/29

                                                                                                                           

GATE Language ResourcesGATE LRs are documents, ontologies, corpora, lexicons, ……

Documents / corpora:• GATE documents loaded from local files or the web... • Diverse document formats: text, html, XML, email, RTF, SGML.

Processing ResourcresAlgorithmic components knows as PRs – beans with execute methods.• All PRs can handle Unicode data by default. • Clear distinction between code and data (simple repurposing).• 20-30 freebies with GATE• e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene

Page 7: GATE, a General Architecture for Text Engineering gate.ac.uk

7/29

Vis

ual

Res

ourc

es

Page 8: GATE, a General Architecture for Text Engineering gate.ac.uk

8/29

                                          ApplicationsGATE has been used for a variety of applications, including:

• MUMIS: automatic creation of semantic indexes for multimedia programme material

• MUSE: a multi-genre IE system

• EMILLE: a 70 million word corpus of Indic languages

• Metadata for Medline (at Merck)

• ACE: participation in the Automatic Content Extraction programme

• HSE: summarisation of health and safety information from company reports

• OldBaileyIE: NE recognition on 17th century Old Bailey Court reports.

• AKT: language technology in knowledge management

• AMITIES: call centre automation

•Various Medical Informatics and database technology projects

• IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and

French (Arabic, Chinese and Russian next year)

Page 9: GATE, a General Architecture for Text Engineering gate.ac.uk

9/29

Some users…At time of writing a representative fraction of GATE users

includes: • Longman Pearson publishing, UK; • Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN;• Sirma AI Ltd., Bulgaria; • the American National Corpus project, US; • Imperial College, London, the University of Manchester,

the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities;

• the Perseus Digital Library project, Tufts University, US.

Page 10: GATE, a General Architecture for Text Engineering gate.ac.uk

10/29

                                                                                                                           

Example 1:the Knowledge Economy and

Human LanguageGartner, December 2002:

• taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications

• through 2012 more than 95% of human-to-computer information input will involve textual language

A contradiction: formal knowledge in semantics-based systems vs. ambiguous informal natural language

The challenge: to reconcile these two opposing tendencies

Page 11: GATE, a General Architecture for Text Engineering gate.ac.uk

11/29

HumanLanguage

Formal Knowledge(ontologies andinstance bases)

(A)IE

CLIE

(M)NLG

ControlledLanguage

OIE

Sem

anti

c W

eb;

Sem

anti

c G

rid

;S

eman

tic

Web

Ser

vice

s

Closing theLanguageLoop (1)

Page 12: GATE, a General Architecture for Text Engineering gate.ac.uk

12/29

                                                                                                                           

Closing the Language Loop (2)

• Information Extraction (IE): from NL to formal data • Adaptive IE: learning by example • Ontology-based IE: annotate to user-supplied ontology • Controlled-Language IE: simplify the interface • (Multilingual) Natural Language Generation: documentation        

Cross-cutting issues: • Content Extraction vs. Information Extraction • Scaling and robustness - cf. MUSE project • Hybrid learning and knowledge-based systems

Page 13: GATE, a General Architecture for Text Engineering gate.ac.uk

13/29

Building IE Components in GATE (1)The ANNIE system – a reusable and easily extendable set of components

Page 14: GATE, a General Architecture for Text Engineering gate.ac.uk

14/29

 Building IE Components in GATE (2)

JAPE: a Java Annotation Patterns Engine • Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components

Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = { kind = company, rule = “Company1” }

Page 15: GATE, a General Architecture for Text Engineering gate.ac.uk

15/29

Populating Ontologies with IE

Page 16: GATE, a General Architecture for Text Engineering gate.ac.uk

16/29

Protégé and Ontology Management

Page 17: GATE, a General Architecture for Text Engineering gate.ac.uk

17/29

Example 2: the MUMIS project

• Multimedia Indexing and Searching Environment • Composite index of a multimedia programme

from multiple sources in different languages• ASR, video processing, information extraction

(Dutch, English, German), merging, user interface• University of Twente/CTIT, University of Sheffield,

University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA

Page 18: GATE, a General Architecture for Text Engineering gate.ac.uk

18/29

The Whole Picture

EN

DE FormalText

FormalText

FormalTextFormal

TextFormal

TextFormal

TextFormalText

FormalText

FormalTextText

Sources

IE

IE

IE

NL

FormalText

FormalText

FormalTextFormalText

FormalText

FormalTextFormalText

FormalText

FormalText

Transcriptions

ASR

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

SpeechSignals

Merging Final Annotations

Formal

Text

Formal

TextForma

lText

Anno-tations

MultimediaData Base

Video & AudioSignal

UserInterface

Query

Results

Ontology & Lexicon

Page 19: GATE, a General Architecture for Text Engineering gate.ac.uk

19/29

User Interface

Page 20: GATE, a General Architecture for Text Engineering gate.ac.uk

20/29

Play

Page 21: GATE, a General Architecture for Text Engineering gate.ac.uk

21/29

Relational Database

GA

TE

Form

at Handlers

HTMLdocs

RTFdocs

XMLdocs

Named entity

Core-ference

ANNIE

POS tagger

Named entity

Eventextraction…

Custom application 1

…Document content

Document metadata

Document format data

Linguistic data

File storage

Oracle/PostgresQL

DevelopingMUMISComponentswith GATE

Page 22: GATE, a General Architecture for Text Engineering gate.ac.uk

22/29

 Ragbag (1): Performance Evaluation

• At document level – annotation diff

• At corpus level – corpus benchmark tool – tracking system’s performance over time

Page 23: GATE, a General Architecture for Text Engineering gate.ac.uk

23/29

Ragbag 2: Regression Testing – Corpus Benchmark Tool

Page 24: GATE, a General Architecture for Text Engineering gate.ac.uk

24/29

Ragbag 3: Information RetrievalBased on the Lucene IR engine

Page 25: GATE, a General Architecture for Text Engineering gate.ac.uk

25/29

                     

GATE Unicode Kit (GUK) Java provides no special support for text input (this may change)

• Support for defining additional Input Methods (IMs)

• currently 30 IMs for 17 languages

• Pluggable in other applications

Ragbag 4: Editing Multilingual Data

Page 26: GATE, a General Architecture for Text Engineering gate.ac.uk

26/29

Ragbag 5: Processing Multilingual DataAll the visualisation and editing tools for ML LRs use enhanced Java facilities:

Page 27: GATE, a General Architecture for Text Engineering gate.ac.uk

27/29

Ragbag 6: Dialogue Systems

• GATE is being used in the Amities project for automating call centres• Creation of dialogue processing server components to run in the Galaxy Communicator architecture• Easy adaptation of the portable IE components to work on noisy ASR output • Robustness and speed of GATE components vital for real-time dialogue systems

Page 28: GATE, a General Architecture for Text Engineering gate.ac.uk

28/29

                                                                                                                           

Conclusion

GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components

Further information: http://gate.ac.uk/

• Online demos, tutorials and documentation• Software downloads• Talks and papers