logic programming for natural language processing

16
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.

Upload: shayla

Post on 06-Jan-2016

43 views

Category:

Documents


1 download

DESCRIPTION

Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc. Logic Programming for Natural Language Processing. Purpose. To link together Recent developments in natural language processing (NLP): Information Extraction (IE) Classical logic programming: Prolog - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Logic Programming for Natural Language Processing

Logic Programming for Natural Language Processing

Menyoung LeeTJHSST Computer Systems Lab

Mentor: Matt ParkerAnalytic Services, Inc.

Page 2: Logic Programming for Natural Language Processing

Purpose

To link together Recent developments in natural language

processing (NLP): Information Extraction (IE) Classical logic programming: Prolog

New Paradigm: bifurcated process An IE application which will produced structured

output from a corpus of free, unstructured text. Transformation of extracted information into a

Prolog knowledge-base (sets of fact-triples) Documents: biographies

Page 3: Logic Programming for Natural Language Processing

Why NLP?

Language is the cornerstone of intelligence The Turing Test: the ability to converse like man

Understanding and generating texts in a natural language, e.g. English

Many specific NLP tasks Chatterbots, e.g. Eliza Machine Translation Information Retrieval (IR), e.g. Google Information Extraction!! SciFi Dreams: universal translation, computers

you can talk to, etc.

Page 4: Logic Programming for Natural Language Processing

Information Extraction (IE)

Most generally, the transformation of Information contained in free, unstructured text in

a natural language into A prescribed, structured format.

More specifically, the identification of Instances of certain object classes Their attributes Relationships between object instances

Always restricted into a particular domain In order to have a reasonably sized and

sufficiently expressive ontology

Page 5: Logic Programming for Natural Language Processing

Why IE?

An Expert must read many documents Advent of the Internet & Information Age

Explosion of the sheer volume of textual information, readily available in electronic form

New opportunity: lots and lots of available information to exploit

Formidable challenge: impossible for an expert to read and analyze that much text.

A pragmatic approach: Full text understanding is out of reach Automate just some of the tasks, i.e. the

identification of objects, attributes, and relations

Page 6: Logic Programming for Natural Language Processing

IE - Details

Five Tasks in IE Named Entity Recognition (NE) Coreference Resolution (CO) Template Element Construction (TE) Template Relation Construction (TR) Scenario Template Production (ST)

Metrics for Evaluation Precision: Recall: F-measure (borrowed from IR):

More intuitive reformulation:

P= correct answers producedtotal answers produced

R= correct answers producedtotal correct answers

F= 21PR

2PR

F−1=2

21R−1 1

21P−1

Page 7: Logic Programming for Natural Language Processing

Annotations

Annotations identify objects in text

Annotation graph: a directed, acyclic graph (DAG) Nodes

position in the text Edges

The literal text Annotations

Page 8: Logic Programming for Natural Language Processing

Frames

Frame: representation of an object, consisting of slots, which contain values

Typical Prolog fact: Frame(Slot, Value). We propose to synthesize it with the idea of

annotations: Doc(Annot, Text). Main idea: represent the document directly as an

object: compromise between text and knowledge Several Advantages

A corpus of multiple related documents Direct link between information and its source Opens the door for the application of Prolog's

logic.

Page 9: Logic Programming for Natural Language Processing

Design

The IE application Input: corpus of free, unstructured text Output: the annotated documents, represented

as annotation graphs How: use GATE (language: JAPE)

The Prolog application Input: the annotated document Output: a frame, i.e. a set of Prolog facts. How: use XSB (language: Prolog)

Page 10: Logic Programming for Natural Language Processing

General Architecture for Text Engineering (GATE)

A comprehensive architecture for development of NLP applications

Documents treated as an annotation graph Java Annotation Patterns Engine

Its own language for writing grammars that identify instances of object classes to annotate

A Nearly New Information Extraction (ANNIE) system An already implemented rudimentary IE system,

that can be extended through addition of JAPE grammars for annotating Machine-learning models for annotating

Page 11: Logic Programming for Natural Language Processing

GATE

Page 12: Logic Programming for Natural Language Processing

Procedures

Obtain the corpus – Python script Write the Jape grammars

annotations 'Mathematician', 'Father'. Train a model

annotation 'Protagonist' Write the Prolog application to

Parse GATE's XML output into a structure Construct the annotation graph from it Process the annotations into a document frame Output the document frame

Test by posing queries

Page 13: Logic Programming for Natural Language Processing

IE Result: Fermat.html

Precision: 1. (why so high?) use of a gazetteer list aggressive pruning by context

Recall: 0.9474 paid for aggressive pruning, missed some

F-measure (β = 2) 0.973

Page 14: Logic Programming for Natural Language Processing

Prolog Result

Correctly constructs facts. Sample session:

| ?- 'Galois.html.xml'('Mathematician', X).X = Abel;X = Cauchy;X = Evariste Galois;X = Fourier;X = Galois;X = Gauss;X = Gergonne;X = Jacobi;X = Lagrange;X = Legendre;X = Libri;X = Liouville;X = Poisson;X = Vernier

Page 15: Logic Programming for Natural Language Processing

Results

The Prolog layer is universal, cross-domain The IE application may produce any annotation,

not restricted to one subject area Bifurcation: success

Opens door to logic and rules, esp. for cross-document relations

| ?- 'Galois.html.xml'('Mathematician', X), 'Cauchy.html.xml'('Protagonist', X).

X = Cauchy;

no

Page 16: Logic Programming for Natural Language Processing

Conclusion

With the recent advancements in computing power, logic programming is finally feasible for practical use To run my Prolog application, ran it on the server

robustus, giving it 2 GB of memory However, computing power continues to be a

limitation (GATE crashed every day) Where do we go from here?

More expressive document frame Context analysis (through proximity, etc) Better IE applications through statistical

processing