izmit, june 12, 2014 natural language technologies at the faculty of mathematics and computer...

28
Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam Mickiewicz University in Poznań Dept of Computer Linguistics and Artificial Intelligence [email protected] Zygmunt Vetulani

Upload: lesley-quinn

Post on 22-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

Izmit, June 12, 2014

Natural Language Technologies at the Faculty of Mathematics and Computer

Science of the Adam Mickiewicz University in Poznań

Adam Mickiewicz University in PoznańDept of Computer Linguistics and Artificial Intelligence

[email protected]

Zygmunt Vetulani

Page 2: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

2

Natural language technologies are developed at the UAM in Poznań since many years. Some of these activities started in 70 and resulted with significant achievments (eg. in the area of vocal synthesis (text-to-speech) (M. Steffen-Batóg)).

Systematic NL-related activities at the Faculty of Mathematics and Computer Science) are more recent and started after research visit of Vetulani at the University Aix-Marseille II in the Artificial Intelligence Group headed by Alain Colmerauer (1984). (Individual research started in the 1980s)

Recent works

On the basis of our know-how and technologies obtained so far we started in 2006 a large project (POLINT-112-SMS) which integrates several NL technologies. This project was funded by Polish Goverment from 2006 to 2010 (within a larger program "Text processing technologies for public security purposes" (Grant MNiSzW nr R00 028 02)) and is continued (Zygmunt Vetulani). Now: main focus on the development of PolNet (a Polish Wordnet).

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 3: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

Recent research

POLINT-112-SMS Team

Department of Computer Linguistics and Artificial IntelligencePL-61614 Poznań, ul. Umultowska 87 tel. +48-61-8295380, fax +48-61-8295315Head of Department: Prof. dr hab. Zygmunt Vetulanihttp://main/amu.edu.pl/[email protected]

Department members and close collaborators of the Polint-112-SMS project in 2010

Page 4: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

4

Polint-112-SMS is intended to collect and process information reported to the system by human operators in natural language (text). The objective of processing is to produce summary raports describing (in real time) a dynamically evolving stituation.

Applications of various kinds may be realised on the basis of such a system:

- monitoring natural disasters (ertheakes, flood, forest fire, volcanbo erruptions)

- monitoring the crowd at mass events (e.g. „high-risk” football matches)

- supporting human explorers (groups) in difficult, unknown environmen

- varous possible military use cases

Processing tasks:

- visualisation

- decision support

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 5: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

5

The main technology the project is language understanding.

As a study case we proposed monitoring of the soccer stadium at a match with a large number of supporters. Such situation are usually generating a number of risks. High-risk matches are covered (typically) by video and human monitoring (often at100% coverage).

POLINT-112-SMS has as its main functionality to interactively cooperate with humans and to provide assistance in the decision making (in critical situations which require an immediate action).

Decisions must be taken on the basis of the current situation analysis. A representation of the current situation is on-line compiled by the system from information elements and processed to obtain the decision supporting elements.

Still, the consulted stadium security experts consider that the video monitoring is usually unsufficient and the human, on-site supervising, is necessary. Rests the problem of how to assure communication and how to complete interpretation of messages send by the informers.

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 6: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

6

The blues arrows represent information pieces send directly to the CCM while the red arrows stand for messages exchanged between the informers and the computer. The blue messages may also be seen by the computer. The system is interactive. That means that it may take control of the dialogue and address questions and messages to the informers. It is also interactive with respect to the CCM : it informs and may receive questions from the CCM.

The current situation is on-line human monitored. The observers/informers are supposed to supply information.

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 7: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

7

The POLINT-112-SMS system is a proposal of a technological solution.

It offers as communication mode usage of short text messages (similar to ordinary SMS messages) send from mobile phones.

This communication mode permits to avoid using voice, which is of low utility in the noisy environment and which may unmask the police informers (what should be possibly avoided). Messages are send by the informers to the machine through the SMS gate and are then processed by the system. The name POLINT-112-SMS refers to:

- the family of various versions of the POLINT system being developed so far, - the 112 emergency number (services as 112 may be supported by systems as POLINT-112-SMS), - the SMS technology which is cheap and popular.

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 8: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

8

Message exchange is done in natural (human) langage (Polish). This means that the system must be doted with language competence as well as with communicative competence.

 The prototype has been tested by public security experts both in simulated and real-life situations of a football match at the city stadium in Poznań.Test messages (SMS) were exchanged using public cellular phones.

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 9: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

9

POLINT-112-SMS system architecture

In red : modules using natural language technologies.

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 10: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

10

a) SMS gate (capturing texte)

b) Natural Language Processing module

- understanding- generationc) Dialogue Maintenance

Moduled) Situation Analysis Module- desambuiguisation- reasoning- information search/query

answeringe) Temporal analysis module

d'analyse temporellef) Knowledge processing module

g) Ontology (PolNet) h) Knowledge Bases- about events- about actes of

commnicationi) CCM terminal (admin)- visualisation- administration- capturing and dislpaying

texte

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 11: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

11

POLINT-112-SMS system is a product of a man-machine communication technology which is an AI technology using in particular various (lovel level) natural language technologies.The man-machine communication requires implementation of appropriate man-machine interfaces. In the case of POLINT-112-SMS we use NL text interfaces dedicated to two kind of users : information suppliers (informers) and target beneficients (CCM staff).The informers’ messages, queries and answers are entered to the system from mobile phones through the SMS gate. The another input-output device is the terminal at the CCM. It recieve and output text and display the images recieved from the visualisation submodule. It is also possible to display the past dialogue in form of structured texte.

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 12: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

12

POLINT-112-SMS system is a product of a man-machine communication technology (which is an AI technology).

Among the lovel level language technologies involved , the highest one is understanding. The NLP and Dialogue Maintenance moduls are both contributing to understanding.

The understanding software takes an element of the text and interprets it, i.e. it calculates its representation which is then submitted to further processing.

Typically, the procedure of understanding a question produces a formal object which initiates a procedure responsible for answer finding.

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 13: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

13

Technologies which contribute to comprehension are parsing as well as discourse analysis.Parsing in POLINT-112-SMS is executed by the PROLOG programming language (PROLOG may be considered as a shell of expert systems) whose interpreter may be used as parser for a properly formalized grammar (e.g. CFG /context free grammar/ DCG).

The main drawback: PROLOG may be ineffective.

Our solution: heuristic parsing, where the main module (expensive when backtracking) is preceded by the pre-analysis (cheap) which simplify the input and generate heuristics whose role is to control the parsing execution (reduction of indeterminism). (By "heuristics" we mean a procedure which guides the parser to make correct choices)

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 14: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

14

Correct and effective parsing requires application of several lower level technologies to perform :

segmentation (sentences and words), lemmatisation, spell checking, simplification, disambiguation, named-entity recognition.

In many cases, complete understanding is not possible on the basis of syntactic analysis alone (some context is to be taken into consideration). 

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 15: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

15

The role of discourse analysis is to produce description of discourse organization. It permits to disambiguate those elements which rest ambiguous at the end of syntactic (compositional) analysis. In particular, the knowledge of the discourse structure is necessary for anaphora resolution and therefore contributes to discourse understanding.

In POLINT-112-SMS the discourse analysis is performed by the dialogue maintenance module (detection of co-references, solution of anaphora).

 

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 16: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

16

Parsing, as well as generation, depends on basic resources which are grammars and dictionaries.  

The grammars POLINT were integrated end directly applied in the project.

These grammars were elaborated for successive versions of question-answer systems POLINT produced since the 1990ties. They are formally equivalent to the definite clause grammars (DCGs) directly translated into PROLOG. They were then adapted in the way allowing them to be controlled by heuristics in order to minimize the non-determinism of parsing. The result is that heuristics make parsing executable practically in the linear time.

POLINT dictionaries are of the kind of lexicon-grammars.

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 17: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

17

An ontology : PolNetUnderstanding, but also the reasoning may benefit from lexical data bases of the wordnet type.

Reasoning is essential for POLINT-112-SMS because we expect that it has characteristics of an expert system serving as decision-aid to a human agent. For that purpose a precise representation of a real situation must be generated (and visualized). /This is a knowledge engineering task./

Ontologies which permit to systematize knowledge elements about individuals and classes /sets/ of individuals using attributes, relations, associations are useful knowledge engineering tools for these purposes. (cf. Linneus).

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 18: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

18

(Wikipédia, "Réseau sémantique")

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 19: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

19

Formal ontologies are considered as means of "formalization of conceptualization" (Gruber). Ontologies may serve as reasoning support because of their mathematical structure, e.g. hierarchies of concepts which e.g. permit to implement the default reasoning and inheritance. We have made the choice in favor of WordNet-like ontologies.

The term "WordNet" refers to the lexical base created at the Princeton University (1985, George A. Miller) also known as the Princeton WordNet (PWN). 

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 20: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

20

The main idea is simple : to gather together synonyms into equivalence classes (which represent concepts) and to consider relations holding between these classes.

In practice, this idea is difficult to implement because of the phenomenon of word polysemy. One has to consider disambiguated words (or more precisely : word+word_sense pairs) instead of words. The equivalence classes of synonymous disambiguated words are called synsets. 

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 21: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

21

Another major issue with wordnets is due to the fact that the language of a given community represents the conceptualization which is specific for this community. Therefore a wordnet created for this language will not necessarily be isomorphic with the Princeton WordNet.

E.g. the Polish language does not make a distinction similar to the French distinction between the concept of fleuve and the concept of rivière (this is the reason of typical mistakes of Polish students of French which use to say that „Seine est une rivière” instead of „Seine est un fleuve” ).

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 22: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

22

The above mentioned problems were at origin of our decision to create our wordnet from scratch.

This decision resulted with PolNet (the full name of the project is "PolNet-Polish Wordnet") .

PolNet is free distributed for non-commercial usage (version v1.0), under the Creative Commons license:Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 23: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

23

The PolNet synsets are linked by relations. The two main relations are hyponymy et hyperonymy for noun synsets and semantic roles for verbal synsets.

The selection of concepts to be represented PolNet was done on the basis of word frequences observed in the National Corpus of Polish (the IPI PAN, Przepiórkowski) andin small expermental corpora collected within the project.

The word meaning identification was done manually on bases of traditional dictionaries of Polish. Also synset creation was done manually by lexicographers assisted by the specialised software, namely the DEBVisDic system made at the Masaryk University of Brno (Czech Republic /Pala, Rambousek/).

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 24: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

24

The computer-aided manual processing supported by DEBVisDic permitted us to obtain the quality impossible to reach in the wordnet systems done entierly or mainly automatically / statistically . The quality is however to be payed at high cost of human experts work and verification.

The PolNet evaluation was done at 3 levels : on-line manual evaluation at the coding time, with the help of a software tool (WQuery, Kubis) and within the POLINT-112-SMS application.

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 25: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

25

Initially, PolNet (v.0....) was containing only the noun synsets composed of simple words. Now Nouns : about 11,700 synsets for 20,300 meanings (for 12,000 common nouns). Verbs : in 2011, the a verbal part of PolNet consisted of env.1,500 synsets, (for 900 verbs). The verbal part is in development. Works in order to include compound nouns (in particular the verb-noun collocations) are also in an advanced phase. For reference, we bring the reader’s attention to the fact that the basic vocabulary sufficient to satisfy the needs of ordinary, every-day conversation has been evaluated for 1000-2000 mots. (According to Ogden /1930/ the size of the Basic English is about 850 words).  

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 26: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

26

Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań

Izmit, June 12, 2014

Page 27: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

Izmit, June 12, 2014

Page 28: Izmit, June 12, 2014 Natural Language Technologies at the Faculty of Mathematics and Computer Science of the Adam Mickiewicz University in Poznań Adam

28

THANKS !

Izmit, June 12, 2014