discourse annotation for arabic 2

Post on 11-Jun-2015

109 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Survey on Discourse Annotation for Arabic

A. Algarni, H. Alharbi and N. AlmutairySupervisor: Dr. A. Alsaif

April 23, 2013

Kingdom of Saudi ArabiaMinistry of Higher Education

Imam Mohammed Ibn Saud Islamic UniversityCollege of computer and Information Sciences

CS465 - Natural Language Processing

السعودية العربية المملكةالعالي التعليم وزارة

اإلسالمية سعود بن محمد اإلمام جامعةالمعلومات ونظم الحاسب علوم كلية

الطبيعية – 465عال اللغات معالجة

Outline

IntroductionThe Leeds Arabic Discourse TreebankDiscourse Connective RecognitionDiscourse Relation RecognitionSemantic-Based SegmentationDiscourse Segmentation Based on Rhetorical

MethodsA Comprehensive Taxonomy of Arabic Discourse

Coherence Relations

2

3

Introduction

Linguistic annotation covers any descriptive or analytic notations applied to raw language data.

Annotated Discourse Corpora can be very useful to facilitate theoretical studies along with contributing in the development of NLP applications.

4

Applications

Information extractionQuestion-answeringSummarizationMachine translation, generation.

5

Discourse Relations and Discourse Connectives

Discourse Relation is the way that two arguments (text segments) logically connected.

Temporal, Comparison, Causal, Expansion..etcDiscourse Connective (DC) :A lexical marker

used to link two abstract objects in a text. Abstract Object (AO) : Abstract objects in

discourse are things like proposition , events, facts and opinions.

Argument (Arg) : A text expressing an abstract object and linked by a DC.

6

The Leeds Arabic Discourse Treebank

• First effort towards producing an Arabic Discourse Treebank was introduced in 2011 by A. Alsaif and K. Markert.• Collected a large set of Arabic discourse connectives using text analysis and corpus based techniques.•Final list contains 107 discourse connectives.

7

Types of Discourse connectives

8

Types of Relations

9

Types of Relations Cont..

COMPARISON.Similarity:

10

Arabic Discourse Annotation Tool (ADA) and Annotation Process

11

Annotation Methodology

1. Measuring whether annotators agree on the binary decision on whether an item constitutes a discourse connective in context.

2. Measuring whether annotators agree on which discourse relation an identified connective expresses. As annotators can use sets of relations for a connective.

12

Results

Agreement in task 1 is highly reliable (N=23331) percentage agreement of 0.95,

kappa of 0.88.Agreement in task 2 (relation assignment)

is relatively low (N=5586), percentage agreement of 0.66, kappa 0.57, and alpha of 0.58.

13

Discourse Connective Recognition

To distinguish between discourse and non-discourse usage of a connective.

Example: once, while.A. Alsaif and K.Markert (2011) introduced

a Connective identifier for Arabic based on syntactic features.

14

Discourse Connective Recognition by A. Alsaif and K.Markert (2011)

Features:Surface Features (SConn)Lexical features of surrounding words

(Lex)Example

] باإلرهاق ] يصابوا ان ممكن االطفال [ Arg1 ان DCو ]

] بالنعاس] يشعروا . Arg2 ان جيدا يناموا لم اذا

[Children might be tired]Arg1 [and]DC [feel sleepy]Arg2 during school time if they did not sleep well

15

Features:Part of Speech features (POS)Syntactic category of related phrases

(Syn) (E.g.: وجميلة كبيرة the school is / المدرسةvery large and beautiful)

Al-Masdar feature.

Discourse Connective Recognition by A. Alsaif and K.Markert (2011) Cont…

16

Results

Discourse Connective Recognition by A. Alsaif and K.Markert (2011) Cont…

Features Acurr KBaseline (not Conn) 68.9 0

M1 Conn only 75.7 0.48

Tokenization by white space + auto taggerM2M3M4

Conn+ SConn+Lex Conn+ SConn+Lex+POS Conn+SConn+Lex+POS+Masdar

85.6 0.6287.6 0.6988.5 0.70

ATB-based featuresM5M6M7

Conn+SConn+Lex Conn+SConn+Lex+Syn/POS Conn+SConn+Lex+Syn/POS+Masdar

86.2 0.6591.2 0.7992.4 0.82

M8M9

Conn+SConn+Syn SConn+Lex+Syn+Masdar

91.2 0.7991.2 0.79

17

Discourse Relation RecognitionTo identify the type of the relationA. Alsaif and K.Markert (2011) introduced

the first algorithms to automatically identify relations for Arabic

18

Features:Connective features Words and POS of arguments MasdarTense and Negation Length, Distance and Order Features Argument Parent Production Rules

Discourse Relation Recognition by A. Alsaif and K.Markert (2011)

19

ResultsAcurr k Features

All connectives (6039)

52.5 0 Baseline (CONJUNCTION)

77.2 0.6078.7 0.6678.3 0.65

Conn only (1) Conn+Conn f+ Arg f (37) Conn+Conn f+ Arg f+ Production rules (1237)

M1M2M3

Excluding wa at BOP (3813)

35 0 Baseline (CONJUNCTION)

74.3 0.6577.0 0.6976.7 0.69

Conn only (1) Conn+Conn f+ Arg f (37) Conn+Conn f+ Arg f+ Production rules (1237)

M1M2M3

20

ResultsAcurr k Features

All connectives (6039)

62.4 0 Baseline (EXPANSION )

88.7 0.7888.7 0.78

Conn only (1) Conn+Conn f+ Arg f (37)

M1M2

Excluding wa at BOP (3813)

41.8 0 Baseline (EXPANSION)

82.7 0.7483.5 0.75

Conn only (1) Conn+Conn f+ Arg f (37)

M1M2

21

Semantic-Based Segmentation of Arabic TextsCorpus AnalysisDefinition: Let L be a list of candidate

segments connectors, each element c in L is classified based on its effects on the text segmentation as either active or passive

Examples:.1[] الكلية في جديد قسم إنشاء الجامعة إدارة تعتزم

]هنالك] القسم هذا إنشاء تؤكد التي التقارير بعض.2] و[ الكلية في جديد قسم إنشاء الجامعة إدارة تعتزم

[ هنالك] و القسم هذا إنشاء تؤكد التي التقارير بعض[ لذلك موعدا يحدد لم لكن

22

Segmentation ProcessIdentifying the connectors that indicate

complete segments. Locating the active connectors.Resolving the case where adjacent active

connectors exist.Setting the segments boundaries. Creating the final list of segments.

23

Discussionevaluate the segmentation process, they

collected ten essays.Each essay ranges between 500 and 700

words.After implementing the segmentation

process.Gave the output to judges to evaluate

them in terms of two factors: correct hit and incorrect hit.

24

Discussion Cont..Incorrect hit Correct hit Essay

0 33 1

1 15 2

0 25 3

1 23 4

0 20 5

1 29 6

1 26 7

2 33 8

0 26 9

0 22 10

25

Arabic Discourse Segmentation Based on Rhetorical Methods

This Method is depends on the meaning of the connector " و" in Arabic language.

There are six types of " و" classified into two classes, "Fasl" and "Wasl " :

"Fasl " : segmenting place."Wasl " : unsegmenting but connecting

the text.

26

Types of Connector "و" Class Example Type

Fasl العلم التالميذ يعلمون انهم واللهاألساتذة. عظيما عمال ليقدمون

والقسم

Fasl يعانون الذين وحدهم ليسوا سائل ورب�الشبابطبقات: بين من الشباب على ركزتم لماذا يقول

؟ المجتمع

ور�ب

Fasl النفسية المشكالت بعض من المراهقون يعاني.و كثيرة أخرى سلبيات به عامة المجتمع

واالستئناف

Wasl الفصل المدرس .ودخل يبتسم هو والحال

Wasl الحبيبان .وجلس القمر ضوء والمعية

Wasl محمد خالد وسافر والعطف

27

The Arabic sentence Segmentation System

28

Feature Extraction

•The following are the features of " والمعية": X3 = noun and X7 = accusative mark.

29

Experiment and Results

They used 1200 instances for training.They used 293 instances for testing after

testing there are 290 correct and 3 incorrect instances.

The result with:94.68% Recall

96.82% Precision

98.98% Accuracy

30

A Comprehensive Taxonomy of Arabic Discourse Coherence Relations

Coherence relations are classified into two types: explicit relations and implicit relations.

example Coherence relations

I am very happy because I got excellent marks in exams.

Explicit relations

I am very happy. I got excellent marks in exams.

Implicit relations.

31

The procedure of creating an Arabic Taxonomy of Coherence Relations

32

Examples of Implicit Arabic relations

"Impossible condition / المستحيل : " الشرطسم) في الجمل يلج حتى الجنة يدخلون وال

الخياط(

"Cascaded questioning/ " المكرر :االستفهامنحن) أم تزرعونه أأنتم ماتحرثون؟ أفرأيتم

الزارعون؟)

33

ResultsThey got a set of 47 Arabic coherence

relations.coherence relations. Result

From English coherence relations.

31

additional Arabic explicit coherence relations.

12

Arabic implicit relations. 4

34

Conclusion

Discourse Annotation is a very fertile field and it has many NLP applications, for Arabic there are some challenges due to the lack of annotated corpora and studies.

35

Thank You

top related