discourse annotation for arabic 2
Post on 11-Jun-2015
109 Views
Preview:
TRANSCRIPT
1
Survey on Discourse Annotation for Arabic
A. Algarni, H. Alharbi and N. AlmutairySupervisor: Dr. A. Alsaif
April 23, 2013
Kingdom of Saudi ArabiaMinistry of Higher Education
Imam Mohammed Ibn Saud Islamic UniversityCollege of computer and Information Sciences
CS465 - Natural Language Processing
السعودية العربية المملكةالعالي التعليم وزارة
اإلسالمية سعود بن محمد اإلمام جامعةالمعلومات ونظم الحاسب علوم كلية
الطبيعية – 465عال اللغات معالجة
Outline
IntroductionThe Leeds Arabic Discourse TreebankDiscourse Connective RecognitionDiscourse Relation RecognitionSemantic-Based SegmentationDiscourse Segmentation Based on Rhetorical
MethodsA Comprehensive Taxonomy of Arabic Discourse
Coherence Relations
2
3
Introduction
Linguistic annotation covers any descriptive or analytic notations applied to raw language data.
Annotated Discourse Corpora can be very useful to facilitate theoretical studies along with contributing in the development of NLP applications.
4
Applications
Information extractionQuestion-answeringSummarizationMachine translation, generation.
5
Discourse Relations and Discourse Connectives
Discourse Relation is the way that two arguments (text segments) logically connected.
Temporal, Comparison, Causal, Expansion..etcDiscourse Connective (DC) :A lexical marker
used to link two abstract objects in a text. Abstract Object (AO) : Abstract objects in
discourse are things like proposition , events, facts and opinions.
Argument (Arg) : A text expressing an abstract object and linked by a DC.
6
The Leeds Arabic Discourse Treebank
• First effort towards producing an Arabic Discourse Treebank was introduced in 2011 by A. Alsaif and K. Markert.• Collected a large set of Arabic discourse connectives using text analysis and corpus based techniques.•Final list contains 107 discourse connectives.
7
Types of Discourse connectives
8
Types of Relations
9
Types of Relations Cont..
COMPARISON.Similarity:
10
Arabic Discourse Annotation Tool (ADA) and Annotation Process
11
Annotation Methodology
1. Measuring whether annotators agree on the binary decision on whether an item constitutes a discourse connective in context.
2. Measuring whether annotators agree on which discourse relation an identified connective expresses. As annotators can use sets of relations for a connective.
12
Results
Agreement in task 1 is highly reliable (N=23331) percentage agreement of 0.95,
kappa of 0.88.Agreement in task 2 (relation assignment)
is relatively low (N=5586), percentage agreement of 0.66, kappa 0.57, and alpha of 0.58.
13
Discourse Connective Recognition
To distinguish between discourse and non-discourse usage of a connective.
Example: once, while.A. Alsaif and K.Markert (2011) introduced
a Connective identifier for Arabic based on syntactic features.
14
Discourse Connective Recognition by A. Alsaif and K.Markert (2011)
Features:Surface Features (SConn)Lexical features of surrounding words
(Lex)Example
] باإلرهاق ] يصابوا ان ممكن االطفال [ Arg1 ان DCو ]
] بالنعاس] يشعروا . Arg2 ان جيدا يناموا لم اذا
[Children might be tired]Arg1 [and]DC [feel sleepy]Arg2 during school time if they did not sleep well
15
Features:Part of Speech features (POS)Syntactic category of related phrases
(Syn) (E.g.: وجميلة كبيرة the school is / المدرسةvery large and beautiful)
Al-Masdar feature.
Discourse Connective Recognition by A. Alsaif and K.Markert (2011) Cont…
16
Results
Discourse Connective Recognition by A. Alsaif and K.Markert (2011) Cont…
Features Acurr KBaseline (not Conn) 68.9 0
M1 Conn only 75.7 0.48
Tokenization by white space + auto taggerM2M3M4
Conn+ SConn+Lex Conn+ SConn+Lex+POS Conn+SConn+Lex+POS+Masdar
85.6 0.6287.6 0.6988.5 0.70
ATB-based featuresM5M6M7
Conn+SConn+Lex Conn+SConn+Lex+Syn/POS Conn+SConn+Lex+Syn/POS+Masdar
86.2 0.6591.2 0.7992.4 0.82
M8M9
Conn+SConn+Syn SConn+Lex+Syn+Masdar
91.2 0.7991.2 0.79
17
Discourse Relation RecognitionTo identify the type of the relationA. Alsaif and K.Markert (2011) introduced
the first algorithms to automatically identify relations for Arabic
18
Features:Connective features Words and POS of arguments MasdarTense and Negation Length, Distance and Order Features Argument Parent Production Rules
Discourse Relation Recognition by A. Alsaif and K.Markert (2011)
19
ResultsAcurr k Features
All connectives (6039)
52.5 0 Baseline (CONJUNCTION)
77.2 0.6078.7 0.6678.3 0.65
Conn only (1) Conn+Conn f+ Arg f (37) Conn+Conn f+ Arg f+ Production rules (1237)
M1M2M3
Excluding wa at BOP (3813)
35 0 Baseline (CONJUNCTION)
74.3 0.6577.0 0.6976.7 0.69
Conn only (1) Conn+Conn f+ Arg f (37) Conn+Conn f+ Arg f+ Production rules (1237)
M1M2M3
20
ResultsAcurr k Features
All connectives (6039)
62.4 0 Baseline (EXPANSION )
88.7 0.7888.7 0.78
Conn only (1) Conn+Conn f+ Arg f (37)
M1M2
Excluding wa at BOP (3813)
41.8 0 Baseline (EXPANSION)
82.7 0.7483.5 0.75
Conn only (1) Conn+Conn f+ Arg f (37)
M1M2
21
Semantic-Based Segmentation of Arabic TextsCorpus AnalysisDefinition: Let L be a list of candidate
segments connectors, each element c in L is classified based on its effects on the text segmentation as either active or passive
Examples:.1[] الكلية في جديد قسم إنشاء الجامعة إدارة تعتزم
]هنالك] القسم هذا إنشاء تؤكد التي التقارير بعض.2] و[ الكلية في جديد قسم إنشاء الجامعة إدارة تعتزم
[ هنالك] و القسم هذا إنشاء تؤكد التي التقارير بعض[ لذلك موعدا يحدد لم لكن
22
Segmentation ProcessIdentifying the connectors that indicate
complete segments. Locating the active connectors.Resolving the case where adjacent active
connectors exist.Setting the segments boundaries. Creating the final list of segments.
23
Discussionevaluate the segmentation process, they
collected ten essays.Each essay ranges between 500 and 700
words.After implementing the segmentation
process.Gave the output to judges to evaluate
them in terms of two factors: correct hit and incorrect hit.
24
Discussion Cont..Incorrect hit Correct hit Essay
0 33 1
1 15 2
0 25 3
1 23 4
0 20 5
1 29 6
1 26 7
2 33 8
0 26 9
0 22 10
25
Arabic Discourse Segmentation Based on Rhetorical Methods
This Method is depends on the meaning of the connector " و" in Arabic language.
There are six types of " و" classified into two classes, "Fasl" and "Wasl " :
"Fasl " : segmenting place."Wasl " : unsegmenting but connecting
the text.
26
Types of Connector "و" Class Example Type
Fasl العلم التالميذ يعلمون انهم واللهاألساتذة. عظيما عمال ليقدمون
والقسم
Fasl يعانون الذين وحدهم ليسوا سائل ورب�الشبابطبقات: بين من الشباب على ركزتم لماذا يقول
؟ المجتمع
ور�ب
Fasl النفسية المشكالت بعض من المراهقون يعاني.و كثيرة أخرى سلبيات به عامة المجتمع
واالستئناف
Wasl الفصل المدرس .ودخل يبتسم هو والحال
Wasl الحبيبان .وجلس القمر ضوء والمعية
Wasl محمد خالد وسافر والعطف
27
The Arabic sentence Segmentation System
28
Feature Extraction
•The following are the features of " والمعية": X3 = noun and X7 = accusative mark.
29
Experiment and Results
They used 1200 instances for training.They used 293 instances for testing after
testing there are 290 correct and 3 incorrect instances.
The result with:94.68% Recall
96.82% Precision
98.98% Accuracy
30
A Comprehensive Taxonomy of Arabic Discourse Coherence Relations
Coherence relations are classified into two types: explicit relations and implicit relations.
example Coherence relations
I am very happy because I got excellent marks in exams.
Explicit relations
I am very happy. I got excellent marks in exams.
Implicit relations.
31
The procedure of creating an Arabic Taxonomy of Coherence Relations
32
Examples of Implicit Arabic relations
"Impossible condition / المستحيل : " الشرطسم) في الجمل يلج حتى الجنة يدخلون وال
الخياط(
"Cascaded questioning/ " المكرر :االستفهامنحن) أم تزرعونه أأنتم ماتحرثون؟ أفرأيتم
الزارعون؟)
33
ResultsThey got a set of 47 Arabic coherence
relations.coherence relations. Result
From English coherence relations.
31
additional Arabic explicit coherence relations.
12
Arabic implicit relations. 4
34
Conclusion
Discourse Annotation is a very fertile field and it has many NLP applications, for Arabic there are some challenges due to the lack of annotated corpora and studies.
35
Thank You
top related