redefining urdu morphology and grammar for the...

REDEFINING URDU MORPHOLOGY AND GRAMMAR FOR THE DEVELOPMENT OF AN INTEGRATED SENTIMENT

ANALYSIS FRAMEWORK

AFRAZ ZAHRA SYED 2007-PHD-CS-07

SUPERSIVSED BY DR. MUHAMMAD ASLAM

(2013)

Department of Computer Science and Engineering University of Engineering and Technology

Lahore, Pakistan

ii

Redefining Urdu Morphology and Grammar for the Development of an Integrated Sentiment Analysis Framework

Dissertation

Submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computer Science

(2013)

AFRAZ ZAHRA SYED 2007-PHD-CS-07

SUPERSIVSED BY DR. MUHAMMAD ASLAM

Department of Computer Science and Engineering University of Engineering and Technology

Lahore, Pakistan

iii

Redefining Urdu Morphology and Grammar for the Development of an Integrated Sentiment Analysis Framework

A dissertation submitted in partial fulfillment of the requirements for the

degree of Doctor of Philosophy in Computer Science

By

Afraz Zahra Syed (2007-PhD-CS-07)

Approved on: ______________________________ Internal Examiner: __________________________

Dr. Muhammad Aslam Assistant Professor, Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan.

External Examiner: __________________________

Dr. Farooq Ahmad Associate Professor, Faculty of Information Technology, University of Central Punjab, Lahore, Pakistan.

_______________________________ Chairman, Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan.

_______________________________ Dean, Faculty of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan.

iv

This thesis has been evaluated by the following examiners: External Examiners a) From Abroad

1) Dr. Escalada Imaz, Gonzalo Scientific Researcher Researcher Council (CSIC) at the Artificial Intelligence Research Institute(CSIC-IIIA) Barcelona, Spain.

2) Dr. Muhammad Adeel Talib Functional Architect, Genix Ventures Pty Ltd, Melbourne, Australia

3) Dr. Muhammad Tahir Abbas Khan Associate Professor

Ritsumeikan Asia Pacific University, 1-1 Jumonjibaru, Beppu-shi, Oita 874-8577, Japan

b) From within the Country Dr. Farooq Ahmad Associate Professor Faculty of Information Technology, University of Central Punjab 1 - Khayaban-e-Jinnah Road, Johar Town, Lahore, Pakistan, Internal Examiner Dr. Muhammad Aslam (Assistant Professor) Department of Computer Science and Engineering University of Engineering and Technology G. T. Road, Lahore, Pakistan

vi

ABSTRACT

The rise of social networking sites and blogs has simulated a bull market in personal opinion;

consumer recommendations, product reviews, ratings, and other types of online expressions. For

computational linguistic researchers, this fast-growing heap of information has opened an

exciting research frontier, referred as, the Sentiment Analysis (SA). For English, this area is

under consideration from last decade. But, other major languages, like Urdu, are totally

overlooked by the research community. Urdu is a morphologically rich and recourse poor

language. The distinctive features, like, complex morphology, flexible grammar rules, context

sensitive orthography and free word order, make the Urdu language processing a challenging

problem domain. For the same reasons, sentiment analysis approaches and techniques developed

for other well-explored languages are not workable for Urdu text.

This dissertation presents a grammatically motivated, sentiment classification framework to

handle these distinctive features of the Urdu language. The main research contributions are; to

highlight the linguistic (orthography, grammar and morphology, etc.) as well as technical

(parsing algorithm, lexicon, corpus, etc.) aspects of this multidimensional research problem, to

explore Urdu morphological operations, grammar and orthographic rules, to redefine these

operations and rules with respect to the requirements of sentiment analysis framework. The

orthographical, morphological, grammatical and finally the conceptual details of the language

are our target concerns. Additionally, our approach can help in the sentiment analysis of other

languages, like Arabic, Persian, Hindi, Punjabi etc.

The proposed framework emphasizes on the identification of the SentiUnits, rather than, the

subjective words in the given text. SentiUnits are the sentiment carrier expressions, which reveal

the inherent sentiments of the sentence for a specific target. The targets are the noun phrases for

which an opinion is made. The system extracts SentiUnits and the target expressions through the

shallow parsing based chunking. The dependency parsing algorithm creates associations between

these extracted expressions. The framework uses the sentiment-annotated lexicon based

approach. Each entry of the lexicon is marked with its orientation (positive or negative) and the

intensity (force of orientation) score. The experimentation based evaluation of the system with a

sentiment-annotated lexicon of Urdu words and two corpuses of reviews as test-beds, shows

encouraging achievement in terms of accuracy, precision, recall and f-measure.

vii

ACKNOWLEDGEMENTS

I believe the research work presented in this dissertation from conception to completion

is a blessing from my Allah, who answered to my parents’ prayers and blessed me with

the strength. I also want to express my deepest gratitude to several individuals:

First and foremost, my utmost gratitude to Dr. Muhammad Aslam, my supervisor,

whose support and encouragement, I will never forget.

Dr. Ana Maria Martinez-Enriquez, who guided me in writing good research papers

through her thoughtful comments and suggestions.

Dr. Muhammad Ali Maud, Chairman of the Department of Computer Science and

Engineering, for his kind concern and consideration regarding my academic

requirements.

My respectable teachers during the PhD course work for their guidance and

invaluable intellect.

I am grate full to my colleagues and staff in the Computer Science and Engineering

Department.

Mr. Waqaar who assisted me in implementation and testing phase.

Lastly, I would like to thank my family for all their love and encouragement. My

parents, for being the excellent models of success and brilliance, who raised me with

a love of science and supported me in all my pursuits. My loving, supportive,

encouraging, and patient husband, Hasan, whose sincere support during all stages of

this Ph.D. gave me the feeling that I always had him on my side. Most of all, my

children Irtaza, Fatima and Ibrahim for their patience and tolerating my long study

hours.

ix

Dedicated to my Parents, Husband and Children

TABLE OF CONTENTS

CHAPTER 1: INTRODUCTION 1

1.1. Research Motivation 2

1.2. Research contribution 4

1.3. The Problem of Sentiment Analysis 5

1.3.1. Targets of the appraisal 7

1.3.2. Sources of the appraisal 8

1.3.3. Appraisal expressions 8

1.3.4. Orientation 9

1.4. Sentiment annotated lexicon 9

1.5. Problem statement 10

1.6. System Evolution 11

1.7. Dissertation Outline 12

CHAPTER 2: STATE OF THE ART RESEARCH 14

2.1. Features of the given text 15

2.2. Techniques 17

2.3. Sentiment-annotated-lexicon construction 18

2.4. Generalization among domains 21

2.5. Processing Morphologically Rich Languages 22

2.6. Sentiment analysis and Urdu language processing 23

2.6.1. Word segmentation 24

2.6.2. Phrase Chunking 24

2.6.3. Stemming of complex morphology 25

2.6.4. Resources for Urdu language processing 25

2.6.5. Miscellaneous works 26

2.7. Adjective based sentiment analysis techniques 28

2.8. Term level vs. Phrase level polarity 30

2.8.1. Term-level-polarity based approaches 30

2.8.2. Phrase-level-polarity-based approaches 31

2.9. Negation Handling in sentiment analysis 32

CHAPTER 3: DISTINCTIVE FEATURES OF THE URDU LANGUAGE 34

3.1 Orthography 35

3.1.1. Character set 35

3.1.2. Word order 36

3.1.3. Bidirectional script 36

3.1.4. Ligatures 36

3.2. Parts of Speech 37

3.3. Vocabulary 38

3.4. Morphology 39

3.4.1. Inflection and derivation 40

3.4.2. Compounding 40

3.4.3. Reduplication 41

3.4.4. Compound verbs and verb phrases 41

3.5. Challenging features of the Urdu language 42

3.5.1. Corpus construction 42

3.5.2. Complex stemming 42

3.5.3. Intricate lexicon 42

3.5.4. Word boundary identification 43

3.5.5. Diacritics omission 44

3.5.6. Code switching 44

3.5.7. Independent case marking 44

3.5.8. Free word order 46

CHAPTER 4: SENTIUNITS: THE APPRAISAL EXPRESSIONS 47

4.1. Adjectives 48

4.1.1. Morphological structure of adjectives 50

4.1.2. Classes of adjective 53

4.2. Modifiers 56

4.3. Orientation 58

4.4. Intensity 58

4.5. Polarity 58

4.6. Negations 58

4.6.1. Negation in Urdu language 59

4.7. SentiUnit extraction model 61

4.8. The appraisal targets 62

4.8.1. Cases of noun phrases 63

4.8.2. Possession markers in noun phrases 63

4.8.3. Effect of complex noun phrases in Urdu text 64

CHAPTER 5: IMPLEMENTATION: CLASSIFICATION MODEL AND

LEXICON STRUCTURE

66

5.1. PREPROCESSOR 68

5.1.1. Diacritic omission 68

5.1.2. Word boundary identification 68

5.2. EXTRACTOR 71

5.3. ASSOCIATOR 72

5.3.1. Working of the ASSOCIATOR 72

5.3.2. Algorithm 73

5.4. CLASSIFIER 74

5.4.1. Working of the CLASSIFIER 74

5.4.2. Algorithm 75

5.5. Computation of SentiUnit polarity: Effect of polarity shifters 76

5.5.1. Computing overall review polarity Rp from SUp 78

5.6. Sentiment Annotated Lexicon 79

5.6.1. Definitions of the specific terms 80

5.6.2. Sentiment annotated lexicon of Urdu words 82

5.7. System integration 85

CHAPTER 6: EXPERIMENTATION AND RESULTS 88

6.1. Lexicon Coverage 88

6.2. Corpus 89

6.3. Case Studies 90

6.3.1. CASE 1: Part of speech tagging 90

6.3.2. CASE 2: Extraction of targets and SentiUnits 90

6.3.3. CASE 3: Case marking and complex noun phrases 93

6.3.4. CASE 4: Polarity annotations 95

6.3.5. CASE 5: Associating targets with SentiUnits 97

6.4. Results 98

6.4.1. Model A 99

6.4.2. Model B 100

6.4.3. Effect of Negation 101

6.5. Example illustrations 103

CHAPTER 7: CONCLUSIONS AND FUTURE DIRECTIONS 107

REFERENCES 111

LIST OF TABLES

1.1 Summary of the given review in terms of sentiment analysis 6

2.1 Features used and their respective contributions 16

2.2 Techniques used by different contributions. 18

2.3 Lexicon construction research. 20

2.4 Corpuses and lexicons for Urdu language. 21

2.5 Urdu language processing. 27

2.6 Research contributions related to adjective based sentiment analysis. 29

2.7 Term-level polarity vs. phrase-level polarity approaches. 31

2.8 Negation handling for sentiment analysis. 32

3.1 Brief overview of Urdu language 34

3.2 Different shapes of a single alphabet 37 .(jeem) ج

3.3 Examples of Urdu words from multiple languages. 38

3.4. Examples of morphological processes in Urdu. 41

3.5 Inflection of multiple words from root word “علم” in the Urdu language. 43

3.6 Examples of affixes, case markers and postpositions. 45

3.7 Free word order property of the Urdu text. 46

4.1 Examples of opinionated sentences from Urdu with different SentiUnits. 47

4.2 Examples of unmarked adjectives. 51

4.3 Adjective marking with gender and number 52

4.4 Marking of adjectives for cases 52

4.5 Adjective agrees with the nearest noun in a sequence. 53

4.6 Adjective with partial and full reduplication. 53

4.7 Descriptive adjectives in Urdu. 54

4.8 Attributive adjectives directly modify the nouns. 54

4.9 Predicative adjectives describe the features of the nouns. 54

4.10 Examples of possessive adjectives. 55

4.11 Examples of demonstrative adjectives. 55

4.12 Inflection of demonstrative adjectives. 56

4.13 Examples of reflexive possessive adjectives. 56

4.14 Adjective modifiers. 57

4.15 Examples of sentential negation from Urdu text. 60

4.16 Examples of constituent negation from Urdu text. 60

4.17 Possession markers in Urdu. 64

5.1 Examples of lexicon entries. 85

6.1. Summary of lexicon entries. 89

6.2 Corpora for evaluation. 89

6.3 Parsing of example 1 into targets and SentiUnits. 91

6.4 Parsing of example 2 into targets and SentiUnits. 91

6.5 POS tagging and phrase chunking of the given review. 92

6.6 Experimental results in terms of P, R, F and A for model A. 99

6.7 Comparison of accuracy from both corpora C1 and C2 for model A. 99

6.8 Experimental results in terms of P, R, F and A for model B. 100

6.9 Comparison of accuracy from both corpora C1 and C2 for model B. 100

6.10 Effect of negation in terms of P, R and F. 101

LIST OF FIGURES

3.1 Character set of Urdu. 35

3.2 Diacritics in Urdu with letter “36 .”ب

4.1 Types of adjectives in Urdu. 50

4.2 SentiUnit extraction and polarity computation. 61

4.3 Cases of noun phrases with core case markers 63

5.1 System model representing modules and their interactions. 67

5.2 Preprocessing of the input sentence by the PREPROCESSOR module 70

5.3 Processing of the input sentence by EXTRACTOR module 71

5.4 The dependency parsing of the given sentence 72

5.5 Linking SentiUnits with candidate targets by ASSOCIATOR module 73

5.6 Sentiment classification of a review as positive or negative. 77

5.7 Computation of the overall polarity of the Urdu text based review. 78

5.8 Structure of the lexicon sentiment annotated lexicon 84

5.9 Integration of the lexicon of Urdu words with the sentiment classifier 86

6.1 Example extraction of the SentiUnits. 93

6.2 Linking the sentiment expressions with candidate targets. 97

6.3 Example of a positive review. 103

6.4 Result of the analysis. 104

6.5 Example of a negative review. 105

6.6 Result of the analysis. 106

LIST OF PUBLICATIONS

1. Syed AZ, Muhammad A, Martínez-Enríquez AM (2012) Handling the Effect of

Polarity Shifters for a Morphologically Rich Language. In: An International

Interdisciplinary Journal in English, Japanese and Chinese. (Submitted)

2. Syed AZ, Muhammad A, Martínez-Enríquez AM (2012) Associating Targets with

SentiUnits: A Step Forward in Sentiment Analysis of Urdu Text. In: Artificial

Intelligence Review.

3. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) Sentiment Analysis of

Urdu Language: Handling Phrase-Level Negation. In: Proceedings of the

10thMexican international conference of artificial intelligence, pp 382–393

4. Syed, AZ, Muhammad A, Martinez-Enriquez, AM (2011) Adjectival Phrases as

the Sentiment Carriers in the Urdu Text. Journal of American Science 7(3), 644–

652

5. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) Sentiment-Annotated

Lexicon Construction for an Urdu Text Based Sentiment Analyzer. In: Pakistan

Journal of Science (2011), ISSN: 0030-9877

6. Syed AZ, Muhammad A, Martínez-Enríquez AM (2010) Lexicon based sentiment

analysis of Urdu text using SentiUnits. In: Proceedings of the 9th Mexican

international conference of artificial intelligence, Pachuca, Mexico, pp 32–43

Chapter 1| Introduction 1

______________________________________________________________________________________ Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis Framework.

CHAPTER 1

INTRODUCTION

The information in the world can be generally categorized into two key types: facts and

opinions.

The facts are objective expressions describing events, entities and their characteristic

properties.

The opinions are typically subjective expressions or appraisal expressions that

describe personal or individual sentiments or appraisals about events, entities and

their characteristic properties.

The notion of opinion is very broad. For this dissertation, our focus is on the opinions

generated by individuals, particularly the appraisal expressions which are given in the

form of web reviews.

These appraisal expressions in the form of opinions and subjective texts are very

important, and John Locke (1632-1704) very rightly said "Man is by nature a social

animal". It means that man always seeks for suggestions, opinions, and views, from other

people in society for his survival and proper decisions in every walk of life.

There are many areas of textual natural language processing like, information extraction,

summarization, information retrieval, text clustering, web search and text categorization

or classification. Little research had been done on the analysis of subjective texts for their

inherent sentiments, until only recently. One of the main reasons for the lack of study on

automatic analysis of subjective text is the fact that there was little opinionated text

available before the World Wide Web. People were used to take opinions for their friends

and relatives before taking any decision. Also, organizations were used to conduct polls,

surveys and focus groups whenever they wanted to find the opinions or sentiments of

their clients. But, in this modern era of computer and technology we are living in virtual

communities and societies. Now, internet forums, blogs, consumer reports, product



reviews, and other type of discussion groups have opened new horizons for human mind.

That is why, from casting a vote to buying a latest gadget people search for opinions and

reviews from other people on the internet. They can now give their reviews about

products at business sites and convey their views on almost anything in web forums,

blogs and discussion groups, which are collectively called the user-generated content.

This is not only true for individuals but also true for organizations and companies. For an

organization or company, it may no longer be compulsory to organize focus groups,

conduct surveys, or employ external consultants in order to get client or consumer

opinions regarding its products and those of its competitors. Now, the user-generated

content on the Web can easily provide such information.

Conversely, finding the right sources of the subjective texts and monitoring them on the

Web is still a difficult task, because there is a outsized number of diverse sources, and

each source may also have a massive volume of such text. In many cases, the opinions

are hidden in lengthy forum blogs and posts. It is hard and very time consuming for a

human reader to find relevant sources, take out related sentences, read them, analyze

them, and classify them into a usable form. Thus, automated opinion or sentiment

discovery and summarization systems are needed. This need has fashioned an exciting

rather new area in text analysis which is referred by many names like sentiment analysis,

opinion mining, subjectivity analysis, and appraisal extraction (Pang and Lee, 2008). For

this dissertation we use the term Sentiment analysis.

1.4. Research Motivation There are two factors which motivated us to dig deep in this research direction:

Factor I Rapid Proliferation of Information: The Web 2.0 has emerged, as a platform

for the dynamic information exchange and the personal view propagation. Now, more

and more people around the globe express their feelings through blogs, give voice to the

governmental and political affairs through news reviews, and record their likes and

dislikes in the form of product reviews. This proliferation of the information has affected

the lives of the internet users both positively as well as negatively.



On one side, the people use internet forums, blogs, consumer reports, product reviews,

and different types of discussion groups for taking everyday decisions. This text helps

them in almost all aspects of life from medical care to business proposals and from home

education to professional training.

On the other hand, the negative aspect of this opinion sharing cannot be ignored, which is

in the form of revolutionary or extremist propaganda. According to (Glaser et al. 2002)

the extremist groups use the Internet to endorse hatred and aggression. The Internet has

turned into a ubiquitous, anonymous, economical, and rapid way of communication for

such groups (Crilley, 2001). Now, people discuss each and every type of emotional

behavior in the web discussions and openly post their opinions. But, this information can

mislead the general public in their beliefs and thoughts, particularly children and youth

are more vulnerable.

Therefore, the analysis of user generated web content is not only useful for commercial

purposes, but also, its need for the discouragement of such misinformation is more

immediate, particularly, in the main languages of the world.

Consequently, the research on opinion mining and sentiment analysis on some Indo-

European languages, like, English, is flourishing and have a number of successful

contributions, (Turney 2002), (Pang et al. 2002), (Riloff et al. 2003), (Riloff and Wiebe

2003), (Tan et al. 2009) and (Bloom and Argamon 2010). They have used multiple

approaches and techniques to handle this flourishing area more effectively and most of

these contributions are very successfully performing the task of sentiment analysis. There

are now at least 20-30 companies that offer sentiment analysis services in USA alone

(Liu, 2010).

Factor II Urdu as a Morphologically Rich Language: Despite the fact that sentiment

analysis is a well explored field for English language, but, it is not yet decided whether

and how equivalent success could be attained for Morphologically Rich Languages

(MRLs) (Abdul-Mageed and Korayem 2010). The MRLs, are defined as, the languages,

in which, considerable information about the syntactic units and their relations is

expressed at word-level, i.e., the structures of the words are complex and morphological

operations like inflection and derivation are more frequent (Tsarfaty et al. 2010). Due to



this word level complexity, the MRLs become more challenging for the computational

linguistics (CL) applications. This can result into intricate lexicons, complex stemming,

erroneous word segmentation and ambiguity in part of speech tagging etc. Urdu is a

worth mentioning case in this point.

Challenges in Urdu Language Processing: Given that, Urdu is a major language with

about 100 million speakers, there is a great potential in performing the sentiment analysis

on the Urdu text. As, the Urdu language is morphologically rich therefore, its constituent

words and phrases tend to be more complex, due to the recurrent derivations and

inflections. Besides, the morphological complexity, the variability in the grammar rules

and vocabulary in the Urdu text is usual and is considered acceptable. The main reason

for this phenomenon is that Urdu is influenced by many other languages, not only in

vocabulary but also in morphology and grammar, e.g., Hindi, Persian, Arabic, Sanskrit

and English, etc. The loanwords from a particular language follow their own grammar

rules. Hence, Urdu language has distinctiveness in features and linguistic aspects.

Moreover, it is altogether different from the well recognized languages in the field of

sentiment analysis and other computational linguistic applications. The computational

linguistics researchers require a comprehensive understanding of its linguistics as well as

computational aspects. Certain challenges which the Urdu language puts forward for

researchers are listed below and are explained in detail in Chapter 3.

Optional use of diacritics causes misleading parts of speech tagging.

Cursive script result into wrong word boundary identification

Frequent inflection and derivation result into complex stemming

Word level complexity makes lexicons more complex

Flexibility in vocabulary and grammar makes it difficult to define spelling and

grammar rules.

Free word order property cause misidentification of parts of speech tags.

1.5. Research contribution On the basis of the two major factors of the research motivation we state here the main

contributions:



Sentiment analysis is a challenging computational linguistic or natural language

processing problem. Due to its remarkable significance for practical applications,

there has been an overpowering and irresistible growth of both, the research in

academia and commercial applications in the industry. Unfortunately, to date there is

no significant contribution, which addresses the problem of sentiment analysis for

Urdu language. Our contribution is the first in this field.

This research performs a deep analysis and survey of the idiosyncratic characteristics

of the Urdu language, challenges posed by these characteristics and their possible

effects on the language processing research performed so far, which make it a worthy

reference for new researchers.

The grammatically motivated model uses shallow parsing based chunking and very

successfully handles the challenging characteristics of the target language (as shown

by the results in Chapter 6).

1.6. The Problem of Sentiment Analysis Although Urdu language is the object of investigation for this research work, but for

understanding the problem of sentiment analysis, we discuss a review given in English

language, to improve its understandability for the non native readers of this work.

Definition 1: (Liu, 2010) define the sentiment analysis as the automatic or computational

analysis of opinions, emotions and sentiments expressed in user-generated content on the

Web.

Definition 2: Sentiment classification is the classification achieved by the analysis of the

given text as positive or negative according to its inherent sentiments.

Example: To establish the problem, we take a laptop review segment; all the sentences

are numbered for later referencing:

“(1) Last month I bought a laptop. (2) It is such a fine manufacture. (3) Its

processing speed is really amazing. (4) The operating system is fantastic too. (5)

Although the battery life is not long, that is acceptable for me. (6) However, mother

is not happy with me as I did not tell her before I bought it. (7) She also thinks the

laptop is too costly, and wants me to replace it with a cheaper one. … ”



Given text conveys the sentiments of a consumer’s (I) opinion about a product (laptop).

The consumer is the source of appraisal and product is its main target. But, when we look

at individual sentence then the targets of appraisal are different features of the main

target, even the sources are different too. There are quite a few opinions in this review.

Sentences (2), (3) and (4) express positive orientations of the inherent sentiments, while

sentences (5), (6) and (7) express negative emotions.

All the appraisals have some targets which mainly address the central or main target, i.e.,

the laptop. This main target is addressed indirectly through its features, like in sentences

(3), (4), and (5) the target features are “processing speed”, “operating system” and

“battery life”, respectively. The expression in sentence (7) is on the cost of the laptop, but

the opinion/emotion in sentence (6) is on the consumer “me” not the product. This is a

key point. In a review, the writer may be interested in opinions on various targets, but not

on all (e.g., improbable on “me”). The source of the appraisals in the sentences (2), (3),

(4) and (5) is the consumer himself, but in sentences (6) and (7) is “mother”. Table 1.1

summarizes this discussion:

Sentence Target Source Appraisal

Expression

Orientation

(1) none None None objective

(2) laptop I such a fine Positive

(3) processing speed I really amazing Positive

(4) operating system I Fantastic Positive

(5) battery life I not long Negative

(6) me Mother not happy Negative

(7) laptop Mother too costly Negative

Table 1.1 Summary of the given review in terms of sentiment analysis.

With this case in mind, we now formally define the sentiment analysis. We start with the

sentiment target.



1.3.1. Targets of the appraisal

In literature of sentiment analysis the terms object, target, entity and object are used to

represent the target entity that has been commented on. We use the term target only. The

targets of the appraisals can be anything; in general, these can be products, services,

individuals or personalities, organizations or businesses, event or happenings, or

discussion topics etc.

A target can have a set of features f these features represent the components or parts of

the target as well as its attributes or properties (Liu, 2010). For example the laptop’s

features under consideration are “its manufacture quality, processing speed, operating

system, battery life” and “cost”. Among these features “its manufacture quality”,

“processing speed” and “cost” are the attributes. Whereas, “operating system” and

“battery life” are its components.

Definition 3: In the given review a target is an entity about which the positive or negative

sentiments are expressed by the reviewer. This can be a person, product, topic, event, or

organization.

Example: In the above example target of overall review is the laptop, at the sentence

level its features act as targets, but the opinions made for these features indirectly address

the main target. In sentence (6), the targets deviate the analyzer to another target “me”.

From this example we make following two assumptions about the targets:

Assumptions:

1. The effect of positive or negative appraisal for all the features of a main target in a

given review is combined to make the final effect.

2. All the other targets are discarded along with their appraisal orientations.

Noun phrases as Targets: The targets of the appraisal are basically the non-overlapping

noun phrases in the given review. Noun phrases are the units of one or more words in a

link with noun as head word and all other words as dependents. Hence, the algorithm

extracts targets with nouns as head words. For this purpose it uses shallow parsing based

chunking. See Chapter 4 and Chapter 5 for more explanation.



1.3.2. Sources of the appraisal

The sources of appraisal are also called opinion source, or opinion holder. In the case of

product reviews and blogs, the source of appraisal is usually the reviewer or the author of

the post. In this case, the presence of other sources very rarely effects the final

classification. But, in the more complicated texts like news articles these sources are

explicitly stated as a person or an organization that holds a particular opinion. For

example, the source of appraisal in the sentence “The president has disapproved the

political situation in the country” is “The president”.

Definition 4: In the given review a source of appraisal an opinion is the person or

organization that expresses the opinion.

Example: For sentence (3), (4) and (5) the source of appraisal is the reviewer; the main

source. In sentence (6) and (7) a secondary source “my mother” is introduced.

Assumptions:

In the given review, only the main source is responsible for generating appraisal

expressions and opinions given by other sources are discarded.

1.3.3. Appraisal expressions

The appraisal expressions or opinions or subjective expressions are mostly based on

adjectives. Some adjectives are general and can be used to modify the features of the

targets of appraisal.

Definition 5: An appraisal expression modifies a feature with a positive or negative view,

attitude, emotion or appraisal from a source of the appraisal.

Example: For example the expression “really amazing” modifies the feature “processing

speed” of the laptop.

Assumptions:

1. The appraisal expressions are always adjective based, these can be single word based

(only adjective) or multiple words based (adjectival phrase).

2. In the given text only those appraisal expressions are considered which make an

association with a specific target.



3. For a given review, only the appraisal expressions generated by the main source are

considered.

Appraisal Expressions as SentiUnits: In our approach (presented in next Chapters), we

label the appraisal expressions as the SentiUnits. For extraction of the SentiUnits the

algorithm first identifies the subjective words according to their orientation scores

(positive or negative). Then, it attaches the polarity shifters (words which shift the

polarity or orientation of the inherent sentiment, for more detail see Chapter 4),

conjunctions, postpositions and modifiers to extract the appraisal expressions from the

opinionated sentences. The shallow parsing based chunking is applied for the extraction

of the SentiUnits, with adjectives as the head words. The overall polarity of a sentence in

a given review can be determined by computing the polarity of these expressions. These

concepts are explained in detail in Chapter 4 and 5.

1.3.4. Orientation

The sentiment classification of the review starts from the word level. Each word is

classified as subjective or objective, further each subjective word is identified as positive

or negative. This positivity or negativity of the word or phrase is called its orientation.

The words with positive orientation exhibit positive sentiments or a supportive opinion.

This orientation can have certain force or strength called its intensity. For example, the

words “good” and “better” both have positive orientation, while the intensity of the later

one is more.

Definition 6: This positivity or negativity of the appraisal expression for a specific target

or its feature is called its orientation.

Example: In the given example appraisal expressions with positive orientation are “such

a fine”, “really amazing”and “fantastic”, while, “not long” and “not happy” exhibit

negative orientation.

1.4. Sentiment annotated lexicon Our approach for sentiment analysis is lexicon based (entries are annotated with

orientation scores, represented as polarities). A usual model of a sentiment analyzer



incorporates two components: (i) the classification algorithm, which analyzes and

classifies the given opinionated text according to inherent sentiments of the reviewer, and

(ii) the lexicon or lexicons annotated with the prior polarities of the lexical entries

(words/ phrases), usually as positive or negative. These prior polarity annotated lexicons

are also called sentiment-annotated lexicons (Pang and Lee, 2008).

Model: At the highest level, our lexicon model categorizes all the lexical entries into

objective and terms. Objective terms have no orientation or intensity and hence are not

marked with the prior polarity scores. Therefore, they demonstrate no effect on the

overall decision of the classification. On the contrary, subjective terms are the carriers of

the sentiments and are marked with polarity scores. Their occurrence can effect or even

altogether alter the final classification decision. With respect to orientation and polarity

the subjective terms are further categorized into three types (This model is explained in

detail in Chapter 5);

1. Absolute subjective terms with orientation only.

2. Subjective terms with intensity only.

3. Subjective terms with both values of orientation and intensity.

So far we have tried to explore the problem of the sentiment analysis by giving examples

and basic definition. Here, we define the research problem.

1.5. Problem statement Let us denote the review under consideration as R in Urdu text. R is single sentence based

or it contains multiple sentences, among which some are subjective sentences (which

contain appraisal expressions, their targets and sources) in the set Ss= {Ss1, Ss2, Ss3,….Ssk}

and others are objective (without appraisal expressions, their targets and sources ) So=

{So1, So2, So3,….Sol}, such that,

R = {Ss1, Ss2, Ss3,…. Ssk } U {So1, So2, So3,… Sol.},

where,

k=1, 2, 3, …n;

l=1, 2, 3, …m;

n and m are finite numbers.



AEXTRACTOR, and ASSOCIATOR modules of the system (presented in Chapter 5).

The final polarity of the review PR is calculated as a sum of all sentence polarities by the

CLASSIFIER:

PR = ∑ Psi ,

where

i=1, 2, 3, …N;

N is a finite number.

Research goal: Hence, the goal of this research is to develop an integrated sentiment

analysis model for the Urdu text. To achieve this entire goal we formulate following

objectives:

To design and develop a sentiment-annotated Urdu lexicon; which includes

information about the subjectivity of an entry in addition to its orthographic,

phonological, syntactic, and morphological aspects. Unluckily, there is no such

lexicon available or even developed to date. Hence, from conception to modeling and

then implementation, we have to cope with this challenging task as a prerequisite for

the final system model (Syed et al. 2010), (Syed et al. 2012).

To fabricate an appropriate classification model (which is capable of handling context

sensitive orthography, morphological operations and grammatical rules of Urdu

language) for the processing and classification of the text in accordance with the

inherent sentiments. The algorithms applied on other languages like English (Pang

and Lee 2008), (Wiebe et al. 2004), (Bloom and Argamon 2010) Chinese (Jang and

Shin 2010), or Arabic (Abbasi et al. 2008), cannot be applied directly to Urdu, due to

its morphological complexity and issues discussed in Chapter 3.

To evaluate that model using the lexicon and compare the results. For the

experimentation, we use sentiment annotated lexicon of Urdu words and two corpora

of reviews about movies and electronic appliances as test-beds.

1.6. System Evolution The sentiment-annotated lexicon based classifier presented in our paper (Syed et al.

2010) focuses on:



Task 1: The extraction of the SentiUnits

Task 2: Computation of the polarity scores of the sentences according to the extracted

SentiUnits

Task 3: Classification of the review according to these polarity scores

This approach is good for handling the sentences with single targets. In other words, it

can only handle simple opinions in which all the opinionated expressions are associated

with one object or target. Presence of multiple targets, as in the comparative sentences,

where two different targets are compared, may lead to a misclassification error, e.g., “It is

hard to rank 300 among the outstanding movies like Brave Heart, or Ben-Hur.” In this

case, the analyzer may misclassify the comment. As, the expression, “outstanding” is

positive and is by default associated to the movie “300”, which is presented for review.

This is because the analyzer is not establishing an expression to target link. The positive

expression “outstanding” should be linked with the movies “Brave Heart and Ben-Hur”,

instead of the reviewed movie “300”.

To handle this kind of misclassifications in complex sentences like comparatives, we

extend this model to introduce the concept of SentiUnit to target associations. In this

approach, we emphasize on the exact identification of the SentiUnits as well as their

targets. To minimize misclassification rate these targets are associated with the

SentiUnits. For this purpose, we incorporate a new module called the ASSOCIATOR. The

EXTRACTOR module uses shallow parsing based chunking to extract the SentiUnits and

the targets (Syed et al. 2010). The ASSOCIATOR module uses the dependency parsing

based algorithm to associates each SentiUnit with its respective target (Syed et al. 2012).

After implementation of the final version, we evaluate the system on the corpus of the

reviews about movies and electronic appliances. We use four classification performance

metrics, i.e., precision, recall, and F-measure in addition to accuracy. In comparison, with

the previous versions, the results are radically improved with an accuracy of 82.5%,

particularly for sentences with multiple targets.

1.7. Dissertation Outline

The chapter wise division of this research dissertation is given below:



Chapter 2 gives a comprehensive overview of the state of the art research in the field of

sentiment analysis and Urdu language processing. It discusses features, approaches, and

techniques used for the development of the sentiment analyzer at different levels for

different languages.

The complete overview of the Urdu language, which is the main object of this research, is

given in Chapter 3. As Urdu is an entirely different language from some well explored

languages like English, therefore, we explain its characteristic features like, orthography,

morphology, syntax and grammar in more detail to augment the understandability of the

next chapters.

Chapter 4 describes the concept of the SentiUnits or the appraisal expression. Some

examples and their description augment the explanation of the structure of the SentiUnits.

The overall system’s implementation, modules and their diagrams are given in Chapter 5.

This chapter also explores the construction, integration and model of the sentiment

annotated lexicon of Urdu words.

Chapter 6 presents experimentation and results. For performance evaluation of the

sentiment analysis systems, the experiments are performed on real corpuses of user

reviews. For this purpose, reviews corpuses are collected and sentiment annotated

lexicons are developed.

Finally, Chapter 7 concludes our research contribution with some discussion points and

indications of the future endeavors.

Chapter review:

In this Chapter we defined some basic terminologies used in the task of sentiment

analysis. Using these terms we formulated our problem statement, stated the objectives,

goals and main contributions of the research.

Chapter 2| State of the Art Research 14


CHAPTER 2

STATE OF THE ART RESEARCH

The field of sentiment analysis is the center of attention for the researchers from

information retrieval, data mining, computational linguistics, and many other related

areas. There is a rapid growth of interest and the foregoing efforts have covered a broad

range of the tasks, for example, polarity classification (Pang et al. 2002), (Turney 2002),

opinion identification (Pang and Lee 2004), and opinion source assignment (Breck et al.

2007), (Choi and Cardie, 2008). Additionally, these contributions have attempted the

problem at different granularity levels. For instance, the contribution in (Pang et al. 2002)

attempts sentiment classification task at the document level. (Pang and Lee 2004)

explores sentence level classification while, (Turney 2002), (Choi and Cardie, 2008)

emphasize on phrases. The literature survey given in this Chapter covers major aspects of

SA research and also gives detailed overview of the contributions done for the language

processing of the morphologically rich languages with Urdu as a special focus.

To present a precise a literature survey for SA and Urdu language processing, we focus

on the following major aspects:

1. Features of the given text

2. Techniques

3. Sentiment annotated lexicon construction

4. Generalization among domains

5. Processing of Morphologically Rich Languages

6. Urdu Language Processing

7. Adjective based SA techniques

8. Term level vs. phrase level polarity

9. Negation Handling in SA



2.7. Features of the given text

Researchers have focused on a number of features of the given text for achieving better

classification results. These features are encoded into feature vectors for the proper

application of machine learning algorithms (Pang and Lee 2008). Thus, feature selection

is a critical task and can affect the results to a great extend. Syntactic, semantic, linking

based, term based, topic oriented and part of speech based features are frequently used in

literature. In following, we discuss four categories, which are Part of speech (POS) based,

term based, syntactic, and topic oriented.

The POS based information, particularly, of adjectives, can help a lot in sentiment

analysis. That is why the earliest work in this domain uses adjectives as subjectivity

indicators (Hatzivassiloglou and McKeown 1997). After that, (Hatzivassiloglou and

Wiebe 2000; Mullen and Collier 2004), and (Whitelaw et al. 2005) present their

approaches to handle adjectives using multiple techniques. (Turney 2002) argues that,

proverbs are also carriers of sentiments in a sentence and should be considered in

combination with adjectives. The sentences are divided into pre-structured grammatical

patterns, which include adjectives and adverbs as the core words. (Riloff et al. 2003)

attempts a relatively new idea and proposes the analysis of nouns in the text. It

emphasizes on the concept of subjective nouns and computes the orientation for the

phrases in the sentence which contained them.

Many works are available in which term based features are considered. For example, the

position of the term in a sentence is put forward as a feature by (Kim and Hovy 2006).

This work locates the specific terms, and then, according to their position, it computes

subjectivity orientation. Another work, (Wiebe et al. 2004) applies the concept of hapax

legomena for feature selection, which means, a word occurring only once in a given

corpus. It proposes that the word that appear only once in the corpus are more subjective

than the others. In addition to this feature, it uses a relatively complex syntactic feature,

i.e., collocations of the words in a sentence. If some words or terms co-occur more

frequently than usual, then, these are considered as collocations. According to (Yang et



al. 2006) the terms which are rare and are not entered in a prefixing dictionary tend to be

more subjective, because, the reviewers use them to emphasis their opinion.

Table 2.1

Features used and their respective contributions.

Type Focused features Contributions

Term based Term presence and position Pang et al. (2002)

Bigrams and trigrams Dave et al. (2003)

Hapax legomena Wiebe et al. (2004)

Rare terms for emphasis Yang et al. (2006)

Tem position Kim and Hovy (2006)

Term frequency Abdul-Mageed, Korayem (2010)

Contrastive distance in terms Kennedy, Inkpen (2006)

Snyder, Barzilay (2007)

Syntax based Collocations Riloff, Wiebe (2003)

Wiebe et al. (2004)

Appraisal expressions Whitelaw et al. (2005)

Valance shifters Kennedy, Inkpen (2006)

Noun adjective dependency Bloom, Argamon (2010)

POS based Adjectives Hatzivassiloglou McKeown (1997)

Hatzivassiloglou Wiebe (2000)

Mullen, Collier (2004)

Whitelaw et al. (2005)

Adjective and adverb Turney (2002)

Subjective noun Riloff et al. 2003

Topic Reference to the topic Mullen, Collier (2004)



(Pang et al. 2002) states better performance, using “presence of term” as a binary-valued

feature vector, whose entries merely specify, whether a term occurs (0, 1) or not. But, in a

term frequency feature vector entry values increase with the occurrence frequency of the

corresponding term (Abdul-Mageed and Korayem 2010). Bigrams and trigrams are used

by (Dave et al. 2003). (Kennedy and Inkpen 2006) and (Snyder and Barzilay 2007)

consider contrastive distance between terms as an automatically computed feature.

(Whitelaw et al. 2005) uses the concept of appraisal theory and extracts appraisal

expressions with the help of sentiment lexicon. (Mullen and Collier 2004) observes that,

the sentences which contain a reference to the topic, can be considered more important.

For this purpose, it specifies words and word phrases which, can be extracted as

indicators of the reference. The above discussed features and related contributions with

some further examples are summarized in Table 2.1.

2.8. Techniques

There are a number of techniques used for sentiment analysis, e.g., unsupervised

bootstrapping, sentiment lexicon and support vector machines (see Table 2.2.). In

unsupervised bootstrap approach, a primary or initial classifier is applied on the text to

generate labeled data as the output. After that, a supervised learning algorithm may be

applied on this data. The initial classifier can have various implementation possibilities,

according to the language complexity and depth of the required analysis. An example of

such an initial high-precision classifier to learn extraction patterns for subjective terms is

proposed by (Riloff and Wiebe 2003). (Kaji and Kitsuregawa 2007) uses this method for

the automatic construction of HTML documents based corpus in which, the polarity

labels are assigned to the entries.

(Hatzivassiloglou and Wiebe 2000; Turney 2002; Yu and Hatzivassiloglou 2003; Riloff

et al. 2003) and (Higashinaka et al. 2007) employ sentiment-annotated lexicon induction

technique. As a first step, an unsupervised approach is applied for the generation of a

sentiment-annotated lexicon. Then using this as a resource, the given text is classified as

positive or negative.



(Hu and Liu 2004) and (Andreevskaia and Bergler 2006) use Preston WordNet for

extraction of sentiment tags. There is also a trend in research community to extend

existing lexicons, e.g. SentiWordNet is an extension of the WordNet.

Table 2.2.

Techniques used by different contributions.

Technique Used Contributions

Unsupervised bootstrapping Riloff, Wiebe (2003)

Kaji, Kitsuregawa (2007)

Sentiment annotated lexicon Hatzivassiloglou Wiebe (2000)

Turney (2002)

Yu, Hatzivassiloglou (2003)

Riloff et al. (2003)

Higashinaka et al. (2007)

Support vector machines (SVM) Pang, Lee (2002)

Dave et al. (2003)

Pang, Lee (2004)

Kennedy, Inkpen (2006)

WordNet based Hu, Liu (2004)

Andreevskaia, Bergler (2006)

2.9. Sentiment-annotated-lexicon construction

As we are using the lexicon based approach for the development of the sentiment

analyzer so we discuss here some contributions from this aspect of the research. Lexicon

construction with an apposite coverage is a challenging task. From definition of grammar

rules to their appropriate implementation, it requires much expertise and proficiency

about the target language as well as the computer algorithms. For the task of sentiment

analysis the entries of these lexicons are annotated with the orientation scores in addition



to their morphological, grammatical and phonological information. This sentiment

annotation task can either be done manually with the help of the agreement of judges who

can decide about the orientation scores of the given words. Or, it can be done

automatically, using computer algorithms like machine learning approaches etc. The

manual annotation, provides higher accuracy but is more time consuming and lengthy.

The languages, which are more popular on the internet, have rich and easily available

electronic-linguistic-resources. For example, English language, for which almost all types

of corpora are available from almost all domains, i.e., from product reviews to news

discussions. That is why the sentiment analysis research community has moved to the

algorithms and approaches which can help in the generation of the automatic lexicons as

an alternative of manual annotation and tagging. For example, (Annett and Kondrak

2008; Higashinaka et al. 2007; Andreevskaia and Bergler 2006; Hu and Liu 2004; Yu and

Hatzivassiloglou 2003; Riloff et al. 2003; Turney 2002) and (Hatzivassiloglou and Wiebe

2000). These methods are fast and can rapidly develop domain dependent lexicons.

Going back to the history of sentiment annotated lexicon construction, General Inquirer

(Stone et al. 1966) is a popular recourse for sentiment analysis of English language and is

manually compiled. A pioneering attempt in automatic acquisition of sentiment

annotated-lexicon is (Hatzivassiloglou and McKeown 1997). This work develops a

sentiment-annotated lexicon with an emphasis on adjectives. They apply shallow parsing

algorithm and developed a log-linear statistical model. This model predicts same

orientation between any two adjectives. After that automatic acquisition of the polarity

values of words and phrases itself appeared as an active line of research. Diverse

techniques have been proposed and implemented for learning the word polarities. These

include corpus-based approaches like (Hatzivassiloglou and McKeown 1997), statistical

approaches to measures of the word association etc as proposed in (Turney and Littman

2003) and using lexical relationships (Kamps et al. 2004).

Some efforts have tried to use or extend the existing lexicons, e.g. the extension of

WordNet is SentiWordNet. In SentiWordNet the polarity marks are annotated with the

existing structure of the gloss. (Annett and Kondrak 2008), (Andreevskaia and Bergler



2006) and (Hu and Liu 2004) utilize WordNet or its extensions for the sentiment analysis.

Moreover, (Hatzivassiloglou and Wiebe 2000; Turney 2002; Yu and Hatzivassiloglou

2003; Riloff et al. 2003) and (Higashinaka et al. 2007) have tried to develop algorithms

and techniques for automatic lexicon construction using unsupervised learning methods.

All these discussed contributions are summarized in Table 2.3.

Most of these efforts use pre-developed linguistic recourses like corpuses for the

development and extraction of required lexicons. But, Urdu is a recourse poor language

and hence the task of lexicon construction becomes more difficult and time consuming.

To our knowledge no such lexicon exists for Urdu text. However, there are a very few

efforts who have tried to construct corpuses and simple lexicons for other NLP

applications.

Table 2.3.

Lexicon construction research.

Research focus Contributions

Manually compiled Stone et al. (1966)

Corpus based Hatzivassiloglou and McKeown (1997)

Turney and Littman (2003)

Kamps et al. (2004)

Extension of existing lexicons Annett and Kondrak (2008)

Andreevskaia and Bergler (2006)

Hu and Liu (2004)

Unsupervised learning methods Hatzivassiloglou and Wiebe (2000)

Turney (2002)

Yu and Hatzivassiloglou (2003)


Higashinaka et al. (2007)



The preliminary work is presented for the EMILLE (Enabling Minority Language

Engineering) project in the form of a multi-lingual corpus for the South Asian languages.

A parallel corpus for Hindi, Urdu, English, Bengali, Punjabi and Gujarati languages

contains about 200,000 words (Baker et al. 2003). Their independent corpus of Urdu text

has 1,640,000 words annotated with POS tags (Hardie 2003).

Another effort is presented in (Ijaz and Hussain 2007). They use corpus to automatically

develop Urdu lexicon. Their corpus is based on cleaned text from news websites,

containing about 18 million words. The work (Muaz et al. 2009), gives brief analysis of

parts of speech of Urdu language and develops a POS tagged corpora, whereas, another

effort (Mukund et al. 2010) generates semantic role labeled corpus for Urdu text using

cross lingual projections. (Humanyoun et al. 2007) presents the extraction and

development of the automatic extraction of Urdu lexicon using corpus. Table 2.4 shows

the corpuses and lexicons developed for Urdu language for different applications of NLP.

Table 2.4.

Corpuses and lexicons for Urdu language.


POS tagged corpora Hardie (2003)

Muaz et al. (2009)

Corpus based lexicon construction Ijaz and Hussain (2007)

Humanyoun et al. (2007)

Semantic role labeled corpus Mukund et al. (2010)

2.10. Generalization among domains

The generalization of sentiment analysis solutions, among multiple domains is still an

open issue. The term domain adaptation is coined by the SA community (Tan et al. 2009)

to refer to the development of a generalized solution which can be applied on all the

potential target domains. Most of the contributions for opinion mining are highly domain



specific (Pang and Lee 2004). (Tan et al. 2009) handles the domain adaptation issue using

frequency co-occurring entropy (FCE) method. It emphasizes on a smooth transformation

from a domain d1 to another domain d2 through a set of generic features F, representing

d1 and d2. It evaluates the model for six domains and finally concludes that FCE is not the

best option. Another feature related to multiple domains is their complexity level.

Sentiment analysis of reviews related to products and movies is considered as the easiest

in literature (Pang and Lee 2004) and these reviews serve as a test bed for most of the

approaches. On the contrary, political speeches and discussions are perhaps the most

complex to handle. (Bansal et al. 2008) pinpoints an issue and evaluates whether the

speech is in favor or opposition.

2.11. Processing Morphologically Rich Languages

The Morphologically rich languages or MRLs are challenging domain for NLP

researchers. Still there are a number of worth mentioning contributions. For example, a

stemming model for classical Arabic in Holly Quran is presented by (Thabet 2004). This

work uses the stop-word list and makes lists of words from every surah. Both lists are

compared and when some words in the created list do not exist in the stop-word list, then

the algorithm remove the prefixes. The accuracy of the algorithm is 99.6% for the prefix-

stemming and 97% for the post-fix-stemming. In (Paik and Parui 2008) presents a general

analysis of the languages spoken in India, particularly, Marathi, Hindi, and Bengali. In

this work different similarity classes are made of all the lexical entries by using the match

of the prefix. This match is done with respect to a predefined length. Another stemmer

for Hindi Language is proposed by (Kumar and Siddiqui 2008), which compute n-grams

of the words with the given length. The algorithm treats these n-grams as the postfixes

and extracts the possible stems with postfixes. Finally, the combination of postfix and

stem with maximum probability is picked with a reported accuracy of 89.9%. A Telgu

language based stemmer in (Akram et al. 2009), presents the statistical techniques and

suggests that this MRL require deeper linguistic analysis for improved results.



Orthographically and grammatically Urdu and Persian language have many similarities.

This is because of a large number of vocabulary matching. (Sharifloo and Shamsfard

2008) present a rule-based bottom up algorithm for stemming of Persian text. The

algorithm first extracts the core substring of the words, and compares them with already

defined cores using some grammar rules. This matching of the strings is done by the

already defined morpheme clusters. Moreover, the accuracy is enhanced to about 90.1%

by applying an anti-rule-procedure.

There are some worth mentioning contributions for handling sentiment analysis in MRLs

For example, (Abdul-Mageed and Korayem 2010) and (Abbasi et al. 2008) for Arabic,

and (Jang and Shin 2010) for Chinese language, etc. The work presented in (Abdul-

Mageed and Korayem 2010) is for sentiment analysis of the Arabic text. In this work, the

main focus is on the Arabic text related issues for the development of a practical analyzer

with acceptable performance. It analyzes news text by automatic classification at the

sentence level. It applies a support vector machines classifier. Another related work is

(Abbasi et al. 2008). It performs sentiment analysis of Arabic and English web forums.

Its emphasis is on the extremist opinion propagation. For handling Arabic language’s

characteristics, it proposes specific feature extraction components. It develops Entropy

Weighted Genetic Algorithm (EWGA), a hybridized genetic algorithm that incorporates

the information gain heuristic for feature selection, i.e., stylistic and syntactic features.

This algorithm improves the system performance by selecting better key features.

2.12. Sentiment analysis and Urdu language processing

Due to the idiosyncratic linguistic features of the Urdu language and an exclusive set of

morphological and grammatical rules, computer based processing of Urdu is not a very

well explored dimension. There are our contributions in the subjectivity or sentiment

analysis of the Urdu text (Syed et al. 2010; Syed et al. 2011; Syed et al. 2012).

In this section, we present a brief survey of the major NLP contributions for Urdu

language, which are useful for sentiment analysis;



2.6.1. Word segmentation

For all computational linguistics application the first task is to segment the sentence into

word segments, or the accurate identification of the word boundaries. Due to cursive and

context sensitive orthography of the Urdu language this word segmentation is not as

trivial as for English or French, where the word boundaries are identified through white

spaces. (Durrani and Hussain 2010) identify this segmentation issue as a major problem

in the accurate processing of the text and give a detailed discussion about the discovery

of the inherent causes. This discussion is concluded by providing a word segmentation

model. (Lehal 2010) and (Lehal 2009) also provide an algorithm in this regard.

2.6.2. Phrase Chunking

A variety of phrases exist in Urdu including verb phrases, noun phrases, and adjectival

phrases. Identifying these phrases in a sentence is very helpful for various applications in

NLP, like, information retrieval or extraction, parsing, sentiment analysis, machine

translation, and question answering. The procedure which directly tags these phrases is

called phrase chunking or simply chunking. For Urdu phrase chunking, a very prominent

contribution is (Ali and Hussain 2010), which describes the structure of Urdu verb

phrases, and applies a series of experiments to automatically label them. It uses a

manually tagged corpus of 100,000 Urdu words with verb phrase chunk tags. The

reported results of this effort give 98.44% accuracy. It uses a hybrid approach with

extended tag set.

As Urdu and Hindi are very similar morphologically, so we discuss two contributions for

phrase chunking from Hindi text (Singh et al. 2005; Dalal et al. 2006). In the former,

HMM based chunk tagger is presented for Hindi language. The chunk tagging is divided

into two sub tasks: the identification of the chunk boundaries and then labeling of the

chunks according to their types. In (Dalal et al. 2006) Hindi tagger uses a statistical

approach based on maximum entropy. Simultaneously various features are used for the

prediction of the word tags. The proposed feature set is largely classified as the set of

dictionary of context-based features, word features, and corpus based features. A corpus



of more than 35,000 words/phrases is used for testing and training, reporting an accuracy

of 87.4%.

2.6.3. Stemming of complex morphology

Urdu is rich in both derivational and inflectional morphology. For example, the verbs

inflect to agree with case, number, respect, and gender. Also the verbs is inflected by the

mood (e.g., imperative, infinitive), tense (e.g., present, past), habitual. In, (Akram et al.

2009) states that only the verbs in Urdu have sixty inflected variations. Moreover, the

adjectives also show agreement for case, number, and gender. (Syed et al. 2012)

describes this phenomenon in detail. The intense inflectional and derivational behavior of

Urdu, entails the stemming of the Urdu text a quite challenging process, because the

stemming become harder to devise as the character encoding, morphology, and script of

the language becomes more intricate. For example, Italian language has more inflections

so the stemming is more complex than that of English.

Arabic is also a MRL, so the stemming task becomes even harder. (Riaz 2010) suggests

that Arabic and Farsi stemming process cannot be used for Urdu due to the inflections,

producing erroneous results. Besides, dictionary/lexicon based error correcting schemes

used by other stemmers cannot be applied to Urdu because of the dearth of machine-

readable resources. An Urdu stemmer (Akram et al. 2009) focus on a rule based

approach, which removes the prefix and the postfix before adding letter or letters to

generate the surface from the stem. The exception lists are created and used to complete

the first two steps of the algorithm. If the lookup is successful then the stripping process

is bypassed. (Riaz 2010) describes the challenges related to the Urdu stemming and

proposes a rule-based model with a few rules implemented to stimulate the intricacies.

2.6.4. Resources for Urdu language processing

For Indo-Aryan languages like Urdu, there are merely a little lexical resources available

and accessible for performing research. For example, for Hindi language a lexical



recourse like English WordNet is presented as Hindi Wordnet (Bhattacharyya et al. 2008;

Bhattacharyya 2010). The methodology and architecture of this resourse is based on the

English WordNet (Fellbaum 1998). Urdu WordNet (Ahmed and Hautli 2010) is

developed by using the same approach. As far as corpus construction is concerned the

Enabling Minority Language Engineering project is a considerable attempt. It is

functioning on a multi-lingual corpus for the South Asian Languages. An independent

parts-of-speech-tagged corpus for Urdu text is developed with about 1,640,000 words

(Hardie 2003). Another parts-of-speech-tagged corpus is presented by (Muaz and

Hussain 2009). (Humanyoun et al. 2007) presents the extraction and development of the

automatic extraction of Urdu lexicon using corpus. Also, (Ijaz and Hussain 2007)

presents the development of an Urdu lexicon from the given corpus. The corpus is based

on cleaned text from Urdu news websites, having nearly 18 million words.

(Hualti and Butt 2011) describes a computational semantic analyzer as part of the parallel

grammar project and is based on the syntactic analysis done for the Urdu grammar

component of the ParGram. In addition to the semantic construction some peripheral

lexical resources like a preliminary Urdu WordNet and a VerbNet are developed and

integrated with the main model. Such resources help to generate a more comprehensive

representation of lexical knowledge, e.g., hyponyms for words and their thematic roles.

2.6.5. Miscellaneous works

There is some other worth mentioning contributions in Urdu NLP. For Example,

(Mukund et al. 2010) employ cross lingual projections in the PropBank paradigm for the

automatic induction of the semantic role annotations for the Urdu text. These annotations

are done on the basis of the word alignments. An Urdu-English parallel corpus is used by

the projection model to utilize syntactic as well as lexical information. The reported

accuracy of the annotations is 92% on short sentences.



Table 2.5.

Urdu language processing.


Word Segmentation Durrani, Hussain (2010)

Lehal (2010)

Lehal (2009)

Phrase Chunking Ali and Hussain (2010)

Singh et al. (2005)

Dalal et al. (2006)

Stemming Akram et al. (2009)

Riaz (2010)

Resources Bhattacharyya et al. (2008)

Bhattacharyya (2010)

Fellbaum (1998)

Ahmed and Hautli 2010)

Hardie (2003)

(Hualti and Butt (2011)

Muaz and Hussain (2009)

Humanyoun et al. (2007)

Ijaz and Hussain (2007)

Analysis (Hualti and Butt (2011)

Mukund et al. (2010)

Rizvi and Hussain (2005)

Mukund and Ghosh (2011)

Mukund and Srihari (2010)



(Rizvi and Hussain 2005) describe computational investigation of different Urdu parts of

speech. Their work is more theoretical based, hence it can be used to define and

implement the rules for many language processing applications.

(Mukund and Ghosh 2011) describes the automatic extraction of the opinion holder

words and phrases from the given Urdu texts. This work refers the opinion holders and

their targets together as the opinion entities. It works in two steps; generate required word

sequences related to the opinion entities and disambiguate these extracted sequences as

the holders or targets of the opinions. The morphological operations like inflections are

used to correctly identify sequence boundaries for the verbs and nouns. Another work in

the context of classification of objective and subjective sentences is attempted by

(Mukund and Srihari 2010), which employs a vector space model.

2.13. Adjective based sentiment analysis techniques

As already mentioned in Section 2.1, the part of speech based features of the given text,

particularly of adjectives, can help a lot in sentiment analysis. Here, we emphasize on

these adjective based approaches used by NLP community. One of the earliest works in

this domain (Hatzivassiloglou & McKeown, 1997) uses adjectives as subjectivity

indicators. They employ a log-linear regression model for identification and validation of

the positive or negative semantic orientation of the conjoined adjectives. A clustering

algorithm divides the adjectives into groups with respect to orientations, and labels them

as positive or negative. Before that (Hatzivassiloglou & McKeown, 1993), present an

approach for automatic recognition of adjectival scales this approach group or cluster the

adjectives carrying same semantics, but this was not with the perspective of sentiment

analysis. (Bruce & Wiebe, 2000) recognize subjectivity within the text by manual

tagging. They take a case study of sentence level categorization and categorize clauses

from the “Wall Street Journal” as objective or subjective. Each clause is given a final

classification on the basis of an agreed decision by four judges.

(Hatzivassiloglou & Wiebe, 2000) analyze two main features of adjectives for

subjectivity prediction, i.e., gradability and semantic orientation. They extract reliability



of gradability values using an automatic method for extracting. (Turney, 2002), suggest

that the proverbs are also carriers of sentiments in a sentence and should be considered in

combination with adjectives. In their work, the sentences are divided into pre-structured

grammatical patterns, which include adjectives and adverbs as the core word. (Riloff et

al., 2003) emphasize on the identification of the subjective nouns, which are modified by

the use of adjectives. They compute the orientation of the phrases in the sentence that

contained them. (Riloff & Wiebe, 2003), use unsupervised learning method for automatic

extraction and learning of the patterns for subjective expressions in the given text.

Table 2.6.

Research contributions related to adjective based sentiment analysis.


Adjectives

Hatzivassiloglou and McKeown (1993)

Hatzivassiloglou, McKeown (1997)

Bruce and Wiebe (2000)

Hatzivassiloglou, Wiebe (2000)

Adjectives and Proverbs Turney (2002)

Subjective nouns Riloff et al. 2003

Subjective expressions Riloff and Wiebe (2003)

Appraisal expressions Whitelaw et al. (2005)

Bloom and Argamon (2010)

(Whitelaw et al., 2005) propose the use of appraisal theory for sentiment analysis. They

work on appraisal expressions extraction. These appraisal expressions are the sentiment

oriented phrases which contain adjectives as head words. (Bloom & Argamon, 2010)

extended this model and propose an approach for automatic learning of these appraisal

expressions. Research contributions related to adjective based sentiment analysis are

shown in Table 2.6.



2.14. Term level vs. Phrase level polarity

According to a survey of the psychological aspects of natural language [55], only 4% of

the total words used in the written texts carry sentimental or affective content. It means to

analyze the sentimentality of a sentence cannot be obtained by analyzing only the 4%

content, and hence in addition to considering the subjective terms we need to explore

more words and phrases which mutually make the final sentiment of the given text. In

this regard the existing works for automatic sentiment classification principally fall into

two categories, i.e., word-level classification and phrase-level classification. The word-

level classification incorporates the polarity orientation of the words called prior

polarities. The phrase-level classification utilizes these prior polarities to calculate the

phrase level polarities.

2.8.1. Term-level-polarity based approaches

In the early contributions, the approaches concentrated mainly on determining the prior

polarities of the constituent terms only. The first effort for automatic sentiment analysis

in 1997, Hatzivassiloglou & McKeown, considered adjectives as the polar terms and

presented a scheme based on the conjunctions between different adjectives in a big

corpus. It incorporated shallow parsing algorithm for text chunking and constructed a

log-linear statistical model, which predicts same orientation between any two adjectives.

Before that, (Hatzivassiloglou & McKeown, 1993) focused on the automatic recognition

of the adjectival scales. It grouped or clustered the adjectives carrying same semantics.

But this work was not with the view of sentiment analysis.

After this input [Hatzivassiloglou & McKeown, 1997], two lines of research appeared in

the sentiment analysis research community. Firstly, they focused more on the adjectives

as the sentiment carriers. For example, [(Hatzivassiloglou & Wiebe, 2000)] evaluates two

key features of adjectives as subjectivity indicators, i.e., semantic orientation and

gradability. [(Bruce & Wiebe, 2000)] recognizes bias within the text by manual tagging.



2.8.2. Phrase-level-polarity-based approaches

Then, the focus of research moved from word level prior polarity to the phrase level, or

expression level polarity analysis. (Riloff et al., 2003) stresses on the subjective nouns,

modified by the adjectives and computes the orientation of the phrases containing these

nouns. Whereas, (Turney, 2002), recommends that the proverbs are also carriers of

sentiments in an opinion and should be considered along with the adjectives. In this work,

the sentences are converted into pre-structured grammatical patterns with adjectives and

adverbs as the core terms. (Riloff & Wiebe, 2003), emphasize on the subjective

expressions in the given texts. Another work, (Whitelaw et al., 2005) proposes the

concept of appraisal expressions based on the appraisal theory. (Bloom & Argamon,

2010) extends this idea by proposing an approach for automatic learning of these

appraisal expressions. Table 2.7, gives the summary of the above discussed contributions.

Table 2.7.

Term-level polarity vs. phrase-level polarity approaches.


Term-level-polarity based

approaches



Bruce and Wiebe (2000)

Hatzivassiloglou and Wiebe (2000)

Phrase-level-polarity based

approaches

Turney (2002)


Riloff and Wiebe (2003)

Whitelaw et al. (2005)

Bloom and Argamon (2010)

Syed et al. (2010)



2.15. Negation Handling in sentiment analysis

Negation handling in sentiment analysis as an independent task is not yet

a well solved issue, even for English text (Jia and Meng 2009) and

(Wiegand et al. 2010). This is because of the context sensitive use of the

negation particles. The first computational model for the treatment of the

negation is presented in (Polanyi and Zaenen 2004). It models negation via

contextual valence shifting. The polarity of a subjective expression is

reversed due to the use of negation mark. The work in (Kennedy and

Inkpen 2005) also proposes an approach for contextual valence shifting

and in addition to dealing with the simple negation particles; this work

decides a simple scope for negation, i.e., if the negation particle

immediately precedes a subjective expression then its polarity is flipped.

As an extension of this work, a parser is added for scope computation in

(Kennedy and Inkpen 2006).

Table 2.8.

Negation handling for sentiment analysis.


Contextual valance shifting Polanyi and Zaenen (2004)

Scope of negation Jia and Meng (2009)

Kennedy and Inkpen (2005)

Kennedy and Inkpen (2006)

Supervised machine learning Wilson et al. (2005)

Compositional semantics Moilanen and Pulman (2008)



The work in (Wilson et al. 2005) uses supervised machine learning

method. It selects the features, like, negation features, shifter features,

and polarity modification features, for an advanced negation modeling. A

technique to compute the polarity of complex noun phrases and headlines

using compositional semantics is presented in (Moilanen and Pulman

2008). The research in (Jia and Meng 2009) investigates the effect of

different scope models of negation. It achieves the scope detection

through, the heuristic rules focused on polar expressions and static and

dynamic delimiters.

All of the above contributions treat the negation as independent lexical units, which can

affect the entire words, phrases or sentences. But, there are many cases in which, the

negation comes within the word structure, e.g., “بے فایده” (bay fayeeda, useless). There are

a few works addressing this type of negation (Moilanen and Pulman 2008).

Chapter review:

This Chapter describes the state of the art research in sentiment analysis and Urdu

language processing. The literature survey is divided into following sections; Features of

the given text, techniques, sentiment annotated lexicon construction, generalization

among domains, processing of morphologically rich languages, Urdu language

processing, adjective based SA techniques, term level vs. phrase level polarity and

negation handling in SA.

Chapter 3| Distinctive Features of the Urdu Language 34

______________________________________________________________________________________Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis Framework.

CHAPTER 3

DISTINCTIVE FEATURES OF THE URDU LANGUAGE

Prior to reporting our research contributions, there are some background issues that must

be presented and discussed. Firstly, we describe the Urdu language itself, which is the

main entity of this investigation. Urdu is introduced briefly to provide background for the

discussion in later chapters. As this language is not widely studied, therefore this section

contains more detail than would be necessary if a more recognizable language, such as

English or French, was being studied.

Language family: Indo-European

Influencing languages: Firstly: Persian, Arabic, and Turkish

Secondly: Sanskrit, English

Script: Persio-Arabic

Writing Style: Nastalique

Major Dialect: Hindi

Regions:

National language of Pakistan, Widely spoken in

Afghanistan, Bahrain, Bangladesh, Botswana, Fiji,

Germany, Guyana, India, Malawi, Mauritius, Nepal,

Norway, Oman, Qatar, Saudi Arabia, South Africa,

Thailand, UAE, United Kingdom and Zambia

Table3.1. Brief overview of Urdu language Urdu is an Indo-European language. Persian, Arabic, Turkish, and English have great

influence on the Urdu vocabulary, whereas, the grammar is more inclined towards

Sanskrit. Some major dialects of Urdu are Hindi, Dakhini, Pinjari, Rekhta, and Modern



Vernacular Urdu. Moreover, Urdu is the national language of Pakistan and is widely

spoken in India, Afghanistan, Bangladesh, Bahrain, Oman, Saudi Arabia, South Africa,

and United Kingdom. Some Salient features of Urdu Language are given in Table 3.1.

The distinctiveness of a language is recognized by its inherent characteristics, which are

its orthography, vocabulary, parts of speech, grammar and morphology. We present here

a precise overview of these characteristics of the Urdu language:

3.1 Orthography

The orthography of a language specifies a standardized method for using a specific script

or writing system as a set of symbols (alphabets); graphemes and diacritics, and the rules

about how to write these symbols. It refers to the relationships between the graphemes

and phonemes for generating word spellings. It also identifies; the diacritics,

capitalization, hyphenation, word boundaries, punctuation marks and emphasis. The

orthography of the Urdu language is inclined toward the Arabic and Persian influences.

Figure 3.1 Character set of Urdu.

3.1.1. Character set

The character set of Urdu is an extended version of the Arabic character set used for

Persian. It has sounds which are not present in Arabic or Persian, including alveolar

consonants, aspirated stop and long vowels. There are 58 letters in Urdu, as given in

Figure 3.1 (Hardie 2003).



The Arabic script employs letters to represent consonants and diacritics to indicate the

vowels. In Urdu both long and short vowels exist. Diacritics are used on the consonants

to specify the short vowels. Whereas, the long vowels are indicated by; a combined effect

of the consonant with diacritic and an additional letter. These diacritics are optional and

usually not written, but they exist implicitly and the native speaker understands their

pronunciation. From Figure 3.2 it is clear that the diacritics a consonant can have two

didactics and these can be written above or below the consonant.

ب ب ب ب

ب ب ب Figure 3.2 Diacritics in Urdu with letter “ب”.

3.1.2. Word order

Generally the basic word order of the Urdu clause is given as subject object verb (SOV).

Variation in this word order is common, particularly the reordering of nominal

constituents, especially for thematic purposes. This is the reason that (Butt, 1995) argues

that Urdu is a free order or a non-configurational language.

3.1.3. Bidirectional script

Like Arabic, Urdu script is bidirectional, it means the words are written from right to left

and numbers are written from left to right.

3.1.4. Ligatures

Urdu uses Persio-Arabic script, which is cursive and context-sensitive with respect to the

shapes of the alphabets. It means that the “حروف” (haroof, alphabets) have multiple

glyphs and shapes and are categorized as joiners and non-joiners. The joiner alphabets

join together into units, called the ligatures (Durrani and Hussain 2010). One word can

have either single or multiple ligatures. During writing, all characters join together until a

non-joiner appears. A new ligature starts after the non-joiner. The process is repeated

until the word ends. If there are more than one ligatures present in a word then it seems



that the word is having a space within, but this space is not their. Consider the example of

,this word have three ligatures which are written without space ,(janwar, animal) ”جانور“

whereas the word “ہمت” (himat, courage) have only one ligature. There is also a

possibility of separation of the ligatures, even in the absence of a non joiner. For

example, “کبھی کبھی” (kabhi kabhi, sometimes) and “بے جان” (bay jaan, lifeless) this

phenomenon is very common in compounding and reduplication of the words.

An Urdu character exhibits multiple shapes according to its position in the ligature, i.e.,

in the initial, medial, or final position, or it remains unconnected. For example, consider

the alphabet “ ج” (jeem). It can be joined in initial position as “ جا” , in medial position as

.see Table 3.2 ,”حج“ and at final position as ”بجا“

Remark Shape adjustment

Joined in the initial position ا+ ج جا

بجا ا+ ج + ب

ج+ ح حج

آج ج+ آ

Joined in the medial position

Joined at the final position

In a word with a non-joiner

Table 3.2. Different shapes of a single alphabet “ ” ج (jeem).

Due to this context sensitive orthography and difference in the behaviors of joiners and

non-joiners, the word boundary identification becomes a major task. The space is not

always an indicator of the word boundary.

3.2. Parts of Speech According to the given literature (Ijaz and Hussain, 2007) and (Muaz and Hussain, 2009)

there are eleven unique parts of speech of the Urdu language:

1. Noun

2. Verb

3. Adjective

4. Adverb

5. Pronoun



6. Post Positions

7. Numerals

8. Auxiliaries

9. Conjunctions

10. Haroof

11. Case markers

Among these, nine (from 1-9 in the above list) are similar to the English parts of speech

in their semantics (though, their morphological and grammar rules are clearly distinct).

While the “حروف” (haroof) and case markers are different. The “haroof” are the words

which have no independent meaning. To become meaningful they are used with other

words (Schmidt, 2000). For example, “اے” (ay), “او” (o), “واه” (wah), and “نا” (na), etc.

3.3. Vocabulary

The absorption power of Urdu is quiet exceptional. In addition to Arabic, Persian, and

Turkish influences, Urdu kept on including the vocabulary from English, Sanskrit and

Hindi. This potential enhances the magnificence of the language. Table 3.3 gives some

examples of the Urdu words taken from English, Persian, Sanskrit, Arabic and Turkish,

along with their use in the sentences.

Language Borrowed words Example of Urdu Sentences

English ٹیلی فون (telephone, Telephone) ٹیلی فون خراب ہے (telephone khrab hay, Telephone is out of order.)

Persian فردوس (firdos, heaven) نظیر ہے سوات فردوس (sawat firdos nazeer hay, Sawat is like heaven.)

Sanskrit آشا (aasha, wish) میری آشا پوری ہوگئ (meri aasha puri ho gayee, My wish came true.)

Turkish خاتون (khatoon, lady) وه ایک نفیس خاتون ہیں (woh aik nafees khatoon hain, She is a fine lady.)

Arabic جنت (janat, heaven) گھر جنت ہے (ghar janat hay, Home is heaven.)

Table 3.3 Examples of Urdu words from multiple languages.



3.4. Morphology

Morphology can be defined as the study of the structure of the word. For example, the

word “لفظ” (lafz, word) describes how “الفاظ” (alfaaz, words) is inflected from it. The

definition of morphology leads to the concept of morpheme the smallest unit of meaning

or smallest recurring unit. The relation of morphology to morpheme is same as that of the

syntax to the words. Morphemes express concepts like “بادل” (badal, cloud), “پنکھا”

(pankha, fan), or relationship like “مند” (mand) in “دولت مند” (dolat mand, rich) and “بے”

(bay) in “بے جان” (bayjaan, lifeless). Also morphemes can express syntactic features for

example number (singular, plural) e.g., “پودا” (poda, plant), “پودے” (poday, plants)

Gender (male, female) e.g., “گیا” (gya, went, inflected for masculine), “گئ” (gayee, went,

inflected for feminine).

The term morph represents morphemes as parts of a word, e.g., In the word “پر” (pur,

feather) the morpheme “پر” (pur) is realized as the morph “پر” (pur) to form the word

and the (pur) ”پر“ the morpheme ,(puron, feathers) ”پروں“ ,In .(pur, feather) ”پر“

PLURAL morpheme are realized as “وں”+ ”پر” (pur+oon) respectively to form the word

.(puron, feathers) ”پروں“

The term allomorphs represent different forms of a morpheme. e.g., the PLURAL

morpheme in Urdu has several allomorphs. Plural of “پودا” (poda, plant) is “پودے” (poday,

plants) plural of “پھول” (phool, flower) is “پھولوں” (phoolon, flowers). The morphemes are

further categorized as free morphemes (can form words by themselves) e.g., “بارش”

(barish, rain), “آسمان” (aasman, sky) and bound morphemes (must be combines with

other words) to form words e.g., “با” (ba) in “ عزت با ” (baizat, respectable). Words can be

found as free morphemes only, bound morphemes only, free and bound morphemes

jointly.

As far as Urdu morphology is concerned, it lies in the category of morphologically rich

languages (MRLs) like Arabic, Persian, Chinese, Turkish, Finnish, and Korean. The

MRLs require considerable challenges for natural language processing, machine

translation and speech processing (Abdul-Mageed and Korayem, 2010). These languages

are distinctive due to highly productive and frequent morphological processes at the word

level, e.g., compounding, reduplication, inflection, agglutination and derivation, etc. Due



to these morphological operations the same root words can generate multiple word forms.

This makes the stemming process quite challenging.

Also, the Lexicons of MRLs tend to be more complex. The dependencies and

relationships between different parts of speech are frequent. This increases the levels of

intricacy, which result into inflection or derivation gaps, because various forms of the

same underlying base-form can easily be misidentified as unrelated entries with negative

effects on the overall alignment of words and hence, on the processing accuracy.

Some frequent morphological processes for Urdu are discussed below:

3.4.1. Inflection and derivation

Inflectional operations deal with the variety of forms of the same words. The changes

indicate grammatical features, e.g., “جانا” (jana, to go) from “جا” (ja, go). The difficult

aspect of these inflections is their diversity. For example, for making a plural in English

s, es or ies are used according to the predefined grammatical rules. Exceptions are there,

but are rare.

On contrary, in Urdu language, the Arabic loan words are made plural according to

Arabic grammar, whereas, the Persian loan words follow the Persian grammar and so on.

For example, the plural of “لفظ” (lafz, word) is “الفاظ” (alfaaz, words) and “پودا” (poda,

plant) is “پودے” (poday, plants). Both are differently inflected to make plural word.

Derivational operations deal with the production of new words with different meanings.

The new words are produced by adding affixes. Often the produced words have a

changed part of speech, e.g., “خوش” (khush, happy) and “خوش بخت” (khushbakht, lucky).

3.4.2. Compounding

The compounding process results into new words which are made by a combination of

two already existing words M and N. Some examples of compound words in Urdu are:

MN formation: M and N are independent in meaning and syntax but they are only

written together to make a new word.



For example, M = “موم” (mom, wax), N = “بتی” (bati, light), make the word MN = “ موم

.(mombati, candle) ”بتی

M-O-N formation: M and N are independent words, but are related in meaning or

context. Their syntax remains the same with an additional alphabet “و” (O). This

alphabet “و” (O), means “and”.

For example, M = “ملک” (mulk, country), N = “ملت” (milat, nation), make the

compound word, M-O-N = “ملک و ملت” (mulk-o-milat, country and nation).

3.4.3. Reduplication

Both full and partial reduplication of words is very common in Urdu. For example, the

full reduplication of the word “کبھی” (kabhi, sometime), result into “کبھی کبھی” (kabhi

kabhi, infrequently).

3.4.4. Compound verbs and verb phrases

In Urdu root verbs and intensifying verbs combine together to form compound verbs

(Schmidt 1999). For example, the root verb “پکار” (pukar, call) and intensifying verb “لو”

(lo, take) make a compound verb “پکار لو” (pukarlo, call (right away)). This compound

verb has the same meaning as the root verb but exhibit more strength. Table 3.4 gives

further examples of the discussed morphological processes, i.e., inflection, derivation,

compounding, reduplication and compound verbs.

Operation Word Modified form

Inflection پھول (phool, flower) پھولوں (phool-on, flowers)

Derivation ممکن (mumkin, possible) ناممکن (na-mumkin, impossible)

Compounding جان (jaan, soul), دل(dil, heart) دل و جان (dil-o-jaan, heart and soul)

Partial Reduplication رات (raat, night) راتوں رات (raat-on-raat, in a night)

Compound verbs مار (maar, beat), (maar dalo, kill) مار ڈالو (dalo, put) ڈالو

Table 3.4. Examples of morphological processes in Urdu.



3.5. Challenging features of the Urdu language

So far we have gone through an overview of the Urdu language. Here, we precisely

describe the challenges posed due to the distinctive features. These aspects are related to

the task of sentiment analysis, like corpus collection, lexicon construction, and word

boundary identification, etc.:

3.5.1. Corpus construction

Urdu websites are becoming popular day by day but still these cannot be used for corpus

construction because such a task needs large amount of electronic text. This is an

unfortunate fact that most of the Urdu websites use graphic formats i.e. gif or other image

formats, to display Urdu text [12].

Despite of these hurdles there are some significant efforts. For example, a relatively small

corpus (20,000–50,000 words) for Urdu language is available at

http://personal1.stthomas.edu/dmbecker/. In this corpus, the documents appear in a

minimally tagged format.

3.5.2. Complex stemming

As with other MRLs the stemming of Urdu language is complex, because various words

emerge from the same root. For example, the root word “علم” (ilm, knowledgw) generates

multiple words with different meanings and forms, as given in Table 3.5.

3.5.3. Intricate lexicon

In most of the NLP applications a lexicon is a main requirement. For sentiment analysis,

this lexicon becomes more complex because it contains sentiments annotated to all

entries in addition to their grammatical and morphological information.

The Urdu language is a blend of languages spoken by the military troops, who invaded

the subcontinent in different eras and the local languages. Therefore, Urdu, previously

known as Rekhta (ریختہ), meaning molded or mixed, have strong linguistic influences

from Arabic, Persian, Turkish, Sanskrit and English, etc. For example, the words, “شمس”

(shams, sun), “بہتر” (behter, better), “ٹیلی ویژن” (televizun, television) and, “پوجا” (pooja,



worship) are Arabic, Persian, English, and Sanskrit loan words, respectively. Due to this

variability, the morphological operations use varying grammar rules. Most of the loan

words follow the grammar rules of their parent language. Generally, the Sanskrit based

adjectives show inflection to agree with the noun they qualify, this property is called

marking with respect to case, gender, or number. Like the demonstrative adjective “جیسا”

(jaisa, such as), becomes “جیسی” (jaisee, such as) and “جیسے” (jaisay, such as) for gender

and number, respectively. On the other hand, most of the Persian loan words like “تازه”

(tazah, fresh) remain unmarked, because, they follow Persian grammar.

Technically, these features result into much intricate lexicons for natural language

processing applications. There is a much higher out of vocabulary rate as compared to

other well defined grammars. Also, it results into poor or unreliable language model

probability estimation, because there are many combinations of word forms which are

missing or rarely available in the language model training data.

Root word “علم” (ilm, knowledgw)

Inflected Words “عالم” (aalim, knowledgeable),

(aalimah, female knowledgeable) ”عالمہ“

(moalim, educator) ”معلم“

(moalimah, female educator) ”معلمہ“

(maaloom, know) ”معلوم“

(maaloomaat, information) ”معلومات“

Table 3.5 Inflection of multiple words from root word “علم” in the Urdu language.

3.5.4. Word boundary identification

Urdu employs Persio-Arabic script, with Nastalique writing style. Its orthography is

context sensitive and the “حروف” (haroof, alphabets) exhibit different shapes according to

their positions in a word. For example, “جاگ“ ,”مگن“ ,”جگ“ ,”گا”, are four different

shapes of the same alphabet “گ” (gaaf). Urdu alphabets are categorized as joiners and

non joiners (Lehal, 2010). The joining alphabets make ligatures. One word can have one

or more than one ligature due to the existence of a non joiner within a word. For example,



has (jago, wakeup) ”جاگو“ is a single ligature word but the word (jugnu, firefly) ”جگنو“

two ligatures. If the ending letter of a word is a joiner then it tends to join with the first

letter of the next word, resulting into a misidentification of the word boundaries. For

example, “کل رات” (kal raat, tomorrow night) are two different words and are written

with space but if by mistake this space is omitted then the last non joiner of the first word

will join with the first letter of the second word and it will become “ کلرات ” (kalraat).

Hence, the spaces are not always true indicators of the word boundaries as in English

text.

3.5.5. Diacritics omission

Like Arabic, diacritics are present in Urdu as vowels. But, their use is not standardized

and is author dependent. Hence, they are removed as a preprocessing step of any NLP

application. This is an accepted practice adopted by the Urdu language research

community (Durani and Hussain, 2010).

3.5.6. Code switching

Another interesting feature of Urdu is code switching. In linguistics, code switching

means using multiple languages concurrently. This phenomenon is very common in Urdu

writing. For example, “کر دو mobile off” (Mobile off kar do, Turn off the mobile) means

“switch off the mobile”. This property causes disambiguation of the accurate lexical

category or part of speech.

3.5.7. Independent case marking

Case markers are defined as the relational morphemes or the lexical units or words,

which mark the grammatical functions to the words with which they are used. In Urdu,

the case markers are syntactically attached with the words but are lexically independent.

It means they are treated with independent POS tags (Rizvi et al., 2005). They affect the

structure of the sentence and can cause grammatical ambiguities, like; the free word order

property of Urdu text is due to case markers. For example, both the phrases; “ رنگوں کے

’naam rangoon kay, colors) ”نام رنگوں کے“ and (rangoon kay naam, colors’ names) ”نام

names) are correct and have same meaning, but different word order due to the use of the



case marker “کے” (kay). Some more examples of the use of case markers are “ ایران کا

sheeshay ki bottle, glass) ”شیشے کی بوتل“ and ,(Iran ka badshah, king of Persia) ”بادشاه

bottle).

Moreover, Urdu text contains two types of affixes, (a) morphemes and (b) words or

lexical units. Morphemes are lexically attached with the nouns through morphological

operations. For example, to make plural “پودے” (poday, plants) of the word “پودا” (poda,

plant) plural postfix “ے” (ay) is applied as shown in Table 3.6.

While the words or lexical units are independent units. These are further categorized as

case markers, pure postpositions and possession or genitive markers. The case markers

are further divided into core case markers and oblique case markers. They mark

grammatical function to the marked words and are generally, morphologically attached

with the words at the lexical level. But, in Urdu, they are syntactically attached and

lexically independent.

1. Morphemes پودا پودےplural postfix ے(ay) is applied

2. Words or lexical units نا (na), نی (ni), نے (nay), سے (say)

2.1. Case marker سے (say), کو (ko)

2.1.1. Core case markers میں نے کہا(mein nay kaha, I said)

2.1.2. Oblique case marker باہر نکالنا (bahir nikal-na, put out)

2.2. Pure postpositions کمرے میں جا(kamray mein ja, go to the room)

2.3. Possession or genitive markers آپ کا نام(aap ka naam, your name)

Table 3.6. Examples of affixes, case markers and postpositions.

As an example of core case markers consider the sentence, “میں نے کہا” (mein nay kaha, I

said), in which the case marker “نے” (nay) is used. Similarly, in the sentence “آپ کا نام”

(aap ka naam, your name), the possession marker “کا” (ka) is used. Table 3.6 gives some

more examples.



گنبد پرنہیں تیرا نشیمن کثرسلطانی کے (naheen tera nasheman kasr-e sultani kay gunbad par)

(tera nasheman kasr-e sultani kay gunbad par naheen) تیرا نشیمن کثرسلطانی کے گنبد پر نہیں

(tera nasheman gunbad-e kasr-e sultani par naheen) تیرا نشیمن گنبد کثرسلطانی پر نہیں

Translation: Your home is not on the tower of the king’s palace)

Table 3.7. Free word order property of the Urdu text.

3.5.8. Free word order

As already mentioned in Section 3.1.2 generally Urdu follows subject object verb (SOV)

word order. But variation in this word order is frequent and well accepted that is why

some researchers suggest that Urdu should be considered as a free word order language

(Butt, 1995).

An example sentence is given in Table 3.7, which is written in three different ways with

same semantics.

The property of free word-order in Urdu text is due to;

The case markers, which can identify constituents in multiple ways (Rizvi and

Hussain 2005). These are lexically independent and are considered as independent

lexical category.

Diacritics used as possession markers. As these are optional so proper identification

of the meaning becomes very difficult.

Chapter review:

This chapter precisely described the linguistic characteristics of the Urdu language. From

this description we believe that Urdu language is unique in a number of aspects related to

its orthography, morphology, grammar and vocabulary. Its distinctive linguistic features

make it a challenging domain for the sentiment analysis community. Hence, they require

updated or altogether different algorithms and approaches to analyze the sentiment

orientation of the Urdu text.

Chapter 4| SentiUnits: The Appraisal Expressions 47


CHAPTER 4

SENTIUNITS: THE APPRAISAL EXPRESSIONS

In an opinionated sentence all the terms are not subjective. Indeed the sentimentality of a

sentence depends only on some specific words or phrases. Consider the examples “This

book is very good.” and “The movie is boring.” underlined words are the expressions

made of one or more words which carry the sentiment information of the whole sentence.

We label them as SentiUnits. We can judge only these units as the representatives of the

whole sentence’s sentiment. These are in fact the appraisal expressions as defined and

discussed in Chapter 2. The SentiUnits can be defined as the core grammatical structures,

expressing the opinion or the sentiment carrier expressions in a sentence (Syed et al.

2010). For understanding the structure of the SentiUnits, consider the following examples

from Urdu text in Table 4.1.

This is a fine book. Yeh aik umdah kitab hay . 1یہ ایک عمده کتاب ہے

This is a fine and informative book. Yeh umdah aur malumati kitab hay . 2یہ عمده اور معلوماتی کتاب ہے

This is the finest book. Yeh sab se umdah kitab hay یہ سب سے عمده کتاب ہے. 3

This book is not very bad. Yeh kitab itni buri naheen . 4یہ کتاب اتنی بری نہیں

Table 4.1. Examples of opinionated sentences from Urdu with different SentiUnits.

In Table 4.1, the underlined expressions are responsible for subjectivity orientation. All

other words are neutral and have no effect on the classification. On a closer look at these

examples, we can observe that the SentiUnits are made of adjectives (as head words).

These can be single word/adjective based like sentence 1, or multiple words based like

sentences 2, 3 and 4. The sentence 1, 2 and 3 have adjectives with positive orientation,

whereas, the sentence 4 contains a negative word but due to the use of negation it

becomes positive. In this case, negation acts as a polarity shifters. Moreover, the intensity



of the expressions is determined by the modifiers which can be absolute, comparative, or

superlative just like English text. Sentence 3 represents the example of the superlative

degree of the appraisal.

Hence, these expressions can be distinguished by six attributes, i.e., adjectives as the head

words, their modifiers, and their orientation towards positive or negative, the intensity of

this orientation, a polarity mark assigned to each word to show the intensity value and

finally the negation (Syed et al. 2010). We consider two types of SentiUnits:

Single Adjective Phrases are made of adjective head and possible modifiers, e.g. “ بہت

ش خو ” (bohat khush, very happy), “زیاده بہادر” (zyada bhadur, more brave)

Multiple Adjective Phrases comprise of more than one adjective with a delimiter or a

conjunction in between, e.g. “بہت چاالک اور طاقتور” (bohat chalak aur taqatwar, very clever

and strong).

As mentioned above SentiUnit can be described by following attributes:

1. Adjectives (as head words)

2. Modifiers

3. Orientation

4. Intensity

5. Polarity

6. Negation

These are described briefly in following paragraphs:

4.1. Adjectives An adjective is a fundamental part of speech (POS) that expresses an attribute of a noun

(place, thing or, person). Generally in the sentence structure adjectives appear in two

ways, whether they are directly linked with the noun within the noun phrase or they

associate with the noun through some other part of speech, e.g., verb. In both cases they

describe the characteristic features of the noun they qualify. This point suggests that any

opinion, sentiment or judgment about a noun can be determined by analyzing its

adjectives. Due to this characteristic the first effort for the automatic sentiment analysis

(SA) of the English text employ adjectives as the main feature of the given text



(Hatzivassiloglou & McKeown, 1993). Therefore, in sentiment analysis community,

adjectives remain center of attention (Turney, 2002), (Riloff et al., 2003), (Riloff &

Wiebe, 2003) and (Bloom & Argamon, 2010).

As with all parts of speech, in every language the use, type, and structure of the

adjectives differ. Urdu is morphologically rich and hence its adjectives and adjectival

phrases tend to be more complex, due to the frequent inflections and derivations. In

addition to the morphological complexity the variability in vocabulary and grammar rules

in Urdu text is regular and is considered normal. This is due to the fact that this language

is strongly influenced by many other languages like, Persian, Arabic, Sanskrit and

English. For example, the adjective “تازه” (tazah, fresh) remain unmarked because it is

Persian loan word and follow Persian grammar, whereas as most of the Sanskrit based

adjectives show inflection to agree with the noun they qualify. For example, the

demonstrative adjective “جیسا” (jaisa, such as), becomes “جیسی” (jaisee, such as) and

for gender and number, respectively. Moreover, the use of post (jaisay, such as) ”جیسے“

positions as independent lexemes involves more specific patterns and rules.

These aspects suggest that Urdu have distinct characteristics and features. We therefore,

present in this section a comprehensive overview of the structures of the adjectival

phrases in the Urdu text with respect to the task of sentiment analysis. For Urdu based

NLP research this is the very first effort presented in (Syed et al 2011 b). So far, syntactic

and morphological aspects of the language are considered related to verbs, nouns and

other parts of speech. But, we find no contribution which investigates Urdu adjectival

phrases discretely.

The given analysis covers almost all aspects of adjectives and adjectival phrases. We

describe their morphological structures, as marked and unmarked through the types of the

agreement with the noun they qualify. This agreement is more frequent for gender,

number, and case. Also, we discuss their structure when used with a sequence of nouns

and for the formations of reduplications. Moreover, we define and illustrate with

examples different adjective classes, i.e., descriptive, predicative, attributive, possessive,

demonstrative, and reflexive possessive adjective. For each class we describe the



morphological structure of the adjectives and their inflected forms. We take most

commonly used adjectives as examples and clearly describe their modifications.

Figure 4.1 Types of adjectives in Urdu.

In linguistics, for understanding the parts of speech (POS) of a language we need to

recognize their morphological structures and the processes through which these structures

are made. Another significant aspect is to look at their different forms or classes.

Therefore, we explore in this Section these two features of Urdu adjectival phrases. We

first describe their morphological structures and then the classes.

4.1.1. Morphological structure of adjectives

The morphological structure of the Urdu adjectives is complex and exhibit frequent

inflections and derivations with the agreement of the noun they qualify.

Conceptually, adjectives in Urdu can be divided into two types (Schmidt 1999). First, are

those describing quantity and quality, e.g. “کم” (kam, less), “بدترین” (budtareen, worst),

And the second type of adjectives distinguishes one person from .(ziyada, more) ”زیاده“

other, e.g. “حسین” (haseen, pretty), “فطین” (fateen, intelligent).



Morphologically, adjectives are categorized as marked and unmarked (Schmidt 1999).

Marked are those which can be inflected for number and gender, e.g., (a) “اچھا کام” (acha

kaam, good work), (b) “اچھے کام” (achay kaam, good works) and (c) “اچھی بات” (achi

aadat, good habit). In (a), (b), and (c) “اچھا” (acha, good) is inflected for masculine,

plural and feminine, respectively. Unmarked are usually Persian loan words, e.g., “تازه”

(tazah, fresh) and the adjectives inflected from nouns, e.g., “دفتری” (daftary, official)

inflected from “دفتر” (daftar, office). Attributive adjectives are very frequent and they

precede the noun they qualify, e.g., the adjective “مزیدار” (mazedaar, tasty) precede the

noun “مزه” (maza, taste). Arabic and Persian loan adjectives are used predicatively, e.g.,

.(maloom hona, to be Known) ”معلوم ہونا“

These adjectives appear in the form of phrases. The postposition “سے” (say), “سی” (si),

“ ,(wali) ”والی“ ,(wala) ”واال“ and (sa) ”سا“ ےلوا ” (walay) are very frequently used with

noun to make adjectives, e.g., “پھول سی” (phool si, like flower) from “پھول” (phool,

flower). This discussion is summarized in Figure 4.1.

The unmarked adjectives do not show any inflection according to the nouns they qualify.

In other words they do not alter to show agreement with nouns through suffixes. Most of

the Persian loan adjectives remain unmarked. Table 4.2 shows some examples of

unmarked adjectives; “دلچسپ” (dilchasp, interesting) and “بہتر” (behtur, better) with the

nouns (a) masculine-singular, “کام” (kaam, task), (b) feminine-singular, “کہانی” (khani,

story) and feminine-plural “کہانیاں” (khanian, stories).

With masculine-singular noun With masculine-feminine noun With plural noun

,dilchasp kaam) ”دلچسپ کام“

interesting task)

,dilchasp khani) ”دلچسپ کہانی“

interesting story)

,dilchasp khaniian) ”دلچسپ کہانیاں“

interesting stories)

,behtur kaam) ”بہترکام“

better task)

behtur khani, better) ”بہتر کہانی“

story)

,behtur khanian) ”بہتر کہانیاں“

better stories)

Table 4.2 Examples of unmarked adjectives

a) Adjective marking: agreement in gender and number: The adjective marking is done

through the suffixes for gender; masculine (m) and feminine (f) and for number; singular

(s) and plural (p). For example, the masculine adjective, “اچھا” (acha, good) is inflected



for gender as “اچھی” (achi, good) and for number as “اچھے” (achay, good). These suffixes

are attached to agree with the noun or nouns, which the adjective qualifies. Therefore,

there are three suffixes, i.e., singular-masculine (a), singular-feminine (ee) and plural-

masculine (ay). Only one feminine suffix (ee) is used for singular and plural both.

Some examples of marked adjectives are given in Table 4.3, in this table we have

considered three nouns; (a) masculine-singular, “بچہ” (bacha, kid), (b) feminine-singular,

These nouns cause inflection in the .(din, days) ”دن“ and masculine-plural (car, car) ”کار“

respective adjectives; “اچھا” (acha, good), “لمبا” (lamba, long), and “برا” (bura, bad).

Adjective (m, s) Inflected for gender (f) Inflected for number (m, p)

(achay din, good days) ”اچھے دن“ (ache car, good car) ”اچھی کار“ (acha bacha, good kid) ”اچھا بچہ“

(lambay din, long days) ”لمبے دن“ (lambee car, long car) ”لمبی کار“ (lamba bacha, tall kid) ”لمبا بچہ“

(buray din, bad days) ”برے دن“ (buree car, bad car) ”بری کار“ (bura bacha, bad kid) ”برا بچہ“

Table 4.3 Adjective marking with gender and number.

b) Agreement in case: Urdu nouns have three cases; oblique, nominative and vocative.

The adjectives that qualify an oblique noun also become oblique.

The masculine-singular suffixes (a) and (an) are replaced by, (ay) and (ayn), respectively.

The feminine adjectives remain the same as shown in Table 4.4.

Masculine Feminine

Nominative “چھوٹا” (chota, little)

(satwan, seventh) ”ساتواں“

(chotee, little) ”چھوٹی“

(satween, seventh) ”ساتویں“

Oblique “چھوٹے” (chotay, little)

(satwayn, seventh) ”ساتویں“



Vocative “چھوٹے” (chotay, little)

(satwayn, seventh) ”ساتویں“



Table 4.4 Marking of adjectives for cases.

c) Adjectives with noun sequences: Sometimes adjectives appear in a sentence with more

than one noun or multiple nouns making a sequence. In this case the nouns may differ in

gender and number.



The adjective agrees with the noun, which is nearest to it. Examples are given in Table

4.5, in which, “بڑا” (bara, big) inflects for “پلنگ” (palang, bed) and “چھوٹی” (choti,

younger) inflects for “خالہ” (khala, aunt).

Adjective With the sequence of nouns

“ (bara, big) ”بڑا“ پلنگ اور الماریاںبڑا ”

(bara palang aur almarian, big bed and cupboards)

”چھوٹی خالہ، ماموں اوربچے“ (choti, younger) ”چھوٹی“

(choti khala, mamoon aur bachay, younger aunt, uncle and kids)

Table 4.5 Adjective agrees with the nearest noun in a sequence.

d) Reduplication of Adjectives: Urdu adjectives show reduplicate either fully or partially.

In full reduplication the whole word is repeated as it is, whereas in partial reduplication

some syllables of the word are reduplicated with different spellings. Examples of full and

partial reduplication are given in Table 4.6.

Partial Reduplication Full Reduplication

”ڈھیال ڈھاال لباس“

(dheela dhala libas, loose dress)

”بڑے بڑے کام“

(baray baray kaam, great tasks)

”چھوٹی موٹی بات“

(choti moti baat, minute matter)

”چھوٹی چھوٹی باتیں“

(choti choti batain, minute matters)

Table 4.6 Adjective with partial and full reduplication

4.1.2. Classes of adjective

Urdu adjectives can be categorized as descriptive, predicative, attributive, possessive,

demonstrative, and reflexive possessive adjective, explained in following paragraphs:

Descriptive Adjectives: These are the most frequent and important type of adjectives.

They describe attributes of the noun they qualify in terms of its size, dimensions, sound,

color, shade, shape, quality, personal trait, or time, etc.

Some examples of descriptive adjectives in Urdu are given in Table 4.7, where, “چھوٹا”

(chota, little) and “لمبا” (lamba, long) describe the size of a noun, and “پیال” (peela,

yellow) and “سرخ” (surkh, red) express the color.



Category Examples

Size “چھوٹا” (chota, little), “لمبا” (lamba, long)

Color “پیال” (peela, yellow), “سرخ” (surkh, red)

Shape “مربع” (muraba, square), “تکونا” (tikona, triangular)

Personal trait “اداس” (udaas, sad), “مجبور” (majboor, helpless)

Qualities “مہربان” (mehrbaan, kind), “اچھا” (acha, good )

Table 4.7 Descriptive adjectives in Urdu

Attributive Adjectives: If the descriptive adjectives directly precede a nominal head as

modifiers then they are called attributive adjectives, because, they attributively modify or

restrict the meaning of the noun. For example, the adjective “پیال” (peela, yellow) modify

the noun “غباره”(ghubara, balloon), to make it “ غباره پیال ” (peela ghubara, yellow

balloon). In this way the attributive adjective becomes part of the noun phrase. Some

more examples are given in Table 4.8.

Nouns Modified attributively

“ (ghubara, balloon)”غباره“ غباره پیال ” (peela ghubara, yellow balloon)

(udaas chiria, sad sparrow) ”اداس چڑیا“ (chiria, sparrow) ”چڑیا“

(mehrbaan badshah, kind king) ”مہربان بادشاه“ (badshah, king) ”بادشاه“

Table 4.8 Attributive adjectives directly modify the nouns

Predicative Adjectives: When the adjectives are used predicatively, they bring in new

information about the noun instead of modifying it.

Nouns With predicative adjectives

“ (ghubara, balloon)”غباره“ ہے پیال (ghubara peela hay, the balloon is yellow) ”غباره

(chiria udaas thee, the sparrow was sad) ”چڑیا اداس تھی“ (chiria, sparrow) ”چڑیا“

(woh mehrbaan badshah tha, he was a kind king) ”وه بادشاه مہربان تھا“ (badshah, king) ”بادشاه“

Table 4.9 Predicative adjectives describe the features of the nouns

These are not the component of the noun phrase, but are the complements of a copulative

function, which links them to the noun. Take first example from Table 4.9, “ ہے پیال ”غباره

(ghubara peela hay, the balloon is yellow). In this case, the adjective “پیال” (peela,



yellow) identify the color of the noun “غباره”(ghubara, balloon). Only a specific feature

of the noun is described both parts of speech, i.e., adjective and noun remain in their

individual role. Some more examples are given in Table 4.9.

Possessive Adjective: Possessive adjectives are used to indicate the possession. This

possession relation is realized in two ways; whether, adjectives precede the head noun as

modifiers in noun phrases like the attributive adjectives or they may be preceded by a

suitable form of the genitive postposition “کا”(ka, of), “کی” (kee, of), and “کے” (kay, of).

These genitive postpositions are lexically independent like “of” in English, but they agree

in number and gender with the object noun. Consider the first example from Table 4.10,

“ غباره ارتضی کا پیال ” (Irtaza ka peela ghubara, Itraza’s yellow balloon). In this example the

genitive postposition “کا” (ka, of) is used with a singular masculine noun, i.e., “ غباره پیال ”

(peela ghubara, yellow balloon). In the second example, “میری” (meri, my) is a

possessive adjective which is used for the first person and in this case is inflected for

gender. Third example also contains the genitive postposition “کا” (ka, of) with a singular

masculine noun.

Examples

“ غباره ارتضی کا پیال ” (Irtaza ka peela ghubara, Itraza’s yellow balloon)

(meri udaas chiria, my sad sparrow) ”میری اداس چڑیا“

“ ایران کا مہربان بادشاه ” (Iran ka mehrbaan badshah, kind king of Persia)

Table 4.10 Examples of possessive adjectives

Demonstrative Adjective: The demonstrative pronouns act as the adjectives to indicate or

demonstrate the specific inherent features of noun/ nouns of a particular type.

Adjectives Examples

“ (aisa, like this) ”ایسا“ لباس ایسا ” (aisa libas, the dress like this)

“ (waisa, like that) ”ویسا“ لباس یساو ” (wasisa libas, the dress like that)

“ (jaisa, such as) ”جیسا“ لباس یساج ” (jaisa libas, such dress)

“ (kaisa, how) ”کیسا“ لباس؟ یساک ” (kaisa libas, what kind of dress)

Table 4.11 Examples of demonstrative adjectives



As shown in Table 4.11, the Urdu demonstrative pronouns are different for near “ایسا”

(aisa, like this), far “ویسا” (waisa, like that), relative “جیسا” (jaisa, such as) and

interrogative “کیسا” (kaisa, how) demonstrations. These demonstrative adjectives inflect

to agree with the noun for gender and number. These inflections are shown in Table 4.12.

Adjectives Inflected for gender Inflected for number

(aisay, like this) ”ایسے“ (aisee, like this) ”ایسی“ (aisa, like this) ”ایسا“

(waisay, like that) ”ویسے“ (waisee, like that) ”ویسی“ (waisa, like that) ”ویسا“

(jaisay, such as) ”جیسے“ (jaisee, such as) ”جیسی“ (jaisa, such as) ”جیسا“

(kaisay, how) ”کیسے“ (kaisee, how) ”کیسی“ (kaisa, how) ”کیسا“

Table 4.12 Inflection of demonstrative adjectives

Reflexive possessive adjective: The reflexive possessive adjectives are very frequently

used in agreement with the noun they qualify, i.e., they inflect for gender, number and

case. For example, “اپنا” (apna, own), “اسکا” (uska, someone else’s) and “اسکا” (iska,

someone else’s) are used to indicate one’s own, someone else’s far, and someone else’s

near. The examples of the reflexive possessive adjective “اپنا” (apna, own) are given in

Table 4.13, it is inflected for gender as “اپنی چابی” (apni chabee, one’s own key) and for

number as “اپنے لوگ” (apnay loag, one’s own people).

Nouns With predicative adjectives

(apna ghar, one’s own house)”اپنا گھر“ (ghar, house)”گھر“

(apni chabee, one’s own key) ”اپنی چابی“ (chabee, kay) ”چابی“

(apnay loag, one’s own people) ”اپنے لوگ“ (loag, people) ”لوگ“

Table 4.13 Examples of reflexive possessive adjectives

4.2. Modifiers The modifiers intensify the orientation of an adjective. These can be absolute,

comparative or superlative. The modifiers made by postpositions are very frequent in

Urdu writing. For example, the absolute adjective “مہنگا” (mehnga, expensive) is

modified by the postposition “سے” to make it comparative; “اس سے مہنگا” (is say mehnga,

more expansive). Also, the postposition “سب سے” result into a superlative expression;



Some Persian loan words are also .(sab say mehnga, most expansive) ”سب سے مہنگا“

commonly used in inflected forms. For example, “کم” (kam, less) is absolute and is

inflected to make comparative “ ” کمتر (kamtar, lesser) and superlative “کمترین”

(kamtareen, least) expressions. Detailed examples of modifiers are given in Table 4.14.

Modifier Made by postpositions Persian loan words Absolute مہنگا (mehnga, expensive) کم (kam, less) Comparative (a) سے (b) سے زیاده

(is say mehnga, more expansive) اس سے مہنگا (is say ziyadah mehnga) اس سے زیاده مہنگا

تر+ کم کمتر

kam + tar (kamtar, lesser) Superlative (a) سب سے (b) سب میں (c) سب سے زیاده

(sab say mehnga, most expansive) سب سے مہنگا (sab main mehnga) سب میں مہنگا (sab say ziyadah mehnga) سب سےزیاره مہنگا

ترین+ کم کمترین

Kam + tareen (kamtareen, least)

Table 4.14 Adjective modifiers.

These examples are further elaborated below for the noun “لباس” (libaas, dress):

a). Absolute

“ -یہ لباس مہنگا ہے ”

Yeh libaas mehnga hay.

This dress is expensive.

b). Comparative

There are two possibilities whether to use “say” or “say zyadah” for comparison between

two objects.

“ -یہ لباس اس سے مہنگا ہے ”

Yeh libaas us say mehnga hay

This dress is more expensive than that.

or

“ -یہ لباس اس سے زیاده مہنگا ہے ”

Yeh libaas us say zyadah mehnga hay

This dress is more expensive than that.

c). Superlative

For superlatives “sab say” or “sab main”, or “sab say zyadah” are used.

“ -یہ لباس سب سے زیاده مہنگا ہے ”

Yeh libaas sab say zyadah mehnga hay



This dress is the most expensive.

or

“ -ہےیہ لباس سب سے مہنگا ”

Yeh libaas sab say mehnga hay


or

“ -یہ لباس سب میں مہنگا ہے ”

Yeh libaas sab main mehnga hay


Also like Persian grammar, the modifiers “een” or “treen” are used for the same purpose,

e.g. “بہترین” (behtareen, best), “بدترین” (badtareen, worst), “کمترین” (kamtareen, lowest).

“ -یہ لباس بہترین ہے ”

Yeh libaas behtareen hay

This dress is the best.

4.3. Orientation Orientation describes the positivity or negativity of an expression, e.g. "اچھا " (acha, good)

have positive orientation.

4.4. Intensity This is the intensity of orientation, e.g. “بہتر” (behtar, better) “بہترین” (behtareen, best).

4.5. Polarity A polarity mark is attached to each word in the lexicon to show the orientation.

4.6. Negations: Negation is one of the most frequent linguistic structures that change the word, phrase, or

sentence polarity. Negation is not only limited to the negation markers or particles, like,

not, never, or no, but there are various concepts, which serve to negate the inherent



sentiments of a comment. Moreover, the presence of the negation influences the

contextual polarity of the words but it does not mean that all of the words conveying

sentiments will be inverted.

There are different forms of negation discussed in the literature. Here, we give three main

forms. Negation can be morphological, i.e., attached as prefix or suffix making a single

lexical unit, e.g., the prefix “بے” (bay) as in “بے پرواه” (bayparwah, careless) is used to

negate the word “پرواه” (parwah, care). Or, it can be implicit, like, “ یہ گھڑی

yeh gharee tumharay mayaar say kam hay, this watch is below your) ”تمہارےمعیارسےکم ہے

standard/level). This comment even with the absence of a negation particle conveys a

negative opinion. To our knowledge, no research work is available to handle this type of

negation, because it cannot be handled automatically.

Lastly, the negation can be explicit with the use of negation particles, e.g., “ یہ گھڑی

yeh gharee tumharay mayaar kay mutabiq naheen, this watch is) ”تمہارےمعیارکےمطابق نہیں

not according to your standard/level). In this comment, the negative effect is conveyed by

the negation particle “نہیں” (naheen, not), which can be determined automatically. Most

of the efforts for an automatic treatment of the negation for sentiment analysis give

attention to the last type, in which the negation appears explicitly.

4.6.1. Negation in Urdu language

In our work, we focus on the negation which appears explicitly in the given text through

negation particles. In Urdu, both sentential and constituent negation exists. Some

prominent negation particles are “مت” (mat, don’t), “نا” (na, no) “نہیں” (naheen, not), “بنا”

(bina, without), and “بغیر” (baghair, without).

Sentential Negation:

The negative particles “نہیں” (naheen, not), “مت” (mat, don’t) and “نا” (na, no) are used to

express sentential negation. The particle “نہیں” (naheen, not) appears before the main

verb, which may or may not be followed by an auxiliary verb. In imperative

constructions, the particles “مت” (mat, don’t) and “نا” (na, no) are used in the preverbal



position. Table 4.15 gives the use of these negation particles before the main verbs; “جاتا”

(jata, goes) and “پڑھو” (parho, read).

Examples

(.who school naheen jata hay, He doesn’t go to the school) ”وه سکول نہیں جاتا ہے“

(.kitaab mat parho, Don’t read the book) ”کتاب مت پڑھو“

(.kitaab na parho, Don’t read the book) ”کتاب نا پڑھو“

Table 4.15 Examples of sentential negation from Urdu text.

Constituent Negation:

The constituent negation is used to negate some particular constituent/constituents of a

sentence. Usually the negative particle comes after the negated constituent. Some

common constituent negation particles are; “نہیں” (naheen, not), “مت” (mat, don’t), “نا”

(na, no), “عالوه” (ilaawa, except), “سوا” (siva, except) and “بنا” (bina, without). In Table

4.16, the negation particles, “نہیں” (naheen, not), “مت” (mat, don’t), “نا” (na, no) and “سوا”

(siva, except) are used after the negated constituent.

Examples

,camera kala naheen neela hay) ”کیمره کاال نہیں نیال ہے“

camera is blue, not black)

“ خریدو انارنا خریدو/انگور مت ” (angoor mat/na khareedo anar khareedo,

don’t buy grapes, buy pomegranate)

,mobile kay rang kay siwa sab acha hay) ”موبایل کے رنگ کے سوا سب اچھا ہے“

everything is fine with the mobile except its color)

Table 4.16 Examples of constituent negation from Urdu text.

Use of multiple negation particles:

Sometimes the double negation marks are used to put emphasize on something. For

example, in the sentence, “وه سکول نہیں نا گیا” (woh school naheen na gya, he did not go to

school. The two negation particles “نہیں” (naheen, not) and “نا” (na, no) are used to give

stress or emphasize.

Negation in coordinate structures:



In the coordinate structures the negation particle does not move to the coordinate point,

unless the identical element is deleted from the second negative conjunct. But, in the

situation like ‘neither … nor’, it appears in the beginning position. For example, “ گھر نیا نا

na ghar nya hay, na hawa daar, The house is neither new and nor) ”ہے نا ہوادار

ventilated).

Hence, in Urdu negation particles exist at both levels, i.e., sentential and constituent, like

in English, but their use in the sentence structure is different.

4.7. SentiUnit extraction model As already discussed for our approach of automatic sentiment classification, based on

subjective phrase or appraisal expressions, we give emphasis, on the accurate

identification of the SentiUnits.

Figure 4.2 SentiUnit extraction and polarity computation.



The model is grammatically motivated and works on the grammatical structure level of

the sentences. It uses a sentiment-annotated lexicon based approach for the identification

of such expressions from the corpuses of Urdu text based reviews (see Figure 4.2). The

adjectives, their modifiers and polarity shifters like explicit negation particles, e.g., “نہیں”

(naheen, not), “مت” (mat, no), “نا” (na, no) etc, are handled within these expressions.

For a given Urdu language based review, the SentiUnit extraction and polarity

computation takes place in three phases.

a. Firstly, the normalized text is passed to the parts-of-speech (POS) tagger, which

assigns POS tags to all the terms. Along with this tagging the word polarities are also

annotated to the subjective words. This polarity annotation takes place with the help

of the sentiment annotated lexicon of the Urdu text.

b. These annotated subjective terms (adjectives) are considered as the headwords for the

next phase in which shallow parsing is applied for phrase chunking and the adjectival

phrases are chunked out. Now, these chunks are converted into SentiUnits by

attaching the negation, modifiers, conjunctions, etc.

c. In the last phase, the identified SentiUnit are analyzed for polarity computation. The

polarity of the subjective terms is treated with the combined effect of the negation, if

it exists in the SentiUnit. Hence, the overall sentiment or impact of the SentiUnit is a

combination of its constituents.

4.8. The appraisal targets

So far we have discussed the adjectives and adjectival phrases as the SentiUnits, which

express an attribute of a noun (place, thing or, person). Noun is a fundamental part of

speech (POS), for which, the opinion is made. If the linkage between the noun and

adjective is not correctly identified then there is a great possibility of misclassification or

error about the exact meaning of the opinion. We call these nouns or noun phrases as the

targets of the appraisal. Commonly, in the sentence structure adjectives to noun

association appear in two ways.

The adjectives are directly linked with the noun within the noun phrase

They associate with the noun through some other part of speech, e.g., verb.



In both cases they describe the characteristic features of the noun they qualify. The

following section describes the characteristics and structure of the noun phrases in the

Urdu language.

4.8.1. Cases of noun phrases

The core case markers change the case of the NP into four different types, i.e.,

nominative, ergative, dative and accusative, summarized in Figure 4.3.

Figure 4.3 Cases of noun phrases with core case markers

4.8.2. Possession markers in noun phrases

In Urdu the genitive markers or postpositions are used as the possession markers (PM).

There are three possession markers in Urdu, as shown in Table 4.17.

In literature the position markers are considered different from case markers due to

following features:

In a noun phrase the possession markers come between two nominals. For examples,

in “فلم کا نام” (film ka naam, name of the movie), the possession marker “کا” (ka)

comes between the nominal “قلم” and “نام”.

Cases of NP with core case marers

aNominative: There isno case marker withNP; the noun is innominative case

bErgative: NP marked with case marker “ نے ” (ne) in an actor role

cDative: NP marked with “کو” (ko) in an indirect object or receiver role

dAccusative: NP marked with “کو” (ko) in a direct object role



A possession marker indicates that in a noun phrase the first nominal is the possessor

or holder of the second nominal.

The second nomainal in the noun phrase change the form of the possession marker. It

means the first nominal is in the oblique form and the second is with the number-

gender agreement. For example, in the noun phrase “فلم کا نام” (film ka naam, name of

the movie), the possession marker “کا” (ka) agrees with the second noun “نام”, which

is singular masculine.

As the possession markers are not restricted by a verbal predicate, so they do not directly

mark a grammatical function.

PM Gender, number Example

کا (ka) Masculine, singular فلم کا نام (film ka naam, name of the film)

کی (kee) Feminine, singular

Feminine, plural

فلم کی کہانی (film kee kahani, story of the film)

(film kee kahanian, stories of the film) فلم کی کہانیاں

کے (kay) Masculine, plural فلم کے کردار (film kay kirdar, characters of the film)

Table 4.17 Possession markers in Urdu.

4.8.3. Effect of complex noun phrases in Urdu text

EXTRACTOR module of the system (explained in Chapter 5) identifies the targets

through shallow parsing based chunking. These targets are the non-overlapping noun

phrases “اسمی ترکیب” (ismi tarkeeb) present in the text. Noun phrases are the units of one

or more words in a link with noun as head word and all other words as dependents. Urdu

noun phrases exhibit variations in structure and complexity level. Even a noun phrase can

include other phrases as its components, e.g., adjectival and genitive phrases etc. In

addition to internal complexity of the noun phrase its position in the sentence is not

always the same. This is due to the free word order property of Urdu text (Rizvi and

Hussain 2005). Hence, the chunker for Urdu noun phrases must be capable of handling

both aspects simultaneously.

Example:

The following sentence contains a complex noun phrase.



“ -ارتضی کا کھلونا روبوٹ شاندار ہے ”

Irtaza ka khilona robot shandar hay

The toy robot of Irtaza is wonderful

Description:

In this sentence a complex noun phrase is used which is based on three nouns “ارتضی”

(Irtaza, proper noun), “کھلونا” (khilona, toy) and “روبوٹ” (robot, robot) with a possession

marker “کا” (ka, of).

NP = ارتضی کا کھلونا روبوٹ

Irtaza ka khilona robot

The SentiUnit in the sentence is single adjective based with positive orientation, i.e.,

.(shandar, wonderful) ”شاندار“

Chapter review:

SentiUnits are described in Chapter 4 in detail as the sentiment carrier expressions. A

general model used for the identification of a subjective sentence or opinion with

identifiable appraisal expressions is based on three units, i.e., source of appraisal,

appraisal expression, and finally target of appraisal. This model is defined in detail in

next Chapter, where the source of appraisal in a given review is the reviewer and the

target is the entity about which the appraisal is made. For our approach of sentiment

analysis of the Urdu language is grammatically motivated and incorporates a sentiment-

annotated lexicon for the identification of the sentiment carrier expressions or the

appraisal expressions in a sentence.

Chapter 5| Implementation: Classification Model and Lexicon Structure 66


CHAPTER 5

IMPLEMENTATION: CLASSIFICATION MODEL AND

LEXICON STRUCTURE

In this Chapter, we present our sentiment classification model, which handles a

morphologically rich language; Urdu. Our model is grammatically motivated and

employs a sentiment-annotated lexicon based classification approach for the

identification of the sentiment carrier expressions in a sentence, called the SentiUnits.

The sentence subjectivity is based on these expressions and all the other terms are

considered neutral. Therefore, the subjective polarity of a sentence is computed by the

polarities of its constituent SentiUnits. We partition logically a single opinionated

sentence into three units:

1. Source of appraisal

2. SentiUnit (the appraisal expression)

3. Target of this appraisal.

First, we extract the SentiUnits and the targets, and then these targets are associated with

the respective SentiUnits.

We break up the task of sentiment analysis into four modules, as shown in Figure 5.1.

The PREPROCESSOR module identifies the word boundaries and segments the sentence

into the meaningful words or lexical units. The out put of the PREPROCESSOR goes to

the EXTRACTOR module as an input. The EXTRACTOR extracts the sentiment

expressions and the noun phrases, as the SentiUnits and the Targets, respectively. Then,

the ASSOCIATOR module is responsible for linking the candidate targets to each

extracted SentiUnit.

Finally, the CLASSIFIER identifies polarities of the SentiUnits in each sentence and

calculates the overall sentiment of the review as a sum of sentence polarities.



Figure 5.1 System model representing modules and their interactions (Syed et al 2012).

5.8. PREPROCESSOR

In general, for natural language processing applications, the preprocessing phase deals

with; the removal of punctuation marks, or omitting other unnecessary symbols and

striping of HTML tags. In addition to these tasks, our PREPROCESSOR module has to

handle the diacritics and word boundary identification issues, which are specific to Urdu

language.

5.1.1. Diacritic omission

Similar to the other Arabic script based languages (Persian, Turkish, Sindhi, and Punjabi)

Urdu script consists of two classes of symbols: letters and diacritics. Just like the letters,

the diacritics are also useful for readability and understanding of the script. They not only

represent the vowels, but affect the meanings of the words. However in writings, these

symbols are optional and this is observed that some authors use some diacritics regularly

and totally ignore the others. Even the over use of a particular kind is very common.

Hence, their use is highly author dependent. This under and over use and sometimes

absence of diacritics adds to the morphological as well as lexical ambiguity of the

language. For example, the task of POS tagging of the diacritic bearing words can

generate incorrect results due to ambiguous meaning. This issue is considered as an

unresolved critical problem in linguistics research and hence, as a regular practice, the

diacritics are removed as a part of preprocessing phase (Durrani and Hussain 2010).

5.1.2. Word boundary identification

In almost all natural language processing applications, word segmentation or word

boundary identification through tokenization is the foremost obligatory task.

Tokenization is easy to implement for languages in which word boundaries are identified

through punctuation marks or white spaces, e.g., Spanish, English, and French (Lehal

2010). In such languages, the input sentence is considered as a sequence of letters, which

determine a sequence of the words, i.e., < w1, w2, w3 ... wi > → < l1, l2, l3 ... lj >. Each

sentence is segmented into the lexical words with the help of word boundaries. But, this



process becomes complicated, if white spaces or other word delimiters are rarely or never

used as word boundaries.

As we already mentioned in Chapter 4, Urdu orthography is context sensitive. The

are divided in two categories as joiners and non joiners. The (haroof, alphabets) ”حروف“

joiners take multiple glyphs and shapes according to the context, which cause word

boundaries identification issues. The work in (Durrani and Hussain 2010) divides the

word segmentation of Urdu text into two sub problems as, i.e., space insertion and space

deletion.

i) Space-insertion

Many words in Urdu are made by more than one ligature (usually two). Semantically and

syntactically these ligatures are part of a single word. If the last letter of the first ligature

in a word is a joiner then it tends to join with the first letter of the second ligature. To

avoid this joining, a space is inserted by the writer.

This causes space insertion errors, e.g., “خوش باش” (khush bash, happy), is a single word

with two ligatures, L1= “خوش” and L2 = “باش”. The last letter of L1 “ش” is a joiner which

tends to join with first letter in L2 “ب” to avoid this joining a space is inserted while

typing the word. On omitting this space we get “خوشباش”, whish is not a correct word, so

the writer cannot avoid the space.

ii) Space-omission

There are many words which end with non-joiner letters. As the non-joiner letters keep a

constant shape so usually the writers do not insert spaces while writing the next word to

identify word boundary. This does not affect the readability of the words but for

computational tasks the boundary identification becomes an issue as both words are

written in continuation without space. For example, the phrase, “شیراوربکری” (shair aur

bakri, lion and goat) is written without, and “شیر اور بکری” (shair aur bakri, lion and

goat) is with spaces. We rewrite the phrase with the symbol “|” to indicate the word

boundaries “ یبکر| اور| شیر ” (shair aur bakri, lion and goat).

For the Urdu language, the word segmentation is handled by most of the researches as the

part of a major task, i.e., morphological analyzer, POS tagger, and translators etc. A few

contributions dealt with this issue as an independent task, for example, (Durrani and



Hussain 2010), (Lehal 2010) and (Lehal 2009). Particularly, (Durrani and Hussain 2010)

presents; a detailed literature survey for the identification of the inherent causes and then

propose a word segmentation model.

According to the above discussion and the previous realized works, we propose to

perform the PREPROCESSOR task in four steps, as shown in Figure 5.2. First of all the

normalization is performed on the given text for the removal of symbols and tags. Then,

diacritic omission is performed to avoid ambiguity. Thirdly, the sentence is tokenized as

a sequence of orthographic words OW = ow1, ow2… own, where the words ow1, ow2, ...

are not grammatical or meaning full words but these are only orthographically separated

from each other.

This sequence becomes the input to the final segmentation module. The result of

segmentation is a sequence of meaning full and grammatically correct words ready for

further processing.

Figure 5.2 Preprocessing of the input sentence by the PREPROCESSOR module (Syed et al 2012).



5.9. EXTRACTOR

The EXTRACTOR module identifies and extracts the SentiUnits and the targets. Two

subtasks are performed:

Extracting SentiUnits with Adjectives as head words

Extracting targets with Nouns as head words

The extractor module uses shallow parsing based text chunking. This method identifies

the beginnings and ends of grammatical phrases without parsing the full phrase structure.

Hence, the EXTRACTOR shallow parse each sentence in the given review to find

adjective or noun phrases and then work out for attributes (modifiers, orientation,

intensity, etc.) modeling the behavior of the modifiers and the negations within the

phrase.

Figure 5.3 Processing of the input sentence by EXTRACTOR module (Syed et al 2012).

For extracting SentiUnits, the parser starts with a lexicon of nominal and adjectival head

words, which define initial values for orientation whether positive or negative. In addition

to positive or negative orientation head words exhibit the intensity of orientation. It

searches for occurrences of these head words in the sentence, and upon finding them it



moves rightward to attach modifiers because the modifiers appear in the right side of the

adjectives in Urdu. Now, it searches for the polarity shifters or negations and finally

distinguishes the whole subjective expression. Likewise the parser identifies candidate

targets with the help of lexicon. It finds the entire target groups matching words specified

in the lexicon. These steps are given in Figure 5.3.

5.10. ASSOCIATOR

Figure 5.4 The dependency parsing of the given sentence (Syed et al 2012).

The extracted SentiUnits and targets are associated with each other through

ASSOCIATOR. We apply dependency parsing for this purpose. Figure 5.4 shows the

dependency parsing of the sentence;

”لڑکا کمپیوٹر اور الیکٹرونکس کی چیزیں بیچتا ہے“

larka computer aur electronics kee cheezain baichta hay.

The boy sells computer and electronic products.

5.3.1. Working of the ASSOCIATOR



First the nominal group that is the lexical representation of the target is identified and

then the values of the attributes describing that target are computed. ASSOCIATOR finds

the target phrase by following the paths through a dependency parse of the sentence. The

result of the dependency parse is a ranked list of paths or linkage specifications. These

specifications are ranked to specify the order in which the links should be traversed. For

each SentiUnit, the system looks for the paths through the dependency tree which

annotate any word in the SentiUnit to the next or final expected word according to the

specification of that particular link. With the identification of a word in the proper

syntactic place, the shallow parsing is applied moving rightward to find a noun phrase

that ends in the identified word. These steps are shown in Figure 5.5.

Figure 5.5 Linking SentiUnits with candidate targets by ASSOCIATOR module (Syed et al 2012).

5.3.2. Algorithm

Hence, the steps performed by ASSOCIATOR are:

Input: Shallow parsed sentence with extracted SentiUnits and targets.

Processing: Apply dependency parse and then,

1. Search all the linkages such that;

a. The linkage is in the linkage specifications

b. The linkage connects to a chunked SentiUnit



c. The linkage need not connect to chunked target

2. For each chunked SentiUnit;

a. If there exists any linkage to the chunked target then,

b. Remove unconnected linkages

3. Select the linkage according to priority of linkage specifications.

Output: One linkage per SentiUnit.

5.11. CLASSIFIER

The CLASSIFIER starts from calculating the intensity of orientation of the SentiUnits by

comparing each tagged word with the polarity values assigned in the lexicon entries. For

example, the expression “ ہت اچھی کتابب ” (bohat achi kitab, very good book) is more

intense than “اچھی کتاب” (achi kitab, good book) due to the modifier “بہت”, (bohat, very)

and both are positive expressions. In this expression, the SentiUnit “بہت اچھی” (bohat

achi, very good) is associated with the target “کتاب” (kitab, book).

The CLASSIFIER look for other associations identified by the ASSOCIATOR, then it

calculates the polarity value for each association for a particular target, e.g., “کتاب” (kitab,

book) in this case. If “ ھیبہت اچ ” (bohat achi, very good) is the only expression in the

sentence showing sentiments about the target then the sentence polarity is equal to the

polarity of this expression otherwise other possible expressions are also evaluated. The

calculation of polarity is summation of either positive or negative expressions with

positive or negative values respectively.

5.4.1. Working of the CLASSIFIER

According to the problem statement, the given review, R may be a single sentence based

or it may contain multiple sentences, among which, some are subjective sentences in the

set Ss= {Ss1, Ss2, Ss3,….Ssk} and others are objective So= {So1, So2, So3,….Sol}, such that,

R = {Ss1, Ss2, Ss3… Ssk} U {So1, So2, So3… Sol.},

where,

k=1, 2, 3 …n; l=1, 2, 3 …m; n and m are finite numbers.

The final polarity of the review PR is calculated as a sum of all sentence polarities

computed by the CLASSIFIER module. If Psi represent the sentence polarities of i

sentences then,

PR = ∑ Psi ,

where,

i=1, 2, 3 …N; N is a finite number.

5.4.2. Algorithm:

Hence, the CLASSIFIER module is divided into two steps as given next,

Step1: Compute sentence polarity

Input: Dependency parsed sentence with SentiUnits to targets associations.

Processing: Start with any one SentiUnit of a particular target

a. COMPARE each word in the SentiUnit with the lexicon to find its orientation and

polarity value;

b. COMPUTE SentiUnit polarity by adding polarities of the words according to the

intensity values

c. LOOK FOR another SentiUnit for the same target

d. Sentence polarity = SUMMATION of all SentiUnits’ polarities for a particular

target

Step2: Compute total polarity of review

a. REPEAT step 1 for all sentences

b. ADD all polarity values to calculate PR.

c. COMPARE with threshold

Case a: If PR > threshold, then R is positive.

Case b: If PR < threshold, then R as negative

Output: Classification of review as positive or negative.



5.12. Computation of SentiUnit polarity: Effect of polarity shifters

For the purpose of sentiment classification, the classifier is integrated with the lexicon of

annotated words (discussed in Section 5.6). In such a lexicon, a polarity mark is

annotated with each lexical entry to show its orientation and intensity. This is called the

prior polarity of the subjective words and phrases. The overall orientation of a sentence

is calculated by recognizing the prior polarities of the constituent subjective terms. This

idea works well in some simple sentences, particularly, if the polarity shifters are not

present. The polarity shifters are the words and phrases, which can change the prior

polarities of the words in a sentence.

Example:

Consider the sentence:

”میرا کیمرا کم قیمت ہے لیکن اسکی بیٹری دیرپا نہیں“

mera camera kam-qeemat hay laykin iski battery derpa naheen

My camera is inexpensive but its battery is not long lasting

Description:

In this sentence, the word “دیرپا” (derpa, long lasting) have positive prior polarity, but due

to the use of polarity shifter “نہیں” (naheen, not), its overall contribution to the sentence’s

sentiment becomes negative. Another example of the polarity shifter in the above

expression is the word “لیکن” (laykin, but), which, alters the positive prior polarity of the

word “کم قیمت” (kam-qeemat, inexpensive). This overall polarity of the appraisal

expression is named as the SentiUnit polarity. Therefore, our approach of sentiment

classification rests on two types of polarity scores:

Prior polarity: Polarity marks annotated with the lexicon entries.

SentiUnit polarity: The overall polarity of the appraisal expression on which the final

polarity of the sentence depends

At the highest level, our lexicon model categorizes all the lexical entries into objective

terms and the subjective terms. Objective terms have no orientation or intensity and

hence are not marked with the prior polarity scores. Therefore, they demonstrate no effect

on the overall decision of the classification. On the contrary, subjective terms are the



carriers of the sentiments and are marked with polarity scores. Their occurrence can

effect or even altogether alter the final classification decision.

Figure 5.6 Sentiment classification of a review as positive or negative.

The algorithm identifies the subjective words according to the prior polarities, annotated

in the lexicon. Then, it attaches the polarity shifters, conjunctions, postpositions and

modifiers to extract the appraisal expressions in the opinionated sentences. These

expressions are labeled as the SentiUnits. The shallow parsing based chunking is applied

for the extraction of the SentiUnits, with adjectives as the head words. The overall

polarity of a sentence in a given review can be determined by computing the polarity of

these expressions. Let us denote the term’s prior polarity with Tp, SentiUnit’s polarty



with SUp, Sentence polarity with Sp, and overall review polarity with Rp, as shown in

Figure 5.6 and 5.7.

5.5.1. Computing overall review polarity Rp from SUp

The Figure 5.7 shows the overall process of the review polarity calculation.

Figure 5.7 Computation of the overall polarity of the Urdu text based review (Syed et al 2011 a)



When the system is given a review for classification it sets the review polarity Rp and

sentence count SCount to zero. Then, its takes each sentence one by one. The analysis

begins with the text normalization resulting into word segmentation. These words are

passed to the SentiUnit extraction and polarity computation module, which gives polarity

annotated SentiUnits. Now, the sentence polarity Sp is computed using the polarities of

its constituent SentiUnits. The total Rp is the sum of all known sentence polarities Sp.

Then, Rp is compared with the threshold value. If Rp is greater than the threshold, then,

the review is positive and vice versa.

5.13. Sentiment Annotated Lexicon

Natural language processing applications use electronic versions of the lexicons or the

machine readable versions. The lexical level require this lexicon, and the particular

approach adopted by the system decides whether a lexicon will be employed, as well as

the extent, nature and level of information that is encoded in that lexicon.

Lexicons may be relatively simple, with only the words and their lexical category (part of

speech), or may be increasingly intricate and include information about the semantic

classes of the word, its arguments, the semantic limitations on these arguments,

definitions of the sense or senses in the semantic representation employed in a certain

system, and it can even hold each sense of a single word for word sense disambiguation.

A usual model of a sentiment analyzer with a sentiment-annotated lexicon incorporates

two components:

(i) The classification model, which analyzes and classifies the given opinionated text

according to inherent sentiments of the reviewer (given in previous sections), and

(ii) The lexicon or lexicons annotated with the prior polarities of the lexical entries

(words/ phrases), usually as positive or negative.

These prior polarity annotated lexicons are also called sentiment-annotated lexicons

(Pang and Lee, 2008). These can be manually compiled like General Inquirer (Stone et al.

1966), a prominent recourse used in English sentiment analysis based research and

applications. Alternatively, such lexicons can be automatically generated. A considerable



percentage of research has emerged in the sentiment annotated lexicon construction

within a few years. For example, (Annett and Kondrak, 2008), (Higashinaka et al., 2007),

(Andreevskaia and Bergler, 2006), (Hu and Lui, 2005), (Yu and Hatzivassiloglou, 2003),

(Riloff et al., 2003), Turney (2002), and (Hatzivassiloglou and Wiebe, 2000). These

contributions have proposed a variety of approaches for the lexicon development, their

structures and the relationships between the entries.

In English text, the corpuses of reviews, movies and other kind of information are readily

available on the product, movie, news, or discussion websites. There are two benefits of

these readily available texts; firstly they can be used as test beds to analyze the

performance of any kind of sentiment analyzer, and secondly, these are very helpful in

the automatic generation of the domain specific prior polarity lexicons.

On the contrary, Urdu is a recourse poor language (Muscand and Ghosh, 2010). Most of

the data available in the Urdu language is in the image formats [Hussain] or is not

suitable for sentiment analysis, because for the generation of a prior polarity lexicon, we

need opinionated texts, like reviews.

Therefore, the task of domain specific prior polarity lexicon development for Urdu text,

poses many challenges. To our knowledge, no Urdu words based prior polarity lexicon is

accessible. Though, some contributions are available, which have tried to develop simple

lexicons, suitable for other language processing applications of the Urdu text. For

example, (Ijaz and Hussain, 2007), (Humanyoun et al., 2007), (Muaz and Hussain, 2009)

and (Muscand and Ghosh, 2010).

5.6.1. Definitions of the specific terms Before presenting our model of the lexicon of Urdu words we consider it mandatory to

define certain terms, like lexicon, lexeme, lemma etc.

Definition1:

In linguistics a lexicon is defined as the set of all the morphemes of a particular language.

More specifically it can be a collection of terms used in a particular profession, subject,



or style; a vocabulary: the lexicon of Greek mythology. More formally, it is a language's

inventory of lexemes.

Definition 2:

A lexeme is a conceptual unit of the morphological analysis, which corresponds to a set

of the forms taken by a single word. Generally, a lexeme belongs to a specific syntactic

class and has a definite semantic value. In case of inflecting languages (such as Arabic,

Urdu, Turkish etc), it has a related inflectional paradigm, so, a lexeme in many languages

will have many different forms.

Example:

As an example, consider the lexeme WALK from the English language; this lexeme have

different forms i.e., walk, walks, walked and walking.

The grammar rules of a language govern the forms of the lexemes, which include,

compound tense rules and subject-verb agreement. For example, walks is the present third

person singular form of the lexeme WALK, whereas, walked is its past form.

Definition 3:

The morphology (defined and discussed in Chapter 3) is also based on the notion of the

lexeme, which, further describes many other terms. For example, in terms of lexemes the

morphological operations (already defined in Chapter 3) can be stated as; inflectional

rules relate a lexeme to its forms and derivational rules relate a lexeme to another

lexeme.

Definition 4:

In dictionaries, conventionally, a lexeme is presented as the lemma, which is a canonical

form of a lexeme and is used as the headword. Other forms of the lexeme that are not

common conjugations of the word are often listed later in the lexical entry.

Definition 5:

A lexical entry is a single word or chain of words that formulates the basic elements of a

lexicon. The single word lexical entries are lion, computer and finger. Whereas, traffic



signal, life style, bits and pieces, and take care of , etc are the examples of the chains of

the words.

Much as a lexeme, the lexical entries generally, express a distinct meaning but are not

limited to single words.

5.6.2. Sentiment annotated lexicon of Urdu words Our approach distinguishes clearly between the subjective and objective entries in the

lexicon. We take two attributes of a subjective entry; i.e, its orientation (either positive or

negative) and its intensity (the force of the orientation).

After development of the lexicon we integrate it with the sentiment classifier. The

classifier preprocesses the given text and then applies shallow parsing based chunking. It

uses lexicon for comparing all the words/phrases present in the text. As a result, all the

subjective terms in the given text become annotated. On the basis of the polarities of

individual words, the sentence and then the total review polarity is calculated. We

evaluate the overall system using a corpus of movie reviews in Urdu language. The

classification algorithm is applied on the review corpus. Each subjective word in the

review is compared with lexicon entries for the computation of the polarity scores.

i) Construction Steps

We divide the lexicon construction task into following steps:

Categorize the words either subjective or objective. We have identified two categories

of lexicon entries. When we apply classification algorithm on these words then the

classifier simply ignores objective terms, in this way its performance totally depends

upon subjective words. For example, “موم” (mome, wax) is an objective and “عمده”

(umdah, fine) is a subjective word.

Categorize these words according to morphological rules, which work at the word

level. This categorization helps in identifying the subjective terms from the given

text. For example, rules for marking of an adjective with the noun it qualifies, etc.



Identify their grammatical rules, which describe the possible structures of a sentence

and position of the parts of speech with respect to each other. For example, use of

modifiers with adjectives or use of auxiliaries with verbs, etc.

Discover relationships between different lexicon entries. These relationships can

define synonyms, antonyms and cross references, etc.

Decide and annotate polarities and then intensities to the entries. In this task first the

entries are categorized as positive or negative then their intensity scores are attached

to them. Some entries have only orientations and some have only intensities (like

modifiers) and some have both values.

ii) Lexicon Structure

The model is designed to distinguish between the objective and subjective terms in a

given review. Objective terms are with neutral sentiments, which have no effect on the

final decision of the classification and subjective terms are considered as the carriers of

the sentiments and their presence can alter the final classification. Keeping this

distinction in view our lexicon entries are categorized as objective and subjective terms.

Before going into the details we define some terms according to our approach:

Orientation: Orientation describes either the positivity or the negativity of a lexicon

entry. For most of the entries orientation is predefined during lexicon construction phase.

But, in a given text it can be altered with the use of a polarity shifter in the sentence, e.g.

the word “اچھا” (acha, good) have positive orientation but, with the polarity shifter “نہیں”

(naheen, not), it becomes a negative expression, i.e., “اچھا نہیں” (acha naheen, not good).

Moreover, the orientation of some words (though their number is few) is highly domain

specific or depends upon the context within which they are used. But, these two issues

are beyond the scope of our research.

Intensity: This is the intensity of orientation of a lexicon entry. This describes the force of

positivity or negativity of a term. Usually, the modifiers, e.g., “بہت” (bohat, more)

describe the intensity of an expression. Like other languages, in Urdu there are three

degrees of intensity; absolute (only positive or negative orientation), comparatives (two



distinct entities are compared with each other) and superlative (one of all entities is with

highest orientation)

Polarity: The polarity mark is annotated to each lexicon entry to show its orientation and

intensity.

iii) Classification of the terms

The Objective terms are saved without any polarity mark, but the subjective terms are

further categorized on the bases of orientation and intensity into three types as shown in

Figure 5.8.

a) Absolute subjective terms with orientation only T (O): For such terms, there are only

two possible values, i.e., “+1”, for absolute positive, and “-1”, for absolute negative.

Examples: Absolute Urdu adjectives come in this category, e.g, the adjectives,

both have positive (bhadur, brave) ”بہادر“ and (khoobsurat, beautiful) ”خوبصورت“

orientation and are marked with prior polarity = +1. Whereas, “گھٹیا” (ghatya, cheap) has

prior polarity = -1, due to its negative orientation.



Figure 5.8 Structure of the sentiment annotated lexicon (Syed et al 2011 c).

b) Subjective terms with intensity only T (I): For such terms the prior polarity is assigned

with respect to the possible intensity values, showing the degrees of the polarity, i.e., 1

for absolute, 2 for comparative and 3 for superlative.

Examples: The adjective modifiers are basically the terms with intensity only, e.g., both

the modifiers, “بہت” (bohat, mush) and “زیاده” (zyadah, more) have prior polarity = 2.

And “ زیادهسب سے ” (sab say zyadah, most) has prior polarity = 3.

c) Subjective terms with both values of orientation and intensity T (I, O): In this case the

prior polarity is calculated by multiplying the orientation score (+1 or -1) with the

intensity score (1, 2 or 3).

Examples: In Urdu language, very few terms come in this category, For example, the

words “بہتر” (behter, better), “بہترین” (behtareen, best) and “برتر” (badtar, worse) have

prior polarities = +2, +3 and -2, respectively. These are usually Persian loan words.

iv) Lexicon Entries

Some examples of lexicon entries from all the three categories, i.e., T(O), T(I) and T(I,O)

are given in Table 5.1. For example, the word “کامیاب” (kamyaab, successful), has

positive orientation but no intensity. Similarly, “زیاده” (zyada, more) and “بہت” (bohat,

very) both have intensity and no orientation. Whereas, “بہتر” (behtar, better) and “بہترین”

(behtareen, best) both have positive orientation with intensities of a comparative and

superlative degrees, respectively.

Examples of T (O) Examples of T (I) Examples of T (O, I)

(behtareen, best) ”بہترین“ (bohat, very) ”بہت“ (kamyaab, successful) ”کامیاب“

(badtareen, worst) ”بدترین“ (zyada, more) ”زیاده“ (khoobsurat, beautiful) ”خوبصورت“

(behtar, better) ”بہتر“ (shadeed, extremely) ”شدید“ (bahadur, brave) ”بہادر“

Table 5.1 Examples of lexicon entries.



5.14. System integration

The annotated lexicon of Urdu words is integrated with the sentiment classifier as shown

in Figure 5.8. First of all, the given text in the form of a review is taken from the website.

The sentiment classifier component of the systems preprocesses this review, segments it

into sentences and then words. These words are then tagged with the respective parts of

speech. Now, these tagged words are compared with the lexicon entries for sentiment

orientations and intensities. This comparison results into polarity marked or polarity

annotated words and phrases.

Figure 5.9 Integration of the lexicon of Urdu words with the sentiment classifier (Syed et al 2010).

On the basis of the polarities of individual words, the sentence and then its total review

polarity is calculated. We evaluate the overall system using a corpus of movie reviews in

Urdu language; the experimentation is given in Chapter 6. The classification algorithm is

applied on the review corpus. Each subjective word in the review is compared with

lexicon entries for the computation of the polarity scores.

Chapter review:

This Chapter has presented our approach in detail, as well as the modules of the system:

PREPROCESSOR, EXTRACTOR, ASSOCIATOR, and CLASSIFIER (see Figure 6.1).

Each module is described by a separate detailed model. In next section, we evaluate our

model through experimentation.

As a pioneering effort, in this research we describe the structure, construction and

evaluation of a manually tagged sentiment-annotated Urdu words based lexicon as a

component of a sentiment analysis model developed for Urdu text. The lexicon contains

Sentiment annotated lexicon

of Urdu words

Sentiment Classifier

Website

POS Tagged words/phrases

Polarity-annotated words/phrases

Given review in Urdu text (review)

Classification



information about the subjectivity of an entry in addition to its orthographic,

phonological, syntactic and, morphological aspects. Our approach distinguishes clearly

between the subjective and objective entries in the lexicon. We take two attributes of a

subjective entry; i.e. its orientation (either positive or negative) and its intensity (the force

of the orientation).

After development of the lexicon we integrate it with the sentiment classifier. The

classifier preprocesses the given text and then applies shallow parsing based chunking. It

uses lexicon for comparing all the words/phrases present in the text. As a result, all the

subjective terms in the given text become annotated. The classifier then calculates the

sentiment orientation of the sentences and then the overall review.

Chapter 6| Experimentation and Results 88


CHAPTER 6

EXPERIMENTATION AND RESULTS

The evaluation of the sentiment classifiers is typically conducted experimentally, rather

than analytically. The reason is that, the analytical evaluation requires a formal

specification of the problem with respect to how correctness and completeness are

defined, it does not emphasize on the practical effectiveness and performance.

On the other hand, the experimental evaluation of a classifier usually measures its

effectiveness in terms of its ability to take the accurate classification decisions. Hence,

we have performed a series of experiments. These experiments compare two versions of

the system; Model A and Model B and analyze the effect of polarity shifters and

negations. The results are given in Section 6.4. The lexicon and corpora are discussed in

Sections 6.1 and 6.2 respectively. Moreover, Section 6.3 gives the case studies to

illustrate the processing of the major components or modules of the system. Some

example illustrations are given in Section 6.5.

6.1. Lexicon Coverage

The current version of the lexicon contains 1,368 adjectives, which are marked according

to the orientation and the intensity. These adjectives are picked from all classes given in

Chapter 4. As already mentioned, the Urdu adjectives are marked with case through

inflection and derivation, therefore all inflected forms are considered for a single entry.

Moreover, there are 67 modifiers, including both comparative and superlative intensity

levels. The nominal head words are selected according to the domains of the movies and

the electronic appliances, which are 1,920 in number. A summary of the existing version

of the lexicon is given in Table 6.1.



Modifiers Adjectival head words Nominal head words

67 1,368 1,920

Table 6.1. Summary of lexicon entries.

6.2. Corpus

Due to the deficiency of publicly accessible corpus of the Urdu language based reviews,

we collect two corpora of reviews to evaluate the efficacy of the employed model. The

first corpus C1 is the collection of 700 movie reviews, among which 385 are positive and

315 are negative. The average document length in this corpus is 264 words. For obtaining

variant reviews, 40 different movies with different popularity scores (already known) and

categories (comedy, drama, historical etc) are given for review.

The second test-bed is a corpus of reviews of the electronic appliances C2. This corpus

comprises a total of 650 reviews with 322 positive and 328 negative. The base collection

has the reviews for three types: refrigerators (237), air-conditioners (250), and televisions

(163). The average review length is 196 words. For achieving diversity, 9 different

brands of the electronic appliances are given for review.

For both corpora, the reviews within the threshold boundary or with neutral scores are

removed. Hence, the data set contains either positive or negative reviews as shown in

Table 6.2.

Domains Total number Average

length

Orientation Number

Movies C1 700 264 words Positive

Negative

385

315

Electronic appliances C2 650 196 words Positive

Negative

322

328

Table 6.2. Corpora for evaluation.

6.3. Case Studies

Before proceeding to the results, we consider different case studies from the Urdu text

and show how the system processes, like; POS tagging, extraction, association and

polarity annotation are performed.

6.3.1. CASE 1 Parts of speech tagging

Consider an example of an Urdu verse for POS tagging.

میں اکیال ہی چال تھا جانب منزل مگر لوگ ساتھ آتے گۓ اور کارواں بنتا گیا۔

(main akela hi chala tha jaanib-e-manzil magar, log saath aate gaye aur kaaravaan bantaa

gayaa, I had started all alone towards the destination, but; people kept joining and it

became a caravan.)

The parts of speech tagging results into following allocations:

<SC> مگر<N> منزل <ADJ> جانب <TA> تھا <VB> چال <ADV> ہی <ADJ> اکیال <PP>میں

<SM> -<VB> گیا<VB> بنتا< NN>کارواں<CJC>اور<VB> گۓ<VB>آتے <NN> ساتھ <NN> لوگ

6.3.2. CASE 2 Extraction of targets and SentiUnits

Following examples describe the execution of the EXTRACTOR module.

Example 1:

”ارتضی کا روبوٹ بڑا شاندار ہے“

Irtaza ka robot bara shaandaar hay

Irtaza’s robot is very fabulous.

In this sentence, both the SentiUnit and the target are complex, i.e., they are composed of

more than one word. The SentiUnit شانداربڑا (barashaandaar, very fabulous) is made by

an adjective head word and a positive modifiers. The target of the comment “ ارتضی کا

” روبوٹ (Irtaza ka robot, Irtaza’s robot) is based on three words; two nouns with a

possession marker in between, as shown in Table 6.3.



Remark Parse

Sentence with complex SentiUnit (SU) and

target (NP)

[N PM N] [ADJ ADJ] AUX

NP SU AUX ارتضی کا روبوٹ بڑا شاندار ہے

Noun phrase (NP) with possession marker (PM) N PM N NP (Target) ارتضی کا روبوٹ

SentiUnit made by two adjectives (ADJ) ADJ ADJ SU (SentiUnit) بڑا شاندار

Table 6.3.Parsing of example 1 into targets and SentiUnits.

Example 2:

”ارتضی اورفاطمہ کا کمره ہوادارنہیں“

Irtaza aur Fatima ka kamrah hawadar naheen

Irtaza and fatima’s room is not airy

Again, both the SentiUnit and the target are complex. The SentiUnit ہوادارنہیں (hawadar

naheen, not airy) contains an adjective head and a negation word. The target of the

comment is even more complex, i.e., ارتضی اورفاطمہ کا کمره (Irtaza aur Fatima ka

kamrah, Irtaza and fatima’s room) is made by five words; three nouns, a possession

marker and a conjunction. The sentence parse in given in Table 6.4.

Remark Parse

Sentence with complex SentiUnit and

target

[N CJC N PM N] [ADJ

NEG] NP SU

ارتضی اورفاطمہ کا کمره ہوادارنہیں

Noun phrase with conjunction (CJC)

and possession marker (PM)

N CJC N PM N NP

(Target)

ارتضی اورفاطمہ کا کمره

SentiUnit with negation (NEG) ADJ NEG SU (SentiUnit) ہوادارنہیں

Table 6.4.Parsing of example 2 into targets and SentiUnits.

Example 3:

Here is a short review from Urdu language based movie review corpus.

“ ستایش ہے۔شکل اچھی نہیں۔ نا ہی ہدایت کاری قابل فلم کی کہانی بورنگ ہے۔ ہیرو کی اداکاری اور ”



filmkikhani boring hay, hero kiadakariaurshakalachinaheen, na hi hdaayetkariqabil-e-

staayesh hay. The story of the film is boring. Hero’s acting and looks are not good. Nor is

the direction appreciable.

Description:

The review is based on three comments. In the first comment no negation particle is used,

the SentiUnit is based on an adjective “بورنگ” (boring, boring) with negative polarity

resulting into an overall negative impact. In the second comment, the SentiUnit is made

by a positive adjective “اچھی” (achi, good) and a negation mark “نہیں” (naheen, not),

which reverse the effect of the adjective to make an overall negative impact. The

SentiUnit in the third comment is based again on a positive word “قابل ستایش” (qabil-e-

staayesh, appreciable) and a negation mark “نا” (na, nor), which appear in the beginning

of the comment and hence, it conveys a negative opinion.

So, the overall polarity of the review is negative. The POS tagging and phrase chunking

of the review is given in Table 6.5. Column 1 gives the POS tags of each comment and in

column 2; the noun phrases NP along with SentiUnits SU are given.

Phrases & SU POS tags Comments

[NP] [SU] [N + PM + N] [ADJ] فلم کی کہانی بورنگ ہے

[NP] [CJC] [NP] [SU] [N + PM + N] [CJC] [N] [ADJ + NEG] شکل اچھی نہیں ہیرو کی اداکاری اور

[NP] [SU] [NEG] [N] [ADJ] نا ہی ہدایت کاری قابل ستایش ہے

Table 6.5.POS tagging and phrase chunking of the given review.

Example 4: Let us take an example execution of a single sentence. Figure 6.1 shows the

executions steps in detail:

“ -کا یہ ماڈل خوب صورت نہیں گاڑی ”

Garikayeh model khoobsurat naheen hay

This model of the car is not beautiful.

Figure 6.1.Example extraction of the SentiUnits.

6.3.3. CASE 3 Case marking and complex noun phrases

Consider an example of a sentence with complex noun phrase in which, a same sentence

with same meanings is written in three different versions.

Version1:

نہیں تیرا نشیمن کثرسلطانی کے گنبد پر

naheenteranashemankasr-e sultanikaygunbad par

Your home is not on the tower of the king’s palace

Description:

In this sentence there are three noun phrases. One of them is complex, i.e., “کثرسلطانی”

(kasr-e sultani, king’s palace). In English translation apostrophe is used as a replacement

of “of”. But in Urdu no indication is visible because the diacritic mark is optional and

mostly ignored. Only the native Urdu readers can understand the right pronunciation and

meaning. This phenomenon is called compounding, which is very common in Urdu texts

(discussed in Chapter 3).

<NEG>نہیں

NP 1: <PP>تیرا

Preprocessing

Classification

Shallow Parsing

-کا یہ ماڈل خوب صورت نہیں گاڑی

کا یہ ماڈل خوب صورت نہیں گاڑی

نہیں| خوب صورت| ماڈل| یہ| کا |گاڑی

نہیں| خوب صورت | ماڈل | یہ | کا | گاڑی <NOT><ADJ><N><DT><POSS><N>

نہیں | خوب صورت | یہ ماڈل | کا گاڑی <NOT><ADJ><NP><NP>

| خوب صورت نہیں | یہ ماڈل | کا گاڑی <SentiUnit><NP><NP>

| خوب صورت نہیں | یہ ماڈل | کا گاڑی <Negative orientation><NP><NP> Result: This is a negative comment.

<NN>نشیمن, NP 1 is based on one noun and one adjective.

NP 2: <NN>کثر

<ADJ>سلطانی, NP 2 is called compounding of two nouns through diacritic.

کے

NP 3: <NN>گنبد, NP 3 is simple with single noun.

پر

Version2:

Let us consider another version of the same sentence:

تیرا نشیمن کثرسلطانی کے گنبد پر نہیں

teranashemankasr-e sultanikaygunbad parnaheen


Description:

In this sentence only the word order is changed but the composition of noun phrases

remain the same.

NP 1: <PP>تیرا


NP 2: <NN>کثر


کے

NP 3: <NN>گنبد, NP 3 is simple with single noun.

پر

<NEG>نہیں

Version3:

Another version of the sentence is

تیرا نشیمن گنبد کثرسلطانی پر نہیں

teranashemankasr-e sultanikaygunbad parnaheen

Description:

In this case word order is changed and “کے” (kay, of) is replaced by the diacritic, making

gunbad-e) ”گنبد کثرسلطانی“ ,.an additional word in noun phrase, i.e (gunbad, tower) ”گنبد“

kasr-e sultani, king’s palace’s tower). Therefore, the sentence contains tow noun phrases.

NP 1: <PP>تیرا


NP 2: <NN>گنبد

<NN>کثر


پر

<NEG>نہیں

6.3.4. CASE 4 Polarity annotations

Example 1:

”میری کتاب عمده ہے“

merikitabumdah hay

my book is good.

Description:

SentiUnit is made by an adjective as the subjective term with orientation only.

Hence,

SUp = Tp…… (1)

where

Tp = +1

Putting this value in equation 1, we get,

SUp = 1

Thus, the SentiUnit polarity is “1”.



Example 2:

”میری کتاب عمده نہیں ہے“

merikitabumdahnaheen hay

My book is not good.

Description:

The SentiUnit is made by an adjective as the subjective term with orientation only and a

negation term as the polarity shifter.

Hence,

SUp = Tp(Neg)…… (2)

Where

Tp = +1 and Neg = -1

Putting this value in equation 2, we get,

SUp = (+1) (-1)

= -1

Thus, the SentiUnit polarity is negative and is “-1”.

Example 3:

”وه سب سے زیاده سخی ہے“

woh sab say zyadahsakhee hay

He most generous of all

Description:

SentiUnit is made by four lexical units; an adjective as the subjective term with

orientation only, and a superlative modifier made by three words. The adjective polarity

shifts to the superlative degree due to intensity of the modifier.

Hence,

SUp = (Tp1) (Tp2)……. (3)

Where



Tp1 = +1 and Tp2 = 3

From equation 3, we get,

SUp = (+1) (3)

= +3

This results into a positive SentiUnit polarity of intensity “3”.

6.3.5. CASE 5 Associating targets with SentiUnits

The ASSOCIATOR module associates the SentiUnits with the targets. For example, take

the linkage specification shown below:

We apply it to the following sentence

”بانگ درا ایک اچھی کتاب ہے“

bang-e-daraaikachikitab hay.

Baang-e-Dara is a good book.

The chunker finds “اچھی”as a sentiment expression. The ASSOCIATOR module then

searches for the target noun phrase, which is “بانگ درا”, the name of the book, as shown

in Figure 6.2.

Figure 6.2.Linking the sentiment expressions with candidate targets.



6.4. Results

For evaluating the effectiveness and efficiency of a text classifier only using the accuracy

as the performance metric is not sufficient. Therefore, we use other three metrics; called

the precision P, recall R and F-measure F in addition to the accuracy A. These metrics

can provide much greater insight into the performance features of a classifier.

Definition 1: For a sentiment classifier the accuracy A can be defined as the measure of

how close the document classification suggested by the classifier is, to the actual

sentiments present in the review.

Definition 2: The precision P measures the exactness of a classifier. A higher P means

less false positive and vice versa. In terms of true positive tp, false positive fp, true

negative tn and false negative fn, P can be defined as:

P = tp / (tp + fp)

Definition 3: The recall R measures the sensitivity or completeness of the classifier.

Higher R means less false negative and vice versa. In terms of tp, fp, tn and fn, R can be

defined as:

R = tp / (tp + fn)

Definition 3: F-measure is produced by combining Precision and Recall, which is the

weighted harmonic mean of both values, as defined below:

F = 2 PR/ (P+R)

A series of four experiments in two sets with two models of the system have been

performed. The model A is the former version of the system with the EXTRACTOR

module only (Syed et al. 2010) and the model B is the final version in which the

ASSOCIATOR module is attached (Syed et al. 2012). By using this testing, the efficacy



and usability of the extended version are easily compared. Both models are applied on

both corpora C1 and C2 separately.

6.4.1. Model A

Table 6.6 and Table 6.7 show the results of the experiments performed by model A on

both corpora C1 and C2. Table 6.6 shows the detailed results with P, R, F and A values

separately computed for positive as well as negative reviews.

Orientation Corpora Precision Recall F-measure Accuracy

Positive C1

C2

0.737

0.795

0.681

0.737

70.8

76.5

74%

79%

Negative C1

C2

0.698

0.785

0.654

0.767

67.5

77.6

66%

77%

Table 6.6. Experimental results in terms of P, R, F and A for model A (Syed et al. 2012)

Table 6.7 shows a comparative summary of the results from both corpora. The accuracy

of C1 is 70% and variation in positive and negative reviews is 8%. Whereas the accuracy

of C2 is 78% and variation in positive and negative reviews is 2%. The total accuracy of

model A is 74%.

Corpora Orientation Accuracy Variation Corpora

Accuracy

Total Accuracy

C1 Pos 74%

8% 70%

74%

Neg 66%

C2

Pos

79%

2%

78% Neg 77%

Table 6.7. Comparison of accuracy from both corpora C1 and C2 for model A.



6.4.2. Model B

For the next two experiments we include ASSOCIATOR module and tested both corpora.

The results are shown in Table 6.8 and Table 6.9. Table 6.8 shows the experimental

results in terms of P, R, F, and A for model B applied on C1 and C2 for positive and

negative reviews separately.

Orientation Corpora Precision Recall F-measure Accuracy

Positive C1

C2

0.822

0.897

0.795

0.877

80.8

88.7

80%

88%

Negative C1

C2

0.795

0.865

0.777

0.832

78.6

84.8

77%

84%

Table 6.8. Experimental results in terms of P, R, F and A for model B (Syed et al. 2012).

Results from Table 6.8 are compared and summarized in Table 6.9. The accuracy of C1 is

improved to 78.5%, and the variation in positive and negative reviews is decreased to

3%. Likewise, the accuracy of C2 is increased to 86.5%. In this case the variation in the

accuracy of positive and negative reviews is also increased to 3%. The total accuracy of

model B is 82.5%.

Corpora Orientation Accuracy Variation Corpora

Accuracy

Total Accuracy

C1 Pos 80%

3% 78.5%

82.5%

Neg 77%

C2

Pos

88%

3%

86.5% Neg 85%

Table 6.9. Comparison of accuracy from both corpora C1 and C2 for model B.



Observations:

From the above results it is clear that the classification accuracy is highly domain

specific. The reviews in C1 are more challenging to classify as compared to those of

electronic appliances in C2. The reason is that these reviews contain more allegory which

results into more divergence, not only syntactic or semantic structure, but also in

appraisal type. Discussion about the movie plot and its characters weather good or evil is

very frequent phenomenon. This discussion results into a number of appraisal targets

which further can lead to the selection of the wrong linkage. On the other hand all

positive or negative comments about the parts of an electronic appliance are indirectly

related to the same target.

Moreover, the classification accuracy also depends upon the orientation of the review.

From results, it is also perceptible that negative reviews are more prone to be

misclassified than the positive ones.

6.4.3. Effect of Negation

On the basis of the above discussion it is clear that the negation markers affect the results

of the analyzer to much extend, therefore, we carry out experimentation to analyze the

behavior of the negation. For this reason, we divide the dataset into three different sets of

data. During the test-bed normalization process, we clean out the neutral comments from

all the three sets.

Test Data Sets Precision Recall F-Measure

Set 1 0.864 0.837 0.850

Set 2 0.590 0.779 0.677

Set 3 0.510 0.615 0.558

Table 6.10 Effect of negation in terms of P, R and F (Syed et al. 2011).



Set 1: In the Set 1, we include the sentences, in which, both implicit and explicit negation

is absent. The polarity of these sentences depends only on the subjective terms and other

polarity shifters.

Set 2: The Set 2 contains those sentences, in which only explicit negation particles are

used and implicit negation is absent.

Set 3: To compile the Set 3, we add implicit negation sentences in the Set 2. In this set

both implicit and explicit negation is present in addition to polar terms.

The Table 6.10 gives the results from the three sets of data, in terms of precision, recall,

and f-measure. From these values the total performance accuracy is about 77%. The Set 1

in which only polar terms are present, gives the best results of the classification.

Whereas, the results from Set 2 are lower than the previous one, as it contains only the

sentencs with the negation particles. From this result, it is infered that the negation

particles can cause relatively high rate of missclassofication. But, the average accuracy

from Set 1 and Set 2 is quite satisfactory. The results from Set 3 show that the implicit

negation still needs an improved treatment.

Observations:

Apart from the results, we have following worth mentioning observations about

negations:

On the average two to three negation particles appear in a single review and the use of

negation is author dependent; some authors tend to use more negative particles than

others. “نہیں” (naheen, not) is the most used particle. In comparative, sentences, the

negation particle “نا” (na, no) is used with multiple targets of the appraisal.

The sentential negation is rarely misclassified as compare to the constituent negation.

Morphological negation is automatically handled, because most of the words

inflected by the lexical negation marks,

e.g., “بے” (bay), “با” (ba), etc, are already present in the lexicon and are annotated

with respective polarities,

e.g., “بے فایده” (bayfayeeda, useless) is a lexical entry with a negative polarity.



6.5. Example illustrations

Example 1: Positive Review

Consider the following review about a laptop:

کی پروسیسنگ کی اس .ہے شاندار چیز ایک یہ .ہے خریدا لیپ ٹاپ ایک نے میں پچھلے مہینے

.ہے مفید میرے لئے جو ،نہیں ہے دیرپا بیٹری اگرچہ .ہے ہترینب آپریٹنگ سسٹم .حیرت انگیز ہے بہت رفتار

Translation: Last month I bought a laptop. It is a wonderful thing. Its processing speed is

very amazing. The operating system is the best. Though, its battery is not long lasting.

But this is good for me.

Figure 6.3 Example of a positive review.



Result: This is a positive review as the result of the analysis shows in Figure 6.4.

Figure 6.4 Result of the analysis.



Example 2: Negative Review

Now, consider another review related to a movie:

فلم یہ فلم انتہاي فالتو اور ناقابل برداشت ہے۔ ٹاپک پرانا ہے۔بہت عرصے کے بعد ایک فلم دیکھنے کو ملی ہے۔ فلم کا

میں ہیرو کی ایکٹنگ بہترین ہے۔ فلم دیکھ کر زیاده مزه نہیں آیا۔

Translation: After a long time, got a film to watch. The film’s topic is old. This film is

very rubbish and intolerable. Hero’s acting is the best in the movie. It was not a fun to

watch the movie.

Figure 6.5 Example of a negative review.



Result:

Figure 6.6 Result of the analysis.

Chapter 7| Conclusions and Future Directions 107


CHAPTER 7

CONCLUSIONS AND FUTURE DIRECTIONS

This dissertation has investigated the automatic sentiment analysis of a morphologically

rich and resource poor language: Urdu. This grammatically motivated approach is

apposite for handling the complex morphology and variable vocabulary of the target

language. We have applied the core natural language processing tasks for word

segmentation, POS tagging and phrase chunking. Our systems have evolved from a

simple phrase chunking model presented in (Syed et al, 2010) to a more flexible and

mature approach given in (Syed et al, 2012). The results from both the versions are

presented and compared in Chapter 6.

Conclusions from linguistic aspects

As the first effort for the sentiment analysis of the Urdu language we have come to

different conclusions regarding the characteristic features of this language and the

challenges it poses for the automatic processing. For example, Urdu is context sensitive

and hence its word segmentation is itself a great issue to handle. Due to this feature word

boundary identification is not as straight forward as for English language.

Another considerable problem is the complex morphology of the Urdu text, which results

into intricate lexicon structure. Our lexicon is based on the adjectives, their modifiers,

polarity shifters and negations. Also the extended version of the lexicon contains nouns.

We have considered adjectives of all the types discussed in Chapter 4, to make our

system more inclusive. The compilation of the lexicon suggests that the handling of Urdu

nouns is much more complicated than adjectives because of the separate case markers,

which are used as the possession markers. Our algorithm handles these possession

markers as the separate lexical unit.



Urdu adjectival phrases are morphologically complex. In Section 4.1, we have discussed

both marked and unmarked adjectives, which are borrowed from many languages, like

Persian, Arabic, Hindi, Sanskrit, and English. This diversity results into flexibility and

variety in the morphological and grammatical rules. For example, the adjectives which

are Persian loan follow Persian grammar and usually remain unmarked, likewise, the

Sanskrit based adjectives show inflections for gender and number, etc.

Almost all types of adjectives, descriptive, attributive, predicative, demonstrative, etc.

show agreement in case, gender and number with the noun they qualify.

Similarly, some other linguistic phenomena are specific to Urdu language, e.g., frequent

reduplication (partial as well as full), compounding, frequent inflections and derivations.

Conclusions from technical aspects

A sentiment-annotated lexicon turns out to be more intricate as compared to other NLP

lexicons. There are two reasons for this intricacy:

Each lexicon entry demonstrates its polarity information in addition to its

orthographic, phonological, syntactic and, morphological features. This polarity

information is usually represented as either positive, or negative or neutral. For

example, SentiWordNet (Andreevskaia and Bergler, 2006), use triplets [positive,

negative, objectives], with minimum value 0.0 and maximum 1.0.

Most of the words exhibit multiple orientations depending upon their use and domain.

For example, “This damage is everlasting”. In this sentence, the everlasting is a

positive word, but the comment’s overall orientation is negative. Also, unpredictable

is a positive word when used about a movie’s plot, and becomes negative for the

performance of a microwave oven.

Moreover the above mensioned linguistic aspects of the Urdu language result into much

complex lexicons. There is a much higher out of vocabulary rate as compared to other

well defined grammars. Also, it results into poor or unreliable language model probability

estimation, because there are many combinations of word forms which are missing or

rarely available in the language model training data.



It is observed that the domain of the test beds affect the classification accuracy. The

results for one domain are different from the other. Moreover, the orientation of the text

to be analysed affects the accuracy to much extent. The negative reviews are more prone

to be misclassified than the positive ones.

For this reason our approach handles the phase-level negation as part of the SentiUnits,

which contain adjectives as the core terms and include the negation particles as their

logical constituents. Hence, the total effect of the negation is dealt along with the effect

of the subjective words. This approach is much appropriate to handle the free word order

property of the Urdu language. Also, it handles the variant grammatical structures of the

Urdu sentences, very successfully, as indicated by the experimentation results, with an

overall accuracy of 77%.

Although, shallow parsing based approach is appropriate for handling the simple

opinions, but it results into misclassifications when applied on complex sentences with

multiple targets. Therefore, the approach presented in Model B, which uses dependency

parsing after the shallow parsing is much more reliable.

Directions for future Endeavors

The classification accuracy is highly domain specific, because the results from the

domain of electronic appliances are with higher accuracy than those for movies. The

problem of domain independence is still an open issue in the sentiment analysis research

community, even for English language. Therefore, our primary future work is to increase

the knowledge of the Urdu language by including more adjectives and other parts of

speech. The lexicon can be extended on the same model by introducing some new rules

for handling adjectives and adjectival phrases, adverbs and adverbial phrases, verbs and

verb phrases, etc. We reckon to extend the lexicon to such an extent, which can make our

model, domain independent.

Another future direction is to update our model for handling the Hindi language which is

morphologically similar to Urdu but is orthographically different. Due to the absence of

segmentation and diacritic issues we believe that our updated model can perform well for

Hindi language also. This model can also be applied on some other morphologically rich



languages, like, Punjabi, Persian, Sindhi etc, which have same orthography and very

similar grammar rules.

Most of the research works presented for English language rely only on the extraction of

the adjectives or adjectival phrases. There are a very few contributions which have

considered adverbs or adverbial phrases. In future, we deem to extend our model by

adding adverbial phrases in combination with adjectival phrases for handleling more

diversified opinions. In this way both aspects, i.e., functions and attributes of the target

product can be handled. The main strength of this model is its flexibility. As we have

considered the classification at the phrase level so we can add new rules and new phrases

very easily to the core model without making major alterations in the algorithm.

References 111


REFERENCES

1. Abbasi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages: feature

selection for opinion classification in web forums. ACM Trans Inf Syst, pp 1–34

2. Abdul-Mageed M, Korayem M (2010) Automatic identification of subjectivity in

morphologically rich languages: case of Arabic. In: Proceedings of the 1st workshop on

computational approaches to subjectivity & sentiment analysis (WASSA), Lisbon pp 2–6

3. Ahmed T, Hautli A (2010) An Experiment for a basic lexical resource for Urdu on the

basis of Hindi WordNet. In: Proceedings of CLT 2010.

4. Akram Q, Naseer A, Hussain S (2009) Assas-Band, an Affix-Exception-List based Urdu

stemmer. In: Proceedings of 7th workshop on Asian Language Resources, pp 40-47.

5. Ali W, Hussain S (2010) A hybrid approach to Urdu verb phrase chunking. In:

Proceedings of 8th workshop on Asian Language Resources, pp 137-143.

6. Andreevskaia A, Bergler S (2006)MiningWordNet for fuzzy sentiment: sentiment tag

extraction fromWord-Net glosses. In Proceedings of the 11th conference of the European

chapter of the association for computational linguistics, EACL-2006, Trent, pp 209–216

7. Annet M, Kondrak G (2008) A comparison of sentiment analysis techniques: polarizing

movie blogs. In: Proceedings of Canadian AI, pp 25–35

8. Anwar W, Wang X, Wang XL (2006) A survey of automatic Urdu language processing.

In: Proceedings of 5th international conference on Machine Learning and Cybernetics.

9. Baker P, Hardie A, McEnery T, Jayaram BD (2003) Corpus data for South Asian

language processing. In: Proceedings of the EACL workshop on South Asian languages,

Budapest

10. Bansal M, Cardie C, Lee L (2008) The power of negative thinking: exploring label

disagreement in the min cut classification framework, Manchester. In: Proceedings of

COLING pp 13–16

References 112


11. Bhattacharyya P (2010) IndoWordNet. In Proceedings of the Seventh conference on

International Language Resources and Evaluation (LREC’10).

12. Bhattacharyya P, Pande P, Lupu L (2008) Hindi WordNet. Linguistic Data Consortium,

Philadelphia.

13. Bloom, K., Argamon, S.: Unsupervised Extraction of Appraisal Expressions. In:

Farzindar, A., Kešelj, V. (eds.) Canadian AI 2010. LNCS (LNAI), vol. 6085, pp. 290–

294. Springer, Heidelberg (2010)

14. Breck E, Choi Y, Cardie C (2007) Identifying expressions of opinion in context. In:

Proceedings of IJCAI’07. Menlo Park, CA, pp 2683–2688

15. Choi Y, Cardie C (2008) Learning with compositional semantics as structural inference

for subsentential sentiment analysis. In: Proceedings of the conference on empirical

methods in natural language processing, Honolulu, HI, pp 793–801

16. Crilley K (2001) Information warfare: new battle fields, terrorists, propaganda, and the

Internet. ASLIB Proc 53(7): 250–264

17. Dalal A, Nagaraj K, Sawant U, Shelke S (2006) Hindi part of speech tagging and

chunking: A maximum entropy approach. In: Proceedings of NLPAI Machine Learning

Context.

18. Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction

and semantic classification of product reviews. In: Proceedings of the twelfth

international world wide web conference (WWW 2003), Budapest, pp 519–528

19. Durrani N, Hussain S (2010) Urdu word segmentation. In: Proceedings of 11th annual

conference of the North American chapter of the association for computational

linguistics, Los Angeles

20. Fellbaum C, editor. 1998. WordNet: An Electronic Lexical Database. Cambridge: The

MIT Press.

21. Glaser J, Dixit J, Green DP (2002) Studying hate crime with the Internet: What makes

racists advocate racial violence?. J Soc Issues 58(1):177 193

22. Hardie A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. In:

Proceedings of the conference of the corpus linguistics, Lancaster

References 113


23. Hatzivassiloglou V, McKeown KR (1993). Towards the automatic identification of

adjectival scales: Clustering adjectives according to meaning. In: Proceedings of the 31st

Annual Meeting of the ACL, pp 172-182

24. Hatzivassiloglou V, McKeown KR (1997) Predicting the semantic orientation of

adjectives. In: Proceedings of ACL’97. Stroudsburg, PA, pp 174–181

25. Hatzivassiloglou V,Wiebe JM(2000) Effects of adjective orientation and gradability on

sentence subjectivity. In Proceedings of the 18th international conference on

computational linguistics, New Brunswick, NJ

26. Hautli A, Butt M (2011) Towards a computational semantic analyzer for Urdu. In:

Proceedings of the 9th workshop on Asian Language Resources, pp 71-78.

27. Higashinaka R, Prasad R, Walker MA (2006) Learning to generate naturalistic utterances

using reviews in spoken dialogue systems. In: Proceedings of the 21st international

conference on computational linguistics and 44th annual meeting of the ACL, Sydney, pp

265–272

28. Hu M, Liu B (2004) Mining and summarizing customer reviews. In Proceedings of

SIGKDD’04, pp 168–177

29. Humayoun M, Hammarström H, Ranta A (2007) Urdu morphology, orthography and

lexicon extraction. In: Proceedings of the 2nd workshop on computational approaches to

Arabic script-based languages. Stanford, USA, pp 59–66

30. Ijaz M, Hussain S (2007) Corpus based Urdu lexicon development. In: Proceedings of the

conference on language technology, University of Peshawar, Pakistan

31. Jang H, Shin H (2010) Language-specific sentiment analysis in morphologically rich

languages. In: Proceedings of the COLING, Poster Volume, Beijing, pp 498–506

32. Jia, L., Yu, C., Meng, W.: The effect of negation on sentiment analysis and retrieval

effectiveness. ACM (2009)

33. Kaji N, Kitsuregawa M (2007) Building lexicon for sentiment analysis from massive

collection of html documents. In: Proceedings of EMNLP’07, pp 1075–1083

34. Kamps J, Marx M, Mokken RJ, de Rijke M (2004) Using Wordnet to measure semantic

orientation of adjectives. In Proceedings of LREC’04, pp 1115–1118

References 114


35. Kennedy A, Inkpen D (2006) Sentiment classification of movie and product reviews

using contextual valence shifters. Computational Intelligence 22(2):110–125

36. Kennedy, Inkpen, D.: (2005) Sentiment Classification of Movie Reviews Using

Contextual Valence Shifters. In: Proceedings of FINEXIN (2005)

37. Khan S A, Anwar W, Bajwa U I (2011) Challenges in developing a rule based Urdu

stemmer, In: Proceedings of 2nd workshop on south and southeast Asian Natural

Language Processing, pp 46-51.

38. Kim S-M, Hovy E (2006) Automatic identification of pro and con reasons in online

reviews. In: Proceedings of the COLING, Sydney pp 483–490

39. Kumar A, Siddiqui T (2008) An Unsupervised Hindi Stemmer with Heuristics

Improvements. In: Proceedings of the Second Workshop on Analytics for Noisy

Unstructured Text Data.

40. Lehal GS (2009) A two stage word segmentation system for handling space insertion

problem in Urdu script. In: Proceedings of world academy of science, engineering and

technology, Bangkok pp 321–324

41. Lehal GS (2010) A word segmentation system for handling space omission problem in

Urdu script. In: Proceedings of the 1st workshop on South and Southeast Asian natural

language processing (WSSANLP), the 23rd international conference on computational

linguistics, COLING, Beijing, pp 43–50

42. Moilanen, K., Pulman, S.: The Good, the Bad, and the Unknown. In: Proceedings of

ACL/HLT (2008)

43. Muaz A, Ali A, Hussain S (2009) Analysis and development of Urdu POS tagged

corpora. In: Proceedings of the 7th workshop on Asian language resources, ACL-

IJCNLP, Suntec, Singapore, pp 24–31

44. Mukund S, Ghosh D (2011) Using sequence kernels to identify opinion entities in Urdu.

In: Proceedings of the 15th conference on Computational Natural Language Learning, pp

58-67.

45. Mukund S, Ghosh D, Srihari RK (2010) Using cross-lingual projections to generate

semantic role labeled corpus for Urdu—a resource poor language. In: Proceeding of the

23rd international conference on computational linguistics COLING, Beijing pp 797–805

References 115


46. Mukund S, Srihari R (2010) An Information-Extraction System for Urdu- A resource

poor language. ACM Ttramsactions on Asian Language Information Processing, vol.9,

No.4, Article 15.

47. Mullen T, Collier N (2004) Sentiment analysis using support vector machines with

diverse information sources. In: Proceedings of the conference on empirical methods in

natural language processing, Barcelona, pp 412–418

48. Na J-C, Sui H, Khoo C, Chan S, Zhou Y (2004) Effectiveness of simple linguistic

processing in automatic sentiment classification of product reviews. In: Proceedings of

conference of the international society of knowledge organization (ISKO), pp 49–54

49. Paik J H, Parui S K (2008) A Simple Stemmer for Inflectional Languages. Forum for

Information Retrieval Evaluation.

50. Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity

summarization based on minimum cuts. In: Proceedings of the 42nd meeting of the

association for computational linguistics, Barcelona, pp 271–278

51. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf

Retrieval 2(1–2):1–135

52. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using

machine learning techniques. In: Proceedings of the conference on empirical methods in

NLP, Philadelphia, PA, pp 79–86

53. Pennebaker, J W, Mehl M R, Niederhoffer K (2003) Psychological aspects of natural

language use: Our words, our selves. Annual Review of Psychology 54, pp 547–577

54. Polanyi, L., Zaenen, A.: Context Valence Shifters. In: Proceedings of the AAAI Spring

Symposium on Exploring Attitude and Affect in Text (2004)

55. Riaz K (2010) Rule based named entity recognition in Urdu. In: Proceedings of the 2010

Named entities Workshop, ACL 2010, pp 126-135.

56. Riloff E, Wiebe J (2003) Learning extraction patterns for subjective expressions. In:

Proceedings of the conference on empirical methods in natural language processing

(EMNLP), Sapporo pp 25–32

References 116


57. Riloff E, Wiebe J, Wilson T (2003) Learning subjective nouns using extraction pattern

bootstrapping. In: Proceedings of the 7th conference on natural language learning,

Edmonton, pp 25–32

58. Rizvi SMJ, Hussain M (2005) Modeling case marking systems of Urdu-Hindi languages

by using semantic information. In: Proceedings of natural language processing and

knowledge engineering, pp 85–90

59. Schmidt RL (1999) Urdu: an essential grammar. Routledge Publishing, New York

60. Sharifloo A A, Shamsfard M (2008) A Bottom up Approach to Persian Stemming. In:

Proceedings of the 3rd International Joint Conference on Natural Language Processing.

61. Singh A, Bendre S, Sangal R (2005) HMM based chunker for Hindi. In: Proceedings of

IJCNPL-05: 2nd international joint conference on Natural Language Processing.

62. Snyder B, Barzilay R (2007) Multiple aspect ranking using the Good Grief algorithm. In:

Proceedings of the joint human language technology/North American chapter of the ACL

conference, Rochester, NY pp 300–307

63. Stone PJ, Dunphy DC, Smith MS, Ogilvie DM (1966) The general inquirer: a computer

approach to content analysis. MIT Press, Cambridge

64. Syed AZ, Muhammad A, Martínez-Enríquez AM (2012) Associating Targets with

SentiUnits: A Step Forward in Sentiment Analysis of Urdu Text. In: Artificial

Intelligence Review.

65. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) (a) Sentiment Analysis of Urdu

Language: Handling Phrase-Level Negation. In: Proceedings of the 10thMexican

international conference of artificial intelligence, pp 382–393

66. Syed, AZ, Muhammad A, Martinez-Enriquez, AM (2011) (b) Adjectival Phrases as the

Sentiment Carriers in the Urdu Text. Journal of American Science 7(3), 644–652

67. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) (c) Sentiment-Annotated

Lexicon Construction for an Urdu Text Based Sentiment Analyzer. In: Pakistan Journal

of Science (2011), ISSN: 0030-9877

68. Syed AZ, Muhammad A, Martínez-Enríquez AM (2010) Lexicon based sentiment

analysis of Urdu text using SentiUnits. In: Proceedings of the 9th Mexican international

conference of artificial intelligence, Pachuca, Mexico, pp 32–43

References 117


69. Tan S, Cheng X, Wang Y, Xu H (2009) Adapting Naive Bayes to domain adaptation for

sentiment analysis. In: Proceedings of the 31st European conference on IR research on

advances in information retrieval, pp 337–349

70. Thabet N (2004) Stemming the Qur’an. In: Proceedings of the Workshop on

Computational Approaches to Arabic Script-based Languages.

71. Tsarfaty R, Seddah D, Goldberg Y, Kübler S, Candito M, Foster J, Versley Y, Rehbein I,

Tounsi L (2010) Statistical parsing of morphologically rich languages (SPMRL) what,

how and whither. In: Proceedings of the NAACL HLT 2010 first workshop on statistical

parsing of morphologically-rich languages, Los Angeles, pp 1–12

72. Turney P (2002) Thumbs up or thumbs down? Semantic orientation applied to

unsupervised classification of reviews. In: Proceedings of 40th meeting of the association

for computational linguistics, Philadelphia, PA, pp 417–424

73. Turney P, Littman M (2003) Measuring praise and criticism: inference of semantic

orientation from association. ACM Trans Inf Syst 21(4):315–346

74. Whitelaw C, Garg N, Argamon S (2005) Using appraisal groups for sentiment analysis.

In: Proceedings of ACM SIGIR conference on information and knowledge management

(CIKM 2005), Bremen, pp 625–631

75. Wiebe J, Wilson T, Bruce R, Bell M, Martin M (2004) Learning subjective language.

Comput Linguist 30(3):277–308

76. Wiegand, M., et al.: A survey on the role of negation in sentiment analysis. In:

Proceedings of the Workshop on Negation and Speculation in Natural Language

Processing 2010. Association for Computational Linguistics (2010)

77. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing Contextual Polarity in Phrase-level

Sentiment Analysis. In: Proc. HLT/EMNLP (2005)

78. Yang K,YuN,ValerioA, ZhangH(2006)WIDIT in TREC 2006 Blog Track. In:

Proceedings of Text REtrieval conference—TREC

79. Yu H, Hatzivassiloglou V (2003) Towards answering opinion questions: separating facts

from opinions and identifying the polarity of opinion sentences. In: Proceedings of

EMNLP’03, pp 129–136

redefining urdu morphology and grammar for the...

Documents