sentiment analysis for serbian language
DESCRIPTION
Sentiment analisis of Serbian languge with stemmer for Serbian.TRANSCRIPT
Sentiment analysis of sentences in Serbian language
Nikola Milošević
Why to analyze sentiment in Serbian?
● Great industrial need– Ads websites– Automated market research– Customer satisfaction
● NLP tools for Serbian are not developed– Need for tools and resources– Almost no accessible tools through API
Serbian language
● Belongs to Indo-Europian language group● Slavic language● Highly inflectional● 3 pronunciation types● 3 dialect groups● Write as you speak● Latin and Cyrillic
writing system
Sentiment analysis work-flow
Tokenization and preprocessing
● Process of breaking a stream of text up into words
● Stop-word filtering● Negation handling
– Adding NE_ prefix after negation– All words before punctuation
● Irregular verbs
Stemming
● Process for reducing inflected words to their stem, base or root form
● Kešelj and Šipka (2008)● Hand crafted rule based stemmer● ~300 rules
Sentiment analysis
● Aim to build binary sentiment analysis● General Serbian language● No annotated corpus for Serbian● Annotation work (~1000 small texts)● Supervised machine learning
Naive Bayes
● Algorithm that learns fast● Bag of words approach● Assumption of conditional independence● Laplace smoothing
Implementation
● Web API with presentation layer● JSON communication● Secured page for annotating● Build using PHP and MySQL● Web & Android
Results
● Stemmer– Smallest and most precise stemmer– 90% correct on news articles– Problems: small words, irregular inflections,
voice changes
● Sentiment analyzer– 80% correct– Problems: Irony, ambiguity, small training
data
Future work
● Stemmer– Use snowball framework
– Build multi-step stemmer
● Sentiment analyzer– POS tagging
– Complex negation handling
– SVM algorithm