www.trendminer-project.eu 21st november 2014, budapest social psychological analysis of public...
TRANSCRIPT
www.trendminer-project.eu
21st November 2014, Budapest
Social Psychological Analysis of Public Political Comments on
Márton Miháltz
TrendMiner Overview
• What kind of social political trends are there in Hungarian comments to political posts on Facebook?– Facebook in Hungary: 4.27M registered users = 59.2% of internet users,
43% of total population
• Download all public comments from Hungarian politicians’, parties’ facebook pages
• Analysis of comments:– Basic NLP (tokenization, PoS, stemming), domain-adapted– Entities: political actors (people, organizations)– Sentiment– Social psychology dimensions: agency/communion, individualism/collectivism,
optimism/pessimism, primordial/conceptual thinking
• In cooperation with Narrative Psychology Research Group, Hungarian Academy of Sciences
2
Data Acquisition
• Get comments via fb Graph API– 1.9M comments for 141K fb posts (2013.10.01 – 2014.09.02)– from 1344 fb pages
• Organizations: parties, regional and associated branches• People: candidate and elected representatives (MPs), government,
party officials• Official and fan pages
– In 3 categories• Hungarian parliament 2010-2014• Hungarian parliament elections 2014 (6th April)• EU parliament elections 2014 (25th May)• Sources: valasztas.hu, wikipedia.hu
• Everything in a MySQL database– For arbitrary queries (political groups, time etc.)
Data model
• Fb_pages– Id, URL, Page title– Type: person or organization– Affiliated party (3 campaigns)
• Fb_posts, Fb_comments– Id, Created_timestamp– Message text, Author_user_id
• Comments_annotations– Sentence_id, Start_token,
End_token index– Annotated text,
Lemmatized_annotated_text, Annotation_tag
• Fb_comments_scores– 16 scores and counts
(sentiment, RID,, agency, communion, optimism, …)
Hungarian Political Ontology
• Extending TM multilingual political ontology– 8 New classes, 3+3 new object/data properties, 1579
new instances (1 Country,18 Party, 661 Politician, 899 Nomination)
– Nominated and elected MPs (2010 Hu. Parl., 2014 Hu. Parl., 2014 EU Parl.), nominating parties;
– Names, abbreviated names, nicknames, Facebook page URLs etc.
• Example:
5
6
Example: Benedek Jávor was member of Hungarian Parliament during 2010-2014 (nominated by LMP), member of European Parliament from 2014 (nominated by EGYÜTT-PM).
Hungarian Political Ontology
Processing Pipeline
• Downloading (Fb Graph API py script)• Tokenizaton (huntoken tool)• PoS-tagging (hunmorph tool)• Morphological analysis (hunmorph tool)• Stem+analysis disambiguation (Python script)• Content analysis (Java NooJ)• Scoring & storage in DB• Uploading in RDF to TM Integration Server
Domain Adaptation
• Problem: existing NLP tools developed on different domain, (f)ail on social media language (facebook comments)
• Using corpus for survey:– 1.25M fb comments (29M tokens)– 2.25M unknown tokens (694K types)– Frequency list, f > 15 items manually revised– Identify common problems– Lists of frequent, relevant unknown, new words etc.
Domain Adaptation: Tokenization
• Huntoken tool• Frequent problems:
– missing spaces around punctuation... end of sentence.Beginning of another ...
– Multiplicated punctuationfirst part……. Second part
– Contracted words (slang)asszem = azt hiszem (“I think”)
– Consonant multiplication (interjections, onomatopeic words etc.)e.g. pfffffffff, uffffff, ejjjjjjjj (pff(f*), uff(f*), ej(j*))
– split large numbers by decimal groups125 000
– split URLS– split emoticons
: D
Domain Adaptation: PoS/stemming
• Hunpos tagger + hunmorph analyzer + stemming script• Frequent problems:
– Unknown words (no lemma/PoS)• add to hunmorph analyzer’s lexicon • using analogous words (morphological paradigm)• Compounds, abbreviations, acronyms, slang words etc.
– Frequently misspelled word forms: • replace with correct forms
– Wrong capitalizatione.g. SENTENCES IN ALL CAPS
– Missing accent characters –disambiguation model neededE.g. kor (age), kór (disease), kör (circle)
NooJ, Java NooJ, Nooj-cmd
• Java NooJ– Open source version of NooJ: define and run finite state
machines for querying, annotation etc. (morphology, syntax) – NooJ-Cmd extension: all NooJ GUI features => command line
options– Open source: https://github.com/tkb-/nooj-cmd
• NooJ grammars (FSMs) for annotation:– Actors (entities)– Emotional valence (sentiment polarity)– Regressive imagery dictionary– Agency-communion– Optimism-pessimism– Individualism-collectivism
Development of NooJ Grammars
• In collaboration with social psychologist researchers– Social Psychology Department, Eötvös Lóránd University,
Budapest– Narrative Psychology Research Group, Hungarian Academy of
Sciences• Development Corpus
– 176K sample fb comments from 570 fb pages (4.9M tokens)– NLP annotation– Frequency lists (lemmas, lemmas+PoS, lemmas+morphological
info etc.)• Development:
– f > 100 content words from development corpus (3500 types)– 7 independent annotators– >= 4 annotartors agree: manual revision– Compile into NooJ grammar with polarity shifters, items to be
excluded etc.
1. Political Actors (NEs)
• Maxent NE tool (huntag): low performance on domain– Trained on standard language news texts– Miscategorization, false positive NEs, entity boundary
recognition problems• NooJ grammar/lexicon for Trendminer
– Person names: family_name (given_name_lemmatized)? | frequent_nicknames …
– Organization names:Standard_form | abbreviated_forms… | nicknames…
– Created automatically (names from DB) + manually (nicknames from freq. lists)
2. Emotional Valence
• Emotions with positive or negative polarity• Polarity in context: recognize negation using simple
rules• Nouns, adjectives, verbs, adverbs, emoticons, multi-
word expressions• 500 Positive, 420 negative entries
3. Regressive Imagery Dictionary
• Martindale (1975, 1990): uncover psychological processes reflected in the text
• 2 basic categories of thinking:– Primordial (primary): associative, concrete, and takes little
account of reality (fantasy, dreams)– Conceptual (secondary): abstract, logical, reality oriented,
aimed at problem solving
• 7+29 more subcategories (social behavior, cognition, perceptions, sensations etc.)
• Hungarian version by Pólya and Szász• 3000+ terms
4. Agency/Communion
• 2 fundamental dimensions of social values:– Communion: moral and emotional aspects of an individual’s
relations to others (affection, expressiveness, cooperation, social benefit etc.)
– Agency: efficiency of an individual’s goal-orientated behavior (motivation, competence, control)
• Positive or negative for both dimensions– Context dependent (e.g. negation)
• 640 expressions
5. Optimism/Pessimism
• Based on PoS and morphology annotations + time expressions
• 2 measures:1. |future_tense_verbs| / (|present_tense_verbs| + |past_tense_verbs|)2. |present_tense_verbs| / |past_tense_verbs|
• Both correlate with degree of optimism
6. Individualism/Collectivism
• Based on PoS and morphology annotations• 1 measure:
|personal pronouns| /(|verbs with personal inflection| + |nouns with possessive inflection|)
• Higher score: higher degree of individualism
Dissemination and Exploitation
• Presentations– Hungarian NLP Meetup, Sept. 25. 2014., Budapest– conText, Nov. 20. 2014, Budapest
• Conference papers, presentations– 2 papers at 11th Conference on Hungarian Computational Linguistics (January
15-16. 2015., Szeged)
• Source code– https://github.com/mmihaltz/trendminer-hunlp– https://github.com/mmihaltz/trendminer-hutools– https://github.com/tkb-/nooj-cmd
• Project website (http://corpus.nytud.hu/trendminer)– Download political ontology– Download 1.9M facebook comments corpus (w/ annotations)– Project info, papers, presentations slides
24