deriving topics and opinions from microblogs feng jiang supervisors: jixue liu & jiuyong li
TRANSCRIPT
Deriving Topics and Opinions from
Microblogs
Feng JiangSupervisors: Jixue Liu & Jiuyong Li
Contents• Background of research• Significance of research• Problems and challenges• Main tasks• Literature review• Methodology • Improvement and innovation • Experiment Result
Background• Microblogs: Twitter
Twitter allows users to post short messages
(i.e. maximum 140 characters) called “tweets” to communicate to each otherInformation platform
allow people to publish, spread and share information, knowledge and personal viewpoint.
Publish easily and convenientlyAuthors publish tweets, so they often publish blogs which are useless as well as good articles by using laptops and smart phones.
Significance• Find useful information
Extract hot topic Extract opinion
• Save plenty of time and energy Do not have to read all the tweets, can quickly know the
content. Quickly find the opinion classification for the hot topic.
• Seek and track the important events• Identify fashion trends • Find popular products
Problems and challenges• It is very hard for individuals to manually find interesting and
popular things due to numerous posts
• We could not directly utilise the existing web and text mining methods to extract hot topics and opinions from mircoblogs because of unique characteristics of mircoblogs.
Problems and challenges mass data• At the end of 2009, Twitter had 75 million account
holders, of which about 20% are active. There are approximately 2.5 million Twitter posts per day.
• While the majority posts are conversational or not very meaningful, about 3.6% of the posts concern topics of mainstream news.
Problems and challengesSemi-structured and unstructured data
there are no restrictions and rules on content and style to write posts on Microblogs.
A great variety of topics and viewsAuthors may discuss the popular movies in one paragraph, and then express their opinions for the sports events in next paragraph in one article, which makes the topic of one tweet is not clear.
Main tasks• Topic extraction
Generate a complete and meaningful sentence to summary a popular current event (e.g. 2012 London Olympics ) from relevant posts of blogs.
Main tasks• Sentiment analysis
find who support this topic and who oppose it from the comments
Literature review
• M. Chau, et al., "A blog mining framework," It Professional, vol. 11, pp. 36-41, 2009.
Literature review
• M. Hutton, et al., "Summarizing microblogs automatically," presented at the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, 2010.
Literature review• B. Sharifi, et al., "Experiments in Microblog Summarization," in Social
Computing (SocialCom), 2010 IEEE Second International Conference on, 2010, pp. 49-56.
Methodology
Methodology• 1 Text pre-processing
Part-of-speech (POS) tagging Feature filteringStop Words list: and, or, ofWord Stemming: wants, wanted -> wantSynonyms and antonymsHypernyms and hyponyms: love -> emotionTF IDF: term frequency * inverse document frequencyVector Space ModelSimilarity analysis
Methodology
• 2 Detect topics: clustering MethodK Means clustering,SOM clusteringwordnet-based clustering
• 3 Detect opinionBayesian classificationSVM (support vector machine)
Improvement and innovation
• Using wordnet to improve clustering, assign the weight to wrods and generate topic sentence.
• WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets.
• For example:• Suppose the weight of “defeat” is 5, the weight of “overcome” is 3. They
are in the same synset, so the weight of “defeat” is 8
Improvement and innovation
• Using clustering method to cluster the tweets before detect hot topics and opinions
• wordnet-based clustering• Other’s work only calculate the word frequency
Improvement and innovation
• Consider Related factors Word Frequency Posts Occurrence time Author: celebrity or have a lot of followers Users’ Discrete Degrees: describe the discrete distribution
level of users who release or forward posts Keywords: some words in twitter are signed by using hashtag:
#Happy Sweetest Day, #beijing, #Alex Cross
Improvement and innovation
• Grammar Analysis• Noun: not changed.• Verb: word stemming. • Adjective and adverb: word stemming, analysed and processed by
wordnet. Synonyms and antonyms• For example: the love of hypernyms and hyponyms, entity——> abstract
entity ——>abstraction ——> attribute ——> state ——> feeling ——> emotion ——> love
• Create subject set, verb set and object set to generate the simple sentence of the topic
Improvement and innovation • 3-layer tree structure• The first layer is subject set, the second layer is verb set, the last layer
is object set • Create subject set, verb set and object set to generate the simple
sentence of the topic• the basic sentence unit: SUBJECT plus VERB, or SUBJECT plus
VERB plus OBJECT. • Remember that the subject names what the sentence is about, the verb
tells what the subject does or is, and the object receives the action of the verb.
• Although many other structures can be added to this basic unit, the pattern of SUBJECT plus VERB (or SUBJECT plus VERB plus OBJECT) can be found in even the longest and most complicated structures.
Improvement and innovation
Improvement and innovation
Experiment
• Input:Australian Olympic shooters have had a tough morning . They lost - Dina Aspandiyarova finished 14th and Lalita Yauhleuskaya was 40th
Germany defeats Aussies beach volleyball pair Bec Palmer and Louise Bawden in three sets
Germany overcomes Aussies beach volleyball pair Bec Palmer and Louise Bawden in August.
Aussies Palmer and Bawden take it to a deciding set in the beach volleyball against Germany
Australian team lost the men's water polo to Italy 8-5 . The Sharks play Kazakhstan next on Tuesday.
They lost the men's water polo to Italy. They came back last night.
Experiment Result
Questions