graph data management lab, school of computer science gdm@fudangdm@fudan put conference information...

Graph Data Management Lab, School of Computer Science

GDM@FUDAN

http://gdm.fudan.edu.cn

Put conference information here

By zerup

Public Opinion Mining on Micro-

blogs

2


GDM@FUDAN

http://gdm.fudan.edu.cnGDM@FUDAN

What is opinion?

An opinion is a subjective belief, and is the result of emotion or interpretation of facts.

An opinion may be supported by an argument or a set of facts.

An opinion may be the result of a person's perspective, understanding, particular feelings, beliefs, and desires.

According to these features, we can make up a dictionary of subjective words to pick up opinions from micro-blogs with lots of facts.

3


GDM@FUDAN


Problem Definition

Given thousands of micro-blogs on the same topic Mining the top-k most popular opinions Solution Framework

• Word partition• Using extracted keywords to calculate similarity• Clustering

4


GDM@FUDAN


Why clustering? Why AP?

In information-theory, an opinion is a set of encoding symbols.

In human recognition, an opinion is made up of subjective, objective and comment.

In semantic network, for a certain topic, opinions sharing common words have similar meanings

According to the three perspectives above, we can construct a graph with every micro-blog as a data point and mutual information as edge. Then we apply a clustering algorithm to the graph to find those opinion exemplars.

5


GDM@FUDAN


Basic Idea of Affinity Propagation (AP)

Data Point i & Exemplar k Responsibility: i->k noted as r(i,k) Availability: k->i noted as a(i,k)

6


GDM@FUDAN


s(i,k)

ik• For each word w in sentence i, encoding w using words in

sentence k or in the dictionary• s(i,k) is negative sum of the encoding cost

Example:• Sentence i has keywords: A B• Sentence k has keywords: A C D E• Dictionary has keywords: A B C D E F G H• s(i,k) = -(log4 + log8) = -5• s(k,i) = -(log2 + 3*log8) = -10

s(k,k) is set to input preference of k to be a exemplar

7


GDM@FUDAN


r(i,k)

i = k r(k,k) = s(k,k) – max{s(i,k’)}where

Actually, r(k,k) is fixed.

8


GDM@FUDAN


a(i,k)

i = k

9


GDM@FUDAN


Result of AP

#iterations is predefined because the convergence of AP is not assured

Finally, we got a matrix M = A + R, where For each >0, we set k to be one exemplar. For , we set k to be i’s exemplar.

10


GDM@FUDAN


Framework of AP

partition

keyword

input

filteredMicro-blogs

word&its property

word with real meaning

return

cluster

Number of clusterCenter&members

output

Word dict.

Semanticnetwork

11


GDM@FUDAN


Preprocessing of Micro-blogs

First, we check all the Micro-blogs and find out that a lot of advertisement share a same embedding URL http://t.cn/agxZ4i. According to this, about 5000 micro-blogs are filtered out of all the total of 165000.

Second, we find that many spam micro-blogs contains lots of unrelated words(so it can occur in many topics), thus it has a lot of spaces in itself.

Third, we hope to develop a subjective word dictionary to pick up those opinion sentences.

http://t.cn/agxZ4i

http://t.cn/agxZ4i

12


GDM@FUDAN


Word Partition of AP

Accuracy of Word Partition

• Tool: ICTCLAS

• Limitation: recognizing a person’s name, unsuitable

granularity

Define user’s dictionary and update it

• Store some latest network vocabulary

Which words(w.r.t their properties) are most related with

opinion/sentiment

• We pick up nouns, verbs, adjectives as meaningful keywords

Translate traditional Chinese into simple Chinese

13


GDM@FUDAN


Example of AP

1. 大雾天，但是温度还好，早上起来心情还不错，连续上班11 天，终于休息啊。

2. 很奇怪到十一月的修水竟然还不冷太阳战胜了大雾又是个周五心情莫名的愉悦啊。。。

3. 虽然大雾，心情尚好。4. 大雾中能见度很低，请大家小心驾驶。5. 雾很大，为了保证安全，只能将车速减到最低。6. 大雾天，减速行车，注意安全。

14


GDM@FUDAN


AP after first iteration

For i->k & k->I

+,+ +,- -,+ -,- unrelated

15


GDM@FUDAN


Result of the Example

1. 大雾天，但是温度还好，早上起来心情还不错，连续上班11 天，终于休息啊。

2. 很奇怪到十一月的修水竟然还不冷太阳战胜了大雾又是个周五心情莫名的愉悦啊。。。

3. 虽然大雾，心情尚好。4. 大雾中能见度很低，请大家小心驾驶。5. 雾很大，为了保证安全，只能将车速减到最低。6. 大雾天，减速行车，注意安全。 After three iterations, 3 and 6 become the

exemplars.

16


GDM@FUDAN


Limitations of AP

it is hard to know what value of parameter 'preference' can yield an optimal clustering solution

oscillations cannot be eliminated automatically if occur

Difficult to scale up

18


GDM@FUDAN


Scalability of AP

As Matrix S, A and R are stored, read and written frequently, when #micro-blogs is more than 10,000, AP is much too slow!

To scale up the AP, we have got 3 perspectives:• Sampling or Pruning to reduce #micro-blogs• Give a sketch of the matrix and store those values only

above a predefined threshold• Use a new similarity measurement and make the

similarity symmetric or sparser

19


GDM@FUDAN


Sampling or Pruning

Sampling:• Randomly sample from the set of micro-blogs• Sample over the whole period those micro-blogs are sent• Sample over the social network

Pruning according to specific features of micro-blogs• Lots of comments on a micro-blog means it is a central

opinion, at least a central topic• The more a micro-blog is retweetted, the more people

approve of it• Validation of overlap btw community center and opinion

exemplar

20


GDM@FUDAN


Validation of Overlap between social clusters and micro-blogs clusters

How micro-blogs clusters spread over a social network

If they overlaps a lot, clustering micro-blogs is much easier, cuz the social network is relatively fixed.

If they overlaps a little bit, then we can choose several social communities as our focus.

Check a micro-blogs cluster is coordinated or uncoordinated across the network

The validation is topic-dependent

21


GDM@FUDAN


Focus on the matrix

Divide the similarity matrix into symmetric part and unsymmetric part

Only store those values bigger than a predefined threshold

Use a decomposition of matrix:• M = then we need only store two vectors a and b

22


GDM@FUDAN


A new similarity measurement

Information encoding (-log) Word Frequency (cos<x,y>)

• Tf-idf (unsuitable)• Bag-of-words

Semantic Network(a long-term work)• A big training data• A big dictionary and Jaccard Coefficient

Social Network• Based on the former validation• If overlap is frequent, we take the relationship of the

senders of two micro-blogs as a factor of similarity

23


GDM@FUDAN


Conclusion

Techniques:• Word Partition• AP• Sampling

Challenges and Solutions:• Filtering spam micro-blogs => build a dict of subjective

word (2)• Word partition => build a user dictionary and improve the

accurancy of partition (4)• Similarity measurement => a semantic network (3)• Sampling and Pruning => leverage the features of social

network and micro-blogs (1)

24


GDM@FUDAN


The large-scale structure of semantic networks

Stanford, Cognitive Science graph-theoretically analyze three types of

semantic networks: word associations, WordNet, and Roget’s thesaurus

Conclusion: they have a small-world structure, characterized by sparse connectivity, short average path-lengths between words, and strong local clustering. In addition, the distributions of the number of connections follow power laws that suggest a hub structure similar to that found in other natural networks, such as the world wide web.

25


GDM@FUDAN


Basic ideas for build such a framework

because new concepts are preferentially attached to rich concepts, the distribution of the connecti-vity follows a power law

Concepts that are learned early in life should show higher connectivity.

graph data management lab, school of computer science gdm@fudangdm@fudan put conference information...

Documents

school of computer sciencegdm

opinion sentences

opinion exemplars

spam microblogs

zeruppublic opinion

dictionary of subjective

subjective word dictionary

mutual information