hotspot detection method of internet public opinion based on … · 2016-01-09 · hotspot...

8
Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction 1 Shouhua Zhang, 1, 2 Zhenpeng Liu 1 College of Mathematics & Computer Science, Hebei University, Baoding 071002, China [email protected] 2 Network Center, Hebei University, Baoding 071002, China, [email protected] Abstract Traditional Internet public opinion hotspot tracking method is based on Text clustering, its text clustering speed and search result is not so good when handling massive webpage. The monitoring scale of Internet public opinion system is limited in the listed key words by the users, so that system cannot monitor the unknown emergency. According to the occurring and spreading features, to improve acquisition strategy of Internet Public opinion information, to propose the method of auto-mine hotspot key words and topic clustering on account of key words, design Internet public opinion hotspot analytical model based on different characteristics of news, forums and blogs, to design and accomplish an Internet public opinion monitoring system. The practical running suggested, this plan could accurately mine hotspot topics in time, track and monitor real-time emergencies. Keywords: Internet Public Opinion, Key Word, Hotspot Detection, Hotspot Topics 1. Introduction In the network era, everyone can make comments on Blogs or Forums without considering the authenticity and social influence. The participation of internet users has reached unprecedented levels. whether domestic or international events, can be formed quickly to online public opinion, users express their views and disseminate ideas through the network, and internet thus has a tremendous pressure of public opinion, to the point that any department, agency, cannot ignore it. At present, study on Internet public opinion domestic has made some advantages, but there are still several problems to be resolved. The monitoring scale of current Internet public opinion system is limited by the key words given by the users. Affected by some subjective factors such as knowledge, information source, and concerns of the user, the system will not detect those unexpected events [1]. Therefore, by using computers to sort out news, to find hot topic keywords automatically; to update common safety word frequency in library timely, to track sudden events timely can be completed [2]. The main algorithms for hotspot tracking adopt the Text Clustering technology. When dealing with mass web pages, it is difficult to cluster the expected hotspot. Clustering causes huge central bias. The algorithms need to be improved [3]. The focus of the research includes improving the strategy of internet public opinion collecting, mining keywords of hot topics automatically, putting forward new analytical models, tracking hot topics timely, and improving efficiency. 2. Hotspot topics mining According to the hot topic distribution features, the system acquires information from timeliness mainstream media webs and searching engines, to guarantee that the information has reliable source, better timeliness, smaller quantity of information and shorter processing time [4]. Key words have the features of representativeness, conciseness, timeliness, mass information, high degree of association between key words, so that the topic and contents of the hot topic will maximally be covered by minimal information. Clues based on keyword extraction to track a hot topic, from the collected web collections to extract hot keywords, and then to cluster these keywords and to mine the hot topics. Hot topic clustering based on keywords demands minimum computing and high efficiency. Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction Shouhua Zhang, Zhenpeng Liu International Journal of Digital Content Technology and its Applications(JDCTA) Volume6,Number16,September 2012 doi:10.4156/jdcta.vol6.issue16.41 340

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hotspot Detection Method of Internet Public Opinion based on … · 2016-01-09 · Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction 1Shouhua Zhang,

Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction

1Shouhua Zhang, 1, 2Zhenpeng Liu

1College of Mathematics & Computer Science, Hebei University, Baoding 071002, China [email protected]

2Network Center, Hebei University, Baoding 071002, China, [email protected]

Abstract Traditional Internet public opinion hotspot tracking method is based on Text clustering, its text

clustering speed and search result is not so good when handling massive webpage. The monitoring scale of Internet public opinion system is limited in the listed key words by the users, so that system cannot monitor the unknown emergency. According to the occurring and spreading features, to improve acquisition strategy of Internet Public opinion information, to propose the method of auto-mine hotspot key words and topic clustering on account of key words, design Internet public opinion hotspot analytical model based on different characteristics of news, forums and blogs, to design and accomplish an Internet public opinion monitoring system. The practical running suggested, this plan could accurately mine hotspot topics in time, track and monitor real-time emergencies.

Keywords: Internet Public Opinion, Key Word, Hotspot Detection, Hotspot Topics

1. Introduction

In the network era, everyone can make comments on Blogs or Forums without considering the authenticity and social influence. The participation of internet users has reached unprecedented levels. whether domestic or international events, can be formed quickly to online public opinion, users express their views and disseminate ideas through the network, and internet thus has a tremendous pressure of public opinion, to the point that any department, agency, cannot ignore it.

At present, study on Internet public opinion domestic has made some advantages, but there are still several problems to be resolved. The monitoring scale of current Internet public opinion system is limited by the key words given by the users. Affected by some subjective factors such as knowledge, information source, and concerns of the user, the system will not detect those unexpected events [1]. Therefore, by using computers to sort out news, to find hot topic keywords automatically; to update common safety word frequency in library timely, to track sudden events timely can be completed [2]. The main algorithms for hotspot tracking adopt the Text Clustering technology. When dealing with mass web pages, it is difficult to cluster the expected hotspot. Clustering causes huge central bias. The algorithms need to be improved [3].

The focus of the research includes improving the strategy of internet public opinion collecting, mining keywords of hot topics automatically, putting forward new analytical models, tracking hot topics timely, and improving efficiency. 2. Hotspot topics mining

According to the hot topic distribution features, the system acquires information from timeliness mainstream media webs and searching engines, to guarantee that the information has reliable source, better timeliness, smaller quantity of information and shorter processing time [4]. Key words have the features of representativeness, conciseness, timeliness, mass information, high degree of association between key words, so that the topic and contents of the hot topic will maximally be covered by minimal information. Clues based on keyword extraction to track a hot topic, from the collected web collections to extract hot keywords, and then to cluster these keywords and to mine the hot topics. Hot topic clustering based on keywords demands minimum computing and high efficiency.

Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction Shouhua Zhang, Zhenpeng Liu

International Journal of Digital Content Technology and its Applications(JDCTA) Volume6,Number16,September 2012 doi:10.4156/jdcta.vol6.issue16.41

340

Page 2: Hotspot Detection Method of Internet Public Opinion based on … · 2016-01-09 · Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction 1Shouhua Zhang,

2.1. Hotspot key words extracting

The process for extracting the keywords of the hot topics is shown in Figure 1. The steps of vectorization, keyword of single document and hotspot keywords are introduced here. The other steps are introduced in section 3.

Figure 1. The Process for Extracting Hotspot Key Words

To build the hot topic keyword set, the set for every single document is extracted as the vector d

based on VSM. The document d can be converted to vector space model comprising of feature vectors as expression (1), where ti is feature word of document and wi is the weight of ti [5].

d= (t1, w1, t2, w2, …, tn, wn) (1)

Where the weight is calculated by TFIDF mostly. Formula (2) shows the TFIDF. Where w (t, d) is

called the weight of text t in document d. The term frequency TF (t, d) is the frequency (number of times) of word t in the document d. The document frequency DF (t) is the number of documents that contain word t. |D| is the total number of d. The main idea of TFIDF is that higher TF indicates more concerns; larger IDF means more obvious discrimination and more suitable for classification [6].

)(log),(),(

tDF

DdtTFdtw (2)

For each word, except TF and IDF, has the following efficient information, part of speech, word

position and length etc [7]. In view of the parts of speech, a naming entity uses more information than nonentity, therefore the weight of naming entity is raised; verbs are considered as the standard, other speeches are lower; longer words presents more information; the titles of the texts of the related web pages of the network of public opinion texts have clear themes, the weight of the word in the title need to increase [8][9].Therefore the calculation method of weight about key words is given as formula (3).

d

tLengthdtPositionWeighttPOSWeighttDFDdtTFdtWeight

)()),(())(())(/log(),(),(

(3)

Weight (t, d) is the weight of candidate key word t. Weight (POS (t)) is the weight of t’s parts of

speech, which is 2 for entity words, 1.5 for verbs and 1 for the rest; Weight (Position (t, d)) indicates that the weight of the first occurrence of the word t in document d, calculated as the number of words in the word t after the first position behind divide the total number of words of the documents. Length (t) is the length of t. The key words are obtained by descending order the words.

Hot keywords should be some of the key words of the document, keywords set to all documents the establishment of the candidate keywords set to feature extraction for hot keywords. Taking into account the higher the frequencies of keywords, the higher its degree of concern; the greater IDF is, the greater the words distinction is, the greater the degree to suit the characteristics of the theme. Based on this, the construct the formula of weight of keyword candidates is constructed, as (4).

Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction Shouhua Zhang, Zhenpeng Liu

341

Page 3: Hotspot Detection Method of Internet Public Opinion based on … · 2016-01-09 · Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction 1Shouhua Zhang,

d

ttfNttfWeightt

))(/(log)( 2 (4)

Weightt indicates the candidate keywords t weight; tf (t) indicates the number of documents in which

candidate keyword t is the keyword, N represents the number of documents; d is the number of candidate keywords clustering keywords. According to the weights in descending order, the hot keywords sequences are got. 2.2. Topic clustering

Taking into account the influence of different sites and the timeliness of hotspot, the web pages based on the weight for the first factor, release time for the second factor, are sorted by the weight, time in descending order.

First a keyword represents a hot topic is tolerated, and then it begin to cluster. The first key word in the key word set is regarded clue as the first hotspot topic. If the key word set of document includes this key word, the document is found out. All these documents found out by us are clustered. The first document is tolerated the first hotspot topic, and then we take out one topic from the other documents. We adopt cosine formulate to calculate the similarity degree between the topic and every hotspot topic. [10] Cosine of the angle algorithm sees formula (5). If all the values of similarity degree are less than the threshold P, we regard the topic as a new hotspot topic. Otherwise the topic is combined with the hotspot topic which is the largest similarity degree with the topic. We repeat the above steps for the remaining topics. Then we handle the remaining documents with the same steps according to the second key word in key word set. Algorithm executes iteratively, until all the documents are handled.

1 21

1 2

2 21 2

1 1

( , )

( ) ( )

n

k kk

n n

k kk k

w wSim d d

w w

(5)

In formula (5), w1k, w2k, represents respectively, the weights of the kth feature of text of d1 and d2. Topic clustering process is shown in Figure 2.

2.3. Analytical models of hotspot topics

According to the different features of news, forum, Blog, we design hotspot topic analysis models respectively. At present, although there are many news sites, but they are uneven. The reliability of different websites and timeliness of news are different. News of the number of participants and the number of comments also reflect the heat value. Considering the above factors, the hotspot analysis model of news is showed in formula (6).

HotNews (t)= )),((*)(1

ii

n

ii cnpnfWeightSWeight

(6)

HotNews (t) indicates the news heat value of the topic t. n indicates the number of news on the topic

t. Weight (Si) indicates the weight of the website where is the news I. Weight (f (pni, cni)) indicates the weight based on the number of participants and the number of comments. pni indicates the number of participants of the news I. cni indicates the number of comments of news I. f (pni, cni) can be computed by formula (7).

f (pni, cni)=α* pni +β* cni, (0<α, β<1, α+β=1) (7)

α and β are the adjustment coefficients, General α = 0.2, General β= 0.8.

Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction Shouhua Zhang, Zhenpeng Liu

342

Page 4: Hotspot Detection Method of Internet Public Opinion based on … · 2016-01-09 · Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction 1Shouhua Zhang,

In addition to the useful information of news, there is other useful information for forums, the number of topic reproduced, for example. The hotspot analysis model of forums is showed in formula (8).

HotForumValue (t)= 1

( )* ( ( , ))* ( )n

i i ii

Weight S Weight f bn rn Weight r i (8)

HotForumValue (t) indicates the forum heat value of topic t. Weight (Si) indicates the weight of the

website where is the post I. Weight (f (bni, rni)) indicates the weight based on the number of post i browsed and the number of post i replied. bni indicates the number of post i browsed; rni indicates the number of post i replied. f (bni, rni) can be computed based on formula (7). Weight (ri) indicates the weight of the number of post i reproduced.

The blog hotspot analysis model is similar to the forum hotspot analysis model.

Figure 2. Topic Clustering Process

3. System design

In order to research and judge network public opinion, use of the above models, a real-time monitoring and tracking system of network public opinion is designed. Based on the characteristic of Internet public opinion occurrence and spreading, the system improves the information acquisition strategy, and automatically mines key words from titles of web pages. It conducts topic clustering with these key words. By reading these topics, public opinion analysts can know what is exactly happening and what has happened. Furthermore, the system can automatically and persistently track the event developments to assist the analysts rapidly, completely and comprehensively apprehending the general picture of an event.

Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction Shouhua Zhang, Zhenpeng Liu

343

Page 5: Hotspot Detection Method of Internet Public Opinion based on … · 2016-01-09 · Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction 1Shouhua Zhang,

The system has five components, public opinion topic planning, public opinion information acquisition, public opinion information preprocess, public opinion information analysis and public opinion process. System architecture is shown in Figure 3.

(1) Public opinion topic planning Public opinion topic planning refers that the public opinion supervising department, based on its

needs, chooses proper public opinion topic and corresponding seed URL collections and then determines the acquisition task. A decided topic is the basis of public opinion analysis. It will be determined by applying the key words set in the system [11]. Key words extraction is the clue of hot topics the quality of which diametrically determines the accuracy. There are two extraction methods, manual extraction and automatic extraction. Automatic extraction refers to as the process of extracting common features from a set of web pages by the program, which will be weighted by frequencies. Manual extraction has the advantages of simple implementation; human experiences are normally accurate so that huge deviation can be avoided. But there are several drawbacks such as absence and inaccuracy of weighting.

Choice for sources of public opinion information is crucial to the next public opinion mining. Accurately grasping the original source of public opinion information, can obtain a more comprehensive public opinion. According to the hot topic distribution features, the system acquires information from timeliness mainstream media webs and searching engines, to guarantee that the information has reliable source, better timeliness, smaller quantity of information and shorter processing time.

Figure 3. System Architecture

(2) Public opinion information acquisition Internet opinion information acquisition is the process of collecting of web pages related to the topic

planning. The system automatically acquires information for Webs through linking relationships and extends to the whole Web by following the links. According to the different features of new, forum, blog, the system collects the different information points. The acquired web data is stored in the data base after preprocess to provide high quality data source to analysis department. A web queue is created and accessed via different protocols; finally the webs are downloaded for further analysis. The system employs multithreading parallel acquisition strategy to improve the efficiency.

To be concrete, all acquisition web sets are stored in the queue; then webs are ceaselessly assigned to threads to perform acquisition; when finished the thread, a request will be sent to the master process to assign a new web till the queue is empty.

(3) Public opinion information preprocess Public opinion information preprocess is the preliminary processing or treatment on the acquired

web information which lays the foundation for future processes. The preprocessing includes web page analysis, web text segmentation, words filtering etc [12]. The source web pages include many advertisements, pictures, links, which carries no value for the process and costs system resources and processing time. Meanwhile the data presents are based on different sources, and the system employs HTML analyzer syntax and regular expression to label the title, source, author, release time and text etc.

Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction Shouhua Zhang, Zhenpeng Liu

344

Page 6: Hotspot Detection Method of Internet Public Opinion based on … · 2016-01-09 · Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction 1Shouhua Zhang,

[13]. Since the Chinese segmentation is well known, no more details are needed [14]. After segmentation, the result will be filtered through the stop word list and filtering rules. The stop word list includes the function words such as auxiliary, preposition, conjunction etc. and no practical meaning single words [15]. For obvious no practical meaning strings, such as a great deal of numeral and classifier collocations, familiar and no meaning prefix and suffix, rules are designed to perform filtering [16].

(4) Public opinion information analysis Public opinion information analysis is the most important part, which includes hot topic mining on

the acquired news, comments and other information according to the hot topic analysis model; and warning analysis, text clustering, topic mergence etc. The work process of internet public opinion information analysis is shown in figure 4.

(5) Public opinion process Public opinion process includes warning, reporting and guiding [17]. First on the analysis result

information warning and forecasting are performed according to the indicator, and then the regulated public opinion in reported to corresponding departments to provide auxiliary to decision-making.

Figure 4. Work Process of Internet Public Opinion Information Analysis

4. System implementation and analysis

The system was implemented by Microsoft visual studio .Net 2005. Considering the system deployment and ease of use, the system structure used B/S, and the DBMS used SQL Server 2005.

Sina, Phoenix, NetEase, NetEase, Mop, Tianya and other sites are as sources of information, to track the reports and posts of December 1, 2011. Five key words are extracted in each article. Formula (5), Weights designs based on the number of participants and comments are shown in Table 1. The tracking results of hot news are shown in Table 2.

Table 1.Weights designs based on the number of participants and comments

f (bn, rn) [0,

500)

[500,

1000)

[1000,

2000)

[2000,

3000)

[3000,

5000)

[5000,

7000)

[7000,

10000)

[10000,

15000)

[15000

, 20000)

[20000

, ∞)

Weight

(bn, rn) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Table 2. The tracking results of hot news

Hot topics Number of

articles Participants Comments

Heat

value

Hubei, Jiangling, county, Standing

Committee, promotion 5 74192 2034 1.88

Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction Shouhua Zhang, Zhenpeng Liu

345

Page 7: Hotspot Detection Method of Internet Public Opinion based on … · 2016-01-09 · Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction 1Shouhua Zhang,

Shanxi, Lueyang, suspected, whore, girls 11 54649 2085 1.61

Iran, UK , expulsion, diplomats , retaliation 35 26360 796 1.07

Ministry of Railways, standard , bookings,

daily, queuing 6 5277 1138 0.55

Ministry of Health, dairy, standard ,

enterprise, kidnapping 8 5531 381 0.55

From the operating results, the proposed hot spot tracking method for network of public opinion is

effective, the hot spot analysis model can identify with great accuracy. The system can provide accurate theme planning, extraction of hot information, hotspot analysis model is accurately and effective. 5. Conclusion

Aiming at solving the problems in current Internet Public Opinion Monitoring Systems, a fast and effective Internet Public Opinion Monitoring System is designed. Based on the characteristics of Internet public opinion occurrence and spreading, to improve the information acquisition strategy. The acquisition range is resisted in high timely reports and authoritative websites and searching engines, so that reducing information amount, faster processing speed, higher reliability and better analysis accuracy rate are achieved. The system automatically mines hot keywords to conduct topic clustering. According to the different features of news, forum, Blog designs the analytical models of hotspot topics respectively. The running tests show that this system is capable of finding hot topics and tracking real-time emergencies. 6. Acknowledgments

This work is partially supported by the National Nature Science Foundation of China (Grant No.

60873203), the National Philosophy and Social Science Foundation of China (Grant No. 10BTQ039) and the China Postdoctoral Science Foundation funded project (Grant No. 20070420700). 7. References [1] DING Jie, XU Jungang, “IPOMS; an Internet Public Opinion Monitoring System”, International

Conference on the Applications of Digital Information and Web Technologies, pp.433-437, 2009. [2] ZHENG Kui, SHU Xueming, YUAN Hongyong, “Hot Spot Information Auto-detection Method

of Internet Public Opinion”, Computer Engineering, vol.36, no.3, pp.4-6, 2010. [3] LIU Hong, LI Xiaojun, “Internet Public Opinion Hotspot Detection Research Based on K-means

Algorithm”, International Conference on Swarm Intelligence (ICSI 2010), pp.594-602, 2010. [4] LU Bei, CHENG Xiao, CHEN Zhiqun, “Overview of the Study of Internet Public Opinion

Mining”, Information and Documentation Services, no.2, pp.41-45, 2010. [5] LI Shendong, LV Xueqiang, LI Yuqin, SHI Shuicai, “Study on Feature Selection Algorithm in

Topic Tracking”, IJIPM: International Journal of Information Processing and Management, vol. 1, no. 1, pp.25-33, 2010.

[6] SHI CongYing, XU Chaojun, YANG Xiaojiang, “Study of TFIDF algorithm”, Journal of Computer Applications, vol.29, no.6, pp.167-180, 2009.

[7] LIAN Jie, LIU Yun, “Web Data Preprocessing and Automatic Abstract for Internet Public Opinion”, Journal of Beijing Jiaotong University, vol.34, no.5, pp.94-99, 2010.

[8] LI Hengxun, ZHANG Huaping, QIN Peng et al, “Keywords Based Hot Topic Detection on Internet”, Fifth China Conference on Information Retrieval (CCIR 2009), pp.134-143, 2009.

[9] Tang Hanqing, Wang Hanjun, “Appliction of Improved K-Means Algorithm to Analysis of Internet Public Opinion”, Computer Systems & Applications, vol.20, no.3, pp.165-168, 2011.

[10] Swe Swe Nyein, “Mining Contents in Web Page Using Cosine Similarity”, In Proceeding (s) of the 3rd International Conference on Computer Research and Development, pp.472-475, 2011.

Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction Shouhua Zhang, Zhenpeng Liu

346

Page 8: Hotspot Detection Method of Internet Public Opinion based on … · 2016-01-09 · Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction 1Shouhua Zhang,

[11] PAN Zhenggao, “Research of Analyzing Public opinion on the Web Based on Topic Key Words”, Journal of Suzhou University, vol.25, no.5, pp.28-30, 2010.

[12] Gong Zhe, LI Qi, ZHANG Jianyi, XIN Yang, NIU Xinxin, “An Online Hot Topics Detection Approach Using the Improved Ant Colony Text Clustering Algorithm”, JCIT: Journal of Convergence Information Technology, vol.7, no. 2, pp.243-252, 2012.

[13] Qian Aibing, “A Model for Analyzing Public Opinion under the Web and Its Implementation”, New Technology of Library and Information Service, no.4, pp.49-55, 2008.

[14] LIU Jian, WEI Cheng, “Arithmetic Research on Chinese Segmentation”, Microcomputer Applications, vol.29, no.8, pp.11-16, 2008.

[15] Hua Bolin, “Stop-word Processing Technique in Knowledge Extraction”, New Technology of Library and Information Service, no.8, pp.48-51, 2007

[16] ZENG Yiling, XU Hongbo, BAI Shuo, “Research on the Extraction and Organization of Key Phrases in Web Texts”, Journal of Chinese Information Processing, vol.22, no.3, pp.64-70, 2008.

[17] LIU Jianyu, “Research on the Web Excavating Techniques in the Network Public Sentiment Warning and Its Appliction”, Journal of Sichuan Police College, vol.21, no.3, pp.77-81, 2009.

Hotspot Detection Method of Internet Public Opinion based on Key Words Extraction Shouhua Zhang, Zhenpeng Liu

347