online topic detection model design and implementationxwu/wie/courseslides/ye-topicdetection.pdf ·...

29
Online topic detection model design and implementation 1

Upload: vonga

Post on 12-Sep-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

Online topic detection model design and implementation

11

Page 2: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

ContentContent

• IntroductionIntroduction

• Topic detection stage

• Hot topic model

• Experiments results

• ConclusionConclusion

22

Page 3: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

IntroductionIntroduction

• Some topics in news on the Internet haveSome topics in news on the Internet have great impact on real society.

• Messages in BBS also have influence on our real life in some way.our real life in some way.

• It is necessary to build an intelligent• It is necessary to build an intelligent system to help discover hop topics embedded on web automatically.

3

embedded on web automatically. 3

Page 4: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• Consider the hot topics in combination ofConsider the hot topics in combination of news from news websites and messages from BBSfrom BBS.

• The algorithm proposed in this research is based on Principle of TF PDF (Termbased on Principle of TF ·PDF (Term Frequency ·Proportional Document Frequency)Frequency).

44

Page 5: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• The algorithm has been adapted in a wayThe algorithm has been adapted in a way that assigns heavy weight to the topics that have been discussed in manythat have been discussed in many documents from many resources concurrentlyconcurrently.

• Based on principle of stock index, topic index is introduced to manifest theindex is introduced to manifest the developing process of hot topics.

55

Page 6: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

Two stages:Two stages:

1 t i d t ti d l t i1. topic detection and clustering

f2. hot topic discovery and generation of topic index

66

Page 7: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

77

Page 8: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

Topic detection stageTopic detection stage

• The objective of topic detection:The objective of topic detection: – Identify topically related stories without

positive or negative training storiespositive or negative training stories.• Detection is similar to tracking, but no

particular training stories are provided forparticular training stories are provided for a particular topic.

88

Page 9: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• Topic detection :Topic detection :– clustering has to be done on-the-fly

resulting clusters have to be non overlapping– resulting clusters have to be non-overlapping

tf idf i hti h d th• use tf·idf weighting scheme and the cosine similarity metric

99

Page 10: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• Document RepresentationDocument Representation

ttf =

avgdldlt 5.15.0

tf++

)5.0log(idf

+

=df

N

)1log(idf

+=

N

1010

Page 11: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• t : number of times feature occurs in the documentif

• : the document’s length• : average document length in the collection

N b f d t i th ll ti

dlavgdl

• N: number of documents in the collection. • : term frequency, the degree to which the term

describes the contents of a documenttf

• idf : the logarithm of the inverse document frequency in the collection; intended to discount very common words, as they have little discrimination poweras they have little discrimination power

• N: total number of documents in the collection• : number of documents in which the feature appears

i th ll tidf

11

in the collection.

Page 12: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• Topic SimilarityTopic Similarity

∑ dSim(D, T)= ))(( 22 ∑∑

∑∈∈

Hi iHi i

Hi ii

dt

dt

• H is the feature collection, which includes the features in both new document D and topic document T.

d fl t th i ht i d t f t i it d• and reflect the weights assigned to feature i in the document D and topic T, respectively.

it id

12

Page 13: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• Online Clustering Algorithmg g– single-link incremental (single pass) clustering

(1) Create a cluster containing the first story(2) For each s bseq ent stor >1 in the stream

1DD(2) For each subsequent story >1 in the stream:

(a) use to update collection-wide idf statistics(b) apply tf·idf weighting to

kD

kD→

(b) apply tf idf weighting to (c) find the most similar story from the past

kD

jk DDsimD→→

= ),(maxarg*

(d) if <θ then create a new cluster containing just

kjDj

<

),(g

),(→→

jk DDsim

D13

containing just else add to the cluster containing .

kD

kD *D 13

Page 14: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

Second StageSecond Stage• Hot Topic DiscoveryHot Topic Discovery

– discussed by many websites are more important than those that only being discussed by few websites.

– also take the messages on BBS into considerationconsideration.

• Hot topics: appeared in news or were discussed on BBS frequently in manydiscussed on BBS frequently in many documents from many major news or BBS websites within a period

14

BBS websites within a period. 14

Page 15: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• Based on TF*PDF adopt DF (document frequency) to substitute TFBased on TF PDF, adopt DF (document frequency) to substitute TF in TF*PDF:

DKsTopic weight (j )=

D

t WBWND

D s

Ks

s ts

tjstjs ×××∑

=

=

))exp(|(|1 )(

)()(

∑=

=Cc

tcs

tjstjs

D

DD

1

2)(

)()( ||

=c 1

15

Page 16: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• is the document number of topic j within t time )(tjsD p jperiod (with the total # of C topics)

• is the total number of documents on website s.

)(tjs

)(tsN• is the weight of website s, it will be 1.0 if the

website is a news website,or it will be 0.1 if the website s serves as a BBS message source

)(

sW

website s serves as a BBS message source. • WB (represents website and BBS) means if a

topic appears both in news and in BBS message top c appea s bot e s a d S essageboard concurrently, it will have the value of 2.0, or it will be 1.0.

1616

Page 17: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• There are altogether five major components in the algorithmThere are altogether five major components in the algorithm

– the first part is the “summation” of the topic weight gained from each website.

– The second composition is | |, which is the normalized topic frequency of a topic on a website. The third composition is which is the PDF

)(tjsD

)exp( )(tjsD– The third composition is , which is the PDF

(proportional document frequency) of a topic on a website. It is the exponential of the number of

)exp()(tsN

documents on topic j to the total number of documents on website s.

1717

Page 18: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

– The forth composition is the weight of a website . Wp gNews website and BBS website has different weight respectively. Th l t iti i th i t l f t WB If

sW

– The last composition is the interplay factor WB. If a topic both appears in news and on BBS, we give WB the value of 2.0, or it will be 1.0 if a topic only appears in news or BBS.

1818

Page 19: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• The total weight of a topic is equal to theThe total weight of a topic is equal to the summation of the weight of the topic in each website respectively.

• The proposed algorithm gives significance to p p g g gthose topics that have many documents in majority websites.

1919

Page 20: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• Topic IndexTopic Index– A stock market index is a number that indicates

the relative level of prices or value of securitiesthe relative level of prices or value of securities in a market on a particular day compared with a base-day figure.y g

• Day2’s index= ×Base Day’s indexDay2’s portfolio value

• Day2 s index= ×Base Day s indexBase Day’s portfolio value

2020

Page 21: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• Topic IndexTopic Index– According to stock index idea, we introduce the

topic indextopic index.

Topic index( )= Base Day’s index The xth day’s topic weight

Base Day’s topic weightxt ×

As the value assigned to the base day index for stockmarket is usually100 or 1000, similarly we assigned 100

21

y , y gas the value to the base day index.

Page 22: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• With topic index, it is easy and obvious to understand the p , ychanging range of a topic in different time compared with its base day weight.

• The information of topic index is also helpful for experts t i f th f t d l t f t ito infer the future development of some topics.

2222

Page 23: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

ExperimentExperiment • the online news taken from websites

www.people.com.cn and www.xinhuanet.com, and messages (or topics) from BBS on websites:

t d i tlwww.tom.com and www.sina.com.cn, concurrently.

2323

Page 24: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• After the clustering process on the first stage, 2503 g p g ,different clusters have been clustered. On the base of those clusters, the topics have been weighted by the second stagesecond stage.

2424

Page 25: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

No. Topic Topic weight

1 Reports about tsunami and earthquake happened in south Asia 2.5588

2 Tsunami disaster situation related Thailand 1.4099

3 Ukraine's general election 1.1199

4 Post-war situation in Iraq 0. 9533

5 Donation for the relief of tsunami disaster from around world 0.9308

6 Aftermath of tsunami disaster in Sri Lanka 0.8227

7 Tsunami disaster in Indonesia 0.8109

8 Relationship between America and China 0.7504

9 General election in Palestine 0.6672

10 NPC Standing Committee’s discussion on the drafts of several important laws 0.4504

2525

Page 26: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

Topic index

Topic No. Base day’s value (date)

Topic index

25 26 27 28 29 30 31

1 2 0509(26) — 100 108 1 98 2 106 7 103 2 88 671 2.0509(26) — 100 108.1 98.2 106.7 103.2 88.67

2 0.6518(26) — 100 256.8 139.8 86.97 140.4 65.35

3 0.4639(25) 100 60.89 331.2 191.3 240.2 48.37 91.83

4 0.5206(25) 100 118.3 207.9 103.8 164.9 289.8 84.49

5 0.1149(26) — 100 202.6 314.5 585.7 627.2 483

6 0.3438(26) — 100 142 389 106.1 128 63.9( )

7 0.3582(26) — 100 125.4 262.1 68.3 91.17 106.9

8 1.1610(25) 100 37.94 108.6 50.67 54.16 45.8 37.88

26

9 1.1372(25) 100 89.2 76.3 102.8 58 96.3 35.6

10 0.5206(25) 100 98.63 86.3 72.45 65.8 101.2 53.7 26

Page 27: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• The curve for topic No 1The curve for topic No.1 110

topic no. 1

100

105

x

95

topi

c in

dex

26 26.5 27 27.5 28 28.5 29 29.5 30 30.5 3185

90

date

27

date

27

Page 28: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

• The curve for No 5 TopicThe curve for No.5 Topic

600

700top ic no . 5

400

500

ndex

300

400

topi

c in

26 26 .5 27 27 .5 28 28 .5 29 29 .5 30 30 .5 31100

200

da te

2828

Page 29: Online topic detection model design and implementationxwu/wie/CourseSlides/Ye-TopicDetection.pdf · • Topic detection stage • Hot topic model • Experiments results • Conclusion

ConclusionConclusion• The proposed model is effective in discovering hot topics on

b i i di llwebsites periodically.• Based on TF*PDF algorithm, it performs well in picking out

hot topics, by taking the advantage of the concept that if there i h t t i b it it ill f tl iis a hot topic on websites, it will appear frequently in many documents from multiple website sources.

• Topic index approach is introduced to observe the change and d l t f th t i ith h i ht hi h idevelopment of those topics with heavy weights, which is an effective way to make a conclusive sign showing to what extent a topic is getting popularity on daily bases.

• The model is also useful for experts to discover the trend of• The model is also useful for experts to discover the trend of some sensitive topics and understand the public interest tendency timely.

2929