t-scroll : visualizing trends in a time-series of documents for interactive user exploration

37
1 T-Scroll: Visualizing Trends in a Time-Series of Documents for Interactive User Exploration Yoshiharu Ishikawa and Mikine Hasegawa Nagoya University, Japan [email protected]

Upload: thane-stanley

Post on 30-Dec-2015

28 views

Category:

Documents


2 download

DESCRIPTION

T-Scroll : Visualizing Trends in a Time-Series of Documents for Interactive User Exploration. Yoshiharu Ishikawa and Mikine Hasegawa Nagoya University, Japan [email protected]. Outline. Background and objective Related work Novelty-based document clustering - PowerPoint PPT Presentation

TRANSCRIPT

1

T-Scroll: Visualizing Trends in a Time-Series of Documents for Interactive User Exploration

Yoshiharu Ishikawa and Mikine Hasegawa

Nagoya University, [email protected]

2

Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work

3

Background Time-series of documents

Example: news articles delivered on the Internet, online academic journals

Continually delivered everyday

Problems A large number of documents: appropriate

summarization is required Topics will change: topic detection/tracking and trend

extraction are useful

4

Objectives Development and evaluation of T-Scroll

(Trend/Topic-Scroll) User interface for visualizing the transition of topics

extracted from a time-series documents

System Features Constructed over a document clustering system that

outputs new clustering results periodically Clusters are displayed along the time axis like a scroll Links are shown between related clusters to represent

topic transition Some useful features for interactive exploratory analysis

5

6

Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work

7

Visualization of a time-series of documents A few systems for visualization of trends in a time-

series of documents ThemeRiver (Havre et al, IEEE Trans. VCG,

2002) [4] Visualizes topic streams like a river Focuses on providing visual impacts No features for analysis and browsing

TimeMine (Swan and Allan, SIGIR’00) [5] Extracts topics from a time-series of documents Displays timelines to represent topics on the screen

8

ThemeRiver

Analysis of the articles related to Cuba (1960 – 1961)

9

TimeMine Swan & Allan (U. of Massachusetts)

10

Analysis of time-dependent clusters Mei & Zhai (KDD’05) [6]

Statistical approach for discovering major topics from a time-series of documents

Probabilistic modeling

MONIC (Spiliopoulou et al., KDD’06) [7] Detects various types of patterns from cluster

transitions Examples: splitting/merging of clusters, cluster size changes

Based on the analysis of historical snapshots of clusters

11

Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work

12

Novelty-based document clustering (1) Developed by our group (ECDL’01 [8], WWW Journal

2007 [10] etc.) Clusters documents incrementally based on their

similarity and novelty Features

Similarity considers novelty Assign high weights to recent documents, low weights to old

ones Document weights decay as time passes: Based on the

concept of obsolescence (aging) Delete old documents whose weights are smaller than the

threshold Incremental processing: low update cost

13

Novelty-based document clustering (2)

ττ time

New President SarkozyYeltsin’s Death

Other articles

Blair to Resign

“Yeltsin’s Death” and other

documents are obsolete!

“Yeltsin’s Death” and other

documents are obsolete!

Periodical clustering processes are performed on a time-series of documents

14

Document similarity (1)

iTττi λ|dw

acquisition timeacquisition time of document of document di

1

dwi

Ti t

iTτλ

Current timeCurrent time

(0 < < 1): forgetting factor determines the forgetting speed

The weight of a document exponentially decreases as time passes.

Assumption: each delivered document gradually loses its value as time passes

dwi: the weight of a documentdi at time

15

Document similarity (2) Similarity score of documents di and dj

Based on novelty of documents and word occurrence patterns in the documents.

Extension of the tf-idf method

New documents have high impact on the clustering result

Document clustering: k-means method

ji

jiji

jiji

lenlendd

ddddsim

dd)Pr()(Pr

),(Pr),(

16

Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work

17

T-Scroll: Idea Periodical clustering results are displayed like a

scroll Links represents related cluster pairs

18

19

System functionalities (1) Cluster labels: selected based on the formula

Pr(di): document weight, tfij: term frequency count

Cluster sizes: ellipse size roughly corresponds to the number of documents

Links: If the score is greater than the threshold, links are shown

pi Cd

ijij tfdtscore )Pr()(

||

||)|Pr()(

i

jiijji C

CCCdCdCCscore

20

System functionalities (2) Cluster quality: visualized using different colors for

the cluster border lines red (good) purple (bad)

High score can be achieved if (1) the cluster size is large, and (2) documents contained in the cluster are similar

jiji ddCddji ddsim

CCCsimavg

CsimavgCCquality

,,

),()1|(|||

1)(_

)(_||)(

21

System functionalities (3) Drill-down/roll-up: user can specify the interval of

between two consecutive clustering interactively (e.g., one day, one week)

Displaying keyword list: user can browse the keyword list for a specified cluster

Access to original documents Keyword-based emphasis: clusters that contain a

user-specified keyword are emphasized

22

Demo

23

System implementation T-Scroll module

Written by Perl: generates an SVG file Browser displays the generated SVG file SVG file includes scripts (JavaScript)

Used for interactive manipulation

Clustering module Written by Ruby Novelty-based incremental document clustering

24

System architecture

SVG ControlModule

T-ScrollMain Module

SVG OutputModule

(JavaScript)

SVG file(includes JavaScript)

(Perl)

( Perl )

Plug-in

Outputs

T-Scroll

---------------------Browser

---------------------

---------------------

---------------------

News articles

Input Output

---------------------

---------------------

---------------------

Clustering result

Input

Commandinputs

Clusterdisplay

Interactivemanipulation

User

ClusteringModule

RSS FeedModule

25

Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work

26

Evaluation 10 Users Data set

Japanese news articles collected from news web sites from Sept. 2006 to Feb. 2007

100 articles per day Clustering was performed at six-hour intervals

Evaluation criteria Overall impressions Evaluation of each function Obervability of topics Comparison with ThemeRiver

27

Overall impression User specifies scores between 0 to 5

0

1

2

3

4

5

Usability

Understandability

Usefulness

Design

28

Evaluation on each function

012345

Scroll

DocN

um

Label

Quality

Keyw

ord

TitleList

Emphasis

Interval

29

Observability of topics (1) Can users observe major topics in Nov. 2006?

Five major topics are specified by ours: user gives scores how clearly he or she can observe the topic

0

1

2

3

4

5

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5

30

Observability of topics (2) 10 users (different from

former experiments) Users should reply

observed topics and their scores with no information

Topics 1 to 5 are major topics used in the previous experiments

Topic 2 (big hurricane) was regarded as a normal weather topic

0

2

4

6

8

Topic 1

Topic 4

Topic 3

Topic 6

Topic 7

Topic 5

Topic 8

Topic 9

Topic 10

Topic 11

No. of answersScore

31

Comparison with ThemeRiver (1) ThemeRiver-like display figure was manually

created for news articles in Dec. 2006 11 users (different from previous experiments) Questions to users

Overall impressions Obserbability of topics

32

33

Comparison with ThemeRiver (2) Overall impression

Category No. of replies

T-Scroll is better 2

T-Scroll is slightly bettrer 3

Almost same 3

ThemeRiver is slightly better 3

ThemeRiver is better 0

34

Comparison with ThemeRiver (2) Can users observe five major topics that we

selected?

Category No. of replies

Good 0

Possible 3

No good 4

Impossible 4

35

Summary of experiments Overall impressions

Good, but improvements required for usability Some users made comments on the response speed

System functionalities Several features (quality info, article lists, etc.) are

useful in practice Appropriate labels are necessary: should be improved

Comparison with ThemeRiver ThemeRiver has visual impacts, but its display tends to

be complicated for many topics

36

Outline Background and objective Related work Novelty-based document clustering Overview of T-Scroll system Evaluation Conclusions and future work

37

Conclusions and future work Development and evaluation of T-Scroll system

Based on novelty-based incremental clustering method Scroll-like display for showing changing trends Several features for interactive analysis

Evaluation Overall impression Observability of topics Comparison with ThemeRiver

Future work Sophisticated keyword (label) selection Improvement of interactive speed