threshold setting and performance monitoring for novel text mining wenyin tang and flora s. tsai...

22
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang Technological University E-mail: [email protected] , [email protected] May 2, 2009 1

Upload: alexandrina-hoover

Post on 18-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Threshold Setting and Performance Monitoring for Novel Text Mining

Wenyin Tang and Flora S. Tsai

School of Electrical and Electronic EngineeringNanyang Technological University

E-mail: [email protected], [email protected]

May 2, 2009

1

Page 2: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Outline

• Introduction– Novel Text Mining (NTM) System– Performance Evaluation of NTM

• Adaptive Threshold Setting for NTM– Motivations– Our Method: Gaussian-based Adaptive

Threshold Setting (GATS)– Experimental Result

• Conclusion

2

Page 3: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Overview of Novel Text Mining System

3

Categorise each incoming document or sentence into its relevant topic bin.

Detect novel yet relevant documents or sentences in each topic.

Prepare a clean data matrix which can be easily processed by a computer.

Interact with users: input documents, output novel info, preference setting and feedback.

Page 4: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Vector space

Given a set of relevant documents in a specific topic, e.g. “football games”, NTM retrieves the novel documents by:

– Step 1: rank documents in the topic “football games” in a chronological order.

– Step 2: assign a novelty score for each document by comparing the document with its history documents.

– Step 3: predict the document as “novel” if its novelty score is greater than the predefined novelty threshold.

Novel Text Mining Algorithm

4

D1

D3

D2

D4

I am “novel” because I am

the first document

I am “novel” because I am

dissimilar to D1

I am “novel” because I am

dissimilar with my nearest neighbor D2

D1, D2, D3, D4 …

Unfortunately, I am “non-novel” because I am very similar to my nearest neighbor D3

Page 5: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

NTM Performance Evaluation

• Given a set of documents D1, D2, to D10, relevant to some topic, for example,

5

D1, D2, D3, D4, D5, D6, D7, D8, D9, D10

System (S):

Assessor (A):

Matched (M):

# Novel:

8

5

novel

non-novel

• Precision (P) reflects how likely the system retrieved docs are truly novel. P=M/S=4/8=0.5, i.e. 50% system retrieved docs are truly novel.

• Recall (R) reflects how likely the truly novel docs can be retrieved by the system. R=M/A=4/5=0.8, i.e. 80% truly novel docs can be retrieved by the system.

• Fβ score: the function of P and R:

RP

F

11

4

Page 6: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Threshold Setting vs. Users’ Requirements

6

I want to read the most

novel information in a short time1.

I do not want to miss any

novel information2.

I am not sure until I can see the documents

The NTM system should define the novelty threshold based on the users’ requirements adaptively.

Different users may have different performance requirements.

1. High-precision NTM systems are desired; 2. High-recall NTM systems are desired.

Page 7: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Why Adaptive Threshold Setting

Motivations:

1. As NTM system is a real-time system, there is little or no training information in the initial stages of NTM. Therefore, the threshold cannot be predefined with confidence.

2. As NTM system is an accumulating system, more training information will be available for threshold setting, based on user’s feedback given over time.

3. Different users may have different definitions of “novelty”: – One user: a document with 50% novel info– Another user: a document with 90% novel info

7

Page 8: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Gaussian-based Adaptive Threshold Setting (GATS)

Basic idea:• GATS is a score distribution-based threshold

setting method. It models the score distributions of both novel and non-novel documents (based on the user feedback);

• This parametric model provides the global information of data, from which we can construct an optimization criterion of desired performance to search the best threshold.

8

Page 9: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Novelty Score Distributions

9

Empirical probability distribution and its Gaussian probability distribution approximation for TREC 2004 Novelty Track data topic N54

Gaussian probability distribution approximation

Novel Non-novel

Page 10: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Optimization Criterion

Satisfy 2 conditions:

1.Criterion is a function of Threshold:

J=f (θ)

2. Criterion is directly related to system performance:

J=Fβ (θ)

Page 11: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Optimization Criterion

11

S1 S0

0 11

111

)|(

)|Pr()(

dxcxpn

cxnS )|Pr()( 000 cxnS

)()(

)()(

01

1

SS

SP

1

1 )()(

n

SR

)(

1)(

1maxarg)(maxarg*

RP

F

Novel Non-novel

θ θ

Page 12: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Flow Chart of NTM with GATS

Page 13: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Experimental Data

13

Sentence-level data: TREC 2004 Novelty Track data

The news providers of the document set are Xinghua English (XIE) , New York Times (NYT), and Associated Press Worldstream (APW). The NIST assessors created 50 topics for this data. Each topic consists of around 25 documents. These documents were ordered chronologically and then segmented into sentences. Each sentence was given an identifier and concatenated together to form the target sentence set. In this data, the overall percentage of novel sentences is around 41.4%. The statistics of data is summarized in Table 1.

#Novel #Non-novel SumRelevant 3454

(41.4%)

4889

(58.6%)

8343

Table 1 Statistics of TREC 2004 Novelty Track data

Page 14: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Experimental Data

14

Document-level data: APWSJ

APWSJ consists of news articles from Associate Press (AP) and Wall Street Journal (WSJ), which cover the same period from 1988 to 1990 [Zhang et al., 2002]. There are 50 TREC topics from Q101 to Q150 in this data and 5 topics (Q131, Q142, Q145, Q147, Q150) that lack non-novel documents are excluded from the experiments. The statistics of this data are summarized in Table 2.

Table 2 Statistics of APWSJ data

#Novel #Non-novel SumRelevant 10,839

(91.1%)1057 (8.9%)

11,896

Page 15: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Methods & Parameters

• Baseline: – Fixed threshold setting θ from 0.05~0.95 with

an equal step 0.05.

• Our method, GATS: – Complete feedback: with β from 0.1~0.9 with

an equal step 0.1.– Partial feedback: with β from 0.1~0.9 with an

equal step 0.1, percentages of feedback: 10%, 20%, 50% and 80%.

Page 16: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Experimental Result

16

Sentence-Level NTM on TREC 2004 Data

Recall

Pre

cis

ion

Page 17: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Experimental Result

17

Document-Level NTM on APWSJ Data

Redundancy-Recall

Re

du

nd

an

cy-P

rec

isio

n

Page 18: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Comparison: GATS vs. Fixed Threshold

• For precision-recall tradeoff– Fixed threshold θ cannot reflect the tradeoff of the

precision and recall directly.

– GATS parameter β reflects the weights of precision and recall directly.

• Under various performance requirements, GATS is able to approximate the best fixed threshold.

Table 3 Comparison of Fβ on TREC 2004 Novelty Track data

Page 19: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Experimental Result

19

PR curves of GATS (tuned for Fβ) with different percentages of the user’s feedback.

Recall

Pre

cis

ion

Sentence-Level NTM on TREC 2004 Data

Page 20: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Experimental Result

20

R-PR curves of GATS with different percentages of the user’s feedback.

Redundancy-Recall

Re

du

nd

an

cy-P

rec

isio

nDocument-Level NTM on APWSJ Data

Page 21: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

Conclusion

• A Gaussian-based Adaptive Threshold Setting (GATS) algorithm was proposed for NTM system.

• GATS is a generic method, which can be tuned according to different performance requirements varying from high-precision to high-recall.

• By testing the proposed method on both document and sentence-level datasets, we found the experimental results showed the promising performance of GATS for a real-time NTM system.

21

Page 22: Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang

22

Q & AQ & A