text bundling: statistics-based data reduction by l. shih, j.d.m. rennie, y. chang and d.r. karger

1

Text Bundling: Statistics-Based Data Reduction

by L. Shih, J.D.M. Rennie, Y. Chang and D.R. Karger

Presented bySteve VincentMarch 4, 2004

2

Text Classification Domingos discussed the tradeoff of

speed and accuracy in the context of very large databases

Best test classification algorithms are “super-linear” – each additional training point takes more time to train than the previous point.

3

Text Classification Most highly accurate text classifiers take a

disproportionately large time to handle a large number of training examples

Classifiers become impractical when faced with large data sets such as the OHSUMED data set The OHSUMED test collection is a set of 348,566

references from MEDLINE, the on-line medical information database

Consists of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991)

4

Data Reduction Subsampling Bagging Feature selection

5

Subsampling Retains a random subset of the original

training data Subsampling does not preserve the entire

set of data, rather it preserves all statistics on a random set of the data

Subsampling is fast and easy to implement Reducing to a single point per class via

subsampling yields a single sample document This gives almost no information about the

nature of the class

6

Bagging Partition the original data set and learns

a classifier on each partition then a test document is labeled by majority vote of the classifiers

This provides fast training due to training on a subset of the original data

Testing is slow since it evaluates multiple classifiers for each test example

7

Feature Selection Retains only the best features of a data set All classifiers use some type of feature

selection If a classifier sees a feature as irrelevant, it simply

ignores that feature One type of feature selection ranks feature

according to |p(fi|+) – p(fi|-)|, where p(fi|c) is the empirical frequency of fi in class c of training documents

Little empirical evidence comparing the time reduction of feature selection with the resulting loss in accuracy

8

Text Bundling Text bundling generates a new

smaller training set by averaging together small groups of points This preserves certain statistics on all

the data instead of just a subset of the data

This application used on one statistic (mean), but it is possible to use multiple statistics

9

Bundling Algorithm Tradeoff between speed and accuracy:

Less raw information retained, the faster the classifier will run and the less accurate the results

Each data reduction technique operates by retaining some information and removing other information

By carefully selecting our statistics for a domain, we can optimize the information we retain

10

Bundling Algorithm Bundling preserves a set of k user-

chosen statistics, s = (s1,…,sk), where si is a function that maps a set of data to a single value.

11

Global Constraint There are many possible reduced data

sets that can satisfy this constraint But we don’t only want to preserve the global

statistics, we also want to preserve additional information about the distribution

To get a reduced data set that satisfied the global constraint, we could generate several random points and then choose the remaining points to preserve statistics This does not retain any information about

our data except for the chosen statistics

12

Local Constraint We can retain some of the

information besides the statistics by grouping together set of points and preserving the statistics locally

13

Local Constraint The bundling algorithm’s local constraint

is to maintain the same statistics between subsets of the training data

Focus on statistics means: Bundled data will not have any examples in

common with the original data Ensures that certain global statistics are

maintained, while also maintaining a relationship between certain partitions of the data in the original and bundled training sets

14

Text Bundling First step in bundling is to select a

statistic or statistic to preserve: For text, the mean statistic of each

feature is chosen Rocchio and Multinomial Naïve Bayes

perform classification using only the mean statistics of the data

15

Rocchio Algorithm Rocchio classification algorithm selects a

decision boundary (plane) that is perpendicular to a vector connecting two class centroids Let {x11,…,x1l1} and {x21,…,x2l2} be set of

training data for the positive and negative classes

Let c1= (1/l1)i x1i and c2= (1/l2)i x2i be the centroids for the classes

RocchioScore(x)=x (c1 – c2) With a threshold boundary, b, an example is

labeled according to the sign of the score minus the threshold value: lRocchio(x) = sign (RocchioScore(x) – b)

16

Naïve Bayes

A basic naïve Bayes classifier is as follows:

wk are the features (words) used, document , and class cj

)(

)|()(maxarg),(

xp

cwpcpcxy k

jkj

cj

j

x

17

Multinomial Naïve Bayes

Multinomial Naïve Bayes has shown improvements over other Naïve Bayes types. The formula is:

n(wk, ) is the count of word wk in document

)(

)|()(maxarg),(

),(

xp

cwpcpcxy k

xwnjkj

cj

k

j

xx

18

Text Bundling Assume that there are a set of training

documents for each class Apply bundling separately to each class Let D=(d1,…dn) be a set of documents. Using the “bag of words”

representation, where each word is a feature and is document is represented as a vector of word counts

19

Text Bundling di = {di1,…, diV} where the second

subscript indexes the words and V is the size of the vocabulary.

Use the mean statistics for each feature as our text statistics.

Define the jth statistic as

20

Maximal Bundling Reduce to a single point with the

mean statistics The jth component of the single

point is sj(D) Using a linear classifier on this

“maximal bundling” will result in a decision boundary equivalent to Rocchio’s decision boundary

21

Bundling Algorithms Randomized bundling

Partition points randomly Only need one pass over the training points Poorly preserves data point locations in feature space

Rocchio bundling Projects points onto a vector Partitions points that are near one another in the

projected space Use RocchioScore to sort documents by their score,

then bundling consecutive sorted documents Pre-processing time for Rocchio bundling is O(n

log(m))

22

Procedure Randomized Bundling

23

Procedure Rocchio Bundling

24

Data Sets 20 Newsgroups (20 News)

Collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

Industry Sector (IS) Collection of corporate web pages organized into

categories based on what a company produces or does

There are 9619 non empty documents and 105 categories.

Reuters 21578 (Reuters) Consists of a set of 10,000 news stories

25

Data Sets

20 News IS Reuters

Train Size 12,000 4,797 7,700

Test Size 7,982 4,822 3,019

Features 62,060 55,194 18,621

SVM time 6,464 2,268 408

Accuracy 86.5% 92.8% 88.7%

26

Experiment Used Support Vector Machine for classification

Used SvmFu implementation with the penalty for misclassification of training points set at 10

Coded pre-processing in C++ Use Rainbow to pre-process the raw documents

into feature vectors Limited runs to 8 hours per run Compared Bagging, Feature Selection, Subsample,

Random Bundling and Rocchio Bundling Also ran experiment on the OHSUMED, but did

not get results for all tests

27

Slowest Results (time, accuracy)

20 News IS ReutersBagging 4051 .843 1849 .863 346 .886Feature Selection

5870 .853 2186 .896 507 .884

Subsample

2025 .842 926 .858 173 .859

Random Bundling

2613 .862

1205 .909 390 .863

Rocchio Bundling

2657 .864

1244 .914 404 .882

28

Quickest Results (time, accuracy)

20 News IS Reuters

Bagging 2795 .812 1590 .173 295 .878

Feature Selection

4601 .649 1738 .407 167 .423

Subsample 22 .261 59 .170 173 .213

Random Bundling

117 .730 177 .9 105 .603

Rocchio Bundling

173 .730 248 .9 129 .603

29

20 News Results

30

IS Results

31

Reuters Results

32

Future Work Extend bundling in both a theoretical

and empirical sense. May be possible to analyze or provide

bounds on the loss in accuracy due to bundling

Would like to construct general methods for bundling sets of statistics

Would also like to extend bundling to other machine learning domains

33

ReferencesP. Domingos, “When and how to subsample:

Report on the KDD-2001 panel”, SIGKDD Explorations,

NIST Test Collections (http://trec.nist.gov/data.html)D. Mladenic, “Feature subset selection in text-

learning”, Proceedings of the Tenth European Conference on Machine Learning

Xiaodan Zhu, “Junk Email Filtering with Large Margin Perceptrons”, University of Toronto, Department of Computer Science (www.cs.toronto.edu/pub/xzhu/reports_nips2.doc )

34

Questions?

text bundling: statistics-based data reduction by l. shih, j.d.m. rennie, y. chang and d.r. karger

Documents