inductive learning in less than one sequential data scan wei fan, haixun wang, and philip s. yu ibm...
TRANSCRIPT
Inductive Learning in Less Than One Sequential Data Scan
Wei Fan, Haixun Wang, and Philip S. YuIBM T.J.Watson
Shaw-hwa LoColumbia University
Problems Many inductive algorithms are main
memory-based. When the dataset is bigger than the memory, it
will "thrash". Very low in efficiency when thrashing happens.
For algorithms that are not memory-based, Do we need to see every piece of data? Probably
not. Overfitting curve? Not practical.
Basic Idea:One Scan Algorithm
Batch 4
Batch 3
Batch 2
Batch 1
Algorithm
Model
Model
Model
Model
Loss and Benefit Loss function:
Evaluate performance. Benefit matrix – inverse of loss func
Traditional 0-1 loss b[x,x] = 1, b[x,y] = 0
Cost-sensitive loss Overhead of $90 to investigate a fraud. b[fraud, fraud] = $tranamt - $90. b[fraud, nonfraud] = $0. b[nonfraud, fraud] = -$90. b[nonfraud, nonfraud] = $0.
Probabilistic Modeling
is the probability that x is an instance of class
is the expected benefit Optimal decision
Example p(fraud|x) = 0.5 and tranamt = $200 e(fraud|x) = b[fraud,fraud]p(fraud|x)
+ b[nonfraud, fraud] p(nonfraud|x) =(200 – 90) x 0.5 + (-90) x 0.5 = $10
E(nonfraud|x) = b[fraud,nonfraud]p(fraud|x) + b[nonfraud,nonfraud]p(nonfraud|x) = 0 x 0.5 + 0 x 0.5 = always 0
Predict fraud since we get $10 back.
Combining Multiple Models
Individual benefits
Averaged benefits
Optimal decision
How about accuracy
Do we need all K models?
We stop learning if k (< K) models have the same accuracy as K models with confidence p.
Ends up scanning the dataset less than 1.
Use statistical sampling.
Less than one scan
Batch 4
Batch 3
Batch 2
Batch 1
Algorithm AccurateEnough?
Model
Model
Model
No
Yes
Hoeffding’s inequality
Random variable within R=a-b After n observations, its mean value is
y. What is its error with confidence p
regardless of the distribution?
When can we stop?
Use k models highest expected benefit
Hoeffding’s error: second highedt expected benefit
Hoeffding’s error:
The majority label is still with confidence p iff
Less Than One Scan Algorithm
Iterate the process on every instance from a validation set.
Until every instance has the same prediction as the full ensemble with confidence p.
Validation Set
If we fail on one example x, we do not need to examine on another one. So we can keep only one example in
memory at a time. If k base models’s prediction on x is
the same as K models. It is very likely that k+1 models will also
be the same as K models with the same confidence.
Validation Set
At anytime, we only need to keep one data item x from the validation set.
It is sequentially read from the validation set.
The validation set is read only once. What can be a validation set?
The training set itself A separate holdout set.
Amount of Data Scan
Training Set : at most one Validation Set: once. Using training as validation set:
Once we decide to train model from a batch, we do not use it for validation again.
How much is used to train model? Less than one.
Experiments
Donation Dataset:
Total benefits: donated charity minus overhead to send solicitations.
Experiment Setup
Inductive learners: C4.5 RIPPER NB
Number of base models: {8,16,32,64,128,256} Reports their average
Baseline Results (with C4.5)
Single model: $13292.7 Complete One Scan: $14702.9
The average of {8,16,32,64,128,256} We are actually $1410 higher than the
single model.
Less-than-one scan (with C4.5)
Full one scan: $14702 Less-than-one scan: $14828
Actually a little higher, $126. How much data scanned with 99.7%
confidence? 71%
Other datasets
Credit card fraud detection
Total benefits: Recovered fraud amount minus overhead
of investigation
Results
Baseline single: $733980 (with curtailed probability)
One scan ensemble: $804964 Less than one scan: $804914 Data scan amount: 64%
Smoothing effect.
Related Work Ensenbles:
Meta-learning (Chan and Stolfo): 2 scans Bagging (Breiman) and AdaBoost (Freund and
Schapire): multiple Use of Hoeffding’s inequality:
Aggregate query (Hellerstein et al) Streaming decision tree (Hulten and Domingos)
Single decision tree, less than one scan Scalable decision tree:
SPRINT (Shafer et al): multiple scans BOAT (Gehrke et al): 2 scans
Conclusion
Both “one scan” and “less than one scan” have accuracy either similar or higher than the single model.
“Less than one scan” uses approximately 60% – 90% of data for training with loss of accuracy.