![Page 1: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/1.jpg)
Inductive Learning in Less Than One Sequential Data Scan
Wei Fan, Haixun Wang, and Philip S. YuIBM T.J.Watson
Shaw-hwa LoColumbia University
![Page 2: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/2.jpg)
Problems Many inductive algorithms are main
memory-based. When the dataset is bigger than the memory, it
will "thrash". Very low in efficiency when thrashing happens.
For algorithms that are not memory-based, Do we need to see every piece of data? Probably
not. Overfitting curve? Not practical.
![Page 3: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/3.jpg)
Basic Idea:One Scan Algorithm
Batch 4
Batch 3
Batch 2
Batch 1
Algorithm
Model
Model
Model
Model
![Page 4: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/4.jpg)
Loss and Benefit Loss function:
Evaluate performance. Benefit matrix – inverse of loss func
Traditional 0-1 loss b[x,x] = 1, b[x,y] = 0
Cost-sensitive loss Overhead of $90 to investigate a fraud. b[fraud, fraud] = $tranamt - $90. b[fraud, nonfraud] = $0. b[nonfraud, fraud] = -$90. b[nonfraud, nonfraud] = $0.
![Page 5: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/5.jpg)
Probabilistic Modeling
is the probability that x is an instance of class
is the expected benefit Optimal decision
![Page 6: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/6.jpg)
Example p(fraud|x) = 0.5 and tranamt = $200 e(fraud|x) = b[fraud,fraud]p(fraud|x)
+ b[nonfraud, fraud] p(nonfraud|x) =(200 – 90) x 0.5 + (-90) x 0.5 = $10
E(nonfraud|x) = b[fraud,nonfraud]p(fraud|x) + b[nonfraud,nonfraud]p(nonfraud|x) = 0 x 0.5 + 0 x 0.5 = always 0
Predict fraud since we get $10 back.
![Page 7: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/7.jpg)
Combining Multiple Models
Individual benefits
Averaged benefits
Optimal decision
![Page 8: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/8.jpg)
How about accuracy
![Page 9: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/9.jpg)
Do we need all K models?
We stop learning if k (< K) models have the same accuracy as K models with confidence p.
Ends up scanning the dataset less than 1.
Use statistical sampling.
![Page 10: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/10.jpg)
Less than one scan
Batch 4
Batch 3
Batch 2
Batch 1
Algorithm AccurateEnough?
Model
Model
Model
No
Yes
![Page 11: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/11.jpg)
Hoeffding’s inequality
Random variable within R=a-b After n observations, its mean value is
y. What is its error with confidence p
regardless of the distribution?
![Page 12: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/12.jpg)
When can we stop?
Use k models highest expected benefit
Hoeffding’s error: second highedt expected benefit
Hoeffding’s error:
The majority label is still with confidence p iff
![Page 13: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/13.jpg)
Less Than One Scan Algorithm
Iterate the process on every instance from a validation set.
Until every instance has the same prediction as the full ensemble with confidence p.
![Page 14: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/14.jpg)
Validation Set
If we fail on one example x, we do not need to examine on another one. So we can keep only one example in
memory at a time. If k base models’s prediction on x is
the same as K models. It is very likely that k+1 models will also
be the same as K models with the same confidence.
![Page 15: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/15.jpg)
Validation Set
At anytime, we only need to keep one data item x from the validation set.
It is sequentially read from the validation set.
The validation set is read only once. What can be a validation set?
The training set itself A separate holdout set.
![Page 16: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/16.jpg)
Amount of Data Scan
Training Set : at most one Validation Set: once. Using training as validation set:
Once we decide to train model from a batch, we do not use it for validation again.
How much is used to train model? Less than one.
![Page 17: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/17.jpg)
Experiments
Donation Dataset:
Total benefits: donated charity minus overhead to send solicitations.
![Page 18: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/18.jpg)
Experiment Setup
Inductive learners: C4.5 RIPPER NB
Number of base models: {8,16,32,64,128,256} Reports their average
![Page 19: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/19.jpg)
Baseline Results (with C4.5)
Single model: $13292.7 Complete One Scan: $14702.9
The average of {8,16,32,64,128,256} We are actually $1410 higher than the
single model.
![Page 20: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/20.jpg)
Less-than-one scan (with C4.5)
Full one scan: $14702 Less-than-one scan: $14828
Actually a little higher, $126. How much data scanned with 99.7%
confidence? 71%
![Page 21: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/21.jpg)
Other datasets
Credit card fraud detection
Total benefits: Recovered fraud amount minus overhead
of investigation
![Page 22: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/22.jpg)
Results
Baseline single: $733980 (with curtailed probability)
One scan ensemble: $804964 Less than one scan: $804914 Data scan amount: 64%
![Page 23: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/23.jpg)
Smoothing effect.
![Page 24: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/24.jpg)
Related Work Ensenbles:
Meta-learning (Chan and Stolfo): 2 scans Bagging (Breiman) and AdaBoost (Freund and
Schapire): multiple Use of Hoeffding’s inequality:
Aggregate query (Hellerstein et al) Streaming decision tree (Hulten and Domingos)
Single decision tree, less than one scan Scalable decision tree:
SPRINT (Shafer et al): multiple scans BOAT (Gehrke et al): 2 scans
![Page 25: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University](https://reader036.vdocument.in/reader036/viewer/2022062417/5515159e55034673228b4bbb/html5/thumbnails/25.jpg)
Conclusion
Both “one scan” and “less than one scan” have accuracy either similar or higher than the single model.
“Less than one scan” uses approximately 60% – 90% of data for training with loss of accuracy.