forecasting skewed biased stochastic ozone days: analyses and solutions forecasting skewed biased...
Post on 14-Dec-2015
213 Views
Preview:
TRANSCRIPT
Forecasting Skewed Forecasting Skewed Biased Stochastic Ozone Biased Stochastic Ozone
Days: Days: Analyses and SolutionsAnalyses and Solutions
Presentor: Prof. Longbin Cao
Wei Fan, Kun Zhang, and Xiaojing Yuan
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Pre
cisi
on
MaMb
VE
What is the business problem and broad-based areas
Problem: ozone pollution day detection Ground ozone level is a sophisticated chemical, physical process and
“stochastic” in nature. Ozone level above some threshold is rather harmful to human health and
our daily life. 8-hour peak and 1-hour peak standards.
8-hour average > 80 ppt (parts per billion) 1-hour average > 120 ppt
It happens from 5 to 15 days per year. Broad-area: Environmental Pollution Detection and Protection Drawback of alternative approaches
Simulation: consume high computational power; customized for a particular location, so solutions not portable to different places
Physical model approach: hard to come up with good equations when there are many parameters, and changes from place to place
What are the research challenges that
cannot be handled by the state-of-the-art?
Dataset is sparse, skewed, stochastic, biased and streaming in the same time. High dimensional Very few positives Under similar conditions: sometimes it happens and
sometimes it doesn’t P(x) difference between training and testing Training data from past, predicting the future
Physical model is not well understood and cannot be customized easily from location to location
what is the main idea of your approach? Non-parametric models are easier to use when “physical or
generative mechanism” is unknown. Reliable “conditional probabilities” estimation under “skewed,
biased, high-dimensional, possibly irrelevant features Estimate “decision threshold” to predict on the unknown
distribution of the future Random Decision Tree
Super fast implementation Formal Analysis:
Bound analysis MSE reduction Bias and bias reduction P(y|x) order correctness proof
TrainingS
et Algorithm
…..
Estimated probability
values1 fold
Estimated probability
values10 fold
10CV
10CV
Estimated probability
values2 fold
Decision threshold
VE
VE
“Probability-TrueLabel”
file
Concatenate
Concate
nate
P(y=“ozoneday”|x,θ) Lable
7/1/98 0.1316 Normal
7/2/98 0.6245 Ozone
7/3/98 0.5944 Ozone
………
PrecRecplot
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Pre
cisi
on
MaMb
A CV based procedure for decision threshold selection
Training Distribution
Testing Distribution
12
3
12
3
+ +
+
+
+
+
- -
P(y=“ozoneday”|x,θ) Lable
7/1/98 0.1316 Normal
7/3/98 0.5944 Ozone
7/2/98 0.6245 Ozone
………
Decision Threshold when P(x) is different and P(y|x) is non-deterministic
Random Decision Tree
B1: {0,1}
B2: {0,1}
B3: continuous
B2: {0,1}
B3: continuous
B2: {0,1}
B3: continuous
B2: {0,1}
B3: continous
B1 == 0
B2 == 0?
Y
B3 < 0.3?
N
Y N
……… B3 < 0.6?
Random threshold 0.3
Random threshold 0.6
B1 chosen randomly
B2 chosen randomly
B3 chosen randomly
B3 chosen randomly
RDT vs Random Forest1. Original Data vs Bootstrap2. Random pick vs. Random Subset + info gain3. Probability Averaging vs. Voting4. RDT: superfast
Optimal Decision Boundary
from Tony Liu’s thesis (supervised by Kai Ming Ting)
what is the main advantage of your approach, how do you evaluate it?
Fast and Reliable Compare with
State-of-the-art data mining algorithms: Decision tree NB Logistic Regression SVM (linear and RBF kernel) Boosted NB and Decision Tree Bagging Random Forest
Physical Equation-based Model Actual streaming environment on daily basis
what impact has been made in particular,
changing the real world business?
From 4-year studies on actual data, the proposed data mining approach consistently outperforms physical model-based method
can your approach be widely expanded to
other areas? and how easy would it be?
Other known application using proposed approach Fraud Detection Manufacturing Process Control Congestion Prediction Marketing Social Tagging
Proposed method is general enough and doesn’t need any tuning or re-configuration
top related