forecasting skewed biased stochastic ozone days: analyses and solutions forecasting skewed biased...

11
Forecasting Skewed Forecasting Skewed Biased Stochastic Ozone Biased Stochastic Ozone Days: Days: Analyses and Solutions Analyses and Solutions Presentor: Prof. Longbin Cao Wei Fan, Kun Zhang, and Xiaojing Yuan 0.0 0.2 0.4 0.6 0.8 1.0 0.00.2 0.4 0.6 0.81.0 Recall Precisio n Ma Mb VE

Upload: stella-doyle

Post on 14-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

Forecasting Skewed Forecasting Skewed Biased Stochastic Ozone Biased Stochastic Ozone

Days: Days: Analyses and SolutionsAnalyses and Solutions

Presentor: Prof. Longbin Cao

Wei Fan, Kun Zhang, and Xiaojing Yuan

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Recall

Pre

cisi

on

MaMb

VE

Page 2: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

What is the business problem and broad-based areas

Problem: ozone pollution day detection Ground ozone level is a sophisticated chemical, physical process and

“stochastic” in nature. Ozone level above some threshold is rather harmful to human health and

our daily life. 8-hour peak and 1-hour peak standards.

8-hour average > 80 ppt (parts per billion) 1-hour average > 120 ppt

It happens from 5 to 15 days per year. Broad-area: Environmental Pollution Detection and Protection Drawback of alternative approaches

Simulation: consume high computational power; customized for a particular location, so solutions not portable to different places

Physical model approach: hard to come up with good equations when there are many parameters, and changes from place to place

Page 3: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

What are the research challenges that

cannot be handled by the state-of-the-art?

Dataset is sparse, skewed, stochastic, biased and streaming in the same time. High dimensional Very few positives Under similar conditions: sometimes it happens and

sometimes it doesn’t P(x) difference between training and testing Training data from past, predicting the future

Physical model is not well understood and cannot be customized easily from location to location

Page 4: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

what is the main idea of your approach? Non-parametric models are easier to use when “physical or

generative mechanism” is unknown. Reliable “conditional probabilities” estimation under “skewed,

biased, high-dimensional, possibly irrelevant features Estimate “decision threshold” to predict on the unknown

distribution of the future Random Decision Tree

Super fast implementation Formal Analysis:

Bound analysis MSE reduction Bias and bias reduction P(y|x) order correctness proof

Page 5: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

TrainingS

et Algorithm

…..

Estimated probability

values1 fold

Estimated probability

values10 fold

10CV

10CV

Estimated probability

values2 fold

Decision threshold

VE

VE

“Probability-TrueLabel”

file

Concatenate

Concate

nate

P(y=“ozoneday”|x,θ) Lable

7/1/98 0.1316 Normal

7/2/98 0.6245 Ozone

7/3/98 0.5944 Ozone

………

PrecRecplot

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Recall

Pre

cisi

on

MaMb

A CV based procedure for decision threshold selection

Training Distribution

Testing Distribution

12

3

12

3

+ +

+

+

+

+

- -

P(y=“ozoneday”|x,θ) Lable

7/1/98 0.1316 Normal

7/3/98 0.5944 Ozone

7/2/98 0.6245 Ozone

………

Decision Threshold when P(x) is different and P(y|x) is non-deterministic

Page 6: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

Random Decision Tree

B1: {0,1}

B2: {0,1}

B3: continuous

B2: {0,1}

B3: continuous

B2: {0,1}

B3: continuous

B2: {0,1}

B3: continous

B1 == 0

B2 == 0?

Y

B3 < 0.3?

N

Y N

……… B3 < 0.6?

Random threshold 0.3

Random threshold 0.6

B1 chosen randomly

B2 chosen randomly

B3 chosen randomly

B3 chosen randomly

RDT vs Random Forest1. Original Data vs Bootstrap2. Random pick vs. Random Subset + info gain3. Probability Averaging vs. Voting4. RDT: superfast

Page 7: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

Optimal Decision Boundary

from Tony Liu’s thesis (supervised by Kai Ming Ting)

Page 8: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
Page 9: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

what is the main advantage of your approach, how do you evaluate it?

Fast and Reliable Compare with

State-of-the-art data mining algorithms: Decision tree NB Logistic Regression SVM (linear and RBF kernel) Boosted NB and Decision Tree Bagging Random Forest

Physical Equation-based Model Actual streaming environment on daily basis

Page 10: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

what impact has been made in particular,

changing the real world business?

From 4-year studies on actual data, the proposed data mining approach consistently outperforms physical model-based method

Page 11: Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

can your approach be widely expanded to

other areas? and how easy would it be?

Other known application using proposed approach Fraud Detection Manufacturing Process Control Congestion Prediction Marketing Social Tagging

Proposed method is general enough and doesn’t need any tuning or re-configuration