welcome to ds504/cs586: big data analyticsyli15/courses/ds504fall16/...• final project...
TRANSCRIPT
DS504/CS586: Big Data Analytics Application I Prof. Yanhua Li
Welcome to
Time:6:00pm–8:50pmRLoca2on:AK232
Fall2016
• 18 critiques & Next Thur we have the last critique. – Already graded 8 of them. – Plan to grade 1-2 more.
• Grading – Projects (40%)
• Project 1 (10%) • Project 2 (30%)
– Finalreportsinthediscussionforum(by11:59pm12/13);– Self-and-peerevalua2onformforproject2(by11:59PM12/13);
– WriPenwork(30%):• Cri2ques+Projectreports(20%)• Quiz(10%,with5%each)
– Oralwork(30%):• Presenta2on
• Final Project Presentation – 22 minutes each group (including Q&A) – Schedule:
• 12/15Thu
5
Next Class: Summary and Discussion
v Review of the semester v Plus the last critique/review
Ar2ficialNeuralNetworks(ANN)
X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0
X1
X2
X3
Y
Black box
Output
Input
OutputYis1ifatleasttwoofthethreeinputsareequalto1.
Ar2ficialNeuralNetworks(ANN)
X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0
S
X1
X2
X3
Y
Black box
0.3
0.3
0.3 t=0.4
Outputnode
Inputnodes
⎩⎨⎧
=
>−++=
otherwise0 trueis if1
)( where
)04.03.03.03.0( 321
zzI
XXXIY
Ar2ficialNeuralNetworks(ANN)
• Modelisanassemblyofinter-connectednodesandweightedlinks
• Outputnodesumsupeachofitsinputvalueaccordingtotheweightsofitslinks
• Compareoutputnodeagainstsomethresholdt
S
X1
X2
X3
Y
Black box
w1
t
Outputnode
Inputnodes
w2
w3
)( tXwIYi
ii −= ∑PerceptronModel
)( tXwsignYi
ii −= ∑
or
GeneralStructureofANN
Activationfunction
g(Si )Si Oi
I1
I2
I3
wi1
wi2
wi3
Oi
Neuron iInput Output
threshold, t
InputLayer
HiddenLayer
OutputLayer
x1 x2 x3 x4 x5
y
TrainingANNmeanslearningtheweightsoftheneurons
Real-worldproblemsarealwaysmessy
• Mul2plemodels
• Keyfeatures
• DataSparsity
• Whatdowedotosolveaclassifica2on/inference/predic2onproblem?– DataCleaning
– Featureselec2on
– Inferencemodel
– Evalua2on
• Anexampleofhowtosolverealworldapplica2onproblem
U-Air:WhenUrbanAirQualityMeetsBigData
Authors:YuZheng,MicrosogResearchAsia
Background • Airquality
– NO2,SO2– Aerosols:PM2.5,PM10
• WhyitmaPers– Healthcare– Pollu2oncontrolanddispersal
• Reality– Buildingameasurementsta2on
isnoteasy– Alimitednumberofsta2ons
(poorcoverage)
Beijingonlyhas22airqualitymonitorsta2onsinitsurbanareas(50kmx40km)
Air quality monitor station
2PM,June17,2013
Challenges• Airqualityvariesbyloca2ons
non-linearly• Affectedbymanyfactors
– Weathers,traffic,landuse…– Subtletomodelwithaclearformula
0 40 80 120 160 200 240 280 320 360 400 440 4800.00
0.05
0.10
0.15
0.20
0.25
0.30
Por
titio
n
Deviation of PM2.5 between S12 and S13
>35%Prop
or2o
n
A) Beijing (8/24/2012 - 3/8/2013)
WedonotreallyknowtheairqualityofalocaGonwithoutamonitoringstaGon!
Challenges • Exis2ngmethodsdonotworkwell
– Linearinterpola2on– Classicaldispersionmodels
• GaussianPlumemodelsandOpera2onalStreetCanyonmodels• Manyparametersdifficulttoobtain:Vehicleemissionrates,street
geometry,theroughnesscoefficientoftheurbansurface…
– Satelliteremotesensing• Sufferfromclouds• Doesnotreflectgroundairquality• Varyinhumidity,temperature,loca2on,andseasons
– Outsourcedcrowdsensingusingportabledevices• Limitedtoafewgasses:CO2andCO• Sensorsfordetec2ngaerosolarenotportable:PM10,PM2.5• Alongperiodofsensingprocess,1-2hours
30,000+USD,10ug/m3 202×85×168(mm)
InferringReal-TimeandFine-GrainedairqualitythroughoutacityusingBigData
Meteorology Traffic POIs RoadnetworksHumanMobility
Historicalairqualitydata Real-2meairqualityreports
Applica2ons • Loca2on-basedairqualityawareness
– Fine-grainedpollu2onalert– Rou2ngbasedonairquality
• Deployingnewmonitoringsta2ons• Asteptowardsiden2fyingtherootcauseofairpollu2on
S2
S1
S5
S3
S7
S6S4
S1
S8
S9
S10
B) Shanghai
Difficul2es • 1.Howtoiden2fyfeaturesfromeachkindofdatasource
• 2.Incorporatemul2pleheterogeneousdatasourcesintoalearningmodel– Spa2ally-relateddata:POIs,roadnetworks– Temporally-relateddata:traffic,meteorology,humanmobility
• 3.Datasparseness(liPletrainingdata)– Limitednumberofsta2ons– Manyplacestoinfer
MethodologyOverview • Par22onacityintodisjointgrids• Extractfeaturesforeachgridfromitsaffec2ngregion
– Meteorologicalfeatures – Trafficfeatures– Humanmobilityfeatures– POIfeatures– Roadnetworkfeatures
• Co-training-basedsemi-supervisedlearningmodelforeachpollutant
– PredicttheAQIlabels– Datasparsity– Twoclassifiers
MeteorologicalFeatures:Fm
• Rainy,Sunny,Cloudy,Foggy• Windspeed• Temperature• Humidity• Barometerpressure
GoodModerate
UnhealthyUnhealthy-S
Very Unhealthy
AQIofPM10AugusttoDec.2012inBeijing
TrafficFeatures:Ft
• Distribu2onofspeedby2me:F(v)• Expecta2onofspeed:E(V)• Standarddevia2onofSpeed:D
Good Moderate
km
Unhealthy-S Very UnhealthyUnhealthy
0≤v<20
20≤v<40
v≥ 40
E(v)
D(v)
GPStrajectoriesgeneratedbyover30,000taxisFromAugusttoDec.2012inBeijing
HumanMobilityFeatures:Fh • Humanmobilityimplies
– Trafficflow– Landuseofaloca2on– Func2onofaregion(likeresiden2alorbusinessareas)
• Features:– Numberofarrivals 𝑓↓𝑎 andleavings𝑓↓𝑙
A) AQI of PM10 B) AQI of NO2
fl fl
fa faGoodModerate
UnhealthyUnhealthy-S
Very Unhealthy
GoodModerate
UnhealthyUnhealthy-S
Very Unhealthy
Numberofarrivalsfaandleavings(departures)fl
Parksvsfactories
Extrac2ngTraffic/HumanMobilityFeatures
• Offlinespa2o-temporalindexing• ta:arrival2me• Traj:trajID• Ii:theindexofthefirstGPSpoint(inthetrajectory)enteringagrid• Io:theindexofthelastGPSpoint(inthetrajectory)enteringthegrid
POIFeatures:Fp• WhyPOI
– Indicatethelanduseandthefunc2onoftheregion– thetrafficpaPernsintheregion
• Features– NumbersofPOIsovercategories– Por2onofvacantplaces– ThechangesinthenumberofPOIs
• Factories,shoppingmalls,• hotelandrealestates• Parks,decora2onandfurnituremarkets
RoadNetworkFeatures:Fr • Whyroadnetworks
– Haveastrongcorrela2onwithtrafficflows– Agoodcomplementaryoftrafficmodeling
• Features:
– Totallengthofhighways𝑓↓ℎ – Totallengthofother(low-level)roadsegments𝑓↓𝑟 – Thenumberofintersec2ons𝑓↓𝑠 inthegrid’saffec2ngregion
fh
fr
fs
Semi-SupervisedLearningModel
• Philosophyofthemodel– Statesofairquality
• Temporaldependencyinaloca2on• Geo-correla2onbetweenloca2ons
– Genera2onofairpollutants• Emissionfromaloca2on• Propaga2onamongloca2ons
– Twosetsoffeatures• Spa2ally-related• Temporally-related
s2s1
s3
s4l
s2s1
s3
s4l
s2s1
s3
s4
ti
t1
t2
lTime
Geospace
A location with AQI labels A location to be inferred Temporal dependencySpatial correlation
POIs: Spatial
Fh Temporal
Road Networks: Fr
Ft FmMeteorologic:Traffic:Human mobility:
FpSpa2alClassifier
TemporalClassifierCo-Training
Co-Training-BasedLearningModel
• Spa2alclassifier– Modelthespa2alcorrela2onbetweenAQIofdifferentloca2ons– Usingspa2ally-relatedfeatures– BasedonaBPneuralnetwork
• Inputgenera2on– Selectnsta2onstopairwith– Performmrounds
∆P1x
∆R1x
c
D1Fp
Frl1 D2
c
d1x
D1
D2
D1
D1
1
1
Fp
Frlk
k
k
Fpx Frx lx
∆Pkx
∆Rkx
cdkx
1
k
x
ANNInput generation
w'11
w'qr
w1
wr
wpq
w11 b1
bq
b'r
b'1
b''
∆P1x
∆R1x
c
D1Fp
Frl1 D2
c
d1x
D1
D2
D1
D1
1
1
Fp
Frlk
k
k
Fpx Frx lx
∆Pkx
∆Rkx
cdkx
1
k
x
ANNInput generation
w'11
w'qr
w1
wr
wpq
w11 b1
bq
b'r
b'1
b''
Co-Training-BasedLearningModel
• Temporalclassifier– Modelthetemporaldependencyoftheairqualityinaloca2on– Usingtemporallyrelatedfeatures– BasedonaLinear-ChainCondi2onalRandomField(CRF)
Yt-1
Fm(t-1) t-1Ft(t-1) Fh(t-1) Fm(t) tFt(t) Fh(t) Fm(t+1) t+1Ft(t+1) Fh(t+1)
Yt Yt-1
LearningProcess
Yt-1
Fm(t-1) t-1Ft(t-1) Fh(t-1) Fm(t) tFt(t) Fh(t) Fm(t+1) t+1Ft(t+1) Fh(t+1)
Yt Yt-1
∆P1x
∆R1x
c
D1Fp
Frl1 D2
c
d1x
D1
D2
D1
D1
1
1
Fp
Frlk
k
k
Fpx Frx lx
∆Pkx
∆Rkx
cdkx
1
k
x
ANNInput generation
w'11
w'qr
w1
wr
wpq
w11 b1
bq
b'r
b'1
b''
Training
Temporally-relatedfeatures
Spa2ally-relatedfeatures
Labeleddata Unlabeleddata
Inference
InferenceProcess
∆P1x
∆R1x
c
D1Fp
Frl1 D2
c
d1x
D1
D2
D1
D1
1
1
Fp
Frlk
k
k
Fpx Frx lx
∆Pkx
∆Rkx
cdkx
1
k
x
ANNInput generation
w'11
w'qr
w1
wr
wpq
w11 b1
bq
b'r
b'1
b''
Temporally-relatedfeatures
Spa2ally-relatedfeatures
<𝑝↓𝑐1 , 𝑝↓𝑐2 ,…,𝑝↓𝑐𝑛 >
<𝑝′↓𝑐1 , 𝑝′↓𝑐2 ,…,𝑝′↓𝑐𝑛 >
× 𝑐= arg↓𝑐↓𝑖 ∈𝒞 𝑀𝑎𝑥( 𝑝↓𝑐𝑖 × 𝑝′↓𝑐𝑖 )
< pc1 , · · · , pcn >
< p0c1 , · · · , p0cn >
c = argci2CMax(P ciSC ⇥ P
ciTC)
Yt-1
Fm(t-1) t-1Ft(t-1) Fh(t-1) Fm(t) tFt(t) Fh(t) Fm(t+1) t+1Ft(t+1) Fh(t+1)
Yt Yt-1
Evalua2on
Datasources Beijing Shanghai Shenzhen Wuhan POI 2012Q1 271,634 321,529 107,061 102,467
2012Q3 272,109 317,829 107,171 104,634
Road #.Segments 162,246 171,191 45,231 38,477 Highways 1,497km 1,963km 256km 1,193km Roads 18,525km 25,530kmKM 6,100km 9,691km #.Intersec. 49,981 70,293 32,112 25,359
AQI #.Sta2on 22 10 9 10 Hours 23,300 8,588 6,489 6,741 Timespans 8/24/2012-3/8/201
3 1/19/2013-3/8/20
13 2/4/2013-3/8/201
3 2/4/2013-3/8/2013
UrbanSize(grids) 5050km(2500) 5050km(2500) 5745km(2565) 4525km(1165)
• Datasets
S1
S2
S4
S5
S8
S5
S2
S1
S7
S5
S3
S3
S6 S7S6
S9S10
S12
S11
S13 S14
S22S15
S16
S16
S17
S18
S19
S20
S21S3
S7
S6
S4
S1
S8
S9
S10
S1
S4
S2
S6
S9
S8
S1
S2
S4
S3S10
S5
S9
S6 S7
S8
A) Beijing B) Shanghai C) Shenzhen D) Wuhan
Evalua2on
• GroundTruth– Removeasta2on– Crossci2es
• Baselines– LinearandGaussianInterpola2ons– ClassicalDispersionModel– DecisionTree(DT):– CRF-ALL– ANN-ALL
Evalua2on
PM10 NO2
Features Precision Recall Precision Recall
Fm 0.572 0.514 0.477 0.454 Ft 0.341 0.36 0.371 0.35 Fh 0.327 0.364 0.411 0.483 Fp+Fr 0.441 0.443 0.307 0.354
Fm+Ft 0.664 0.675 0.634 0.635 Fm+Ft+Fp+Fr 0.731 0.734 0.701 0.691
Fm+Ft+Fp+Fr+Fh 0.773 0.754 0.723 0.704
• Doeseverykindoffeaturecount?
Evalua2on • Overallperformanceoftheco-training
0 20 40 60 80 100 120 140 160
0.65
0.70
0.75
0.80
Pre
cisi
on
Num. of Iterations
SC TC Co-Training
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
NO2PM10
Acc
urac
y
U-Air Linear Guassian Classical DT CRF-ALL ANN-ALL
Accuracy
Evalua2on
GroundTruth
PredicGons
G M S U
G 3789 402 102 0 0.883
Recall
M 602 3614 204 0 0.818 S 41 200 532 50 0.646 U 0 22 70 219 0.704 0.855 0.853 0.586 0.814 0.828
Precision
• ConfusionmatrixofCo-TrainingonPM10
Evalua2on
CiGes PM2.5 PM10 NO2
Prec. Rec. Prec. Rec. Prec. Rec.
Beijing 0.764 0.763 0.762 0.745 0.730 0.749
Shanghai 0.705 0.725 0.702 0.718 0.715 0.706
Shenzhen 0.740 0.737 0.710 0.742 0.732 0.722
Wuhan 0.727 0.723 0.731 0.739 0.744 0.719
• PerformanceofSpa2alclassifier
Evalua2on
Procedures Time(ms) Procedures Time(ms) Feature
extracGon (pergrid)
Ft&Fh 53.2 Inference (pergrid)
SC 21.5 Fp 28.8 TC 13.1 Fr 14.4 Total 131
• Efficiencystudy• Singlegrid131s• InferringtheAQIsforen2reBeijingin5minutes
Conclusion• Inferfine-grainedairqualitywith
– Real-2meandhistoricalairqualityreadingsfromexis2ngsta2ons– Otherdatasources:meteorology,POIs,roadnetwork,humanmobility,and
trafficcondi2on
• Co-Training-basedsemi-supervisedlearningapproach– Dealwithdatasparsitybylearningfromunlabeleddata– Modelthespa2alcorrela2onamongtheairqualityofdifferentloca2ons– Modelthetemporaldependencyoftheairqualityinaloca2on
• Results– 0.82withtrafficdata(co-training)– 0.76ifonlyusingspa2alclassifier
Ques2ons?