machine learning assists the classification of reports by...
TRANSCRIPT
![Page 1: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/1.jpg)
Machine Learning Assists the Classification of Reports byCitizens on Disease-Carrying Mosquitoes
Antonio Rodriguez1 Frederic Bartumeus2,3,4 Ricard Gavalda1
Universitat Politecnica de Catalunya, Barcelona (Spain)
Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain)
CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain)
ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain)
Workshop on Data Science for Social Good, SoGoodSeptember 2016
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 1
/ 20
![Page 2: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/2.jpg)
Overview
1 Introduction
2 Methodology
3 Project developmentExploratory data analysisData cleaning and pre-processingClassifier training, evaluation and selectionReal-time classification system design
4 Discussion
5 Future work
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 2
/ 20
![Page 3: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/3.jpg)
Introduction - Mosquito Alert
Citizen Science Platform
Mobile application
Growing fast
Various mosquito speciesWorldwide localizations
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 3
/ 20
![Page 4: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/4.jpg)
Introduction - Mobile App
Send breeding site
Send specimen report
Small questionnaire
Geolocated!
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 4
/ 20
![Page 5: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/5.jpg)
Introduction - Mosquito Alert System
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 5
/ 20
![Page 6: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/6.jpg)
Introduction - Classification system
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 6
/ 20
![Page 7: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/7.jpg)
Methodology
1 Exploratory data analysis
2 Data cleaning and pre-processing3 Classifiers
trainingevaluationselection
4 Real-time classification system design
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 7
/ 20
![Page 8: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/8.jpg)
Exploratory data analysis - Raw files
users 16967 observations of 10 variables
userID
userRegistTimeOriginal
userRegistDatetime
userRegistDate
userRegistMonthNum
userRegistMonthString
userRegistWeekdayString
userRegistWeekdayNum
userSyst
userDaysSystRelease
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 8
/ 20
![Page 9: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/9.jpg)
Exploratory data analysis - Raw files
reports 10618 observations of 23+1 variables
reportVersionID
reportVersionNum
userID
reportID
reportType
reportNote
os
hide
reportCreationDatetime
reportCreationDate
reportVersionDatetime
reportVersionDate
reportCreationMonthNum
reportCreationMonthString
reportCreationWeekdayString
reportCreationWeekdayNum
reportLong
reportLat
missionNum
missionName
tiger q1 response
tiger q2 response
tiger q3 response
class
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 9
/ 20
![Page 10: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/10.jpg)
Questionnaire variables
QuestionsIs small, black and has white stripes?
Has a white stripe in both head andthorax?
Has white stripes in both abdomen andlegs?
Response values-1 No
0 Not sure
1 Yes
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 10
/ 20
![Page 11: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/11.jpg)
The class variable
-2 The report is definitely not a valid specimen.
-1 The report doesn’t seem to be a valid specimen. But it isnot sure.
0 There isn’t enough information to classify the report.
1 The report seems to be a valid specimen. But it is not sure.
2 The report is definitely a valid specimen.
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 11
/ 20
![Page 12: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/12.jpg)
Instance variables
Added
reportNote
reportTimeOfDay.
newUser
userNumReports
userAccuracy
userTimeForFirstReport
userTimeSinceLastReport
userMeanTimeBetweenReports
userNumActionAreas
userMobilityIndex
reports1kmLast* (4)
validReports1kmLast* (4)
Preserved
os
reportMonth
reportQ*Answ (3)
class
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 12
/ 20
![Page 13: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/13.jpg)
Generated instances
2094 instances from usable reports
Class 2 1 −1 −2Frequency 47% 46% 2% 5%
Class-imbalanced problem: positive instances over 7 times as frequentas negative ones.
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 13
/ 20
![Page 14: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/14.jpg)
Studied classifiers
Naive Bayes
k-nearest neighbors
Decision trees
Random Forests
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 14
/ 20
![Page 15: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/15.jpg)
Classifiers - Considerations
Most classifiers have trouble dealing with imbalanced classes
Merged “unsure” (-1,1) classes into “sure” ones (-2,2)
Replication of minority class performed
. . . but testing still on original proportion
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 15
/ 20
![Page 16: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/16.jpg)
Classifiers - Selected classifier
Positive NegativeAccuracy 0,380Precision 0,983 0,086
Recall 0,344 0,912F-measure (F1) 0,51 0,157
Table: Evaluation metrics, Naive Bayes
Naive Bayes
Training conditions:
Aggregated instancesReplicated (x10)negatives in training
High positive Precision
High negative Recall
can detect approximately1 third of the validreports with a precisionnear 98%
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 16
/ 20
![Page 17: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/17.jpg)
ROC curve and variable importance
Variable name ImportancereportQ2Answ 0.7424reportQ3Answ 0.7038reports1kmLastMonth 0.6623reportQ1Answ 0.6615userNumReports 0.6405userNumActionAreas 0.6348validReports1kmLastMonth 0.6216userTimeForFirstReport 0.6197reports1kmLastWeek 0.6158userAccuracy 0.6085
Table: Variable importance in the NBclassifier. Numbers are the values of themodel coefficients after standarization.
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 17
/ 20
![Page 18: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/18.jpg)
Real-time classification system design
Two subsystems:Instance generation system
Instance creation scriptEnvironment
Classification system
Training scriptClassifierClassification script
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 18
/ 20
![Page 19: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/19.jpg)
Future work
ScalabilityCode modifications
GIS enabled database
Approximately the samecomputational resources
ImprovementsClassifier tuning
Priority system
Another classifier: RandomForest
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 19
/ 20
![Page 20: Machine Learning Assists the Classification of Reports by …gavalda/papers/sogood2016-slides.pdf · 2016-09-17 · Machine Learning Assists the Classi cation of Reports by Citizens](https://reader034.vdocument.in/reader034/viewer/2022050109/5f472a3a8de566304f75ba93/html5/thumbnails/20.jpg)
Machine Learning Assists the Classification of Reports byCitizens on Disease-Carrying Mosquitoes
Antonio Rodriguez1 Frederic Bartumeus2,3,4 Ricard Gavalda1
Universitat Politecnica de Catalunya, Barcelona (Spain)
Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain)
CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain)
ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain)
Workshop on Data Science for Social Good, SoGoodSeptember 2016
Antonio Rodriguez, Frederic Bartumeus, Ricard Gavalda (Universitat Politecnica de Catalunya, Barcelona (Spain), Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain), CREAF, Cerdanyola del Valles, 08193 Barcelona (Spain), ICREA, Pg Lluıs Companys 23, 08010 Barcelona (Spain) )Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying MosquitoesWorkshop on Data Science for Social Good, SoGood September 2016 20
/ 20