finalpresentation-gradproject
TRANSCRIPT
Big Data:Predicting Rent in London by Machine Learning
Manabu Mukohyoshi
Motivation
• Interested in Machine Learning• Wide range of Machine Learning applications
in use• Data-driven cities: City slicker - Data are slowly
changing the way cities operate (The Economist)
Initial ideas
• Predict fires to dispatch ambulances efficiently• Predict crimes to dispatch police cars efficiently• Predict energy consumption (gas, electricity, etc.)• Predict increase of waste using population• Predict emission of carbon dioxide• Predict the rise of rents and house prices using
economics and population data• Map Londoners’ health on to the map of London• Predict happiness by region• Predict congestion
Number of Fires by Ward
Number of Fires by Borough
Number of Fires by hour
# of Fires
First Arrival Time
First Arrival Time and Fire Stations
Initial ideas
• London Datastore has a variety of data– Mostly statistics– Not a lot of individual data
• What to learn?
The Idea
• Rent Prediction in London by Machine Learning– Can retrieve individual rent data from Zoopla– Rent keeps changing and it is hard to know if the
rent is right for the place• For landlords, it can be a standard to decide rent• For tenants, it can be a standard to judge rent• For Zoopla, it can attract more customers
Data Source
• Zoopla (about 45,000 examples)– Latitude, Longitude, # of bedrooms, # of bathrooms,
# of floors, # of receptions, property type, price• Walkscore – Calculate score of an address based on how
walkable it is. (Close to grocery stores, restaurants, cafes, etc…)
• MapIt– Converting Latitude/Longitude to Ward and Borough
code
Data Source
• London Datastore– Ward profile
• Mean Age, Population density, % Not Born in UK, General Fertility Rate, Male life expectancy, Female life expectancy, % children in year 6 who are obese, Rate of All Ambulance Incidents per 1,000 population, Employment rate (16-74), Median House Price, Number of properties sold, % Households Social Rented, % Households Private Rented, % dwellings in council tax bands A or B, % dwellings in council tax bands C, D or E, % dwellings in council tax bands F, G or H, Claimant Rate of Income Support, % with no qualifications, % with Level 4 qualifications and above, Crime rate, Deliberate Fires, Cars per household, Average Public Transport Accessibility score, Turnout at Mayoral election - 2012
– Borough profile• Total carbon emissions, Teenage conception rate, Life satisfaction score,
Worthwhileness score, Happiness score, Anxiety score
Steps to solve
1. Collect and combine data2. Preprocess data3. Try different algorithms of machine learning
on the collected data4. Tune the parameters of ML algorithms5. Evaluate the results and algorithms
Step 1: Collect and Combine Data
1. Download listings data using Zoopla API2. Get Walkscore using the API3. Convert Longitude/Latitude to ward and
borough code using self-hosted MapIt4. Merge ward and borough profile downloaded
from London Datastore to listings data
MapIt: UK
Step 2: Preprocess Data
• Scale (bias elimination)• Encode categorical features• Impute– Replace n/a or space with mean
• Shuffle• Split into training dataset and test dataset
(cross validation)
Step 3: Try Different Algorithms name Average MSE
1.11.2.1. Random Forests 0.241214063
1.11.4. Gradient Tree Boosting 0.273875445
1.11.1. Bagging meta-estimator 0.296172365
1.11.2.2. Extremely Randomized Trees 0.296710726
1.6.3. KNeighborsRegressor uniform 0.306133182
1.6.3. KNeighborsRegressor distance 0.319488307
1.10. DecisionTreeRegressor 0.336486662
1.10. ExtraTreeRegressor 0.40337387
1.4.2 SVR poly 0.429585937
1.4.2 NuSVR poly 0.434766842
1.11.3. AdaBoost 0.443524744
1.4.2 SVR rbf 0.476364995
1.1.9.1. Bayesian Ridge Regression 0.567228078
1.1.4. Elastic Net 0.56727658
1.1.2. Ridge Regression 0.567611415
name Average MSE
1.1.1. Ordinary Least Squares 0.567641956
1.1.11. Stochastic Gradient Descent 0.573168168
1.1.8. Orthogonal Matching Pursuit 0.576630178
1.1.14.3. Theil-Sen estimator 0.5875179
1.4.2 SVR linear 0.642531415
1.4.2 LinearSVR 0.667162534
1.1.14.2. RANSAC 0.705499997
1.1.13. Passive Aggressive Algorithms 0.726516853
1.1.3. Lasso 0.899948627
1.1.7. LARS Lasso 0.899948627
1.4.2 SVR sigmoid 0.937398784
1.8. Cross decomposition PLSRegression 1.662293485
1.6.3. NearestCentroid 1.701974047
1.8. Cross decomposition PLSCanonical 10.72550448
Step 4: Tune Parameters of Algs.
• Grid Search– Exhaustively search the possible combinations of
parameters– Takes too much time on my computer
• Random Search– Takes less time– Result is similar to grid search
Let’s see tuning parameters…
Support Vector Regression
KNN
Step 5: Evaluate
• Feature Importance• Final MSE for 4 selected algorithms• Compare rents with Zoopla Estimate
Feature Importance:Random Forest
Feature Importance:GBR
1 2 3 4 5 6 7 8 9 10 new data0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
MSE on Cross Validation and new listings data
KNNGBRRFSVRstandard deviation
Cross Validation / Score on new data
MSE
Final Result
Final Result (MSE) MSE from Step 3Fitting Time (42003 examples)
Predicting Time (3582 examples)
Random Forest 0.108435602 0.241214063 53.12 sec 2.03 sec
Gradient Tree Boosting 0.117256254 0.273875445 149.18 sec 0.45 sec
Support Vector Machine 0.143577993 0.429585937 3192.02 sec 4.54 sec
K-Nearest Neighbors 0.217186025 0.306133182 3.97 sec 3.82 sec
Actual rent and predicted rent (Random Forest)
4 156 308 460 612 764 916 1068 1220 1372 1524 1676 1828 1980 2132 2284 2436 2588 2740 2892 3044 3196 3348 35000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
predictedactual
Rent (£)
Compare rents with Zoopla Estimate (1/2)Zoopla Estimate
Actual Rent
Predicted Rent by Random Forests
£381.5120443 pw = £1653 pcm(pw x 52 / 12 = pcm)
Compare rents with Zoopla Estimate (2/2)Zoopla Estimate
Actual Rent
Predicted Rent by Random Forests
£1488.237929 pw = £6449 pcm
Conclusion
• Random Forest works the best for this problem
• Data quality in dataset greatly influence the result of prediction more than parameters of machine learning algorithms does
• Can not compare all the predicted rents with Zoopla estimate, but got some results closer to the actual rents than Zoopla estimate
Future Work
• Adding more room specific information such as size of the room and age
• Make an app to predict rent by inputting an address, # of bedrooms, # of bathrooms, # of floors and property type
Challenges
• Collect Data– Time consuming– Hard to find good dataset
• Statistics– Possible to use machine learning without knowing
math/statistics– Need to know in order to understand what ML
algorithms do deeply or tune the parameters efficiently
What I learned
• Python• Scikit-learn / Tableau / Google Maps API /
Walkscore API / Coordinate systems (MapIt API)
• How to apply machine learning algorithms• Collecting good dataset is more important
than algorithms
References
• Walkscore– https://www.walkscore.com
• MapIt– http://mapit.poplus.org
• Google Maps API– https://developers.google.com/maps/documentation/javascript/
• Scikit-learn– http://scikit-learn.org/stable/
• London Datastore– http://data.london.gov.uk
• Tableau– http://www.tableau.com
References
• Zoopla– http://www.zoopla.co.uk– Examples from Zoopla
• http://www.zoopla.co.uk/property/101-greyhound-road/london/n17-6xr/15262720
• http://www.zoopla.co.uk/to-rent/details/36920785#5yJdKDM4BovT5eu6.97
• http://www.zoopla.co.uk/property/28-cato-street/london/w1h-5jj/28909969
• http://www.zoopla.co.uk/to-rent/details/37005409?search_identifier=0f64a06eeb798647935af065dcaf87c4#V6Xmr062sEqY198c.97
References• Data-driven cities: City slicker - Data are slowly changing the way
cities operate (The Economist)– http://www.economist.com/news/britain/21629533-data-are-slowly-ch
anging-way-cities-operate-city-slicker
• CS7641.TNL.MATLAB. Supervised Learning Workflow and Algorithms– http
://wiki.omscs.org/confluence/display/CS7641ML/CS7641.TNL.MATLAB.+Supervised+Learning+Workflow+and+Algorithms
• Coursera: Machine Learning by Andrew Ng– https://www.coursera.org/course/ml
• Questions?
MSE
• The RMSE is the distance, on average, of a data point from the fitted line (representing predictions made by the model), measured along a vertical line.
Cross Validation
https://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
Random Forests
http://provectus.com/blog/news/research_paper_for_load_forecast
Gradient Tree Boosting
http://provectus.com/blog/news/research_paper_for_load_forecast
Support Vector Machine
http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html
K-Nearest Neighbors
http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/
What is Machine Learning?
• Supervised learning– Classification– Regression
Fitting/Training
Predicting
# of bedrooms, lat/long
rent
# of bedrooms, lat/long
Predicted Rent