afullyautomatedadjustmentofensemblemethodsinmachine...

Research ArticleA Fully Automated Adjustment of Ensemble Methods in MachineLearning for Modeling Complex Real Estate Systems

Jose-Luis Alfaro-Navarro 1 Emilio L Cano 2 Esteban Alfaro-Cortes 2

Noelia Garcıa 1 Matıas Gamez 2 and Beatriz Larraz 2

1Faculty of Economics and Business Administration University of Castilla-La Mancha Albacete Spain2Quantitative Methods and Socio-Economic Development Group Institute for Regional Development (IDR)University of Castilla-La Mancha (UCLM) Albacete Spain

Correspondence should be addressed to Beatriz Larraz beatrizlarrazuclmes

Received 12 November 2019 Accepted 19 March 2020 Published 14 April 2020

Guest Editor Marco Locurcio

Copyright copy 2020 Jose-Luis Alfaro-Navarro et al +is is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in anymedium provided the original work isproperly cited

+e close relationship between collateral value and bank stability has led to a considerable need to a rapid and economicalappraisal of real estate +e greater availability of information related to housing stock has prompted to the use of so-called bigdata and machine learning in the estimation of property prices Although this methodology has already been applied to the realestate market to identify which variables influence dwelling prices its use for estimating the price of properties is not so frequent+e application of this methodology has become more sophisticated over time from applying simple methods to using the so-called ensemble methods and while the estimation capacity has improved it has only been applied to specific geographical areas+e main contribution of this article lies in developing an application for the entire Spanish market that fully automaticallyprovides the best model for each municipality Real estate property prices in 433 municipalities are estimated from a sample of790631 dwellings using different ensemble methods based on decision trees such as bagging boosting and random forest +eresults for estimating the price of dwellings show a good performance of the techniques developed in terms of the error measureswith the best results being achieved using the techniques of bagging and random forest

1 Introduction

Since the year 2008 the global economic crisis caused aslowdown in the economy which resulted in a decrease in theprice of real estate properties +e appraised valuation of aproperty is considered key for any transaction related to theproperty and particularly for its sale or for a mortgageapplication so it is essential that the price is a true reflectionof its value Banks also need to periodically review the valueof their real estate portfolio by updating their appraisedvaluation see the Basel II International Banking Agreement[1 2] Normally appraisals for the purpose of a mortgage arecarried out by professional appraisers visiting the propertyHowever the appraisal procedure developed in this way isexpensive both in terms of time and money and makes thisprocedure unsustainable for the valuation of large real estate

portfolios Furthermore although the physical presence ofan appraiser may help to give a more accurate valuation ofthe property there is also the possibility of bias from in-terested parties such as buyers sellers or banks themselveswhich may make the valuation more subjective +ere isclearly a need for the development of a prediction modelwhich can present unbiased realistic valuations

+e International Association of Assessing Officers(IAAO) considers mass appraisal as the process of valuing agroup of properties using common data standardizedmethods and statistical procedures [3] +ese valuationmethods have been implemented through models known asAutomated Valuation Models (AVM) and have enabled theappraisal of large real estate portfolios without the directintervention of an appraiser [4 5] +e development of theseestimation procedures has been enhanced by the growth in

HindawiComplexityVolume 2020 Article ID 5287263 12 pageshttpsdoiorg10115520205287263

the quantity and quality of information related to both realestate prices and property characteristics accessible to re-searchers +is level of information allows the application ofincreasingly sophisticated statistical techniques for the de-velopment of estimation procedures of a higher quality andprecision Real estate AVMs allow the valuation of propertyprices en masse without the need for the physical presenceof an appraiser by using computer-assisted task appraisalsystems [6] In many cases the presence of an appraiser isonly necessary for those valuations considered to be out ofthe ordinary [7]

Estimation techniques include parametric regressionanalysis [8] and nonparametric [9] or machine learningmethods such as neural networks [10 11] decision trees[12 13] random forests [14 15] fuzzy logic [16] or en-semble methods [17] +ese techniques are used primarilywith three objectives in mind to estimate a real estateproperty price to find out the influence of a characteristic ofthe house on its price and to create a hedonic price index

Over the recent decades the most commonly usedprocedures has been based on hedonic-based regression[18 19] However these models present certain fundamentalproblems related to the assumptions of the model normalityof the residuals homoscedasticity independence and theabsence of multicollinearity +is situation has led to greateruse of pattern recognition techniques often known as datamining techniques which include machine learning +esetechniques are more flexible about the assumptions relatedto the distribution of data they are easier to interpret andthey allow linear and nonlinear relationships to be analysedIn addition they enable both categorical and continuousvariables to be managed [13] Although these techniqueswere initially used more as classification methods in recentyears their application has been used in determining themost influential variables on house pricing and in estimatingdwelling prices Perez-Rave et al [20] provide a two-stagemethodology for the analysis of big data regression under amachine learning approach for both inferential and pre-dictive purposes

Accurate and efficient prediction of real estate prices hasbeen and will continue to be an essential but controversialissue with an impact on the various actors in the economysuch as buyers sellers commission agents governmentsand banks [21 22] Nowadays the big data paradigm offersexciting possibilities for more accurate predictions and oneof the main approaches for dealing with big data is machinelearning

+ese machine learning methods have been applied tothe estimation of real estate properties in very specific lo-cations Research has been carried out by Jaen [12] in CoralGables (Florida USA) Fan et al [13] in Singapore (Republicof Singapore) Ozsoy and Sahin [23] in Istanbul (Turkey)Del Cacho [14] in Madrid (Spain) Pow et al [17] inMontreal (Canada) Ceh et al [24] in Ljubljana (Slovenia)Nguyen [25] in 5 counties of USA and Dimopoulos et al[26] in Nicosia (Cyprus) In contrast Perez-Rave et al [20]deal with the estimation of dwelling prices for a wholecountry Colombia using independent variables identifyingthe city in which each property is located thus proposing a

unique model for the entire country (with a sample of 61826properties)

+e main new element of this article is that it proposes anew methodology to carry out the automated estimation ofreal estate prices for an entire country (Spain in this casestudy) specifying automatically a different model for eachmunicipality with a sample size of 790631 real estateproperties +e whole country can be considered as acomplex real estate system due to the great differences thatexist between rural and urban areas and even inside theurban areas Each municipality is trained with the infor-mation available in its training set so that each one will haveits own model adapted to its characteristics and needs +isstudy addresses a program capable of covering a differentmodel for each of the 433 municipalities with more than 100properties for sale with a population ranging from 1559inhabitants in the smallest municipality to 3223334 inMadrid the biggest one We focus on the application of thisautomated valuation system based on machine learningmethods in estimating real estate prices and analyse theaccuracy of each of them using error measures widelyrecognised in economic literature As a rule to evaluatemodel quality aggregated diagnostic indicators are used(coefficient of determination) although there are few con-tributions in the relevant literature where the quality of theprocedure is analysed using ameasurement of the estimationerror [15] In this article we use four measures to analyse thevalidity of the proposed methods namely the mean ratiothe mean absolute percentage error (MAPE) the medianabsolute percentage error (MdAPE) and the coefficient ofdispersion (COD)

In this article machine learning techniques used inestimating the price of dwellings are based on the decisiontree technique However in general it is difficult to build asingle tree to make predictions because of incorrect pa-rameter settings simplicity rules and tree instability Toovercome these problems and obtain better behaviour whenmaking predictions techniques of ensemble of decision treeshave been developed such as bagging boosting and randommethods [27] In bagging the models are fitted using ran-dom independent bootstrap replicates that are then com-bined by averaging the output for regression [28] Inboosting the fitted model is a simple linear combination ofmany trees that are fitted iteratively and boosted to reweightpoorly modelled observations [29] +e random forestmodel however is constructed in a random vector of thedata feature space sampled independently [30] Startingfrom this base we automatically design the best ensemblemethod including bagging boosted regression tree andrandom forest in each municipality and then make acomparison to analyse their behaviour under different cir-cumstances In addition the results obtained with a singledecision tree are included in order to analyse and comparethe benefit of using an ensemble of decision tree techniques

A further consideration is the particular emphasis thatspecialized literature places on the need to include spatialinformation in hedonic models given the significant in-fluence that the location of the property can have on its priceand therefore on its valuation Ceh et al [24] highlight the

2 Complexity

growing interest in recent years in applying spatial statisticsto hedonic price modeling besides coupling the geographicinformation system and machine learning techniques Inthis article we include geographical coordinates in the ex-planatory variables to draw attention to the importance ofthe property location when valuing a property

+e article is laid out as follows After the introductionSection 2 presents a literature review highlighting the mainresearch to date on the valuation of real estate propertyprices from the use of regression trees to tree ensemblemodels In Section 3 the methodology used is presentedwith a description of the main techniques as well as thevaluation measures of behaviour of the different models InSection 4 the empirical argument for an application to thewhole of Spain is developed and advocates the need for usingensemble methods for Spain Finally Section 5 presents themain conclusions and further lines of research

2 Literature Review

+e application of machine learning methods in the field ofestimating property prices has attracted interest for someyears now However the application of decision trees isrelatively recent initially being used as a classificationtechnique and for determining which variables had thegreatest influence on the price of housing+e application ofdecision trees was then used as a prediction techniquethrough the so-called regression tree to obtain dwelling pricepredictions One of the first proposals for the application of aregression tree was made by Jaen [12] who used informationfrom 15 variables for 1229 transactions in the city of CoralGables (Florida) taken from the multiple listing system(MLS) Jaen [12] tests the effectiveness of using stepwiseregression CART decision tree and neural networks inestimating the price of housing and in determining the mostimportant variables for this prediction +e best results areachieved from CART measuring the estimation capacitywith the mean absolute error (MAE) using a smallernumber of variables specifically five versus the nine used instepwise regression

Following on from [12] Fan et al [13] demonstrate thegood behaviour of regression trees using the CART algo-rithm to identify the main determinants and predict theprice of housing +is application is developed for theSingapore resale public housing market However althoughthe process used to identify the main variables that affect theprice of housing is extensive its estimation is based solely onthe average value in a leaf node of the tree this value beingthought of as a forecasting value or regression value Ozsoyand Sahin [23] develop a CART application in Turkey todetermine the most influential characteristics on the price ofhousing in Istanbul based on a database taken from theInternet in 2007 +e results lead them to conclude that thesize of the house and the existence of an elevator securitycentral heating and views are the most influential variableson house prices in Istanbul

Kok et al [31] show the main advantages of the appli-cation of regression trees to predict property prices takinginto account that these models help overcome the problem

of regression models in nonlinear relationships +e ad-vantages highlighted by the authors are they are simple tounderstand and interpret and their statistical significance iseasy to calculate they can handle categorical variableswithout creating dummy variables and they consume littlecomputing time even with large amounts of data In addi-tion the authors propose the use of a procedure calledstochastic boosting which allows an unlimited number ofvariables to be handled with good results including eco-nomic and demographic variables and hyperlocal metrics inthe prediction model +e limitations of regression trees arethey can show unlimited growth vertically until the samplehas an observation which may generate models with poorgeneralization capacity they are not robust to changes in thetraining set and they usually suffer an underfitting effectgiving rise to models with little predictive capacity To solvethese limitations the authors propose the use of tree en-sembles such as random forest +ough it is true that thesemodels have been used before in the pioneering works in[14 15] Breiman [30] produced one of the first papers thathighlights the need to improve prediction using ensemblemethods

Following on from these papers there have been nu-merous proposals that compare the behaviour of ensembletechniques with classical regression models concluding thatmodels behave better with machine learning techniquesLikewise in the work by Pow et al [17] they use 25000 webdata on Montreal properties with 130 characteristics 70related to the housing itself and 60 sociodemographic +eseauthors use principal component analysis (PCA) to reducethe dimension and four regression techniques to predictproperty prices linear regression support vector machineK-nearest neighbors (KNN) and random forest regressionand an ensemble approach by combining KNN and randomforest technique From the results the authors highlight thegood behaviour of the ensemble approach with a meanabsolute percentage difference for the asking price of 985 Inaddition they show that applying PCA does not improve theprediction error

Ceh et al [24] analyse the behaviour of random forestcompared to multiple regression to select the most im-portant variables In the case of multiple regression ananalysis of main components allows it to go from 36 vari-ables to 10 principal components and in the case of randomforest a procedure is carried out to determine the 10 mostimportant variables Interestingly for random forest thedate of sale is important but not for ordinary least squares(OLS) Although the behaviour in terms of COD and MAPEof random forest is better than that of OLS it should benoted that both overestimate the lowest prices and under-estimate the highest Specifically in the application devel-oped for the price of apartments in Ljubljana (Slovenia) with7497 observations for the 6-year period 2008ndash2013 theresults in terms of MAPE for the test set were 727 for RFand 1748 for multiple OLS while in terms of COD thevalues obtained were 728 and 1712 respectively Al-though the authors state that their model does not take intoaccount the potential price differences over the 6-year timeperiod under consideration this price change could

Complexity 3

influence their results In our study we use a static databaseof 2018 to avoid this problem

Nguyen [25] develops an application in five counties inthe United States using Zillow group web data by com-paring linear regression models random forest and sup-port vector machine +e results lead the author toconclude that both random forest and support vectormachine behave better than linear regression in terms ofthe percentage of houses whose estimated prices fall withina 5 range of their actual sold prices In addition theconclusion emphasizes that it is not necessary to change thevariables used in each county and that the accuracy of themodel is practically the same using a series of commonattributes for all of them Dimopoulos et al [26] develop anapplication to compare the behaviour of random forest andlinear regression in estimating the prices of residentialapartments in Nicosia (Cyprus) +e results verify that thebest behaviour in predictive terms is that of random forestwith average MAPE values of 252 Shinde and Gawande[32] use data based on 3000 observations with 80 pa-rameters of a database called KaggleInc to compare thebehaviour of logistic regression support vector regressionlasso regression and decision tree and show that the bestbehaviour both in terms of accuracy and of error isachieved with the decision tree +e variables used to es-timate the sale price are area in square metres overallquality which includes the overall condition and finish ofthe dwelling location the year in which the house wasbuilt number of bedrooms and bathrooms garage area andnumber of cars that can fit in the garage swimming poolarea year in which the house was sold and price at whichthe house was sold

In addition to the comparison in the literature of ma-chine learning techniques with classical regression modelsthere is a wide range of literature that compares differentmachine learning methods concluding that there is no onetechnique that shows better behaviour than the others buthighlights the best behaviour of tree ensemble techniquesFor example Kagie and Wezel [33] use Friedmanrsquos LSBoostand LADBoost boosting algorithms designed for regressionwith three main objectives to predict dwelling prices in sixareas in Netherlands to determine the most importantcharacteristics and to build a price index To do this theyuse transaction data from the year 2004 obtained fromNederlandse Vereniging van Makelaars (NVM Dutch As-sociation of Real Estate Brokers) for the cities of GroningenApeldoorn Eindhoven Amsterdam Rotterdam and Zee-land with 83 variables and a number of observations rangingfrom 2216 for Zeeland to 8490 for Amsterdam also in-cluding sociodemographic variables +e results show thatboth boosting models improve the behaviour of linear andnonlinear models in the six areas considered with im-provements in terms of the absolute error of around 25ndash30and in relative error of around 33ndash39 In addition theyshow that the models present a better behaviour in theprediction of errors in terraced houses and apartments and aworse behaviour in predicting errors in detached houseswhich is consistent considering that the most influentialcharacteristic on dwelling price is the size of the house

Del Cacho [14] compares different ensemble methodsfor housing valuation inMadrid based on a sample of 25415observations taken from an online real estate portal +eresults show a better behaviour of ensemble of M5 modeltrees with a better behaviour of bagging unpruned decisiontrees with a mean relative error of 1525 Similar resultswith a median percentage error of 1511 and 1318 areobtained for the English private rental market using gradientboost [34] and Cubist [35] respectively by Clark and Lomax[36] Graczyk et al [37] use six machine learning algorithmsmultilayer perceptron (MLP) radial basis function neuralnetwork for regression problems (RBF) pruned model tree(M5P) M5Rules (M5R) linear regressionmodel (LRM) andNU-support vector machine (SVM) for the three ensemblemethods of additive regression (an implementation ofboosting in WEKA) bagging and stacking in WaikatoEnvironment for Knowledge Analysis (WEKA) +e resultsshow that there are differences between the simple andensemble methods used although all of them with goodbehaviour in terms of MAPE had values ranging from1902 to 1589 Bagging results are the most stable withbetter results using SVM However the best results areobtained using stacking and SVM+e general conclusion ofthe study is that there is no single algorithm that producesthe best results and therefore it is necessary to investigatethe behaviour of different alternatives

Antipov and Pokryshevskaya [15] show the best be-haviour of random forest when estimating prices per squaremetre rather than for the total price due to heteroscedasticityand other real estate data problems +ey propose com-paring the behaviour of 10 algorithms multiple regressionCHAID exhaustive CHAID CART k-nearest neighbors (2modifications) multilayer perceptron neural network(MLP) radial basis function neural network (RBF) boostedtrees and random forest In the evaluation of each methodhabitual metrics are used in the validation of the predictivecapacity of automated valuation models such as the averageratio sale (SR) the coefficient of dispersion (COD) and themean average percentage error (MAPE) All the analysedtechniques showed acceptable values for all the metrics bothin the training set and in the test set and with better resultsfor random forest with a MAPE of 1725 and a COD of 1697while using a two-step procedure these are 1486 and 1477respectively In addition this study proposes a classificationof variables according to their relevance highlighting theimportance of the type of house and the district in which it islocated It also recommends a segmentation-based diag-nostic method that determines segments based on the totalarea and the district in which the house is located with anyoverestimated or underestimated value highlighting theneed for the intervention of an appraiser However the maindrawback of this study is that the data are too limited fo-cusing on 2-bedroom apartments with an area of up to160m2 and a price below 30 million rubles Such a limitedprofile is an unrealistic reflection of most cities

Lasota et al [38] propose that instead of using a singleexpert machine learning system a combination of theseshould be used +ey argue that in this way the risk ofselecting a poormodel would be reduced in some of the cases

4 Complexity

and large volumes of data could be analysed efficiently byapplying the procedure to small partitions of the data andcombining the results +is proposal is compared with theindividual methods with two ensemble machine learningmethods mixture of experts (MoE) and AdaboostR2 (AR2)concluding that Adaboost and this mixture of machinelearning procedures show better behaviour with no sig-nificant differences between the methods In the case ofMoE the algorithms multilayer perceptron general linearmodel and support vector regression are used while forAR2 multilayer perceptron general linear model and aregression tree are used It is the mixture of machinelearning procedures with multilayer perceptron and generallinear model that show a better behaviour without signif-icant differences between MoE and AR2 However in thestudy by Lasota et al [38] they used information for theperiod 1998ndash2011 with the problem as highlighted by theauthors of comparability in the data+ey also use only fourcharacteristics as explanatory variables which can give rise toan oversimplified model and could be the reason for a goodbehaviour of ensemble techniques with simple basic tech-niques such as the general linear model

Another comparison in this case of random forest withother machine learning methods was developed by Yoo et al[39] Machine learning is used to determine the variableswhich have the greatest impact on the price of housing inOnondaga (New York) and to establish a way of estimatingdwelling prices Specifically OLS regression methods arecompared with Cubist and random forest In terms of de-termining the most important variables though OLS uses astepwise selection based on the level of significance RF orCubist uses boosting or bagging techniques that permit thehandling of nonlinear models as they are nonparametricprocedures For predictability the behaviour of the twomachine learning techniques is better highlighting RF interms of root mean squared error (RMSE) with values inrelative terms with respect to their average of 2504 con-sidering a neighbourhood within a radius of 100 metres and2247 within a radius of 1 km for the test set In addition themodel also incorporates environmental variables which havenot previously been included in these types of models +eauthors highlight that the application of machine learningmethods in the selection of variables allows key variables tobe selected without being based on a level of significance+ese methods also allow a sufficiently parsimonious set ofimportant variables to be found for good prediction whichmeans it is not so important that the model contains allrelevant variables as long as the prediction works well Parkand Bae [40] compare C45 RIPPER Naive Bayesian andAdaboost in the residential market of the county of FairfaxVirginia concluding that the best behaviour is achieved withRIPPER In addition their study uses these techniques asclassification techniques not regression when classifyingproperties based on the presence of a positive or negativevalue in the difference between what they call closing (sold)prices and listing (for sale) prices

Shahhosseini et al [41] compare the behaviour of severalensemble models for the prediction of dwelling prices usingtwo databases widely cited in the relevant literature the

Boston metropolitan area dataset [42] and the sales databaseof residential homes in Ames (Iowa) presented in [43] Todemonstrate the validity of the ensemble models they usethe following algorithms multiple learners including lassoregression random forest deep neural networks extremegradient boosting (XGBoost) and support vector machineswith three kernels (polynomial RBF and sigmoid) Based onthe results of the median price prediction error for Bostonthe best performance in terms of MAPE appears forXGBoost and random forest with MAPE values of 1644and 1635 respectively In the case of Ames housing lassoand random forest are the models with the best MAPE withvalues of 066 and 077 respectively Incredibly lowerrors are attributable to the quantity and quality of in-formation related to the 80 available variables as well as thehuge sample size (2930) in relation to the population size ofAmes (Iowa USA) of 50781 inhabitants +erefore theseresults lead to us conclude that there is no one model whichperforms better than the others

Finally Neloy et al [44] develop a model for predictingthe rental price of houses in Bangladesh through a websitedatabase of 3505 homes with information on 19 charac-teristics To develop the model the following simple algo-rithms are selected for prediction advance linear regressionneural network random forest support vector machine(SVM) and decision tree regressor In addition the en-semble learning is stacked with the following algorithmsensemble AdaBoosting regressor ensemble gradientboosting regressor and ensemble XGBoost Also ridgeregression lasso regression and elastic net regression areused to combine the advanced regression techniques +ebest results in terms of accuracy are obtained by the en-semble gradient boosting with 8875 and the worst by theensemble AdaBoosting with 8226 In terms of root meansquare error (RMSE) the behaviour is similar with values of01864 and 02340 respectively

Other uses of the decision tree include the application ofthe CART algorithm to segment the observations and toimprove the ability to estimate the model by applying dif-ferent models by segments or even with the assistance of anappraiser if necessary [45] To do this a CART algorithm isapplied using the percentage error (estimated value less realvalue in absolute value divided by real value) as a dependentvariable and the sales ratio (estimated value divided by realvalue) to determine segments of observations that allowthem to go from a general MAPE in the simple training of12688 to a value in the best segment of 9783 and in thesimple test of 14859 to 12364 Perez-Rave et al [20] proposea methodology that incorporates a variable selection pro-cedure called simple incremental with resampling (MIN-REM) +is procedure is used in combination with aprincipal component analysis in two cases 61826 homessold in Colombia and the data used in [46] from the 2011Metropolitan American Housing Survey with 58888 ob-servations +e results show a MAPE value of 27 withoutusing interactions and 209 using the procedure proposedin the case of housing in Colombia

From all these studies it follows that the analysis of thebehaviour of different machine learning techniques to

Complexity 5

analyse the price of housing has been widely covered in theliterature While the majority of the applications stress theimportance of determining the most influential variables onthe price of housing there are few applications which focuson prediction and above all there are few studies that usemeasures such as MAPE or COD to help evaluate thepredictive capacity of the models the majority are based onmeasuring the predictive capacity using the coefficient ofdetermination In addition the applications developed focuson specific areas or cities without trying to cover a widegeographical area (except in the case of Colombia where thestudy use the same model for the whole country) In thisstudy we cover a wider geographical area by developingthrough an automated procedure for estimating models amodel to be applied to each of the Spanish municipalitieswhere information is available +is gives us a total of 433municipalities

3 Methodology

As it has been stated the aim of this article is to develop anautomatic application that contains for each municipality amodel capable of accurately estimating the price of housingSeveral models are fitted in each municipality among arange of competing machine learning techniques+en theywill be analysed in order to check if there exists one bestmethod that achieves optimal results in terms of the errormeasures explained at the end of this section

+e selected models are bagging boosting and randomforest All of them are ensemble algorithms and we useregression trees as base learners For this reason the resultsof the single decision tree model will also be displayed as areference alongside the results of the more complex models+e ensemble methods usually provide good predictionresults although it is true that they sacrifice in some way thepossibility of interpretation of the relationships between thepredictor variables and the target In our context given thelarge number of models that will be estimated to completelycover the Spanish territory accurate predictions are moreimportant that easily interpret models

+e following briefly shows what each of these ensemblemethods consists of To begin with bagging is an ensemblemethod proposed in [47] from the basis of bootstrappingand aggregating methods +e main advantage of thismethods is the reduction of noise presence in the obser-vations in the random samples obtained with replacementfrom the original set Once the trees are fitted over thebootstrap samples the outputs are averaged +e noise re-duction coupled with the instability often shown by indi-vidual predictors lead bagging to improvements especiallyfor unstable procedures

For its part boosting [48] is an ensemble method capableof converting a weak learner into one with much higheraccuracy Boosting similar to bagging applies an iterativelearning process +e differential characteristic of thismethod is that each iteration is not independent of theprevious ones but uses a reweighting system to focus theattention of the learning process on the observations that informer steps have been estimated with higher errors +e

chosen algorithm to implement boosting in this article isgradient boosting [34] that consists in adding weak learningmodels such as decision trees by using a gradient descentprocedure to minimize a loss function

Random forest was also proposed in [30] and it couldbe seen as a variation of the bagging method with a higherdose of randomness +is added randomness is given be-cause when constructing the successive trees the optimaldivision is not sought among all the available predictivevariables but only among a subset randomly chosen in eachnode +e main advantage of this method is that it incurs alower risk of overfitting and therefore usually providesmore accurate estimates It should be noted that bagging isa special case of random forest when the subset of variablecandidates contains the total number of predictorsavailable

All the models have been applied using the statisticalenvironment R [49] Specifically the R packages rpart [50]gbm [51] and randomForest [52] have been used for fittingindividual trees boosting and bagging and random forestrespectively

Due to the large number of models to be fitted in thiscomplex problem the parameter tuning in random foresthas been optimized for each model through the caret Rlibrary [53] +ere are three main parameters to be set inrandom forest +e first two are the number and size of treesto be grown +e number should not be too small to ensurethat every input row participates in the learning process atleast a few times +e size of trees depends on the minimumsize of terminal nodes Setting this number larger leads tosmaller trees and quicker learning procedure Anotherimportant parameter in random forest is the number ofpredictors randomly sampled as candidates at each splitWith regard to bagging it has been treated as a particularcase of random forest

Regarding boosting there are four main parameters tobe set in the gbm model +e first one is the learning rate(shrinkage) with values 0001 001 and 01 which controlshow large the changes are from one iteration to the next onesimilarly to the learning rate in neural networks Secondlythe complexity of the tree is controlled by two parametersinteraction depth (tested among 1 3 5 and 10) and theminimum number of observations per node similar torandom forest (taking 1 5 10 and 20) Finally but also veryimportant the number of trees (iterations) an ensemble of1000 trees is generated and then pruned according to theminimum cross-validation error

+e loss function chosen for the optimization of eachsupervised method is the mean square error (MSE) In orderto guarantee a good generalization ability avoiding overfittedmodels 23 of the observations in the sample has beenrandomly assigned to the train set and the other 13 to thevalidation set Once the best model of each technique (re-gression tree bagging boosting and random forest) hasbeen chosen in each municipality the comparison of thefinal behaviour of four models has been analysed by thefollowing error measures +ey will be able to analyse thegoodness of fit and validate the predictive capacity of themodels

6 Complexity

(a) Mean ratio (average sales ratio) is the average of theSRi SR being the sales ratio defined as

SRi 1113954yi

yi

(1)

where yi is the property value i and 1113954yi is the esti-mated value

(b) Mean absolute percentage error (MAPE) or relativemean error

MAPE 1n

1113944

n

i1

yi minus 1113954yi

yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868middot 100 (2)

+e measure is in percentage terms so it is com-parable among different models

(c) Median absolute percentage error (MdAPE)

MdAPE Meyi minus 1113954yi

yi

11138681113868111386811138681113868111386811138681113868

111386811138681113868111386811138681113868111386811138681113888 1113889 100 (3)

where Me() is the median ie the value separatingthe higher half from the lower half of the absolutepercentage errors

(d) Coefficient of dispersion (COD)

COD 1113944n

i1

(1n) 1113936ni1 SRi minus Me(SR)

11138681113868111386811138681113868111386811138681113868

Me(SR)100 (4)

where Me(SR) is the median of the SRi Its interpretationdoes not depend on the assumption of normality

In line with the study by Perez-Rave et al [20] we carryout the estimation and measurement of errors usingmonetary values since any transformation of the variable tobe estimated (price) such as the logarithmic transformationcan lead to an improvement in results from a fictitiousstatistical point of view In addition the estimated value ofthe price is made in monetary terms and does not requireany transformation for its interpretation and comparisonbetween locations

4 Empirical Application

To develop the empirical application a database is con-structed based on the information obtained from freelyavailable real estate websites +e data from advertisementson the Internet allow the development of the application ofbig data techniques for the analysis of dwelling prices withgreater precision because the volume of accessible data islarge and enriched daily both ideal characteristics to be ableto apply these techniques In addition the data are quitevaried and the sale value on the web and the offline sale valueare seen to be of similar magnitude +e Internet source also

offers information on a variety of property and neigh-bourhood characteristics that are difficult to find from othersources [20] However this source of information has beenlittle used despite the existence of applications developedwith great success in both the real estate sector as well asother sectors [54] +ese same aspects are highlighted in[55 56] in which the authors point out that web prices offer avaluable opportunity for statistical analysis due to theconstant generation of information their accessibility andavailability as well as there being little notable differencescompared to offline prices Within real estate applicationsdeveloped using web data are used by Ozsoy and Sahin [23]in Istanbul Del Cacho [14] inMadrid Larraz and Larraz andPoblacion [57 58] in Spain Pow et al [17] in MontrealLarraz and Poblacion [59] in Czech Republic Nguyen [25]in the United States Clark and Lomax [36] in EnglandPerez-Rave et al [20] in Colombia or Neloy et al [44] inBangladesh

In our study the database contains information relatedto the price of the property (flat nonsingle-family home) andits reliable geolocation as well as information that refersspecifically to the characteristics of each property We haveaccess to information on properties for sale in all Spanishmunicipalities during 2018 +e information includes theprice of the property and the following 33 variables whichrepresent the characteristics of each property a text variablethat shows the postal code in which the property is locatedthree numerical variables that include the constructedsurface area the number of bedrooms and the number ofbathrooms and 29 attributes that have been categorized intodifferent levels Among these the variables considered themost influential on dwelling prices by the implementedmethods are location (longitude and latitude coordinates)constructed area number of bedrooms number of bath-rooms floor (basement normal or attic) the state ofconservation (new with important improvements adequatefor the age or need for major improvements) and thepresence of air conditioning heating lift garage terracegreen areas swimming pool and storage room +ereforethese variables are used to estimate the price of eachproperty

In the first phase data of adverts with possible errors areremoved for example properties with zero unit priceSubsequently a descriptive analysis is performed to decidewhat type of variables to work with Finally a multivariateanalysis of outliers with the available variables is carried outbased on Mahalanobis distance After this preliminaryanalysis we obtain a substantial database whose elements areuniformly distributed throughout the territory From thisdatabase we work with those municipalities where there areat least 100 sample observations 100 homes for sale thatallow the procedure to choose the best model in each case+erefore our study presents an empirical application de-veloped in 433 municipalities (out of the 8125 in Spain)which have more than 100 dwellings on sale during theperiod of research To be precise this amounts to infor-mation on 790631 real estate properties distributed in 48Spanish provinces (out of the 52 in Spain) made up of 433municipalities

Complexity 7

As it has been said in Section 3 for the application of thedifferent regression techniques the data set is divided intotwo a training set and a test set in a proportion of two-thirdsand one-third respectively +e data in the training set allowto fit the best combination in an automated way for eachcase while assessments made on the properties of the test setare used to compare the suitability of the differenttechniques

+e errors obtained in the valuation of property prices inthe 433 municipalities with information available show anaverage MAPE value of 2049 with the decision tree tech-nique 1654 with bagging 1698 with boosting and 1669with random forest while the average value of the COD was2003 1603 1653 and 1623 respectively Statistical dis-persion is depicted through the box and whiskers plots inFigure 1 +ese results show a good performance of baggingboosting and random forest techniques given the hetero-geneity of the sample and the wide geographic range ana-lysed However the decision tree technique overall does notgive satisfactory results

A further line of research could be to what extent theerror measures depend on the population size and even thesample size or the number of properties available to be usedas examples in the estimate It is worth finding out whetherthere is a better or worse behaviour in the methods analysedin small medium or large cities or if the maxim of ldquolargersample size better estimatesrdquo is met In fact Table 1 showshow the results of the 4 techniques show a practically zerolinear correlation between the different population andsample sizes with the different error measures +is may bedue to both the quality of the starting information and thegreat arbitrariness present in the prices of housing in SpainSince the quality of the starting information was controlledin the early stages of the analysis the second option isconsidered more plausible

Nor are there any nonlinear correlation between errormeasures and population or sample sizes Just as an exampleFigure 2 shows the scatter plot for two main error measuresMAPE and COD versus population size for bagging resultsFour techniques present very similar results No regressionstructure can be deduced from the graphs As observed inFigures 2(a) and 2(c) the biggest cities in Spain Madrid andBarcelona could be hiding the real correlation But afterhaving eliminated both cities (see Figures 2(b) and 2(d)) the

graphs do not show any relation between the errors and thepopulation size Coefficient of determination is stated inTable 2 having computed linear exponential potential andlogarithmic coefficients Note that all of them are almostzero

Because most of the assessments of real estate portfolioswill be carried out in the largest cities we decide to analysethe average results of the municipalities with a populationgreater than 100000 inhabitants (see Table 3) 63 in all andshow the results of each of these in more detail +e resultsobtained for each of the 63 selected municipalities arepresented in four tables in the appendix of this study (seeTables S1ndashS4 in the Supplementary Material for a moredetailed analysis) one for each of the techniques used

Table 3 reports the improvement in all cases when usingtree ensemble methods (bagging boosting and randomforest) compared to individual decision trees with im-provements in the estimation capacity measured through theMAPE of around six percentage points +e average MAPEvalues for the 63 municipalities show the best performancefor bagging and random forest with an average value for theset of 63 municipalities with the largest population of 1573and 1593 respectively Both techniques achieve the sameminimum MAPE value (858) However in terms of amaximum value although the value reached by randomforest is a point higher than that reached by bagging bothachieve very acceptable values In terms of COD a similarsituation is observed with the results of bagging and randomforest being the best at around 15 followed by those of theboosting technique that has an average COD of 163 +e

4540353025201510

50

Decision treeBagging

BoostingRandom forest

(a)

4540353025201510

50



(b)

Figure 1 MAPE (a) and COD (b) main error measures of the four techniques (decision tree bagging boosting and random forest)corresponding to the 433 municipalities in Spain

Table 1 Linear correlation coefficients between the MAPE andCOD values obtained from the 433 municipalities where the val-uations and population sizes (inhab) and sample sizes (N) of saidmunicipalities have been calculated

TechniqueCorrelationcoefficient

Decisiontree Bagging Boosting Random

forestMAPE vs inhab 010 minus007 minus004 minus007MAPE vs N 026 minus003 002 minus004COD vs inhab 009 minus007 minus003 minus006COD vs N 024 minus002 003 minus003Note Own elaboration

8 Complexity

COD of the decision tree technique was alone in achieving avalue above the recommended 20 Figure 3 graphicallyshows these results along with the dispersion of MAPE andCOD measures Note that all the outliers municipalitieswith MAPE and COD abnormally high or low have dis-appeared +ey corresponded to municipalities with lessthan 100000 inhabitants

From the analysis of errors made in the valuation ofproperties of the test set for each of the 63 municipalitieswith more than 100000 inhabitants of Spain it is worthhighlighting the good behaviour of the mean ratio that inalmost all cases shows values between 098 and 11 (seeTables S1ndashS4 in the Supplementary Material) From thevalue of MdAPE in the case of bagging the smallest value

0

5

10

15

20

25

30

35

40

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(a)

0

5

10

15

20

25

30

35

40

45

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(b)

0

5

10

15

20

25

30

35

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(c)

0

5

10

15

20

25

30

35

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(d)

Figure 2 Baggingrsquos MAPE and COD versus municipality population considering all the locations and without Madrid and Barcelona (a)MAPE vs inhabitants bagging (b)MAPE vs inhabitants bagging withoutMadrid and Barcelona (c) COD vs inhabitants bagging (d) CODvs inhabitants bagging without Madrid and Barcelona Note Own elaboration

Table 2 Coefficients of determination of the regression between Baggingrsquos MAPE and COD andmunicipality population considering all thelocations and without Madrid and Barcelona

Technique baggingCoefficient of determination Linear Exponential Logarithmic PotentialConsidering all locationsMAPE vs inhab 00053 00034 00075 00040COD vs inhab 00044 00026 00049 00023Without Madrid and BarcelonaMAPE vs inhab 00252 00246 00129 00148COD vs inhab 00053 00028 00038 00017Note Own elaboration

Table 3 Average minimum and maximum results for MAPE and COD values obtained for the 63 municipalities with more than 100000inhabitants

63 municipalities (gt100000 inhab)Technique MAPE average MAPE minimum MAPE maximum COD average COD minimum COD maximumDecision tree 2192 1168 3040 2122 1165 2795Bagging 1573 858 2293 1541 848 2219Boosting 1662 1015 2345 1636 1006 2328Random forest 1593 858 2392 1556 848 2308

Complexity 9

for this measure is 659 and the maximum 1891 whichindicates that in the best-case municipalities fifty percent ofthe valuations with the lowest error have an error of less than659 and in the worst-case municipalities this error is nogreater than 1891 For random forest these values are 615and 1964 respectively reaching a higher value in terms ofmaximum and a lower value in terms of minimum com-pared to bagging In addition given that choosing betweenbagging and boosting is made more difficult by the similarityin the results obtained it should be noted that baggingpresents a better behaviour in terms of MdAPE since only 4of the 63 municipalities analysed present values superior to16 and 10 superior to 15 while with random forest thesevalues are 6 and 14 respectively

5 Conclusions

+e need for a rapid and economical appraisal of real estateand the greater availability of up-to-date information ac-cessible through the Internet have led to the application ofbig data techniques and machine learning to carry out realestate valuation

At the forefront of these machine learning techniques aretree ensemble methods in particular bagging boosting andrandom forest So far these techniques have been applied inmany cases for purposes other than the estimation of propertyprices and when they have been applied to real estate val-uations they have been done in a limited way to very specificgeographical areas In order to advance understanding of thevalue of the techniques of tree ensemble on an automated andmassive scale this study shows the results obtained from theapplication of different techniques for the whole of Spain witha total of 433 municipalities spread across 48 provinces +earticle presents an automated algorithm which selects the bestmodel for each technique in each municipality +eir be-haviour in terms of estimation capacity is measured througherror measures widely cited in the literature

+e results show that the behaviour of the tree ensembleclearly outperforms individual trees although of the threemethods analysed (bagging boosting and random forest)none has a clear advantage over the others Even so lookingmore closely at the behaviour of the bagging and randomforest methods it seems that the slightly better results ofbagging in terms of MAPE and COD together with the

results in terms of MdAPE would make us opt for the use ofbagging in the case of Spain

Reviewing the literature available so far it can be con-cluded that the results obtained in terms of MAPE are betterthan those obtained in [26] with a value of 252 in Nicosiaor in [46] for the US with 209 +e results are similar tothose obtained in [14] with a value that ranges from 1902to 1589 in Madrid and worse than those obtained forLjubljana in [24] with an average MAPE of 728 However itshould be borne in mind that these applications focused onspecific geographical areas while the application developedin this study covers the entire Spanish territory +e errormeasures provided are means of theMAPE and CODof eachmunicipality with municipalities of very different pop-ulation sample sizes and socioeconomic characteristics

From the global analysis of the 433 municipalities as awhole it can be concluded that the error measures do notdepend on the population size or the size of the sample set+is fact suggests the presence of a certain random compo-nent in the determination of sales prices since the greater theavailable sample information the better the results should be

Finally it should be noted that this study has other activelines of research that are already being developed such as theinclusion of a dynamic database that allows the handling ofinformation with different temporal references or the in-clusion of ensemble methods that allow machine learningtechniques not just simple trees to be combined

Data Availability

+e data base used to support the findings of this study weresupplied by COHISPANIA Consultorıa y Valoracion underlicense and so cannot be made freely available Requests foraccess to these data should be made to httpwwwcohispaniacomcontacto

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+e authors want to thank the collaboration of CompantildeıaHispania de Tasaciones y Valoraciones S A Emilio L Cano

353025201510

50



(a)



353025201510

50

(b)

Figure 3 MAPE (a) and COD (b) main error measures of the four techniques (decision tree bagging boosting and random forest)corresponding to the largest 63 municipalities in Spain

10 Complexity

was partially funded by the Spanish ldquoAgencia Estatal deInvestigacionrdquo via the MTM2017-86875-C3-1-R AEIFEDER UE project +e present work was financed throughthe RampD contract between the University of Castilla-LaMancha and Cohispania with ref UCTR180093

Supplementary Materials

Table S1 main results for decision trees Table S2 mainresults for bagging Table S3 main results for boosting TableS4 main results for random forest (SupplementaryMaterials)

References

[1] Bank for International Settlements International Convergenceof Capital Measurement and Capital Standards Basel Com-mittee on Banking Supervision Basel Switzerland 2006

[2] European Council ldquoDirective 200648EC of the EuropeanParliament and of the Council of 14 June 2006 relating to thetaking up and pursuit of the business of credit institutionsrdquoOfficial Journal of the European Union vol L177 pp 1ndash2002006

[3] J K Eckert Property Appraisal and Assessment Adminis-tration International Association of Assessing OfficersChicago IL USA 1990

[4] V Kontrimas and A Verikas ldquo+e mass appraisal of the realestate by computational intelligencerdquo Applied Soft Comput-ing vol 11 no 1 pp 443ndash448 2011

[5] R Schulz M Wersing and A Werwatz ldquoAutomated valu-ation modelling a specification exerciserdquo Journal of PropertyResearch vol 31 no 2 pp 131ndash153 2014

[6] O Kettani and M Oral ldquoDesigning and implementing a realestate appraisal system the case of Quebec Province CanadardquoSocio-Economic Planning Sciences vol 49 pp 1ndash9 2015

[7] M Mooya ldquoOf mice and menrdquo Urban Studies vol 48 no 11pp 2265ndash2281 2011

[8] W McCluskey and S Anand ldquo+e application of intelligenthybrid techniques for the mass appraisal of residentialpropertiesrdquo Journal of Property Investment amp Finance vol 17no 3 pp 218ndash239 1999

[9] C M Filho and O Bin ldquoEstimation of hedonic price func-tions via additive nonparametric regressionrdquo EmpiricalEconomics vol 30 no 1 pp 93ndash114 2005

[10] H Selim ldquoDeterminants of house prices in Turkey hedonicregression versus artificial neural networkrdquo Expert Systemswith Applications vol 36 no 2 pp 2843ndash2852 2009

[11] N Garcıa M Gamez and E Alfaro ldquoANN+GIS an auto-mated system for property valuationrdquo Neurocomputingvol 71 no 4ndash6 pp 733ndash742 2008

[12] R D Jaen ldquoData mining an empirical application in realestate valuationrdquo in FLAIRS Conference S M Haller andG Simmons Eds pp 314ndash317 AAAI Press Palo Alto CAUSA 2002

[13] G-Z Fan S E Ong and H C Koh ldquoDeterminants of houseprice a decision tree approachrdquoUrban Studies vol 43 no 12pp 2301ndash2315 2006

[14] C Del CachoAComparison of DataMiningMethods forMassReal Estate Appraisal University Library of Munich MunichGermany 2010 httpsmpraubuni-muenchendeideprint27378MPRA Paper No 27378

[15] E A Antipov and E B Pokryshevskaya ldquoMass appraisal ofresidential apartments an application of Random forest for

valuation and a CART-based approach for model diagnos-ticsrdquo Expert Systems with Applications vol 39 no 2pp 1772ndash1778 2012

[16] M +eriault F Des Rosiers and F Joerin ldquoModelling ac-cessibility to urban services using fuzzy logicrdquo Journal ofProperty Investment amp Finance vol 23 no 1 pp 22ndash54 2005

[17] N Pow E Janulewicz and L Liu ldquoApplied machine learningproject 4 prediction of real estate property prices in montrealrdquo2014 httprlcsmcgillcacomp598fall2014comp598_submission_99pdf

[18] O Bin ldquoA prediction comparison of housing sales prices byparametric versus semi-parametric regressionsrdquo Journal ofHousing Economics vol 13 no 1 pp 68ndash84 2004

[19] Shabana G Ali M K Bashir and H Ali ldquoHousing valuationof different towns using the hedonic model a case of Fai-salabad city Pakistanrdquo Habitat International vol 50pp 240ndash249 2015

[20] J I Perez-Rave J C Correa-Morales and F Gonzalez-Echavarrıa ldquoA machine learning approach to big data re-gression analysis of real estate prices for inferential andpredictive purposesrdquo Journal of Property Research vol 36no 1 pp 59ndash96 2019

[21] J Bin S Tang Y Liu et al ldquoRegression model for appraisal ofreal estate using recurrent neural network and boosting treerdquoin Proceedings of the 2017 2nd IEEE International Conferenceon Computational Intelligence and Applications (ICCIA)IEEE Beijing China pp 209ndash213 September 2017

[22] R A Dubin ldquoPredicting house prices using multiple listingsdatardquo 5e Journal of Real Estate Finance and Economicsvol 17 no 1 pp 35ndash59 1998

[23] O Ozsoy and H Sahin ldquoHousing price determinants inIstanbul Turkeyrdquo International Journal of Housing Marketsand Analysis vol 2 no 2 pp 167ndash178 2009

[24] M Ceh M Kilibarda A Lisec and B Bajat ldquoEstimating theperformance of random forest versus multiple regression forpredicting prices of apartmentsrdquo International Journal of Geo-Information vol 1 p 168 2018

[25] A Nguyen ldquoHousing price predictionrdquo 2018 httpspdfssemanticscholarorg782d3fdf15f5ff99d5fb6acafb61ed8e1c60fab8pdf

[26] T Dimopoulos H Tyralis N P Bakas and D HadjimitsisldquoAccuracy measurement of random forests and linear re-gression for mass appraisal models that estimate the prices ofresidential apartments in Nicosia Cyprusrdquo Advances inGeosciences vol 45 pp 377ndash382 2018

[27] M Skurichina and R P W Duin ldquoBagging boosting and therandom subspace method for linear classifiersrdquo PatternAnalysis amp Applications vol 5 no 2 pp 121ndash135 2002

[28] B Efron and R Tibshirani An Introduction to the BootstrapChapman amp Hall New York NY USA 1993

[29] J Elith J R Leathwick and T Hastie ldquoA working guide toboosted regression treesrdquo Journal of Animal Ecology vol 77no 4 pp 802ndash813 2008

[30] L Breiman ldquoRandom forestsrdquo Machine Learning vol 45no 1 pp 5ndash32 2001

[31] N Kok E-L Koponen and C AMartınez-Barbosa ldquoBig datain real estate From manual appraisal to automated valua-tionrdquo 5e Journal of Portfolio Management vol 43 no 6pp 202ndash211 2017

[32] N Shinde and K Gawande ldquoValuation of house price usingpredictive techniquesrdquo International Journal of Advances inElectronics and Computer Science vol 5 no 6 pp 34ndash40 2018

[33] M Kagie andM VWezel ldquoHedonic price models and indicesbased on boosting applied to the Dutch housing marketrdquo

Complexity 11

Intelligent Systems in Accounting Finance amp Managementvol 15 no 3-4 pp 85ndash106 2007

[34] J H Friedman ldquoMachinerdquo 5e Annals of Statistics vol 29no 5 pp 1189ndash1232 2001

[35] L Breiman J H Friedman R A Olshen and C J StoneClassification and Regression Trees Wadsworth Belmont CAUSA 1984

[36] S D Clark and N Lomax ldquoA mass-market appraisal of theEnglish housing rental market using a diverse range ofmodelling techniquesrdquo Journal of Big Data vol 5 no 1 p 432018

[37] M Graczyk T Lasota B Trawinski and K TrawinskildquoComparison of bagging boosting and stacking ensemblesapplied to real estate appraisalrdquo in Intelligent Information andDatabase Systems ACIIDS 2010 Lecture Notes in ComputerScience R Goebel J Siekmann and W Wahlster Edsvol 5991 pp 340ndash350 Springer Berlin Germany 2010

[38] T Lasota B Londzin Z Telec and B Trawinski ldquoComparison ofensemble approaches mixture of experts and AdaBoost for aregression problemrdquo in Intelligent Information and DatabaseSystems ACIIDS 2014 Lecture Notes in Computer ScienceN T Nguyen B Attachoo B Trawinski and K SomboonviwatEds vol 8398 Springer Cham Switzerland 2014

[39] S Yoo J Im and J EWagner ldquoVariable selection for hedonicmodel using machine learning approaches a case study inOnondaga County NYrdquo Landscape and Urban Planningvol 107 no 3 pp 293ndash306 2012

[40] B Park and J K Bae ldquoUsing machine learning algorithms forhousing price prediction the case of Fairfax County Virginiahousing datardquo Expert Systems with Applications vol 42 no 6pp 2928ndash2934 2015

[41] M Shahhosseini G Hu and H Pham ldquoOptimizing ensembleweights for machine learning models a case study for housingprice predictionrdquo Smart Service Systems Operations Man-agement and Analytics Springer Berlin Germany 2019httpslibdriastateeduimse_conf185

[42] D Harrison and D L Rubinfeld ldquoHedonic housing pricesand the demand for clean airrdquo Journal of EnvironmentalEconomics and Management vol 5 no 1 pp 81ndash102 1978

[43] D De Cock ldquoAmes Iowa alternative to the Boston housingdata as an end of semester regression projectrdquo Journal ofStatistics Education vol 19 no 3 p 115 2011

[44] A Neloy M Sadman Haque and M Mahmud Ul IslamldquoEnsemble learning based rental apartment price predictionmodel by categorical features factoringrdquo in Proceedings of the2019 11th International Conference on Machine Learning andComputing pp 350ndash356 Zhuhai China 2019

[45] E B Pokryshevskaya and E A Antipov ldquoApplying a CART-based approach for the diagnostics of mass appraisal modelsrdquoEconomics Bulletin vol 31 no 3 pp 2521ndash2528 2011

[46] S Mullainathan and J Spiess ldquoMachine learning an appliedeconometric approachrdquo Journal of Economic Perspectivesvol 31 no 2 pp 87ndash106 2017

[47] L Breiman ldquoBagging predictorsrdquo Machine Learning vol 24no 2 pp 123ndash140 1996

[48] R E Schapire ldquo+e strength of weak learnabilityrdquo MachineLearning vol 5 pp 197ndash227 1990

[49] R Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing ViennaAustria 2019 httpswwwR-projectorg

[50] T +erneau and B Atkinson Rpart Recursive Partitioningand Regression Trees R Package Version 41-15 2019 httpsCRANR-projectorgpackagerpart

[51] B Greenwell B Boehmke and J Cunningham Gbm Gen-eralized Boosted Regression Models R package version 2152019 httpsCRANR-projectorgpackagegbm

[52] A Liaw and M Wiener ldquoClassification and regression byrandom forestrdquo R News vol 2 no 3 pp 18ndash22 2002

[53] M Kuhn J Wing and S Weston Caret Classification andRegression Training R Package Version 60-84 2019 httpsCRANR-projectorgpackagecaret

[54] M E Beresewicz ldquoOn representativeness of Internet datasources for real estate market in Polandrdquo Austrian Journal ofStatistics vol 44 no 2 pp 45ndash57 2015

[55] A Cavallo Scraped Data and Sticky Prices Social ScienceResearch Network Rochester NY USA SSRN ScholarlyPaper ID 1711999 2012

[56] A Cavallo ldquoAre online and offline prices similar Evidencefrom large multi-channel retailersrdquo American Economic Re-view vol 107 no 1 pp 283ndash303 2017

[57] B Larraz ldquoAn expert system for online residential propertiesvaluationrdquo Review of Economics amp Finance vol 2 pp 69ndash822011

[58] B Larraz and J Poblacion ldquoAn online real estate valuationmodel for control risk taking a spatial approachrdquo InvestmentAnalysts Journal vol 42 no 78 pp 83ndash96 2013

[59] E Hromada ldquoMapping of real estate prices using data miningtechniquesrdquo Procedia Engineering vol 123 pp 233ndash2402015

12 Complexity

the quantity and quality of information related to both realestate prices and property characteristics accessible to re-searchers +is level of information allows the application ofincreasingly sophisticated statistical techniques for the de-velopment of estimation procedures of a higher quality andprecision Real estate AVMs allow the valuation of propertyprices en masse without the need for the physical presenceof an appraiser by using computer-assisted task appraisalsystems [6] In many cases the presence of an appraiser isonly necessary for those valuations considered to be out ofthe ordinary [7]

Estimation techniques include parametric regressionanalysis [8] and nonparametric [9] or machine learningmethods such as neural networks [10 11] decision trees[12 13] random forests [14 15] fuzzy logic [16] or en-semble methods [17] +ese techniques are used primarilywith three objectives in mind to estimate a real estateproperty price to find out the influence of a characteristic ofthe house on its price and to create a hedonic price index

Over the recent decades the most commonly usedprocedures has been based on hedonic-based regression[18 19] However these models present certain fundamentalproblems related to the assumptions of the model normalityof the residuals homoscedasticity independence and theabsence of multicollinearity +is situation has led to greateruse of pattern recognition techniques often known as datamining techniques which include machine learning +esetechniques are more flexible about the assumptions relatedto the distribution of data they are easier to interpret andthey allow linear and nonlinear relationships to be analysedIn addition they enable both categorical and continuousvariables to be managed [13] Although these techniqueswere initially used more as classification methods in recentyears their application has been used in determining themost influential variables on house pricing and in estimatingdwelling prices Perez-Rave et al [20] provide a two-stagemethodology for the analysis of big data regression under amachine learning approach for both inferential and pre-dictive purposes

Accurate and efficient prediction of real estate prices hasbeen and will continue to be an essential but controversialissue with an impact on the various actors in the economysuch as buyers sellers commission agents governmentsand banks [21 22] Nowadays the big data paradigm offersexciting possibilities for more accurate predictions and oneof the main approaches for dealing with big data is machinelearning

+ese machine learning methods have been applied tothe estimation of real estate properties in very specific lo-cations Research has been carried out by Jaen [12] in CoralGables (Florida USA) Fan et al [13] in Singapore (Republicof Singapore) Ozsoy and Sahin [23] in Istanbul (Turkey)Del Cacho [14] in Madrid (Spain) Pow et al [17] inMontreal (Canada) Ceh et al [24] in Ljubljana (Slovenia)Nguyen [25] in 5 counties of USA and Dimopoulos et al[26] in Nicosia (Cyprus) In contrast Perez-Rave et al [20]deal with the estimation of dwelling prices for a wholecountry Colombia using independent variables identifyingthe city in which each property is located thus proposing a

unique model for the entire country (with a sample of 61826properties)

+e main new element of this article is that it proposes anew methodology to carry out the automated estimation ofreal estate prices for an entire country (Spain in this casestudy) specifying automatically a different model for eachmunicipality with a sample size of 790631 real estateproperties +e whole country can be considered as acomplex real estate system due to the great differences thatexist between rural and urban areas and even inside theurban areas Each municipality is trained with the infor-mation available in its training set so that each one will haveits own model adapted to its characteristics and needs +isstudy addresses a program capable of covering a differentmodel for each of the 433 municipalities with more than 100properties for sale with a population ranging from 1559inhabitants in the smallest municipality to 3223334 inMadrid the biggest one We focus on the application of thisautomated valuation system based on machine learningmethods in estimating real estate prices and analyse theaccuracy of each of them using error measures widelyrecognised in economic literature As a rule to evaluatemodel quality aggregated diagnostic indicators are used(coefficient of determination) although there are few con-tributions in the relevant literature where the quality of theprocedure is analysed using ameasurement of the estimationerror [15] In this article we use four measures to analyse thevalidity of the proposed methods namely the mean ratiothe mean absolute percentage error (MAPE) the medianabsolute percentage error (MdAPE) and the coefficient ofdispersion (COD)

In this article machine learning techniques used inestimating the price of dwellings are based on the decisiontree technique However in general it is difficult to build asingle tree to make predictions because of incorrect pa-rameter settings simplicity rules and tree instability Toovercome these problems and obtain better behaviour whenmaking predictions techniques of ensemble of decision treeshave been developed such as bagging boosting and randommethods [27] In bagging the models are fitted using ran-dom independent bootstrap replicates that are then com-bined by averaging the output for regression [28] Inboosting the fitted model is a simple linear combination ofmany trees that are fitted iteratively and boosted to reweightpoorly modelled observations [29] +e random forestmodel however is constructed in a random vector of thedata feature space sampled independently [30] Startingfrom this base we automatically design the best ensemblemethod including bagging boosted regression tree andrandom forest in each municipality and then make acomparison to analyse their behaviour under different cir-cumstances In addition the results obtained with a singledecision tree are included in order to analyse and comparethe benefit of using an ensemble of decision tree techniques

A further consideration is the particular emphasis thatspecialized literature places on the need to include spatialinformation in hedonic models given the significant in-fluence that the location of the property can have on its priceand therefore on its valuation Ceh et al [24] highlight the

2 Complexity



2 Literature Review







Complexity 3







4 Complexity








Complexity 5


3 Methodology











6 Complexity


SRi 1113954yi

yi

(1)



MAPE 1n

1113944

n

i1

yi minus 1113954yi

yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868middot 100 (2)




yi

11138681113868111386811138681113868111386811138681113868

111386811138681113868111386811138681113868111386811138681113888 1113889 100 (3)



COD 1113944n

i1


11138681113868111386811138681113868111386811138681113868

Me(SR)100 (4)








Complexity 7








4540353025201510

50



(a)

4540353025201510

50



(b)






8 Complexity



0

5

10

15

20

25

30

35

40

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(a)

0

5

10

15

20

25

30

35

40

45

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(b)

0

5

10

15

20

25

30

35

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(c)

0

5

10

15

20

25

30

35

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(d)






Complexity 9


5 Conclusions








Data Availability




Acknowledgments


353025201510

50



(a)



353025201510

50

(b)


10 Complexity




References



































Complexity 11




























12 Complexity



2 Literature Review







Complexity 3







4 Complexity








Complexity 5


3 Methodology











6 Complexity


SRi 1113954yi

yi

(1)



MAPE 1n

1113944

n

i1

yi minus 1113954yi

yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868middot 100 (2)




yi

11138681113868111386811138681113868111386811138681113868

111386811138681113868111386811138681113868111386811138681113888 1113889 100 (3)



COD 1113944n

i1


11138681113868111386811138681113868111386811138681113868

Me(SR)100 (4)








Complexity 7








4540353025201510

50



(a)

4540353025201510

50



(b)






8 Complexity



0

5

10

15

20

25

30

35

40

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(a)

0

5

10

15

20

25

30

35

40

45

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(b)

0

5

10

15

20

25

30

35

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(c)

0

5

10

15

20

25

30

35

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(d)






Complexity 9


5 Conclusions








Data Availability




Acknowledgments


353025201510

50



(a)



353025201510

50

(b)


10 Complexity




References



































Complexity 11




























12 Complexity







4 Complexity








Complexity 5


3 Methodology











6 Complexity


SRi 1113954yi

yi

(1)



MAPE 1n

1113944

n

i1

yi minus 1113954yi

yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868middot 100 (2)




yi

11138681113868111386811138681113868111386811138681113868

111386811138681113868111386811138681113868111386811138681113888 1113889 100 (3)



COD 1113944n

i1


11138681113868111386811138681113868111386811138681113868

Me(SR)100 (4)








Complexity 7








4540353025201510

50



(a)

4540353025201510

50



(b)






8 Complexity



0

5

10

15

20

25

30

35

40

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(a)

0

5

10

15

20

25

30

35

40

45

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(b)

0

5

10

15

20

25

30

35

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(c)

0

5

10

15

20

25

30

35

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(d)






Complexity 9


5 Conclusions








Data Availability




Acknowledgments


353025201510

50



(a)



353025201510

50

(b)


10 Complexity




References



































Complexity 11




























12 Complexity








Complexity 5


3 Methodology











6 Complexity


SRi 1113954yi

yi

(1)



MAPE 1n

1113944

n

i1

yi minus 1113954yi

yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868middot 100 (2)




yi

11138681113868111386811138681113868111386811138681113868

111386811138681113868111386811138681113868111386811138681113888 1113889 100 (3)



COD 1113944n

i1


11138681113868111386811138681113868111386811138681113868

Me(SR)100 (4)








Complexity 7








4540353025201510

50



(a)

4540353025201510

50



(b)






8 Complexity



0

5

10

15

20

25

30

35

40

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(a)

0

5

10

15

20

25

30

35

40

45

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(b)

0

5

10

15

20

25

30

35

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(c)

0

5

10

15

20

25

30

35

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(d)






Complexity 9


5 Conclusions








Data Availability




Acknowledgments


353025201510

50



(a)



353025201510

50

(b)


10 Complexity




References



































Complexity 11




























12 Complexity


3 Methodology











6 Complexity


SRi 1113954yi

yi

(1)



MAPE 1n

1113944

n

i1

yi minus 1113954yi

yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868middot 100 (2)




yi

11138681113868111386811138681113868111386811138681113868

111386811138681113868111386811138681113868111386811138681113888 1113889 100 (3)



COD 1113944n

i1


11138681113868111386811138681113868111386811138681113868

Me(SR)100 (4)








Complexity 7








4540353025201510

50



(a)

4540353025201510

50



(b)






8 Complexity



0

5

10

15

20

25

30

35

40

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(a)

0

5

10

15

20

25

30

35

40

45

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(b)

0

5

10

15

20

25

30

35

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(c)

0

5

10

15

20

25

30

35

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(d)






Complexity 9


5 Conclusions








Data Availability




Acknowledgments


353025201510

50



(a)



353025201510

50

(b)


10 Complexity




References



































Complexity 11




























12 Complexity


SRi 1113954yi

yi

(1)



MAPE 1n

1113944

n

i1

yi minus 1113954yi

yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868middot 100 (2)




yi

11138681113868111386811138681113868111386811138681113868

111386811138681113868111386811138681113868111386811138681113888 1113889 100 (3)



COD 1113944n

i1


11138681113868111386811138681113868111386811138681113868

Me(SR)100 (4)








Complexity 7








4540353025201510

50



(a)

4540353025201510

50



(b)






8 Complexity



0

5

10

15

20

25

30

35

40

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(a)

0

5

10

15

20

25

30

35

40

45

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(b)

0

5

10

15

20

25

30

35

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(c)

0

5

10

15

20

25

30

35

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(d)






Complexity 9


5 Conclusions








Data Availability




Acknowledgments


353025201510

50



(a)



353025201510

50

(b)


10 Complexity




References



































Complexity 11




























12 Complexity








4540353025201510

50



(a)

4540353025201510

50



(b)






8 Complexity



0

5

10

15

20

25

30

35

40

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(a)

0

5

10

15

20

25

30

35

40

45

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(b)

0

5

10

15

20

25

30

35

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(c)

0

5

10

15

20

25

30

35

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(d)






Complexity 9


5 Conclusions








Data Availability




Acknowledgments


353025201510

50



(a)



353025201510

50

(b)


10 Complexity




References



































Complexity 11




























12 Complexity



0

5

10

15

20

25

30

35

40

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(a)

0

5

10

15

20

25

30

35

40

45

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(b)

0

5

10

15

20

25

30

35

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

(c)

0

5

10

15

20

25

30

35

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

(d)






Complexity 9


5 Conclusions








Data Availability




Acknowledgments


353025201510

50



(a)



353025201510

50

(b)


10 Complexity




References



































Complexity 11




























12 Complexity


5 Conclusions








Data Availability




Acknowledgments


353025201510

50



(a)



353025201510

50

(b)


10 Complexity




References



































Complexity 11




























12 Complexity




References



































Complexity 11




























12 Complexity




























12 Complexity

afullyautomatedadjustmentofensemblemethodsinmachine...

Documents