predicting inter-frequency measurements in an lte net...

Linköpings universitetSE–581 83 Linköping

+46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information ScienceMaster thesis, 30 ECTS | Datateknik

202018 | LIU-IDA/LITH-EX-A–18/017–SE

Predicting inter-frequencymeasurements in an LTE net-work using supervised ma-chine learning– a comparative study of learning algorithms and dataprocessing techniques

Att prediktera inter-frekvensmätningar i ett LTE-nätverk medhjälp av övervakad maskininlärning

Adrian E. Sonnert

Supervisor : Mattias TigerExaminer : Fredrik Heintz

External supervisor : Daniel Nilsson & Erik Malmberg (Ericsson)

http://www.liu.se

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 årfrån publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstakakopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och förundervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva dettatillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. Föratt garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sättsamt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende elleregenart. För ytterligare information om Linköping University Electronic Press se förlagetshemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement– for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone toread, to download, or to print out single copies for his/hers own use and to use it unchangedfor non-commercial research and educational purpose. Subsequent transfers of copyrightcannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measuresto assure authenticity, security and accessibility. According to intellectual property law theauthor has the right to be mentioned when his/her work is accessed as described above andto be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of documentintegrity, please refer to its www home page: http://www.ep.liu.se/.

c© Adrian E. Sonnert

http://www.ep.liu.se/

http://www.ep.liu.se/

Abstract

With increasing demands on network reliability and speed,network suppliers need to effectivize their communicationsalgorithms. Frequency measurements are a core part of mobilenetwork communications, increasing their effectiveness wouldincrease the effectiveness of many network processes such ashandovers, load balancing, and carrier aggregation. This studyexamines the possibility of using supervised learning to predictthe signal of inter-frequency measurements by investigatingvarious learning algorithms and pre-processing techniques. Wefound that random forests have the highest predictive performanceon this data set, at 90.7% accuracy. In addition, we have shownthat undersampling and varying the discriminator are effectivetechniques for increasing the performance on the positive classon frequencies where the negative class is prevalent. Finally,we present hybrid algorithms in which the learning algorithm foreach model depends on attributes of the training data set. Thesealgorithms perform at a much higher efficiency in terms of memoryand run-time without heavily sacrificing predictive performance.

Acknowledgments

I would like to thank Ericsson for giving me the opportunity toperform my master thesis study on such an interesting area ofresearch. In addition, I would like to thank my supervisors atEricsson, Daniel Nilsson and Erik Malmberg. I would also like tothank my examinator and supervisor at Linköping University, FredrikHeintz and Mattias Tiger. Thank you for your time and all the helpyou have given me.

iv

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables ix

Acronyms x

1 Introduction 11.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 42.1 Introduction to LTE/LTE-A networks . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Introduction to machine learning and supervised learning . . . . . . . . . . . . 52.3 Cost functions and minimization techniques . . . . . . . . . . . . . . . . . . . . 72.4 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Hyperparameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.8 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.9 Comparative studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.10 Memory and run-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Method 193.1 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 From network data to features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Data statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Choice of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Initial test suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 Result parsing and visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.7 Tool specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.8 Hyperparameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.9 Final algorithm configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.10 Analytic experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.11 Hybrid algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.12 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

v

4 Results 294.1 Initial test suite results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Optimized learning algorithm results . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Pre-processing technique results . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 Number of samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.5 Performance on the positive class . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.6 Improvement from output class distribution . . . . . . . . . . . . . . . . . . . . 424.7 Final comparison of systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Discussion 445.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 Algorithm choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Hyperparameter optimization choices . . . . . . . . . . . . . . . . . . . . . . . . 475.4 Evaluation choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.5 Run-time and memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.6 Classification vs. regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.7 Data characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.8 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.9 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Conclusion 51

Bibliography 53

A Appendix A 57

vi

List of Figures

2.1 Simplified cutout of an LTE network, showing the eNB, UE, SC, and NC. . . . . . . 52.2 Variance and bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Examples of decision boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 PCA with one component on data generated from the Gaussian distribution . . . . 92.5 Examples of search space for grid search and random search when considering

two parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.7 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.8 Decision tree for binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.9 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.10 ROC curve with discriminator thresholds 0.0, 0.2, 0.4, 0.6, 0.8, 1.0 from the top

right to the bottom left. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Output class skewness per frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Metrics for logistic regression averaged over models based on the skewness of data 314.2 Metrics for multi-layer perceptron averaged over models based on the skewness

of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Metrics for random forest averaged over models based on the skewness of data . . 324.4 Metrics for gradient boosting decision trees averaged over models based on the

skewness of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5 Training and validation accuracy score for the chosen algorithms averaged over

models based on the skewness of data. The fully drawn line represents the vali-dation score and the dashed line represents the training score. . . . . . . . . . . . . 33

4.6 Training time for the chosen algorithms based on models with a certain numberof training samples available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.7 Classification time for the chosen algorithms based on models with a certain num-ber of training samples available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.8 Accskew for the chosen algorithms averaged over all models based on whether ornot the data was preprocessed with scaling . . . . . . . . . . . . . . . . . . . . . . . 35

4.9 AccSkew when training on different sets of features and models with varying lev-els of skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.10 Accuracy when training with the serving cell feature and various fractions of theneighbor cell features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.11 Accuracy when training with the serving cell feature and various fractions of theneighbor cell features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.12 Accskew for the chosen algorithms based on models with a certain number oftraining samples available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.13 AccSkew for random forest on 12 models trained on varying sizes of data . . . . . 384.14 Accuracy for random forest on 12 models trained on varying sizes of data . . . . . 394.15 F1 score for random forest on 12 models trained on varying sizes of data . . . . . . 40

vii

4.16 ROC curve for the gold (79.4%) model from Section 4.4 with thresholds in intervalsof 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.17 Accuracy and F1 score for RF when using various scoring functions . . . . . . . . . 424.18 Accuracy of various systems trained on the same data set . . . . . . . . . . . . . . . 43

A.1 ROC curve for the 79.5% skewness model from Section 4.4 with thresholds in in-tervals of 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57











viii

List of Tables

3.1 A sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Algorithm suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Algorithms and Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 Search space grid to distr. transformation . . . . . . . . . . . . . . . . . . . . . . . . 233.7 Logistc Regression hyperparameter search space . . . . . . . . . . . . . . . . . . . . 243.8 Multi-layer perceptron hyperparameter search space . . . . . . . . . . . . . . . . . 243.9 RF hyperparameter search space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.10 Gradient boosting decision tree hyperparameter search space . . . . . . . . . . . . 253.11 Hybrid 1 configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.12 Hybrid 2 configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.13 Hybrid 3 configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Results, Initial run, Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Results, PCA run, Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Algorithm training stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

ix

Acronyms

TelecomANR Automatic Neighbor RelationsAP Access pointc-ID Physical identityCGI Global Cell IDECGI E-UTRAN Global Cell IdentityECI E-UTRAN Cell IdentityeNB Evolved Node BMT Mobile terminalNC Neighboring cellPCI Physical Cell IDRSRP Reference Signal Received PowerSC Serving cellTC Target cellUE User equipment

Machine learningANN Artificial neural networkGBDT Gradient boosting decision treesGP Gaussian processesKNN k-Nearest neighborLogReg Logistic regressionMLP Multi-layer perceptronReLU Recitifed linear unitRF Random forestPCA Principal component analysisSGD Stochastic gradient descentSPCA Sparse principal component analysisSVD Singular value decompositionSVM Support vector machineTSVD Truncated singular value decomposition

x

1 Introduction

With the emergence of the networked society, many of people’s daily tasks are performedover wireless connections, and increasing requirements are put on mobile networks in termsof latency, data rates, and reliability. In order to meet these requirements, network providersmust constantly adapt to advances in technology and improve their communications algo-rithms.

To facilitate non-disrupted communications in mobile networks, handovers are performed.These are processes in which a mobile terminal (MT) switches channel during an ongoingsession [41]. The most general case for when a handover is performed is when the signalstrength from an MT to its access point (AP) falls below a certain threshold and a new APmust be acquired. Another important technique in network communications is load balanc-ing, in which APs serving many MTs are disfavored over APs serving few MTs1. This is es-pecially important in hotspots where many users conglomerate. In these areas, aggregatingusers toward low range high frequency APs is healthy for easing the load on low frequencyhigh range APs. In addition to techniques for providing reliable and stable services, thereexist techniques for boosting the throughput for users in the network. Carrier aggregation isa technique for connecting single MTs to several APs on one or several frequency bands inorder to increase bandwidth2.

The above mentioned techniques are examples of techniques that rely on performing fre-quency measurements, which is a measurement of signal strength from an MT to an AP ona certain frequency. In general, frequency measurements are the base for every technique inwhich the signal strength between MTs and APs in a network needs to be known. It is simpleto realize that an improvement in frequency measurement efficiency would be an improve-ment to network efficiency in general. To give an idea of how frequency measurements areused we may briefly examine a simplified process of a handover. When a handover is tobe performed, the MT must first acquire information about the APs in its vicinity to knowwhich one is the best candidate for a handover. For this to happen, the MT performs fre-quency measurements to each nearby AP. When the measurements are completed, the MTperforms a handover to the AP with the highest measured signal strength. Alternatively, theMT may switch as soon as an AP with a high enough signal is found. A more thoroughdescription of mobile networks will follow in Section 2.1

1http://www.3gpp.org/technologies/keywords-acronyms/105-son2http://www.3gpp.org/technologies/keywords-acronyms/101-carrier-aggregation-explained

1

1.1. Aim

Because of the network cost of performing frequency measurements, it would be advanta-geous to be able to know whether the signal to the strongest signal AP on a certain frequencyis strong, without having to perform a measurement. Predicting whether or not this wouldbe true given a certain position clearly requires the position of MTs in the world to be known.Extracting the absolute position of MTs in a network is time consuming, unreliable, and mayentail both legal and ethical complications. It could be possible to trilaterate the position ofan MT given the distance to its host AP and other nearby APs. If the relative position of anMT is established through trilateration, and there exists information regarding which inter-frequencies previous measurements were successful to, it is theoretically possible to estimatea probability of the success of a new measurement.

For this task, machine learning is a suitable approach. Machine learning is a field in which acomputer program, rather than being programmed to follow a specified rule set, is learned tomake predictions of certain outcomes through applying a learning algorithm onto some dataset [43]. The learning algorithms are dependent upon the data being divided into features,observed properties or phenomena. In machine learning, the data by which the model islearned is called training data, and input vectors are assembled from the features of the data.In supervised learning, each input-output pair represents a training example, as opposed tounsupervised learning, where each example only consists of input.

The main issue of this study is to apply machine learning to the problem of inter-frequencysignal prediction. This amounts to the creation of a system that can, given information aboutsignal strength to host AP and neighboring APs, be able to predict the signal to the highestsignal AP on a certain inter-frequency. More about how the network data relates to the algo-rithm I/O in Section 3.2. We investigate which learning algorithm and which data processingtechniques are best suited for the task. In order to to accomplish this, a comparative study oflearning algorithms is performed, and algorithms of interest are analyzed more in-depth. Inaddition to finding the optimal setup for the problem in general, we investigate solutions forpredicting signals on high frequencies in particular, where the signal strengths generally arelower. In addition, we analyze the data with the objective of evaluating the effectiveness ofthe studied solutions on this problem.

Although there exists research on the use of machine learning in various parts of an LTEnetwork, such as handover management [4], predicting user movement [51][44][11], andother techniques which make use of frequency measurements, there exists no research onusing machine learning in order to improve the frequency measurement algorithms in them-selves, to the authors knowledge.

1.1 Aim

The goal of the project is to determine which supervised learning algorithm performs bestin the given problem context with the available data set. A comparative study of supervisedlearning algorithms is made, where a few algorithms are selected for deeper study. In addi-tion to learning algorithms, the data is studied and suitable data processing techniques areselected.

1.2 Research questions

The following research questions are defined to guide the research in trying to explore algo-rithms, alternatives to pre-process the data, and how well suited a machine learning solutionis for the given data set:

1. From the selected suite of learning algorithms and pre-processing techniques, whichcombination achieves the highest predictive performance on the given data set?

2

1.3. Delimitations

This can be considered the main research question of the study. At the end of the studywe compare our implementation with various models including an Ericsson prototypesystem.

2. How does the predictive performance of the implemented system vary when the train-ing size is varied?

In order to easily test our resulting system in the field, a proposition is to make thesystem non-reliant on an external system. This means that data collection, data storage,training, model storage, and classification are all performed on the node. Because ofmemory and run-time requirements as well as the desire to keep the data collectionduration minimal, it is desirable to investigate how the number of training samplesaffect the predictive performance.

3. Which techniques are suitable for the objective of achieving high predictive perfor-mance on the positive class on high frequencies?

This is important because in general, the low frequency APs have large ranges and yieldhigh signal strength connections. This makes it easy for users to aggregate to these APs.From a load balancing standpoint, it is therefore desirable to move users to high fre-quency APs. Because these APs generally give lower signal strength connections, thereexist less data for cases when the signal is considered "good enough", and therefore lesscases to train the models on. From a machine leaning perspective, this means that thenegative output class is dominant.

4. How much is gained from using the implemented system over using naive selection?

It is important to produce a system which shows the capability of improving efficiencyin the real world application. By naive selection we mean always choosing the mostcommon output class. This could also help answer whether or not supervised learningis an ideal technology for this application.

1.3 Delimitations

The study is centered around implementations that should in entirety be run on LTE eNBs.This entails certain limitations in run-time and memory. In the proposed solution, the datais intended to be acquired and stored in the eNB, and the model is trained and stored in theeNB as well. This makes training speed, classification speed, and memory requirements allfactors of what is an efficient solution.

3

2 Theory

This chapter briefly describes relevant information about an LTE/LTE-A network. It also de-scribes machine learning, and supervised learning in particular, before detailing certain as-pects of learning techniques that are of importance for the studied algorithms. Cost functionsand minimization, regularization, and hyperparameter optimization techniques are some ofthe topics described. We detail the workings of the algorithms that were chosen for furtherstudy after the initial broad sweep. Finally, techniques for evaluation of machine learningmodels are described.

2.1 Introduction to LTE/LTE-A networks

In an LTE/LTE-A network, User Equipment (UE) is a conglomerate term for devices in thenetwork that communicate with the core network, such as mobile phones or laptops. Thesedevices connect to the core network via base stations called Evolved Node B (eNB) [39]. EacheNB consists of one or more cells, meaning that one eNB can consist of a single physicalantenna (omnidirectional), or several (sectorized multi-cell solution). However, a cell is notdefined as the actual antenna but rather the area the antenna services. The eNB or cell whichis currently supplying a particular UE’s connections is called its serving eNB or serving cell(SC). When a frequency measurement or handover is performed, the cell to which the proce-dure is being performed is called the target cell (TC) [6].

There are many different types of identifications in an LTE/LTE-A network pertaining tocells, UEs, and eNBs1. It is further complicated by different sources using different terms forthe same properties [5][37][19]. There are several ways of identifying cells with different mag-nitudes of uniqueness. The E-UTRAN Global Cell Identity (ECGI) identifies any cell in theworld uniquely, the E-UTRAN Cell Identity (ECI) identifies any cell in a network uniquely.Finally, the physical identity (c-ID) identifies neighboring cells uniquely in relation to one ECI[37], and range from 0 to 504. All of this can be simplified by using the terms Global Cell ID(CGI), which contains both the ECGI and ECI, and Physical Cell ID (PCI), a more commonterm for the c-ID [5][19]. In summary, the CGI is a cell identified uniquely, and the PCI is acell identified in relation to a CGI.

1http://www.rfwireless-world.com/Terminology/LTE-Identifiers.html

4

2.2. Introduction to machine learning and supervised learning

Figure 2.1: Simplified cutout of an LTE network, showing the eNB, UE, SC, and NC.

Each eNB maintains a list of neighboring cells (NC) called a neighbor cell relation list [5],or neighbor relation table [19], that includes certain connectivity information such as IP andPCI mappings [5]. UEs moving through the network may detect both PCIs and CGIs, butidentifying a cell absolutely is more time consuming. The UEs in the network continuouslymeasure their Reference Signal Received Power (RSRP), the signal strength, from the SC toall NCs in their vicinity that are possible candidates for a handover. When a UE encountersa new cell, it sets up a connection between its serving eNB and the eNB of the unknowncell. The eNBs then share information about the cells in their respective area, including PCIsand CGIs. This function is called Automatic Neighbor Relations (ANR) [19], and is a core ofthe work-flow of relational mapping in a network. The information generated by ANR usu-ally only consists of information regarding neighbors on the same frequency. Inter-frequencymeasurements can also be performed, but are generally ordered by an eNB in specific circum-stances, and are therefore considered more expensive.

2.2 Introduction to machine learning and supervised learning

In machine learning, algorithms are able to learn from data. This means that the performanceof an algorithm on a certain task is improved according to a certain performance measure[25]. The task generally refers to the procedure of predicting one or several output variables,given a series of input variables (or features) [27]. These tasks can further be divided intoclassification and regression, which is the prediction of categorical and continuous variables,respectively [25]. In the case of binary classification, the output is usually true or false andrepresented by a binary digit: 1 or 0. The measures used to gauge the performance of machinelearning algorithms are usually referred to as metrics; the most common one being accuracy,simply the ratio of correct predictions.

In supervised learning, an algorithm is trained by feeding it a training set consisting offeature vectors and corresponding outputs [25]. One feature vector and its correspondingoutput is often called an example or a sample. During evaluation (or validation), or when thealgorithm is queried to make a prediction in general, only the feature vector is given. In orderto acquire data for both training and validation, the data available for analysis is usually splitup into one training set and one test set, usually by ratios of 9 to 1, or 8 to 2.

If the vector of features is x P Rn, and the output variable is y, supervised learning al-gorithms often learn to predict y by estimating the probability distribution of y given x. Asimple example of a machine learning algorithm is Linear Regression, in which the output is a

5

2.2. Introduction to machine learning and supervised learning

linear function of the inputy = wJx + b (2.1)

where w P Rn is a vector of weights, and b is an intercept (or bias) term [25]. The weightsand the bias term are what is referred to as the parameters of the model, the variables thatare learned by training. The extra bias term is needed if the decision boundary is to not gothrough the origin (0, 0). The bias term is therefore added to account for what in machinelearning is called bias, the inaccuracy of the model. Another closely linked term is variance,which is instead the impreciseness of the model. Consider an example where a model istrying to, for a single observation, predict the correct outcome several times. If the multipletries yield predictions that are close to equal, but far from the true outcome, the bias is highand the variance is low. On the other hand, if the predictions are widely spread out aroundthe correct outcome, the bias is low but the variance is high. This is explained visually inFigure 2.2. For machine learning model performance, it is usually difficult to obtain both alow bias and a low variance, which is why the bias-variance tradeoff is a central problem. Biasand variance are closely linked to overfitting and underfitting, more about this in Section 2.6.

High variance, high bias

Low variance, low biasLow variance, high bias

High variance, low bias

Figure 2.2: Variance and bias

As can be concluded from Equation 2.1, when we are using linear regression for classifica-tion we rely on being able to linearly separate different outcome classes. This can visualizedby what is called a decision boundary, data will be assigned to one class or another depend-ing on which side of the boundary the data point falls. Figure 2.3 shows example decisionboundaries for two data sets. This demonstrates that the data cannot always be separatedso that the model always predicts the correct class. This is a simple example, but holds forhigher dimensions as well. The higher the number of dimensions of the data to analyze, themore complex the model needs to be in order to achieve high predictive performance.

6

2.3. Cost functions and minimization techniques

(a) Linearly separable data (b) Non-linearly separable data

Figure 2.3: Examples of decision boundaries

2.3 Cost functions and minimization techniques

The problem of optimizing a learning algorithm can be described by finding the function,from the set of all possible functions, that best predicts the true responses to a set of examples[50]. In order to find a function as close as possible to this function, the loss is measured. Theloss is the discrepancy between the true response and the prediction of the model. We wantto find the function that minimizes the expected loss (or risk), defined as

R(a) =ż

L(y, f (x, a))dP(x, y) (2.2)

where f (x, a) is the sought function, L(y, f (x, a)) is the loss function, and P(x, y) is the un-known probability distribution of the input x and output y. Because the probability distribu-tion is unknown, we cannot minimize the expected loss function directly but must computean approximation, the average loss over the training set [12]. Minimizing a loss function isin machine learning often done by using Stochastic Gradient Descent (SGD), in which oneweight update can be described by

wt+1 = wt ´ γt∇wL(zt, wt) (2.3)

where wt is the weight at time t, and γ is the learning rate. The learning rate is a constantby which the gradient of the loss is multiplied to hasten or slow down the descent. If thelearning rate is too large, the algorithm may not converge, while a too small learning ratemight make the convergence too slow for a practical application. There are many modernvariations of the SGD algorithm, such as Adam [33] and SAGA [20]. When a maximization ofthe objective function is used instead of a minimization, the loss function is often called thescoring function instead.

2.4 Data pre-processing

Before being used as input for a learning algorithm, there are a variety of ways data can bemanipulated. There are many reasons for why data pre-processing is useful; these include: re-ducing training time, boosting predictive performance of the resulting model, the data format

7

2.4. Data pre-processing

may be incompatible with the learning algorithm implementation, among others. Dimension-ality reduction can be used to reduce the number of features or samples, often while tryingto retain some information about what has been discarded. Feature selection can be used tosimply remove certain features that possess little or redundant information. Sampling can beused to decide in which manner samples are divided into sub sets, or which types of samplesto include. Feature scaling can be used to transform the values of the features.

Sampling

A common way of splitting data into train and test subsets is by random sampling, thatis to say picking samples at random from the full data set [42]. However, there are somecomplications with this technique. If the distribution of output classes is heavily skewed,the possibility of having all or most of samples with one of the classes in only one of thesubsets increases. This is problematic because if the algorithm is never trained on sampleswith a certain output class, there is no way for it to ever predict that class. This can be solvedby using stratified sampling, which retains the distribution of a chosen feature or outputvariable in the subsets.

In addition to the technique of sampling the full data set, it is possible to use samplingtechniques that only use part of the full data set, or that expands upon it. Removing sam-ples from the dominant class is called undersampling while generating data of the infrequentclass is called oversampling [17]. Both methods have strengths and weaknesses; undersam-pling may remove important examples while oversampling generates artificial data from theexisting data set, making it prone to overfitting.

Feature scaling

What in machine learning is referred to as feature scaling, or simply scaling, is a normal-ization of feature data by transforming different ranges of data into a more common scale.Many machine learning algorithms base feature similarity on the Euclidean distance of fea-ture vectors in the feature space instead of studying the features’ characteristics directly [3].One effect of this is that features with larger value ranges are given larger significance inweight calculations. This is not generally desired, as there is usually no inherent correlationbetween a feature’s total value range and the value of the feature in one example. A concreteway of explaining this could be to consider weight and height of a person when determiningthe person’s sex. If the heights of the observed people range between 160cm and 190cm, andthe weights range between 55kg and 100kg, a larger relative weight would be assigned to theweight feature. However, there is no support for the fact that a person’s weight is a largerdeterminant than height when predicting a person’s sex. In addition to this, simply reducingthe values in the data to smaller numerical values have shown to increase training speed forneural networks [28].

Because of these phenomena, it is considered best practice to scale the features. Thereexist multiple ways of scaling features; some rely on transforming the values of every featureto fit the same range, for example [0, 1], while others subtract or divide each feature value bya mean of the feature [3]. If x̂ is a feature, x a feature component, and l and u represent alower and upper bound for feature components, respectively, [0,1]-scaling can be formulatedas

x̂ =x´ lu´ l

(2.4)

Standard scaling (or standardization), which is the removal of feature variance from eachfeature component, can be formulated as

x̂ =x´ µ

σ(2.5)

where µ and σ are the mean and deviation of the feature, respectively.

8

2.5. Hyperparameter optimization

Dimensionality reduction

In statistics, multivariate analysis is the practice of analyzing the joint behavior of multiplevariables. This is used because it is often inadequate to consider the importance of variablesindependently. For example, it is possible that a variable that gives a lot of information isredundant if another variable is introduced. This is important in machine learning becausewe want to minimize the dimensionality of input while maximizing the information from thedata that is fed to the algorithm.

Principal component analysis (PCA) is a technique for reducing the number of variableswhile retaining a large part of the variance of the data [2]. A principal component is thelinear combination that explains the largest possible variance of a set of data. When PCA isperformed, principal components are created sequentially, generally until a certain thresholdof variance is obtained. Principal components are always orthogonal to one another. This isimportant because if the components were to be correlated, it would not be possible to deter-mine the amount of variance that is explained by each component, akin to the variables thatwe had prior to performing PCA. Numerically, PCA is based on singular value decomposi-tion (SVD).

(a) Large amount of variance lost (b) Small amount of variance lost

Figure 2.4: PCA with one component on data generated from the Gaussian distribution

PCA is most easily interpreted visually. Figure 2.4 shows principal components fit on dif-ferent sets of data. One may say that the data is projected orthogonally upon the nearest prin-cipal component and the shorter the distance of the projection, the less variance is destroyed.Apart from classic PCA, there exists similar methods based on SVD such as Truncated SVD(TSVD) [26] and Sparse PCA (SPCA) [52].

2.5 Hyperparameter optimization

In machine learning, a model’s parameters are what is automatically updated as the algo-rithm performs training, such as the weights of a neural network. Meanwhile, a model’shyperparameters are what is specified by the user prior to training, such as the number oftrees to use in a random forest [8]. For many learning algorithms, there exists a large numberof hyperparameters. In addition, there are no real default configurations that "always work",one rather has to test a number of configurations to see what fits the application and data athand. In Bayesian machine learning, the term hyperparameter refers to a parameter in theprior distribution of another parameter of the model; for consistency across the various mod-els in this study, we will not use this definition. Note that this is only relevant for Gaussianprocesses among the learning algorithm we study in this work.

9

2.5. Hyperparameter optimization

The most common technique for hyperparameter optimization is the combination of man-ual work and a brute force search technique [10]. Around 10 years ago, grid search was themost common strategy. Grid search relies on manually configuring a hyperparameter searchspace which is then iterated through until every possible combination has been explored.For large grids, the process is very costly. If one wishes to optimize an algorithm with 5hyperparamters and simply explore 5 values for each, 3125 iterations need to be performed.Because hardware keeps improving throughout the years, grid search has not completely lostits function.

(a) Search space for grid search (b) Search space for random search

Figure 2.5: Examples of search space for grid search and random search when consideringtwo parameters

Another method, called random search, is similar to grid search in the way that it searchesthrough a pre-defined search space iteratively [10]. However, instead of performing exhaus-tive search, random search samples a configuration at random in each iteration. Because ofits non-exhaustive behavior, it is possible to use continuous distributions to sample valuesfrom rather than using a static grid. Also in regard to the non-exhaustive behavior, it is pos-sible to manually set a stopping condition, for example a set number of iterations to perform.Random search has been shown to outperform grid search [9]. In addition it performs welleven in comparison with modern sequential methods, but can be outperformed by GaussianProcesses and Tree-structured Parzen Estimator Approach when optimizing complex modelssuch as deep belief networks and convolutional networks [10].

Random search will in most cases explore more choices for each parameter, as illustratedby Figure 2.5. This allows random search to find local optima that grid search would other-wise not find. However, because of this it makes it difficult to analyze which parameters aremore significant. Grid search may therefore be better suited for analyzing the importance ofparameters while random search is better suited for simply finding the best configuration.

Hyperopt-sklearn is a package for automatic hyperparameter configuration [34]. It letsusers perform hyperparameter optimization using built-in parameter distributions, whichare engineered to account for both dense and sparse data representations. Hyperopt-sklearnconsiders quite a low number of hyperparameter choices but have still shown considerableperformance in comparison to other techniques [34].

10

2.6. Regularization

2.6 Regularization

In machine learning, our goal is to minimize our test error, which is the same as maximizingour predictive performance on the test data set. We generally obtain a low test error by ob-taining a low train error, a good performance on training data. However, because the trainand test data sets contain different data, it is possible to obtain a low train error but a hightest error; this phenomenon is called overfitting [25]. Intuitively, overfitting can be explainedby the algorithm focusing "too much" on the specific values in the training data, and over-looking general correlations. Overfitting, therefore, often occurs in models that have highvariance. In contrast, it is possible that the training error is large, which will also generallyinduce a high test error. This is called underfitting, and can be explained by the algorithmfailing to find correlation between the input and output of the data. In contrast to overfitting,underfitting instead occurs in models with high bias. Our ultimate goal is to minimize thetest error, and this can often be accomplished by performing a trade-off between overfittingand underfitting. It is possible that a configuration performs better on the test data than thetraining data.

In order to reduce overfitting, one uses what is called regularization. This can be explainedas reducing the amount the algorithm is learned from each training sample, hopefully lower-ing the test error at the expense of the training error [25]. This is done differently dependingon the type of algorithm. For example, for algorithms that rely on minimizing a cost functionby gradient descent, regularization is induced by adding a penalizing term to the cost func-tion. For algorithms that rely on decision trees, highly complex trees may induce overfitting,which is why breadth and depth are controlled. This is generally done by growing a full tree,and then cutting away select branches.

As previously mentioned, prior to training the data is split into training and test data sets.In order to calculate the training error, one must further split the training data into train- andtest subsets for use during training. In order to obtain a more accurate training error, thisis usually done multiple times using the method k-fold cross-validation. This method splitsthe data into k subsets, training iteratively k times, each iteration using one of the folds astest data and the remaining sets as training data [25]. After training has been completed, theaverage error is selected as the training error.

2.7 Learning algorithms

The learning algorithms considered in this study are logistic regression, random forest, gra-dient boosting decision trees, multi-layer perceptron, k-nearest neighbor, Gaussian processes,and support vector machines. The latter three were discarded after the evaluation of the ini-tial test run, which is why we deemed detailed descriptions of these algorithms superfluous.We will first briefly describe these three algorithms and follow with a more detailed descrip-tion of the others.

Gaussian processes are non-parametric kernel-based learning algorithms that are basedon finding a distribution over possible functions from the input data to the output data. Thefunctions considered are indirectly defined by a covariance matrix, which in itself defineshow the data points in the input relate. For a more detailed explanation, see Rasmussen etal.’s Gaussian Processes in Machine Learning [47].

We previously described how linear regression separates data points in a two-dimensionalspace by a line. A support vector machine attempts to separate data points in any dimensionwith a hyperplane, while at the same time maximizing the distance between each data pointand the hyperplane. This is partly done by transforming the input data to a (normally) higherdimensional feature space by a kernel function where the data is hopefully easier to separate.For details, see Shmilovici et al.’s Support Vector Machines [49].

k-Nearest neighbor is an instance-based method that is based on considering groupingsof each data point in the input by proximity in the feature space and where the outcome is

11

2.7. Learning algorithms

decided by majority voting of that grouping; the size of each grouping is decided by k. The k-NN holds the entire training data in memory and performs all computations at classification.This algorithm is considered one of the simplest machine learning methods. If any moredetails are desired, consider Cover et al.’s Nearest neighbor pattern classification [18].

Logistic regression

Logistic regression is a probability model for binary response data [23]. When used for classi-fication it will output the odds of one class over the other. The name derives from the logisticfunction; the standard logistic function (or sigmoid function), can be expressed as

f (x) =1

1 + e´x (2.6)

An example of a logistic regression function is

p(x) =ew1x+w2x+b

1 + ew1x+w2x+b (2.7)

where w1 and w2 are weights, and b is a bias. This may also be expressed as

q(x) = ln(

p(x)1´ p(x)

)= w1x + w2x + b (2.8)

which, if p(x) is a probability function for P(Y = 1|X), gives us the probability odds of oneoutput class over the other as a function of the input. Classification with logistic regression isdone by interpreting the output, for example by

f (x) =

#

1, q(x) > 0.50, q(x) <= 0.5

(2.9)

Training logistic regression models amounts to specifying a likelihood function over theparameters given training data and computing the maximum likelihood estimates of thatfunction using SGD; for a thorough mathematical explanation, see Czepiel et. al’s MaximumLikelihood Estimation of Logistic Regression Models: Theory and Implementation [1].

Artificial neural networks

Artificial neural networks (ANN) are a type of computational systems based on emulatingthe functionality of a biological neural network (such as the human brain) [30]. ANNs arecomprised of a number of artificial neurons, which are based on the functionality of biologi-cal neurons [38]. The inputs & weights, summation & activation function, and output of anartificial neuron are the equivalent of the dendrites, soma, and axon of a biological neuron,respectively.

An artificial neuron is seen in Figure 2.6, where x1 to xn are the inputs, and w1j to wnj

are the weights. The sum of the weighted inputs is fed to the non-linear activation function,which calculates an output. This output may be used as input to one or several other artifi-cial neurons, which is what constitutes the ANN [38]. The output of an artificial neuron iscalculated by

y = F

(n

ÿ

i=0

wi ¨ xi + b

)(2.10)

where F is the activation function and b is the bias term. The activation function of a neu-ral network must be non-linear for us to be able to model non-linear data [25]. Commonactivation functions include the logistic function and the rectified linear unit.

12


Figure 2.6: Artificial neuron

Figure 2.7: Artificial Neural Network

Figure 2.7 shows an ANN, each node representing one artificial neuron. In this figure,the green nodes represent the input layer, the purple nodes represent a hidden layer, and theorange node represents the output layer. It should be noted that while a network may consistof only one input layer and one output layer, there exists no boundary for how many hiddenlayers may be incorporated.

ANNs may be divided into two different subcategories, feed-forward and recurrent net-works [38]. The difference is that in feed-forward networks, the output of a neuron maynever be used in the input of a neuron in the same or previous layer, while in a recurrent net-work it may. The ANN in Figure 2.7 therefore represents a feed-forward network, describedby an acyclic graph.

In order to train (run SGD) the network, one must first acquire the gradient of the loss.This is done by what is called back-propagation, which is a technique for cheaply computingthe gradient of the network from the output node and back using the chain rule [25]. TheMulti-layer perceptron (MLP) is a feed-forward neural network that consists of one or morehidden layers, often called a "standard" ANN.

Activation function

As stated earlier, the activation function is what calculates the output of an artificial neuron.Because neural networks commonly use gradient minimization techniques, the weights ofthe hidden units are updated in proportion to the gradient of the error function. One effect ofthis is that the weights receive small updates if the gradient of the error is small. This is calledthe vanishing gradient problem and is present in many commonly used activation functionssuch as the logistic function (or sigmoid) and the hyperbolic tangent [7].

13


The rectified linear unit (ReLU), defined as

f (x) =

#

x, x > 00, x <= 0

(2.11)

solves this, by always having the derivative f 1(x) = 1 for x ą 0. Because the ReLU is 0for negative inputs, training becomes fast as every hidden unit need not be used for everytraining sample. However, this also causes a problem for the backpropagation algorithm asit cannot back-propagate through a 0-neuron. ReLU, and its variations, is currently the mostpopular activation unit, and is considered best practice for most applications [46].

Decision trees

In machine learning, decision trees can be used as predictive models. Nodes representfeature-based tests with one branch for each outcome, and leaves represent the output [45].Figure 2.8 shows an example of a decision tree, in which x and y represent features, and 1and 0 represent the output classes.

Figure 2.8: Decision tree for binary classification

Training a decision tree concretely amounts to generating the tree structure that producesthe best predictive performance. An impurity measure is used to gauge the usefulness of anode [40]; the lower the impurity score of a node, the better predictions the node yields. Onegenerally starts with only a root node, and then nodes are split into children in a mannerwhich maximizes the impurity decrease. In theory, a tree should be split until no further splitof the tree yields a lower impurity, therefore obtaining a "pure" tree. However, many algo-rithms facilitating decision trees control the number of nodes allowed in each tree, resultingin only a locally optimal purity.

There exist several "tree growing" algorithms that specifies exactly how a tree is gener-ated, most of these algorithms are based on what is called recursive partitioning [29]. In orderto describe this algorithm, we must define what is called a learning set, L, basically the train-ing data set. The algorithm starts with only a root node, containing the entire learning set L.The root node then splits along two branches by a boolean check of one value of one feature,for example weight ě 23. L is then divided into two new learning subsets L1 and L2, depend-ing on which branch the individual samples will fall into based on the condition. The twodaughter nodes may then split into two additional nodes each, further splitting the learningsets. Of course, prior to any splitting, a stopping criterion check may be performed, in whichcase the algorithm stops and the tree is returned.

One quickly realizes that this is quite a naive algorithm, only visiting each node once andchoosing each split by local optimization. Exactly how the splitting conditions are selectedand which stopping conditions exist is decided by each particular implementation. The most

14

2.8. Evaluation metrics

famous tree growing algorithms include CART, ID3, and C4.5 [40]. One difference betweenthese methods is whether to use early stopping, or to build a full tree and then prune it, cuttingaway select branches. Another is whether to always perform a binary split, or to allow morethan 2 daughter nodes per parent node.

Random forests

Multiple classifier systems, or ensembles, are a group of supervised learning algorithmswhich use multiple (weak) classifiers in order to create one strong classifier [21]. The clas-sifiers are called weak because they are low complexity versions of the aggregated classifiertype.

Bagging is a type of ensemble method in which many weak estimators are trained onvarious sub sets of the learning set and are aggregated in the final classifier by voting [13].This means that at time of prediction, each estimator is queried for an outcome; the predictionof the classifier is then the most common choice among the estimators. In bagging, eachestimator is trained independently and it is therefore possible to, for example, parallelizeestimator training.

Random forests are a version of bagging in which decision trees are used as estimators[14]. A random forest is trained by training weak decision trees, called stumps. Duringtraining, various attributes of the stumps are controlled by what we described in Section2.7, for example the tree depth. In addition to hyperparameters specific to each decisiontree, an important hyperparameter of a random forest is the number of estimators to train.Random forests are resilient to the overfitting complications of decision trees in that eachweak estimator is shallow.

Although bagging methods classically use voting for determining the classification, it ispossible to use other alternatives. For example, the scikit-learn implementation uses the trees’predictive average instead of having them vote2.

Gradient boosting decision trees

Similarly to bagging, boosting is an ensemble method in which multiple weak estimators areused to create one strong classifier. However, instead of training each estimator indepen-dently, boosting is a sequential process in which every new estimator in dependant on thepreviously trained estimators [48]. Boosting algorithms maintain a set of weights which areupdated during training in a way which makes each estimator focus on the examples that theprevious estimator did not perform well on.

Gradient boosting is a combination of the terms gradient descent and boosting, and isbased on that fact that boosting can be seen as an optimization algorithm over a certain lossfunction [24]. In this case, the weights described in the above paragraph equals the nega-tive gradient of the loss function. There exist variations of gradient boosting techniques. Forexample, one variation may use the entire data set for training in each iteration, while an-other uses some sub sample of the full data set (Stochastic gradient boosting). Because ofthe sequential nature of gradient boosting, it is not possible to parallelize the training of eachlearner. Gradient boosting decision trees is simply gradient boosting in which decision treesare used as estimators.

2.8 Evaluation metrics

Before we define the metrics themselves, it is important to define the concept confusion matrix.It is a table of four items that are defined on a combination of a true class and a class predictedby a model. If the true class and the predicted class are both positive, it is defined as a true

2http://scikit-learn.org/stable/modules/ensemble.html

15


positive (TP). If the true class is negative and the predicted class is positive, it is defined as afalse positive (FP). If the true class and the predicted class are both negative, it is defined as atrue negative (TN). Finally, if the true class is positive and the predicted class is negative, it isdefined as a false negative (FN) [22]. One way of concretizing this is is to view the true classas whether or not a patient has a certain affliction, while the predicted class is the diagnosisof a doctor. Then, if the patient has cancer, but the doctor incorrectly diagnoses the patient ashealthy it would correspond to a FN. Figure 2.9 shows a confusion matrix.

Figure 2.9: Confusion matrix

Many of the well-known metrics for model evaluation in machine learning can be derivedfrom the terms in this matrix. Accuracy, Precision, Recall, and F1score are defined as [22]:

Accuracy =TP + TN

TP + TN + FP + FN(2.12)

Precision =TP

TP + FP(2.13)

Recall =TP

TP + FN(2.14)

F1score = 2 ¨precision ¨ recall

precision + recall(2.15)

Accuracy therefore describes how many out of all predictions are correct. Precision de-scribes how many of the predicted positives that are true, while recall shows the proportionof actual positives identified. This means that a model that predicts the positive class veryseldomly may have a very good precision, while a model that frivolously predicts the posi-tive class may have a very high recall. Because of this phenomenon, recall and precision maybe good for analyzing individual factors, but quite weak at evaluating how "good" the modelis at predicting the positive class in general. We may therefore instead use the F1 score, whichis the harmonic mean of precision and recall. This is to say, it measures how good the modelis at finding all positive cases without predicting the positive class too often.

In addition to these metrics, we define what we call AccSkew, which is a comparison ofthe accuracy and the distribution of output classes, more specifically

AccSkew = accuracy´ skewness (2.16)

where skewness is the percentage most common outcome of a data set. Concretely, if 10samples contain 7 outputs ’1’ and 3 outputs ’0’, or vice versa, the skewness is 70%. TheAccSkew is not an absolute value and may be negative, suggesting that simply guessing themost common class is more accurate than using the classifier in question.

16


ROC analysis

In binary classification, most classifiers predict a class over the other by being over 50% con-fident in the outcome. For example, in a random forest, if 53% of the trees vote for the pos-itive class, the positive class will be predicted. This is the equal to having a 0.5 discriminatorthreshold, or simply discriminator, and can be likened with relying on the predictions of theclassifier. However, instead of relying on the classifier as is, we may sometimes achieve ahigher predictive performance by moving the discriminator. For example, we may decide toonly predict the positive class if the classifier is over 70% certain, which would be the sameas using a 0.7 discriminator. This can be likened to not fully relying on the prediction ofthe classifier, as we may then predict a class for which the model does not have the highestprobability.

Figure 2.10: ROC curve with discriminator thresholds 0.0, 0.2, 0.4, 0.6, 0.8, 1.0 from the topright to the bottom left.

Receiver Operating Characteristic (ROC) analysis examines a classifier’s ability to predictthe positive class when the discriminator is varied. ROC analysis is traditionally used inmedicinal diagnostics, and has in modern times become popular for use in machine learningclassifier evaluation [22]. Figure 2.10 shows an example of a ROC curve, where the y-axisrepresents the rate of TPs, and the x-axis represents the rate of FPs. Each data point signifiesa certain discriminator. A perfect score would be a value in the top left, signifying findingevery positive case while not raising any false alarms. The diagonal line represents the resultsof random guessing, so a value below the diagonal line represents a classifier that performsworse than guessing for that particular threshold. For example, the point (0.7, 0.7) representspredicting positives in 70% of cases, getting them wrong half the time and getting them righthalf the time. Because of its characteristics, ROC analysis is efficient for determining thepredictive performance of the positive class in data sets skewed towards the negative class.

In diagnostics, it may be important to have a good performance at various threshold lev-els, in which case the Area under the curve (AUC) may be calculated and compared betweenvarious systems. It should be used with caution as a general measure of single-thresholdmodel performance, however, as a classifier with great performance at a certain thresholdand a bad performance at all other thresholds would yield a low AUC. In addition, researchhas shown that although AUC is insensitive to class skewness, it can mask poor performance[31].

17

2.9. Comparative studies

2.9 Comparative studies

There have been several occasions where a suite of machine learning algorithms has beencompared on a suite of metrics and problems. StatLog [32] from 1995, as well as the reviewsby Kotsiantis et. al. [35] and Caruana et. al. [16] from 2006, and by Caruana et. al. [15] from2008, are some of the most well-known. They all conclude that while some algorithms maybe stronger on average on a range of problems, a learning algorithm should be chosen basedon the problem and data at hand.

2.10 Memory and run-time

Kotthaus et. al. [36] have performed a study in which a range of supervised machine learningalgorithms for binary classification were evaluated in terms of run-time and memory usage.They do not study training and classification times separately but find that the total run-timeof random forests, support vector machines and gradient boosting decision trees are compar-atively slow while k-nearest neighbor and logistic regression are comparatively fast. Theyalso find that gradient boosting decision trees and logistic regression have a comparativelylow total memory allocation while random forest and support vector machines have a com-paratively high total memory allocation. They conclude, however, that both run-time andmemory usage is affected by the algorithm choice, the implementation, and the hardware.It is therefore difficult to draw conclusions about a specific algorithm by studying certainimplementations.

18

3 Method

This chapter describes the data extracted from the LTE/LTE-A network and the pre-processing of that data for use with learning algorithms. Following this, we motivate ourchoice of which algorithms were selected for the initial study. We then describe the exper-iments that were conducted in order to gauge the initial suite of learning algorithms, in-cluding choices for pre-processing and hyperparameter optimization. The chapter continuesby describing the algorithms chosen for comprehensive comparison, as well as their con-figurations. We then describe various analytic experiments performed with the optimizedalgorithms. Finally, we introduce novel implementations by which the choice of learningalgorithm varies depending on data attributes.

3.1 Platform

The algorithms were implemented in python using scikit-learn1, with pandas2 being usedfor data handling. Algorithm test runs were performed on 2 Ericsson Red Hat VMs with8 CPU cores and 32Gb of RAM on each. Training- and classification speeds were measuredsimply by using the time function of the python time package. Hyperparameter optimizationwas implemented using hyperopt-sklearn3, as well as the implementations of grid search andrandom search in scikit-learn. Visualization was performed using pandas and matplotlib4.

3.2 From network data to features

Our raw data consists of a number of time stamped frequency measurements. These includemeasurements to SC and NCs on the UEs frequency, and to inter-frequency cell neighbors.The measurements to the inter-frequency cells are performed in a way that lets us save onlythe cell on a particular frequency which has the highest signal strength. Using the CGI and thetime stamps, we can link together measurements into a sample which provides informationregarding whether or not a handover to a particular inter-frequency cell would be successful.

1https://scikit-learn.org/stable/2https://pandas.pydata.org/3http://hyperopt.github.io/hyperopt-sklearn/4https://matplotlib.org/

19

3.2. From network data to features

Table 3.1: A sample

CGI 1639551231SF 2688 MHzIF 720 MHz

SCM 15 RSRPNCM1 PCI 302, 24 RSRP

...NCM8 PCI 12, 17 RSRPICM 23 RSRP

An example of a sample can be seen in Table 3.1, where SF is serving frequency, IF is interfrequency, SCM is serving cell measurement, NCM is neighboring cell measurement, andICM is inter cell measurement. It is important to note that for this study we are exclusivelystudying inter-frequency measurements, which is why every sample contains an ICM. Weshould also point out that the number of NCs available in each sample vary between 1 and 8.

What we want to do is to predict the signal on the inter-frequency, which is why the ICMwill be our output. However, the RSRP value of the ICM is first converted into binary usinga threshold. This threshold is a network configuration constant and is there to determinewhat constitutes a good enough signal for a handover. In our case, this threshold is set to 24.This means that every ICM RSRP ě 24 is transformed into 1, while every ICM RSRP ď 24 istransformed into 0. This also means that we transform the problem from regression to binaryclassification. The motivation for this is partly because internal preliminary testing showedbetter results for classification than for regression, and partly because another ongoing masterthesis study at Ericsson on the same data set is focusing on regression.

When transforming the sample data to features, we group the data by CGI and IF, whichmeans that we will have one model per cell and inter-frequency. This also means that thosevariables are constant in each model’s data set and would therefore be redundant. The sam-ples contain multiple value data for the NCMs, which is problematic for many algorithms tohandle. For this reason, we transform the data from a dense representation to a sparse rep-resentation, in which every PCI has its own data column, and every row is a feature vectorthat represents a particular sample. Table 3.2 and 3.3 are examples of input and output ofone row in our representation. Note that in this instance, PCI0 and PCI144 have the value 0.This may either mean that the RSRP value is 0, or that the data value is missing. This meansthat any information regarding the difference between missing data and RSRP 0 will be lostto the algorithm. Because each sample contains the SCM and a maximum of 8 NCMs, therecan be a maximum of 9 non-zero values in each feature vector. Finally, it should be expressedthat because the samples available of each CGI contain various numbers of unique PCIs, thelength of the feature vector may vary between classification models.

Table 3.2: Input

SCM 15PCI0 0

...PCI99 24

...PCI144 0

Table 3.3: Output

ICM 1

20

3.3. Data statistics

3.3 Data statistics

The total number of samples, when limiting the available data for each model to 5000 sam-ples, is 1415049. The number of classification models, where each model is constituted byone cell and one target cell frequency, equal 428. This gives each model an average of 3283samples, or around 2955 training samples, if split 9 to 1. The average number of neighbormeasurements per sample is 3. As the average number of neighbors per feature vector is80, the data representation is sparse, meaning that a high number of data matrix elements areempty. The average number of elements in the feature matrices that are non-zero is 4.96%,we will call this value the allotment. We also define a single feature’s allotment as the ratio ofnon-zero values. It is important to note that with our data representation, the maximum pos-sible allotment is quite low. For example, if our feature vector contains 90 features, we mayhave a possible maximum of 10% allotment, because each sample may contain at maximumthe serving cell and an additional 8 neighbors, meaning the rest of the neighbors will be 0 forthis particular row in the matrix.

Figure 3.1: Output class skewness per frequency

Figure 3.1 shows the distribution of the output classes depending on the target frequencyof the model. For the lowest frequency, which contains 19.7% of the data, the skewness to-wards the ’1’ class is extreme, at 93.2%. The highest frequency, which contains 44.0% of thedata, is instead very skewed towards the ’0’ class, at 87.0%. Finally, the middle frequency,which contains 36.3% of the data, is 69.1% skewed towards the ’0’ class. The varying levelsof skewness on the different frequencies mean that the average RSRP value of the outcomevariable before conversion varies between frequencies. The lowest frequency, for example,has an average RSRP far higher than 24, while the middle frequency has an average RSRPslightly below 24. In total, there are only 33 models in the range of 50%´ 70% skewness,whereas there are 330 models in the range of 80%´ 100% skewness. In addition to the skew-ness, we do note that the allotment of the different frequencies is close to equal, at 5.34%,4.87%, and 4.83%, for low, middle, and high, respectively.

It is important to note that these frequency bands are not representative of everyLTE/LTE-A network. There are many more frequencies in use for LTE/LTE-A networks, andthe number of bands in use also vary between networks. The configuration of the network isof course decided by the owner of the network, and we do therefore not possess informationas to why these exact frequencies were chosen for this particular network.

21

3.4. Choice of algorithms

3.4 Choice of algorithms

There are many supervised learning algorithms, but many of these are variations within acategory of algorithms. For instance, k-nearest neighbor and radius nearest neighbor areboth implementations of the nearest neighbor category of algorithms. In addition to this,many supervised learning algorithms only work for, or are mostly suited for, either regres-sion or classification. We selected 7 distinctly different learning algorithm spanning the mostwidely used approaches to binary classification. Following this, Table 3.4 shows the chosenalgorithms and the abbreviations we will be using from this point on.

Table 3.4: Algorithm suite

Logistic regression LogRegSupport vector machine SVM

k-Nearest neighbor KNNGaussian processes GP

Random forest RFGradient boosting decision trees GBDT

Multi-layer perceptron MLP

3.5 Initial test suite

The suite of algorithms was implemented in python using scikit-learn, with hyperopt-sklearnfor hyperparameter optimization. Where hyperopt’s default parameter distributions werenot applicable, a short test run on a custom parameter distribution was performed, beforeusing the best average configuration for all models. In addition, the full suite was also testedwith 3 different dimensionality reduction techniques: PCA, TSVD, and SPCA. For all of theseruns, [0,1]-scaling was used. In addition, accuracy was used as a scoring function duringtraining. The algorithms included as well as the choice of optimization can be seen in Table3.5, where Cs is a factor for regularization strength, res is the number of restarts (complete re-training from new initial parameters), rb f is the kernel choice (Radial basis function), x, y, zis the number of hidden units in each layer, where x is the input layer, y are the hiddenlayers, and z is the output layer; # f is the number of features, relu is the choice of activationfunction, and lb f gs (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is the choice ofminimization algorithm.

Table 3.5: Algorithms and Configurations

Algorithm Implementation HyperparametersLogReg LogisticRegressionCV Cs=10

SVM SVC hyperopt defaultKNN KNeighborsClassifier hyperopt default

GP GaussianProcessClassifier res=3, rbf(0.25, (1e-5, 5e4))RF RandomForestClassifier hyperopt default

GBDT GradientBoostingClassifier hyperopt defaultMLP MLPClassifier #f,#f/2(x3),1,relu,lbfgs

After the initial runs, GBDT, RF, LogReg, and MLP were selected for further investigation.Meanwhile, dimensionality reduction by either of the three techniques were abandoned.

22

3.6. Result parsing and visualization

3.6 Result parsing and visualization

During the training runs, the frequency, number of samples, skewness, training time, classi-fication time, Accuracy, Precision, Recall, F1 score, and AccSkew were collected for each typeof classifier. These values were stored in .csv files where one row represents the results of onemodel. In order to be able to analyze the large amounts of saved results, a parsing programwas written in python that extracts the average performances for models with certain modelor data attributes. For example, the program can calculate the average Accuracy of everymodel for which the number of samples are in a certain range. In addition to the parsingprogram, a program for visualization was written in python primarily using the packagespandas and matplotlib.

3.7 Tool specifics

Scikit-learn uses a version of the CART algorithm for growing its trees which is used by theRF and GBDT algorithms. In scikit-learn, some algorithms give the option of choosing thenumber of cores to use for training in parallel. For example, this option exists for RF butnot for GBDT. However, this is not something specific to scikit-learn; gradient boosting isinherently impossible to parallelize due to its sequential nature. This is worth noting whenconsidering the end results.

3.8 Hyperparameter optimization

When optimizing the algorithms selected for further investigation, a combination of gridsearch and random search was used. Firstly, a grid was specified and ran on 20% of the mod-els, selected so that many different skewness levels were represented. The hyperparametervalues which were never selected, or selected very seldomly, were removed. If the lowest orhighest value was selected often then a lower or higher value, respectively, was introducedinto the grid. This process was repeated 3 times.

The resulting grids’ numeric parameters were transformed into uniform continuous dis-tributions from the lowest value to the highest value of each grid. Table 3.6 shows the trans-formation from a grid to a search space.

Table 3.6: Search space grid to distr. transformation

Attr. & Val. Grid Attr. & Val. Distr.penalty: ’l1’, ’l2’ penalty: ’l1’, ’l2’

solver: ’saga’, ’liblinear’ solver: ’saga’, ’liblinear’C: 0.1, 1.0, 10.0 C: uniform{0.1, 10.0}

max_iter: 10, 100, 300 max_iter: uniform{10, 300}class_weight: None, ’balanced’ class_weight: None, ’balanced’

The resulting random search distributions are used as the final hyperparameter optimiza-tion configuration. Each of the final random search distributions outperformed the defaulthyperopt-sklearn distributions, as well as the custom hyperparameter configurations.

3.9 Final algorithm configurations

The final implementations of the algorithms use the MaxAbsScaler and Randomized-SearchCV classes of scikit-learn. This represents a [0,1]-scaling and a random search hyper-parameter optimization with 3-fold cross-validation for each configuration.

23

3.9. Final algorithm configurations

Logistic regression

The hyperparameter search space for the LogReg classifier can be seen in Table 3.7. solver de-termines the which minimization algorithm to use, C is the inverse of regularization strength,the lower the value, the stronger the regularization. For hyperparameters that are not listed,the default value is chosen.

Table 3.7: Logistc Regression hyperparameter search space

Attribute Valuessolver ’saga’

C uniform{0.1, 10.0}

Multi-layer perceptron

The hyperparameter search space for the MLP can be seen in Table 3.8, where n f is the num-ber of features and each item within each parenthesis of hidden_layer_sizes represents onehidden layer. activation specifies the activation function used, and alpha specifies the reg-ularization strength, the higher the value the stronger the regularization. learning_rate_initspecifies the constant learning rate to use throughout training. For hyperparameters that arenot listed, the default value is chosen.

Table 3.8: Multi-layer perceptron hyperparameter search space

Attribute Valueshidden_layer_sizes (nf*2,nf*3,nf*4), (nf*4,nf*6), (nf*8)

activation ’relu’solver ’adam’alpha uniform{0, 0.01}

learning_rate_init uniform{0, 0.1}

It can be noted that the configurations of hidden layers and units only contain layers withan increasing number of units per layer. An equal number and descending numbers were alsotested but was discarded during multi-stage grid search due to being selected very seldom.

Random forest

The hyperparameter search space for the RF can be seen in Table 3.9. n_estimators determinesthe number of trees to grow, min_samples_split determines the number of samples requiredin the learning set of a node in order to split it, and min_samples_lea f determines the numberof samples that must be present in the learning set of a leaf node.

Table 3.9: RF hyperparameter search space

Attribute Valuesn_estimators uniform{8, 1256}

min_samples_split uniform{2, 20}min_samples_leaf uniform{1, 3}

Gradient boosting decision trees

The hyperparameter search space for the GBDT can be seen in Table 3.10. The first fourparameters have the same meaning as for the MLP and RF, and max_depth specifies the max-imum allowed tree depth for each estimator.

24

3.10. Analytic experiments

Table 3.10: Gradient boosting decision tree hyperparameter search space

Attribute Valuesn_estimators uniform{8, 1256}learning_rate uniform{0.001, 1}

min_samples_split uniform{2, 20}min_samples_leaf uniform{1, 3}

max_depth uniform{1, 10}

It is interesting to note that the configurations that proved effective for RF and GBDTare very similar. Surprisingly, setting a value for max depth proved effective for GBDT butnot for RF. Specifically, removing the parameter from GBDT increased training time but didnot largely affect predictive performance. Adding the parameter to RF did not largely affecttraining time but reduced predictive performance.

3.10 Analytic experiments

In order to determine the predictive performance of the optimized algorithms, the Accuracy,F1 score, and AccSkew were analyzed. In addition to comparing these metrics when aver-aging over all models, we also studied how the results varied when averaging over modelsbased on different levels of skewness in data. The motivation for this is that preliminary test-ing showed that the AccSkew was generally quite low, meaning that the skewness seemedvery important for the Accuracy of the models. We wanted to determine whether or not thiswas an effect of the data set generally being very skewed, or that the data contained informa-tion which was difficult for the algorithms to interpret.

In order to have time to complete the various subsequent experiments, the number ofalgorithms to use had to be limited. Because the RF showed the best predictive performancewhile having a reasonable training time, it was solely used for some of the experiments.

Training- and test score

In order to determine whether or not our models suffer from over- or underfitting, we an-alyzed the difference in training and test scores over the four optimized algorithms. Sincethe models are trained with accuracy score as a objective function, we compared it to theaccuracy of the trained models on the test set. The performance was analyzed when study-ing an average over all models and when averaging over models based on different levels ofskewness.

Scaling technique

During the initial test run and all subsequent experiments up to the point of our four finalizedalgorithms, [0,1]-scaling was used. However, it is not certain that [0,1]-scaling, or scaling atall, is the optimal choice for our configurations.

In order to determine whether or not [0,1]-scaling was the strongest technique when thehyperparameter optimization had been finalized, a study was conducted in which the scal-ing technique was varied but all other factors remained constant. Apart from [0,1]-scaling,standardization and performing no scaling were evaluated. Only the RF was used in thisstudy.

Training- and classification speed

In order to easily test the application in the field, everything should be included in a packageready to run in a node. This means that data collection, storing samples, training, storing

25

3.10. Analytic experiments

a model, and classification all should be run in the same package without the need of anexternal system. Because of this, both speed of training and classification is of interest. Itis important to note that should a more intricate system where the training is performedin a cloud environment, for example, be used, training speed and memory requirements oftraining data storage would no longer be important factors.

In order to determine how the algorithms perform in regards to speed of training and clas-sification, these results were studied based on different training sizes. It is important to notethat we do not analyze the speed of the algorithms from a purely theoretical standpoint, butrather comparing these specific implementations of the algorithms. Hopefully we should beable to draw some conclusions that are not implementation-specific, as it is not certain Erics-son would use these specific implementations, should any of the algorithms be implementedfor node usage.

Sample size

Studying the number of samples to use when training is interesting for various reasons. Forthis application in particular, it is important because the system, once installed in the field,must gather samples for a time before it can be used. The quicker the system can be up andrunning in a node, the better. In general, it is also possible that a lower number of samplesdoes not reduce predictive performance, in which using a larger number of samples is useless,increasing memory requirements and training time. For all the experiments presented in thissubsection, only the RF was used.

In order to determine how the training size affects performance, we first analyzed howthe AccSkew varied when studying RF models where varying training sizes were available.Since the result of this was deemed inconclusive, and could be dependant on other factorsof the models rather than simply the training size, a more intricate study was performed.In this study, the cap of 5000 samples was removed and 12 models were found for whichthe total number of samples surpassed 8000. This meant that we could train on up to 7000samples. The training samples in each interval were a random subset of the full data setof each model, drawn stratified to maintain the same skewness between training instances.The models were trained with samples drawn in intervals of 1000, and model training wasrepeated 3 times for each training size. This was done because of the potential variation inresults. The lower the training size, the lower the test size, and the lower the test size, themore arbitrary the evaluation of model performance is. In addition, if we had only trainedon, for example, 1000 samples when the full set has 7000 samples, we risk choosing a sampleset which is not representative of the full set. This only pertains to the variations in the input,as output class distribution variations are counteracted by stratification. After these testswere ran, we analyzed the average AccSkew for the 12 models over the 7 intervals, and eachmodel’s performance individually.

With the information gathered from this experiment, yet another experiment was con-ducted in which the training sizes were drawn in intervals of 200, from 200 to 1000. Becauseof the low training sizes for this study, the test size was increased from 10% to 25%. In addi-tion, model training was repeated 10 times instead of 3.

Feature size

Feature selection is of interest when exploring the information available in the data, and howmodel performance varies when it is used. It is also always helpful to be able to decrease theamount of data needed to train the model, especially when memory requirements are to beconsidered.

In order to determine the quality of our feature composition, we trained the RF on varioussubsets of our feature space; training with all features was compared to training with only the

26

3.11. Hybrid algorithms

neighbor features, and only the serving cell feature. The results were studied when averagingover all models, and when averaging over models with even skewness levels.

After this preliminary study, an additional study was conducted in which neighbor fea-tures were removed based on the allotment. The RF was trained when features were set to1/2, 1/4, 1/8, and 1/16 of the original amount.

Undersampling

Undersampling can help us understand how the information in the data varies between mod-els when skewness is not a factor. It could also be helpful when trying to increase perfor-mance of the positive class in a data set skewed towards the negative class. Finally, reducingthe amount of data needed to train the model is always helpful.

Because the data is heavily skewed, an experiment with uniform random undersamplingwas performed. After undersampling, each training and test set are exactly 50% skewed.Because the skewness levels are equal, there was no need to evaluate the AccSkew, instead wesimply compared the Accuracy with and without undersampling. This study was conductedwith the RF.

ROC analysis

ROC analysis is best used when analyzing systems for use in multi-discriminator diagnostics,but can also be used to analyze whether or not predictive performance can be increased forsingle-discriminator systems. Since most classifiers use a 0.5 discriminator by default, we cananalyze the ROC curve and try to distinguish whether other discriminator have a better ROC.

In order to determine whether or not varying the discriminator could increase our predic-tive performance of the positive class, we performed ROC analysis on the same 12 modelsthat were used in the training size study. Because we were only interested in finding a singleoptimal threshold, and not the overall diagnostic capability of the models, we chose not tostudy the AUC but rather the individual curves analytically.

Objective function

Accuracy may not be the best metric for evaluating the predictive power on the positive classin particular. Therefore, a high F1 score is perhaps more highly sought than a high accuracyfor this objective, for example. If this is the case, using accuracy as a scoring function duringtraining may not be optimal.

In this experiment, we train the RF with F1 score and ROC-AUC as scoring functions inaddition to accuracy, and evaluate the resulting accuracy and F1 score.

3.11 Hybrid algorithms

The various experiments conducted show that the AccSkew is smaller the larger the skew-ness, that the training size had little to moderate impact, and that reducing the feature sizeby allotment do not impact performance at some levels. We have also established that whileRF outperforms LogReg at low skewness levels, RF and LogReg have more similar predictiveperformance the higher the skewness. At a high enough skewness level, no algorithm has asignificant AccSkew.

Therefore, hybrid implementations were created in which the algorithm used depends onthe level of skewness of data. For 50´ 70% skewness, the optimized RF algorithm was used.For 70´ 90% skewness, the optimized LogReg algorithm was used. Finally, for 90´ 100%skewness, the most common class is chosen. In addition to these criteria, variations wereintroduced based on the feature size and training size. For Hybrid 1, the full training sizeand feature set were used. For hybrid 2, the number of features was cut in half based on

27

3.12. Comparison

allotment. For Hybrid 3, the number of features was cut by half, and the training size wasreduced to a maximum of 2000 samples. The exact configurations of each hybrid can bestudies in Tables 3.11, 3.12, and 3.13 .

Table 3.11: Hybrid 1 configuration

Type Choice Factor ValueAlg RF Skewness <70%Alg LogReg Skewness 70 - 90%Alg N/A Skewness 90 - 100%

Features All N/A N/ATrain size Full N/A N/A



Features 1/2 N/A N/ATrain size Full N/A N/A



Features 1/2 N/A N/ATrain size 2000 or Max N/A N/A

3.12 Comparison

The final procedure was to make a comparison of the results of the implemented algorithmsand other approaches. We chose to compare the RF and LogReg, as well as our Hybrid1,Hybrid2, and Hybrid3, against an Ericsson prototype system and the approach of anothermaster thesis student who has been studying regression. The results are compared in Accu-racy, which is the only denominator of metric data possessed from all systems.

28

4 Results

This chapter shows the results of the various test runs and experiments detailed in Chapter 3.The results of the initial test run is studied first. We then analyze the results of the algorithmsthat were selected for more comprehensive optimization: RF, GBDT, LogReg, and MLP. Fol-lowing this, we investigate the results of various experiments that were conducted using theabove listed algorithms, such as various sampling techniques and training sizes. Finally, wemake a comparison of a suite of systems including the RF and LogReg algorithms, the Hybridalgorithms detailed in 3.11, and the Ericsson prototype system.

4.1 Initial test suite results

This section details the results of the initial test suite. As we may recall from Section 3.5, in-ternal testing showed that the skewness may be an important factor for model performance.Because most of the models are quite skewed, we study the results of models with low skew-ness separately in addition to the average over all models. It should also be noted that pre-liminary testing showed an increase in predictive performance and reduction of standarddeviation when stratification was employed when sampling the training and test subsets.

The following results are comprised of metric averages when calculating the performanceof models based on data attributes. ’Skew50-60’ shows the average performance of modelsfor which the skewness is in the range of 50% to 60%. ’Full’ comprises all models. The codesfor the metrics are as follows:

1. TRT: Training time (s)

2. CLT: Classification time (s)

3. ACC: Accuracy (%)

4. F1: F1 score (%)

5. ACS: AccSkew (p.p., percentage points)

Tables 4.1, 4.1, 4.2, and 4.2 show that LogReg was the quickest in both training and clas-sification over all runs, and all types of model averages. It also retained a good predictiveperformance in comparison to the other algorithm for the full model average. For the models

29

4.1. Initial test suite results

Table 4.1: Results, Initial run, Scaling

Full TRT CLT ACC F1 ACSKNN 8.9s 1.74e´1s 90.572% 55.970% 2.524p.p.

LogReg 1.3s 4.63e´4s 90.557% 52.369% 2.509p.p.GBDT 42s 1.08e´2s 91.136% 57.838% 3.088p.p.

RF 35s 2.76e´1s 91.322% 62.126% 3.275p.p.SVC 79s 5.19e´2s 90.554% 54.485% 2.507p.p.GP 135s 1.33e´1s 90.642% 48.590% 2.594p.p.

MLP 2.7s 1.06e´3s 90.137% 59.948% 2.087p.p.Skew50-60 TRT CLT ACC F1 ACS

KNN 8.4s 1.65e´1s 76.615% 75.173% 20.67p.p.LogReg 1.5s 4.45e´4s 75.908% 75.278% 19.95p.p.GBDT 35s 1.34e´2s 78.275% 77.187% 22.32p.p.


MLP 2.3s 9.46e´4s 77.426% 76.981% 21.472p.p.

Table 4.2: Results, PCA run, Scaling

Full TRT CLT ACC F1 ACSKNN 5.25s 1.17e´1s 90.399% 55.176% 2.351p.p.

LogReg 0.67s 3.67e´4s 90.164% 49.718% 2.114p.p.GBDT 43s 9.90e´3s 90.624% 54.013% 2.576p.p.


MLP 0.74s 5.41e´4s 90.384% 58.460% 2.234p.p.Skew50-60 TRT CLT ACC F1 ACS

KNN 4.9s 1.06e´1s 77.683% 76.444% 21.729p.p.LogReg 0.43s 3.56e´4s 74.554% 73.287% 18.60p.p.GBDT 59s 1.92e´2s 77.427% 76.065% 21.473p.p.


MLP 0.64s 4.90e´4s 78.340% 78.142% 22.386p.p.

with a low skewness however, LogReg failed to uphold the same performance as some of theother algorithms.

RF beats all the other algorithms in predictive performance over all types of model av-erages, but had a classification time several magnitudes higher than some other algorithms.Interestingly, RF has quite a bit higher F1 score than all the other algorithms, especially whenaveraging over all models. This suggests that RF is better at finding positives on data skewedtowards negatives than the other algorithms. GBDT performs only slightly worse than RF,with a quicker classification speed. We must also point out that for RF, 8 cores were used totrain estimators in parallel. Because this is not possible due to the sequential nature of GBDTtraining, a system for which parallelization is not possible, this implementation of GBDTmight have an edge over the RF.

GP and SVC fail to impress, displaying worse predictive performance than RF and GBDTdespite higher training times. GP has a very high training time desipte not being trainedmultiple times for hyperparameter optimization. KNN has worse predictive performancethan both GBDT and RF, with a slow classifcation time. KNN has an advantage over both in

30

4.2. Optimized learning algorithm results

training time, however. The MLP, despite having the worst predictive performance out of allalgorithms, has very quick training- and classification speeds. Recall that the hyperparame-ters of the MLP were not optimized.

Studying the results of the run with PCA used as pre-processing, some algorithms gota reduced training time but at the expense of predictive performance or classification time.For other algorithms, the training time was unexpectedly increased. The only algorithm thatachieved an improvement on all metrics was GP, although slight. In conclusion, PCA did nothave a clear positive effect on algorithm performance. In addition to PCA, TSVD and SPCAwere tested as pre-processing for the full algorithm suite. Both of these techniques showedmuch worse performance with similar training times to PCA.

4.2 Optimized learning algorithm results

After the initial test suite had been analyzed, RF, GBDT, LogReg, and MLP were chosen forfurther optimization. This section details the results of the four optimized algorithms. Weanalyze the F1 score, accuracy, and accskew when averaging over every model, and whenaveraging over models for which the data is in a certain range of skewness. In addition, weanalyze the training- and validation scores, as well as the training and classification times.

Figure 4.1: Metrics for logistic regression averaged over models based on the skewness ofdata

Figure 4.2: Metrics for multi-layer perceptron averaged over models based on the skewnessof data

31


Figure 4.3: Metrics for random forest averaged over models based on the skewness of data

Figure 4.4: Metrics for gradient boosting decision trees averaged over models based on theskewness of data

Figures 4.1, 4.2, 4.3, and 4.4 shows that while the accuracy decreases when the skewnessdecreases, the accskew increases by a larger amount. Also interestingly, the F1 score and theaccuracy are closer the lower the skewness. At around an even distribution of output classes,the accuracy and F1 score are close to equal. In addition, the F1 score decreases by a significantamount when analyzing models in the range of 80%´ 90% skewness as opposed to models inthe range of 90%´100%. Likely, this is an effect of most models on the lowest frequency beingextremely skewed towards positives while the highest frequency is somewhat less skewed,and instead skewed towards negatives. Recall that in this data set the lowest frequency is themost skewed, and the only frequency skewed towards the positive class.

The algorithms, although varying somewhat on a p.p. basis, perform similarly over allmetrics for predictive performance. The data seems to impact performance much more thanthe choice of algorithm. For data with high amount of skewness, around 80%´ 100%, littleis gained from using any of the investigated classifiers over simply using the most frequentclass.

Training and validation scores

Figure 4.5 shows the algorithms’ training and validation accuracies for various skewnesslevels. The figure shows almost equal training and validation scores across all algorithms.This means that regularization works well and we have little to no overfitting. In general,the accuracies go down as the skewness falls, but is reduced by less than the skewness. The

32


accuracies are around 95% for data in the 100´ 90% skewness range and between 75% and82.5% for data in the 60´ 50% skewness range. GBDT and RF perform better than the othertwo algorithms across all skewness levels.

Figure 4.5: Training and validation accuracy score for the chosen algorithms averaged overmodels based on the skewness of data. The fully drawn line represents the validation scoreand the dashed line represents the training score.

Surprisingly, some algorithms get a higher validation score than training score for somelevels of skewness. This is theoretically unlikely but may occur in practice for several reasons.First of all, the test set is much smaller than the training set (10% vs 90%), which, if the modelgeneralizes well, can cause the model to be "lucky" on the predictions of the test set. It canalso be an effect of our features being so similar to one another, that very informative samplesfrom training are present in the test set, or very similar samples with only a few featurevariations. Finally, because the cross-validation is only 3-folded, the internal splits can beunluckily drawn or non-representative. However, the differences are small and the test scoreis what is of primary interest.

Training and classification time

Figure 4.6 shows that MLP and GBDT seem to have a polynomially increasing training timewhereas RF and LogReg have a linear increase. Figure 4.7 shows that RF have around twomagnitudes higher classification time than the other algorithms. LogReg is the fastest. Ob-serve that the y-axis for Figure 4.6 is in a linear scale while the y-axis for Figure 4.7 is in alogarithmic scale.

It is notable that RF is so much slower than GBDT. Both GBDT and RF predict by queryingevery learned tree. If we analyze the average n_estimators for the chosen classifiers, the resultis similar. However, as we may recall from Section 3.9, when optimizing RF and GBDT forpredictive performance setting a max depth was advantageous for GBDT but not for RF.This results in RF having deeper trees in general. This is likely a big factor for speed ofclassification, but it is difficult to determine how much of the difference is explained by depthof trees, and how much is explained by specifics of algorithm implementation.

33


Figure 4.6: Training time for the chosen algorithms based on models with a certain numberof training samples available

Figure 4.7: Classification time for the chosen algorithms based on models with a certain num-ber of training samples available

Summary

The RF has the best predictive performance. GBDT has worse predictive performance thanRF, but can be an option if the positive class is not important, and when parallelization is notavailable. LogReg performs similarly to RF and GBDT at high skewness levels, with fastertraining and classification. MLP does not seem to have any major advantage over the otherthree algorithms, with lower predictive performance than the ensembles, slow training, andslower classification than LogReg.

34

4.3. Pre-processing technique results

4.3 Pre-processing technique results

We have mentioned previously that none of the SVD-based methods proved helpful for ourdata set and application. This includes PCA, SPCA, and TSVD. Sampling with stratificationwas shown to be a major improvement for both mean predictive performance, and standarddeviation of predictive performance between runs, in preliminary testing, and has been usedthroughout the study. Although undersampling is a pre-processing technique, we will detailthe results of using this technique in Section 4.5. This is partly because of its reduction ofgeneral predictive performance and increase in predictive performance for the positive class,and partly because the technique was specifically considered because of its potential for thepositive class.

Following this, this section details feature scaling and feature reduction techniques.

Scaling

This subsection explores how the results of the algorithms vary when different types of scal-ing were applied prior to training. Figure 4.8 shows the chosen algorithms’ performance inaccskew when the data was preprocessed with [0,1]-scaling, and with no scaling.

Figure 4.8: Accskew for the chosen algorithms averaged over all models based on whether ornot the data was preprocessed with scaling

The only algorithm that showed a substantial increase in predictive performance withscaled data was the MLP. It did, however, slow down training by around 85%. It is importantto note that this might be an indirect effect, as the hyperparameter optimization might haveallowed for better predictions by using more complex models, while these types of modelsdid not increase performance for non-scaled data. GBDT and RF showed marginal increasesin performance, while LogReg’s training sped up by around 50%.

In addition to [0,1]-scaling, standardized scaling was also performed. It showed resultsvery similar to [0,1]-scaling on all metrics, although slightly worse. The difference was sosmall that it could have been due to standard variation. In general, scaling is so quick thateven though there is no significant improvement, it makes sense to keep it for all the final con-figurations for consistency across algorithms, and to make it more easily comparable to otherapproaches. For clarification on consistency between algorithms, scaling is helpful for rul-ing out certain attributes in data when considering what causes variations of results betweenmodels, such as the size of values in the input vectors.

35

4.3. Pre-processing technique results

Training on a subset of features

This subsection shows how the predictive performance of the RF varies when various subsetsof the full feature set were employed. Figure 4.9 shows that, while there is a clear differencein percentage points, the results of training with one feature compared to training with anaverage of 80 features only differ by around 2.2p.p.. Of course, this also translates to a 2.2p.p.difference in accuracy.

Figure 4.9: AccSkew when training on different sets of features and models with varyinglevels of skewness

For models with low skewness the difference is much larger, a 5.8p.p. difference betweenusing only the serving cell and using the full feature set. However, there is also a clear dif-ference between using the full feature set and discarding only the serving cell feature. All ofthis suggests that the serving cell feature is more important than each neighbor cell featureindividually. This is expected, as the serving cell feature has a non-zero data value in everysample, or 100% allotment.

These results show that the full feature vector may not be needed, we therefore study theneighboring cells further. Figure 4.10 shows the Accuracy when training with the servingcell and various fractions of the neighboring cells based on allotment. Again, this means thatwe remove the features by priority of having the most missing values. Reducing the featuresby half seems to not reduce the accuracy at all. The slight increase can be attributed to testrun variation. The difference when using all neighboring features and when using 1/16 ofthe features is 1.6p.p.. This means that the remaining features after removing 15/16 amountto 0.8p.p.. For the average case, when we have 80 features, this would amount to using 5features.

Studying the difference in F1 score alongside the accuracy, Figure 4.11 shows us a muchsteeper descent for F1 score. This suggests that some low allotment features are needed inorder to correctly predict the positive class in a negative class favored setting. This meansthat for the application of simply guessing the correct class and performing at a high accu-racy, removing quite a few neighbor features seem to not make a big impact. However, forpredicting the positive class successfully, feature selection should be used selectively. Note,however, that reducing the features by half does not reduce the F1 score.

36

4.4. Number of samples

Figure 4.10: Accuracy when training with the serving cell feature and various fractions of theneighbor cell features

Figure 4.11: Accuracy when training with the serving cell feature and various fractions of theneighbor cell features

4.4 Number of samples

This section shows how the predictive performance varies when analyzing models with dif-ferent training data sizes. It also explores how the performance varies when sampling differ-ent amounts of training data from selected models.

Figure 4.12 shows accskew for the four optimized algorithms based on the number ofsamples that the models were trained on. The accskew increases for models with a highernumber of training samples available. In general, the different algorithms seem to behave ina similar manner. However, the data is limited to a maximum of 4500 training samples. Inaddition, we compare different models instead of the same models trained on varying datasizes. These results are not substantial enough to draw strong conclusions about the effect ofthe training size.

37


Figure 4.12: Accskew for the chosen algorithms based on models with a certain number oftraining samples available

The rest of this section details the results of the experiment with the 12 RF models withlarge available data pools that are trained with varying sample sizes. Note that the intervalscale in Figures 4.13, 4.14, and 4.15 changes from one size every 200 samples to one size every1000 samples. The legend of the graphs show the skewness of the data for a particular modelaverage. In addition, the colors of the curves are persistent between graphs, meaning thatfor example the red curve in all graphs correspond to the same model average. The tickson the models’ curves show the standard deviation, and the legend of the figures show theskewness of the data.

Figure 4.13: AccSkew for random forest on 12 models trained on varying sizes of data

Figure 4.13 shows the accskew of the different model averages. Firstly, the model averagesthat have a low skewness seem to be more heavily impacted by the training size. The brown(57.3%) model increases by around 7.5p.p. over the entire interval while the red (95.3%) model

38


increases by less than 2p.p.. It is clear from the results that there is no particular threshold thatcauses the results to spike, but rather there is a small average increase at each interval. Thereare some results that seem to not depend entirely on the skewness. Study the marine blue(79.5%) and light purple (79.0%) models, for example. These have a very similar skewnessbut seem to respond differently to varying training sizes. The light purple model starts at amuch lower accskew but eventually overtakes the marine blue model. Therefore, the trainingsize seems to be of varying importance for different models even if the skewness is taken outof the equation.

Figure 4.14: Accuracy for random forest on 12 models trained on varying sizes of data

Figure 4.14 shows the models of the same test run but for accuracy instead of accskew.Although the brown model had a high accskew, it makes the correct classification in lessthan 75% of cases after having been trained on 7000 samples, while the red model makes thecorrect classification in around 95% of cases after having been trained on only 200 samples.Although the brown and lime models were similar in accskew, the lime model is much higherin accuracy. It could be argued that the lime model is simply "better", being able to performat a much higher accuracy than its skewness despite the skewness being high. Of the fourmodels that were close in accskew but clearly better than the rest, the lime model performsthe best in accuracy. In addition, the brown and lime models are on the same frequency(2145MHz). The marine blue and light purple models are on different frequencies however,at (2630MHz) and (2145MHz), respectively.

Figure 4.15 again shows the same models but where the F1 score has been analyzed. Allthe studied models are skewed toward the negative class. With this fact in mind the resultsare expected, as the F1 score is low and has an extremely large standard deviation for themodels with high skewness; there are few examples of the positive class to train on. For thesame models, the standard deviation decreases by a large margin when the training size isincreased. In addition, the average increases by a lot, from around 20% to around 40% for thered model. Studying the brown and lime models again, the lime model has a better F1 scoredespite having much fewer instances of the positive class to train on. This again supports ourprevious observation that the lime model is better than the brown model.

There seems to be something inherent in the data for the lime model that makes it easier tolearn than for example the brown model. Analyzing the input data of the lime and the brownmodels, the allotment, the ratio of non-zero data, is higher for the brown model, at 8.30% ver-sus 5.30%. If we go one step further, we may analyze the distance from the pre-binarification

39

4.5. Performance on the positive class

Figure 4.15: F1 score for random forest on 12 models trained on varying sizes of data

RSRP values to the threshold 24 in the output data. However, this value is also similar, at 9.76and 10.08, respectively. The reasoning for this analysis is that if one model has data valuesthat are far from the threshold, it should be easier to distinguish these from samples whichshould be classified as the opposite class. In summary, something makes the data for the limemodel better than the brown model, it does not seem to be a factor of frequency, training size,skewness, allotment, or pre-binarification RSRP distance to threshold. Finally, as the data isscaled to the range [0,1], it cannot be because of the size of the values, for example.

4.5 Performance on the positive class

This section details the results of the experiments conducted with the objective of findingtechniques for the improvement of predictive performance on the positive class. This in-cludes ROC analysis, undersampling, and varying the objective function.

ROC analysis

ROC analysis was performed on the 12 RF models we analyzed in Section 4.4. Because we arenot interested in the performance of the models over the full range of thresholds but ratherin finding which threshold is the strongest, we have not analyzed the AUC.

Figure A.11 shows the ROC curve of the gold (79.4%) model. In the graph, the top rightpoint corresponds to the threshold 0.00, and each consecutive point down the curve repre-sents an increase in threshold by 0.05. The best threshold for the gold model is therefore 0.3or 0.4. It should be noted that there is no "strict" optimal threshold, as it varies with the objec-tive of the model, or, whether rate of detection or false alarm is the most important. However,if we consider TPR´ FPR, it seems in general that the best threshold is found at or greaterthan 1´ skewness. This means that for the best result of TPR´ FPR we should be more surethat we have the positive class than the frequency of the positive class. In all but two cases,the optimal threshold is lower than 0.5. In only one case, 0.5 is the optimal threshold. Thisshows that we may increase our predictive performance on the positive class by varying ourdiscriminator. For the ROC curves of the rest of the 12 models, see Appendix A.

40

4.5. Performance on the positive class

Figure 4.16: ROC curve for the gold (79.4%) model from Section 4.4 with thresholds in inter-vals of 0.05.

Undersampling

For this experiment, uniform random undersampling to 50% skewness was performed onall models prior to training the RF. The results showed a decrease in accuracy from 90.6% to75.3%, an increase in accskew from 4.08p.p. to 25.3p.p., and an increase in F1 score from 63.4%to 75.3%. This shows that undersampling can be used to majorly increase the performanceon the positive class, while reducing the number of training samples. It is not unexpectedthat giving the algorithm more room to vary its parameters based on the positive class yieldsa higher performance on the positive class. Recall, however, that we have not actually madeany changes to the samples of the positive class but rather only for the negative class. Thatwe can achieve such an improvement in F1 score without any sort of feature engineering orchanges to the positive class data makes undersampling a simple but powerful tool in ourcase.

Comparing individual models, the difference in predictive performance is large. Somemodels achieved over 90% Accuracy while others achieved around 50%. This supports ourprevious claim that there are varying quality of information in the input data between modelsthat is not explained by skewness. In addition, the high and middle frequencies performedmuch better than the low frequency, with 78.9%, 76.6%, and 63.3% accuracy, respectively.Furthermore, the top 153 models are of the high or middle frequency, and among the bottom50 models, there exist no models of the highest frequency.

Varying the scoring function

For this experiment, we measured how the accuracy and F1 score of the RF vary when usingvarious scoring functions. By default in our configurations, we use accuracy as a scoringfunction. For this experiment, we also tried the F1 score and ROC-AUC as scoring functions.

Figure 4.17 shows the results of this experiment. Expectedly, using accuracy as the scor-ing function yields the highest accuracy, while using F1 score as the scoring function yieldsthe highest F1 score. Both the F1 score and ROC-AUC scorers have a higher F1 score but alower accuracy than using an accuracy scorer. This suggests that we may increase our perfor-mance on the positive class at the cost of general predictive performance. This is in line withundersampling, although the effect of this technique is much less dramatic.

41

4.6. Improvement from output class distribution

Figure 4.17: Accuracy and F1 score for RF when using various scoring functions

4.6 Improvement from output class distribution

The data set suffers from large skewness. Preliminary testing showed that the accuracy de-pended heavily on the skewness of the data. Although the best average accskew achievedon the entire data set is 4.17p.p. with RF, various experiments have shown that there is goodinformation in various subsets of the data. After undersampling to 50%, Section 4.5, accskewwas 25.3p.p.. When analyzing individual models, Section 4.4, we found that the data setsof the various models give heavily varied characteristics. Some respond with larger pre-dictive changes than other when the training size is varied, while some achieve a higheraccskew than others despite training for data with different skewness levels. We believe thelow accskew on the full data set is an effect of the data set being so skewed that there simplyis not much room for high values of accskew, rather than the data being non-informative. Forexample, if the data set is 95% skewed, the maximum possible accskew is 5p.p..

In summary, the best accskew achieved for the full data set is 4.17p.p.. When undersam-pling the data set to 50%, we achieved 25.3p.p. skewness. We do not improve the accuracyfrom skewness for the full data set by alot percentage-wise. However, when considering thesheer number of frequency measurements that are performed in a mobile network, a 4.17p.p.is a rather large improvement.

4.7 Final comparison of systems

After running the training of hybrid systems 1, 2, and 3, we compared the total run-time, av-erage training speed, average classification speed, and number of data values to store againstother stand-alone algorithm implementations. Table 4.3 shows the results. For clarificationof terms, the Run´ time is the sum of training speed and classification speed for all models,while TrSpeed and ClSpeed are averages over all models. Data values refer to the product ofthe number of samples and the number of features of the training data set, and therefore bothzeros and non-zeros are counted.

Figure 4.18 shows the accuracy of LogReg and RF, the RF with features cut in half, thethree hybrid models, an Ericsson prototype system, and a regression model from anotherEricsson thesis study. That the last named model is a regression model means that the itpredicts a continuous integer and the threshold is only used when calculating the accuracy.Because this model is optimized for regression loss functions and predict continuous values,

42

4.7. Final comparison of systems

Table 4.3: Algorithm training stats

Algorithm Run-time TrSpeed ClSpeed Data valuesRF 3h26m 28.54s 0.174s 101 787 520

RF f/2 3h29m 28.96s 0.185s 50 893 760LogReg 53m 7.46s 0.001s 101 787 520

Hybrid 1 39m 5.41s 0.014s 50 893 760Hybrid 2 39m 5.39s 0.016s 50 893 760Hybrid 3 41m 5.64s 0.019s 16 376 440

it is not a fair comparison. Despite this, it is of interest to see that the performance of the twoapproaches are not widely different.

Figure 4.18: Accuracy of various systems trained on the same data set

The optimized RF models and the Ericsson prototype system outperform the rest of themodels. However, it should be noted that while the hybrid models suffer a 1.1p.p. Accuracydeficit compared to the RF models, the average training and classification times are heavilyreduced, even if the features of the RF are cut. The hybrid models also, of course, requireless samples and classifiers to be stored. These two factors make them much more efficientin terms of run-time and memory. It should be noted, however, that because we simply usethe most common output class instead of training for many of the models, we consider thetraining and classification times in these cases to be 0.

43

5 Discussion

In this chapter we aim to provide a critical view of the data set and the approach of the study’svarious objectives. Because it is partly a comparative study, the techniques which were ini-tially chosen for study and the techniques which were pursued further must be motivated.In addition, we want to provide a way to connect the results of the study to a real worldapplication, and to give some insights as to ways of continued research.

5.1 Data

As mentioned in Section 3.2, the samples are created by connecting measurements on theintra-frequency and the inter-frequency based on eNB and time-stamp. Because of this, thereis a possibility that the various frequency measurements of a single sample occur at slightlydifferent positions. Because the idea of the measurements to intra-frequency neighbors is totrilaterate the position of the UE, it is possible that this creates data noise which makes thiscorrelation more difficult for the algorithm to interpret. The same phenomenon could alsoproduce samples in which one sample contains measurements from different UEs, in whichcase data noise could be introduced based on the variance in hardware specification of thedifferent UEs. The rate of error, if it exists, is unknown, although Ericsson experts believe itto be low.

Skewness

As mentioned in Section 3.2, the output class distribution varies a lot between the frequencies.The lowest frequency is extremely skewed towards the positive class. This means that whileanalyzing some of the performance metrics, the algorithms may score very high. For exam-ple, for data with 95% skewness, all algorithms score around 95% accuracy. This is a highaccuracy but not very impressive, as not using machine learning at all and simply choosingthe most common class every time would yield the same value. This is why it is especiallyinteresting to investigate how much better the predictions of the systems are than the level ofskewness, and also why accskew was introduced.

The algorithms score higher in accskew the lower the skewness is. This is not entirelyunexpected, as the higher the skewness, the lower the maximum accskew is. For example,if the skewness is 60%, the maximum accskew is 40p.p., whereas a 95% skewness gives a

44

5.2. Algorithm choices

maximum of 5p.p. accskew. However, had the data simply contained little information or themodels been underfitted, the accskew could still be very low at a low skewness.

Samples

There exists a 5000 soft cap on the number of samples to use per model, which means thatsamples are removed from models with more than 5000. This is partly due to a concernfor the memory requirements of the node, and partly because of a preliminary belief thatthis amount would suffice for a general convergence in performance. However, Section 4.4show that although training size is not the most important factor for predictive performance,the performance growth of increasing the training size surpasses 5000 samples. A positiveeffect of having a maximum cap of samples is that it increases consistency across models,making more models train on the same number of samples. On this note, only around a thirdof models possess data sets for which the samples amount to 5000. For many models, theavailable number of samples are far lower, and some are even below 1000. This is a problemfor consistency between models, and an obstacle in trying to decipher what the factors of thevarying level of information between model data sets is dependant upon. For clarification, itshould be mentioned that no additional data was made available at various stages or datesof the study, that is to say, this study has been conducted on a static data set.

Features

Supervised learning can be viewed as trying to make a qualified guess given various obser-vations. For example, guess the age of a person in which you observe the weight, height,occupation, and political standing of said person. In our case, each feature represents the ex-act same type of observation: signal strength. The distinction is that each feature representsa location from which the signal originates. This may pose one to ask the question: "Do wein reality only have one feature?". Section 4.3 show the results when training with varioussubsets of the full feature set. Although there is some difference in performance when usingthe various subsets, it is indeed quite small. For instance, training with every feature gives usan accskew of 4.16p.p. averaged over every model. Training with only the serving cell featuregives us an accskew of 1.95p.p.. The percentage difference of accskew is large, but this meansthat removing every feature except one only reduces the accuracy by around 2 percentagepoints. This may suggest that an abundance of features with the same type of observationgives us diminishing returns in information. However, the allotment of the features vary byquite a lot; removing half the features with the lowest allotment increases the total allotmentof the input matrix by around two times. Instead of diminishing returns, then, it could simplymean that features cannot give much information because they hold such small amounts ofdata.

5.2 Algorithm choices

This section details our choices of learning algorithms for the various stages of the study. Wefirst discuss why the 7 initial algorithms were chosen. Following this, we detail the choiceof removing 3 algorithms and keeping 4 for optimization. Finally, we discuss the reason forusing RF as the algorithm of choice for the analytic experiments.

Initial suite

Although we have already briefly touched on this matter in Section 3.4, we would like to ex-pand our reasoning for our initial selection of algorithms. We want to encompass every ma-jor category of learning algorithms. For example, we want to include kernel-based methods,neural networks, decision tree-based algorithms, ensembles, and instance-based methods.

45

5.2. Algorithm choices

The reasoning for this is that there is probably a larger probability of us finding an algorithmthat fits the data well if we try a set of algorithms which are distinctly different. For exam-ple, choosing a suite of various neural networks would make little sense if we have no priorknowledge of the fact that neural networks are particularly suitable for the data.

In-depth study

After the initial test run, we needed to discard some algorithms in order to have time toconduct all the experiments we deemed necessary. This subsection details the reasoning forwhy RF, GBDT, LogReg, and MLP were kept while GP, SVM, and KNN were discarded.Finally, we also discuss why the RF was chosen as the algorithm to use for all the analyticexperiments where only one algorithm was used.

GPs generally scale by O(n3), where n is the number of training samples [47]. They aretherefore problematic on moderate to large scale problems, and when hardware limitationsare a large factor. Our results showed a relatively low predictive performance, while hav-ing a relatively high training- and classification time, which was the reasoning for discardingthis algorithm from the study. It should be noted that for GP, hyperparameter optimizationwas only performed in 3 iterations (3 restarts). It could be argued that this invalidated thechoice of removing this algorithm from the study. However, had hyperparameter optimiza-tion been performed in similar manner to the other algorithms, the run-time would simplybe unfeasible; if not for the application, then for the time available for this study.

SVM performs similarly to GP. In comparison to GBDT, for example, there is no advan-tage. With similar classification speed, worse predictive performance, and slower training.Therefore, this algorithm was discarded from further research. While kernel-based methods(both GP and SVM) have many improvements to be made from tuning and optimizing, suchas designing a custom kernel tailored to a specific problem, these types of major investmentswere not deemed feasible for this study.

Although training was quick, KNN performed relatively poorly in its predictions andin its classification speed. In addition, in comparison to other algorithms, there was littleimprovement to be made in regard to hyperparameter optimization. KNN is a comparativelysimplistic algorithm, with fewer parameters to tune. This was the reasoning for discardingthis algorithm from the continued study.

GBDT and RF performed the best in the initial experiments. While RF perform slightlybetter in predictive performance and training time, GBDT classifies faster by around an orderof magnitude. It was therefore decided to pursue both of these algorithms. Although RFperforms better than all other algorithms in accuracy, it further surpasses them in F1 score.We believed this would make RF especially suited for the task of performing well on thepositive class.

LogReg is the most basic algorithm in the suite. Despite this, it did not perform muchworse than RF and GBDT. In addition, it was the quickest in both training and classification— to many algorithms by a huge margin. It was interesting to keep this algorithm in the studyas a baseline of what could be achieved with very little effort. It should be noted that LogRegperformed poorly on data sets with low skewness in comparison to the other non-discardedalgorithms.

The MLP, although having the worst predictive performance on average over everymodel, performed relatively well for data sets with low skewness, which we considered moredifficult. In addition, it had quick training and classification, on par with LogReg. It shouldbe noted that its hyperparameter configuration was not empirically optimized during the ini-tial test run. Although we knew optimization would increase training time, we were hopefulto increase the predictive performance while still classifying quickly. It turned out that weneeded to increase the complexity of the architecture by quite a lot to reach the predictivelevels of RF and GBDT, and we did not manage to surpass their levels.

46

5.3. Hyperparameter optimization choices

Analytic experiments

The reasoning for only using one of the four optimized algorithms for many of the experi-ments is simply because of time restrictions of the study. This is not meant implementation-wise, the implementation allows the user to choose which algorithms should be included ineach test-run. Instead, this is because of the time it takes for the test runs to complete, andthe time it takes to analyze the data. The reasoning for why RF was chosen for the tests isbecause it had the best accuracy, the best F1 score by quite a lot, and reasonable training time.Using another algorithm for the experiments on the positive class would be nonsensical, andwe did not want to vary the algorithm used between studies.

5.3 Hyperparameter optimization choices

There have been multiple stages of the study where hyperparameter optimization choicesneeded to be made. Firstly, we may consider the optimization of hyperparameters for theinitial suite of algorithms. For this stage we chose to perform limited hyperparameter op-timization, and we used various optimization techniques depending on the algorithm. Theforemost reason for this is that we wanted to gauge the performance of the algorithms with-out extensive work. Because we discarded several algorithms from the study based on theresults of the initial test run, it is possible that we removed algorithms which would haveproven superior given better optimization. This is perhaps especially true for some of thealgorithms removed such as GP and SVM, which performance may vary greatly given mod-eling knowledge and extensive engineering. However, the training times of these algorithmseven without performing hyperparameter optimization in the same manner as the other algo-rithms, which would multiply the number of training iterations by 30, were long. This meansthat if these algorithms were to be pursued, we would not have the resources to performsome of the experiments presented in this study. On another note, the experiments presentedsuggest that pre-processing and attributes of data are much larger determinants of predictiveperformance than the chosen learning algorithms. This was of course not known prior to thestudy, but strengthens the validity of our choices retroactively. It is possible that one of thediscarded algorithms would be able to exploit something major in the data that the pursuedalgorithms are unable to do, in which case our choices could perhaps be questioned.

For the second round of hyperparameter optimization, when optimizing RF, LogReg,GBDT, and MLP, we chose to perform a combination of grid search and random search. Bothgrid search and random search require pre-defined parameter search spaces, which requiressome level of human interaction. This interaction is generally decided on a case-to-case ba-sis, and is determined by the knowledge and experience of the developer. Concretely, thisgenerally means determining the search space based on "reasonable values", or constructingan ad-hoc procedure for obtaining the search space. In the method we chose, we attemptedto exploit the strength of both grid search and random search. Grid search is slow but thor-ough, and because of the static structure, it is easier to determine which variables seem toaffect the results more than other, in comparison to random search. Because of this, after de-termining some initial search space based on our knowledge, we ran grid search on a sub setof all models to try and determine which variables were important, and whether we seemedto be missing important values. As the second step, we transformed the grid into a contin-uous search space and ran random search on every model. Grid search would take far toolong to run, especially if we deemed it necessary to run several tests. In addition, randomsearch has been shown to be more efficient than grid search for the sake of optimal predictiveperformance [10].

47

5.4. Evaluation choices

5.4 Evaluation choices

Choosing which metrics to use for model evaluation is a constant issue in machine learn-ing. Similar to choosing which pre-processing techniques and learning algorithms to use,this is often a matter to study per application. Accuracy shows how often a model makescorrect classifications, but may mask poor performance in relation to skewness, for example.Precision shows how often a model is correct when predicting the positive class, but maybe misleading since the model may simply predict this class seldomly, when it is certain ofits choice. ROC-AUC can show how good the system is at diagnosing the positive class atvarious certainties, and is rather a measure of the potential for using the system in variousmanners than performance in practice. The point is that no metric is strictly better than an-other, and one must determine which metrics to use by deciding what is important for theapplication in which the system is to be used.

In our case, the primary objective was to analyze how good a classifier is at predictingthe correct class in comparison to not using the classifier. We should mention that for thisobjective, the performance of the model on the two classes was of equal importance. Thesecondary objective is to analyze how good a classifier is at predicting the positive class ondata sets for which the negative class is dominant.

For the primary objective, because the data set is generally dominated by the negativeclass, precision, recall, and F1 score seemed quite volatile. In addition, these metrics aregenerally stronger when the performance on the positive class is most valuable. ROC-AUCis stronger at analyzing the theoretical capability of the model rather than its strength in apractical application. By this we mean that it may show how strong the model is at vari-ous certainty levels by analyzing various deterministic thresholds. In practice, however, wewould only use one deterministic threshold, in which case the capability of the model forother thresholds is useless. Accuracy seemed like the best choice, except for that fact that itdoes not consider skewness. We therefore introduced our own metric, accskew. We believethat accskew is easy to grasp and gives good information about how much we would gainin practice by implementing a system if the output classes are of equal importance. It shouldbe noted that accskew is not fair from a theoretical standpoint, as the performance dependson the skewness of the data set. Not only is the score dependant on the data set, but themaximum possible score varies with the data set as well.

For the secondary objective, F1 score and ROC seemed like good choices. F1 score is greatfor this objective, since we cannot "cheat" by predicting the positive class very seldom or veryfrivolously. We previously stated that ROC-AUC perhaps is not the best metric for studyinghow well a model would work in practice. Studying the curve itself and not looking at thearea under it is almost the opposite. The idea here is that using a 0.5 discriminator is akin to"trusting" the model’s decision whereas moving the discriminator is akin to listening to themodel’s choice and then interpreting it for our own gain. In practice we do not care abouthow good the model’s capability, but rather how well it can perform at a given task.

Because varying the discriminator can give us a better result in practice than just "trust-ing" the model, why not do this for other metrics than the ROC space? For example, why notdo the same type of study but analyze how the accuracy varies with varying discriminators?This is something we have not done in the study because of time restraints. However, shouldthis type of system be implemented and used in practice, this is something we highly recom-mend doing. Especially if the performance on both of the classes are of equal importance,as moving the discriminator may increase the performance on one output class at the cost ofperformance on the other.

48

5.5. Run-time and memory

5.5 Run-time and memory

The training- and classification times of various algorithms were analyzed in this study. Boththeory, Section 2.10, and empirical studies1,2 show that the memory and run-time of learningalgorithms depend heavily on the implementation. In addition to this, the analysis is fur-ther complicated by the fact that various learning algorithms may be parallelized in differentways. To which magnitude should this be taken into account? For example, it is possible tofully parallelize training an RFs estimators, but how reasonable is it to use a certain numberof cores for this task? This means that the training- and classification times are not only heav-ily dependant upon implementation but also on hardware. To further complicate the matter,these times are also dependant upon the configuration of the model. For algorithms that usedecision trees, for examples, the training time depends on the number of trees to grow, andboth the training- and classification times depend on the depth of the trees. Because of allthese factors we have let these the times be of less importance than the predictive perfor-mance. In addition, instead of trying to answer which algorithm is the quickest in general,we have tried to answer which type of algorithm fits in a particular circumstance.

We have chosen to regard required memory only by how many samples we need to store.It would of course be possible to study the source code of each algorithm implementationand learn in which way data is stored during training, and how the final model is stored.We could then have the system drop memory imprints at various stages of training and aftercompletion of training. We could then compare the storage methods used in the implemen-tations with what is theoretically possible. We felt an analysis of this scope was too large acommitment for something that is not a focal point of this thesis.

5.6 Classification vs. regression

In order to study the problem from a classification standpoint, the output data was trans-formed into two classes based on an RSRP threshold. This results in the disappearance ofinformation pertaining to the relative strength of the output signals. Internal preliminarytesting at Ericsson showed better results for classification than regression when evaluatingthe accuracy. It is not unexpected that a model that is trained for classification performs bet-ter at classification than a model that was trained for regression. However, since the purposeis to predict whether or not a signal is strong enough, the comparison is fair in regard to theapplication.

5.7 Data characteristics

As shown in Chapter 4, the lower the skewness, the more equal the accuracy and F1 score is.This is somewhat expected, but quite important. As our data is generally skewed towardshaving a higher occurrence of the negative class, it is reasonable that the F1 score is lowerthan the accuracy on the full data range. However, as we consider models with more equaldistributions of output classes, the algorithm gets equal amounts of examples to train on. Ifthe accuracy and F1 score is similar for data with similar distributions, it tells us that there isno inherent characteristic in the data that makes one class more difficult to predict than theother.

5.8 The work in a wider context

The results of this study show that it is possible to boost the efficiency of inter-frequencymeasurements by using machine learning. The implications of this is that mobile network

1https://www.kaggle.com/nschneider/gbm-vs-xgboost-vs-lightgbm2http://datascience.la/benchmarking-random-forest-implementations/

49

5.9. Future work

communications could become more effective. It should be noted that this study does notweigh the effectiveness gained by the investigated techniques against the effectiveness lost byhaving to perform new tasks in the node. If the system should prove successful, however, andtraining and sample storage is moved to an external system, there should be few downsidesto implementing this system in regards to strain on the node.

Because the technology provided here does not use UE identifications or geographicalpositions, there is little harm to be made in the question of personal integrity. Apart frompersonal integrity, the ethicality of machine learning is usually discussed in the question ofautomation taking away jobs from humans. In this case, it could actually be the oppositesince it means the development of an addition to a feature that is currently fully automated.

5.9 Future work

We have established that the data is so skewed that we can use much simpler and faster meth-ods than RF, GBDT, MLP, and the like, for prediction, without a heavily reduced predictivecapability. However, the large difference in predictive performance between the models afterundersampling the data for each model to 50% seems to suggest that the quality of the in-formation in the data available per model varies largely. We have also shown that this is notexplained by training size, allotment, nor average distance from binary labeling threshold.This is therefore a question left open and a good choice for further research. Establishingwhich combinations of information are effective could help create powerful dimensionalityreduction which is tailored to the application, for example. It could also be used to constructa solution where models only are trained if the information in the data is deemed qualitativeenough.

50

6 Conclusion

For the application of inter-frequency measurement prediction in an LTE network and thedata set provided, a suite of supervised learning algorithms and pre-processing techniqueswere explored. Decision tree-based ensembles showed the best predictive performance overboth classes, where random forest and gradient boosting decision trees displayed the highestaccuracy. Averaging over every model, the highest accuracy achieved was 90.73% with ran-dom forest. Random forest was the most efficient algorithm, but gradient boosting decisiontrees could prove more efficient if parallelization is not available.

We showed that the training size on average affect the predictive performance in a linearmanner all the way from as few as 200 samples to as many as 7000 samples. Studying in-dividual models showed that various models respond somewhat differently to the trainingsize. In particular, the lower the skewness of the model’s data set, the higher the impact ofincreasing the training size.

We explored several techniques for the task of achieving good predictive performanceon the positive class on data sets dominated by the negative class. We showed that under-sampling, varying the discriminator threshold, and varying the objective function all achievehigher F1 score than our default configuration; undersampling gave the largest effect.

The data set was shown to be highly skewed, largely associated with the frequency. Thisis an effect of the transformation of the output data from a continuous integer to two classes,in which the network configuration threshold 24 was used. Because high frequencies gener-ally have low RSRP values and low frequencies generally have high RSRP values, the highfrequencies are skewed towards the negative class while the low frequencies are skewed to-wards the positive class. The accuracy of the algorithms were shown to be highly correlatedto the skewness of the data, which is why we studied the accuracy in relation to the skewness.Averaging over every model, the best accuracy over skewness achieved was 4.17p.p.. Under-sampling to 50% skewness showed that the quality of data varies largely between modeldata sets, as models achieved from around 50% accuracy to around 90% accuracy. Thesetwo phenomena combined means that the accuracy being close to the skewness of the datais not simply because there exists little correlation in the data but also because the levels ofskewness does not allow for a large margin.

Our results showed that for data sets with low skewness, decision-tree based ensemblesperformed the best, for data sets with medium skewness, logistic regression performed al-most on par with the ensembles, and for very high skewness, no algorithm surpassed skew-

51

ness on average. We presented hybrid algorithm implementations in which the learning al-gorithm used depends on the skewness of the data. We showed that our hybrids dropped1p.p. in accuracy while increasing training speed by over 5 times and classification speed byover 12 times in comparison to our optimized random forest.

In summary, we have showed that supervised learning can be used to improve inter-frequency measurement efficiency in LTE networks when compared to naive selection. Inaddition, for the sake of load balancing, we have shown several techniques which improvethe performance of predicting signals on high frequency APs in, for example, hotspots. Weknow there exist attributes of the input data which highly affect predictive performance. Wesuggest future research be dedicated to discovering what these are. This could greatly helpselect in which cells using a machine learning model is effective, and which learning algo-rithms to use.

52

Bibliography

[1] Scott A Czepiel. “Maximum Likelihood Estimation of Logistic Regression Models: The-ory and Implementation”. In: (2002).

[2] Hervé Abdi and Lynne J. Williams. “Principal component analysis”. In: Wiley Interdis-ciplinary Reviews: Computational Statistics 2.4 (July 2010), pp. 433–459. DOI: 10.1002/wics.101.

[3] Selim Aksoy and Robert M. Haralick. “Feature normalization and likelihood-basedsimilarity measures for image retrieval”. In: Pattern Recognition Letters 22.5 (Apr. 2001),pp. 563–582. ISSN: 01678655. DOI: 10.1016/S0167-8655(00)00112-4.

[4] Nasser M. Alotaibi and Sami S. Alwakeel. “A Neural Network Based Handover Man-agement Strategy for Heterogeneous Networks”. In: 2015 IEEE 14th International Con-ference on Machine Learning and Applications (ICMLA). IEEE, Dec. 2015, pp. 1210–1214.ISBN: 978-1-5090-0287-0. DOI: 10.1109/ICMLA.2015.65.

[5] M. Amirijoo, P. Frenger, F. Gunnarsson, H. Kallin, J. Moe, and K. Zetterberg. “NeighborCell Relation List and Physical Cell Identity Self-Organization in LTE”. In: ICC Work-shops - 2008 IEEE International Conference on Communications Workshops. IEEE, May 2008,pp. 37–41. DOI: 10.1109/ICCW.2008.12.

[6] Mohmmad Anas, Francesco D. Calabrese, Per-Erik Ostling, Klaus I. Pedersen, andPreben E. Mogensen. “Performance Analysis of Handover Measurements and Layer3 Filtering for Utran LTE”. In: 2007 IEEE 18th International Symposium on Personal, In-door and Mobile Radio Communications. IEEE, 2007, pp. 1–5. ISBN: 978-1-4244-1143-6. DOI:10.1109/PIMRC.2007.4394671.

[7] Y. Bengio, P. Simard, and P. Frasconi. “Learning long-term dependencies with gradientdescent is difficult”. In: IEEE Transactions on Neural Networks 5.2 (Mar. 1994), pp. 157–166. ISSN: 10459227. DOI: 10.1109/72.279181.

[8] Yoshua Bengio. “Practical Recommendations for Gradient-Based Training of Deep Ar-chitectures”. In: Springer, Berlin, Heidelberg, 2012, pp. 437–478. DOI: 10.1007/978-3-642-35289-8_26.

[9] James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. “Algorithms forhyper-parameter optimization”. In: Proceedings of the 24th International Conference onNeural Information Processing Systems. Neural Information Processing Systems, 2011,pp. 2546–2554. ISBN: 9781618395993.

53

https://doi.org/10.1002/wics.101

https://doi.org/10.1002/wics.101

https://doi.org/10.1016/S0167-8655(00)00112-4

https://doi.org/10.1109/ICMLA.2015.65

https://doi.org/10.1109/ICCW.2008.12

https://doi.org/10.1109/PIMRC.2007.4394671

https://doi.org/10.1109/72.279181

https://doi.org/10.1007/978-3-642-35289-8_26

https://doi.org/10.1007/978-3-642-35289-8_26

Bibliography

[10] James Bergstra and Yoshua Bengio. “Random Search for Hyper-Parameter Optimiza-tion”. In: Journal of Machine Learning Research 13.Feb (2012), pp. 281–305. ISSN: ISSN1533-7928.

[11] Gitanjali Bhutani. “Application of Machine-Learning Based Prediction Techniques inWireless Networks”. In: International Journal of Communications, Network and System Sci-ences 07.05 (May 2014), pp. 131–140. ISSN: 1913-3715. DOI: 10.4236/ijcns.2014.75015.

[12] Léon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: inCOMPSTAT. 2010.

[13] Leo Breiman. “Bagging predictors”. In: Machine Learning 24.2 (Aug. 1996), pp. 123–140.ISSN: 0885-6125. DOI: 10.1007/BF00058655.

[14] Leo Breiman. “Random Forests”. In: Machine Learning 45.1 (2001), pp. 5–32. ISSN:08856125. DOI: 10.1023/A:1010933404324.

[15] Rich Caruana, Nikos Karampatziakis, and Ainur Yessenalina. “An empirical evalua-tion of supervised learning in high dimensions”. In: Proceedings of the 25th internationalconference on Machine learning - ICML ’08. New York, New York, USA: ACM Press, 2008,pp. 96–103. ISBN: 9781605582054. DOI: 10.1145/1390156.1390169.

[16] Rich Caruana and Alexandru Niculescu-Mizil. “An empirical comparison of super-vised learning algorithms”. In: Proceedings of the 23rd international conference on Machinelearning - ICML ’06. New York, New York, USA: ACM Press, 2006, pp. 161–168. ISBN:1595933832. DOI: 10.1145/1143844.1143865.

[17] Nitesh V. Chawla. “Data Mining for Imbalanced Datasets: An Overview”. In: Data Min-ing and Knowledge Discovery Handbook. New York: Springer-Verlag, 2005, pp. 853–867.DOI: 10.1007/0-387-25465-X_40.

[18] T. Cover and P. Hart. “Nearest neighbor pattern classification”. In: IEEE Transactions onInformation Theory 13.1 (Jan. 1967), pp. 21–27. ISSN: 0018-9448. DOI: 10.1109/TIT.1967.1053964.

[19] Anders Dahlen, Arne Johansson, Fredrik Gunnarsson, Johan Moe, Thomas Rimhagen,and Harald Kallin. “Evaluations of LTE Automatic Neighbor Relations”. In: 2011 IEEE73rd Vehicular Technology Conference (VTC Spring). IEEE, May 2011, pp. 1–5. ISBN: 978-1-4244-8332-7. DOI: 10.1109/VETECS.2011.5956600.

[20] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. “SAGA: A Fast IncrementalGradient Method With Support for Non-Strongly Convex Composite Objectives”. In:(July 2014). arXiv: 1407.0202.

[21] Thomas G. Dietterich. “Ensemble Methods in Machine Learning”. In: Springer, Berlin,Heidelberg, 2000, pp. 1–15. DOI: 10.1007/3-540-45014-9_1.

[22] Tom Fawcett. “An introduction to ROC analysis”. In: Pattern Recognition Letters 27.8(June 2006), pp. 861–874. ISSN: 0167-8655. DOI: 10.1016/J.PATREC.2005.10.010.

[23] David Freedman. Statistical models : theory and practice. Cambridge University Press,2009, p. 442. ISBN: 0521743850.

[24] Jerome H. Friedman. “Stochastic gradient boosting”. In: Computational Statistics & DataAnalysis 38.4 (Feb. 2002), pp. 367–378. ISSN: 0167-9473. DOI: 10 . 1016 / S0167 -9473(01)00065-2.

[25] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. MIT Press, 2016.

[26] Per Christian Hansen. “The truncated SVD as a method for regularization”. In: BIT 27.4(Dec. 1987), pp. 534–553. DOI: 10.1007/BF01937276.

54

https://doi.org/10.4236/ijcns.2014.75015

https://doi.org/10.4236/ijcns.2014.75015

https://doi.org/10.1007/BF00058655

https://doi.org/10.1023/A:1010933404324

https://doi.org/10.1145/1390156.1390169

https://doi.org/10.1145/1143844.1143865

https://doi.org/10.1007/0-387-25465-X_40

https://doi.org/10.1109/TIT.1967.1053964

https://doi.org/10.1109/TIT.1967.1053964

https://doi.org/10.1109/VETECS.2011.5956600

http://arxiv.org/abs/1407.0202

https://doi.org/10.1007/3-540-45014-9_1

https://doi.org/10.1016/J.PATREC.2005.10.010

https://doi.org/10.1016/S0167-9473(01)00065-2

https://doi.org/10.1016/S0167-9473(01)00065-2

http://www.deeplearningbook.org

http://www.deeplearningbook.org

https://doi.org/10.1007/BF01937276

Bibliography

[27] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. “Overview of SupervisedLearning”. In: The Elements of Statistical Learning: Data Mining, Inference, and Predic-tion. New York, NY: Springer New York, 2009, pp. 9–41. ISBN: 978-0-387-84858-7. DOI:10.1007/978-0-387-84858-7_2.

[28] Sergey Ioffe and Christian Szegedy. “Batch normalization: accelerating deep networktraining by reducing internal covariate shift”. In: Proceedings of the 32nd InternationalConference on International Conference on Machine Learning - Volume 37. JMLR.org, 2015,pp. 448–456.

[29] Alan J. Izenman. “Recursive Partitioning and Tree-Based Methods”. In: 1st ed.Springer-Verlag New York, 2008. Chap. 9, pp. 281–314. ISBN: 978-0-387-78188-4. DOI:10.1007/978-0-387-78189-1_9.

[30] A.K. Jain, Jianchang Mao, and K.M. Mohiuddin. “Artificial neural networks: a tutorial”.In: Computer 29.3 (Mar. 1996), pp. 31–44. ISSN: 00189162. DOI: 10.1109/2.485891.

[31] Laszlo A. Jeni, Jeffrey F. Cohn, and Fernando De La Torre. “Facing Imbalanced Data–Recommendations for the Use of Performance Metrics”. In: 2013 Humaine AssociationConference on Affective Computing and Intelligent Interaction. IEEE, Sept. 2013, pp. 245–251. ISBN: 978-0-7695-5048-0. DOI: 10.1109/ACII.2013.47.

[32] R. D. KING, C. FENG, and A. SUTHERLAND. “STATLOG: COMPARISON OF CLAS-SIFICATION ALGORITHMS ON LARGE REAL-WORLD PROBLEMS”. In: Applied Ar-tificial Intelligence 9.3 (May 1995), pp. 289–333. DOI: 10.1080/08839519508945477.

[33] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:(Dec. 2014). arXiv: 1412.6980.

[34] Brent Komer, James Bergstra, and Chris Eliasmith. “Eliasmith C. Hyperopt-sklearn: au-tomatic hyperparameter configuration for scikit-learn”. In: Proceedings of SciPy. 2014.

[35] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas. “Machine learning: a review of clas-sification and combining techniques”. In: Artificial Intelligence Review 26.3 (Nov. 2006),pp. 159–190. ISSN: 0269-2821. DOI: 10.1007/s10462-007-9052-3.

[36] Helena Kotthaus, Ingo Korb, Michel Lang, Bernd Bischl, Jörg Rahnenführer, and PeterMarwedel. “Runtime and memory consumption analyses for machine learning R pro-grams”. In: Journal of Statistical Computation and Simulation 85.1 (Jan. 2015), pp. 14–29.ISSN: 0094-9655. DOI: 10.1080/00949655.2014.925192.

[37] Ralf Kreher and Karsten Gaenger. LTE Signaling, Troubleshooting, and Optimization.Chichester, UK: John Wiley & Sons, Ltd, Dec. 2010. ISBN: 9780470977729. DOI: 10.1002/9780470977729.

[38] Andrej Krenker, Janez Bester, and Andrej Kos. “Introduction to the Artificial NeuralNetworks”. In: Artificial Neural Networks - Methodological Advances and Biomedical Appli-cations. InTech, Apr. 2011. DOI: 10.5772/15751.

[39] Pierre Lescuyer and Thierry Lucidarme. Evolved Packet System (EPS): The LTE and SAEEvolution of 3G UMTS. 2008. ISBN: 0470059761, 9780470059760.

[40] Gilles Louppe. “Understanding Random Forests: From Theory to Practice”. PhD thesis.University of Liége, July 2014. DOI: 10.13140/2.1.1570.5928. arXiv: 1407.7502.

[41] Richard J. Manterfield. Telecommunications Signalling. The Institution of Electrical Engi-neers, 1999.

[42] R.J. May, H.R. Maier, and G.C. Dandy. “Data splitting for artificial neural networksusing SOM-based stratified sampling”. In: Neural Networks 23.2 (Mar. 2010), pp. 283–294. ISSN: 0893-6080. DOI: 10.1016/J.NEUNET.2009.11.009.

[43] Mehryar. Mohri, Afshin. Rostamizadeh, and Ameet. Talwalkar. Foundations of machinelearning. MIT Press, 2012, p. 414. ISBN: 9780262018258.

55

https://doi.org/10.1007/978-0-387-84858-7_2

https://doi.org/10.1007/978-0-387-78189-1_9

https://doi.org/10.1109/2.485891

https://doi.org/10.1109/ACII.2013.47

https://doi.org/10.1080/08839519508945477


https://doi.org/10.1007/s10462-007-9052-3

https://doi.org/10.1080/00949655.2014.925192

https://doi.org/10.1002/9780470977729

https://doi.org/10.1002/9780470977729

https://doi.org/10.5772/15751

https://doi.org/10.13140/2.1.1570.5928


https://doi.org/10.1016/J.NEUNET.2009.11.009

Bibliography

[44] Pratap S. Prasad and Prathima Agrawal. “Movement Prediction in Wireless NetworksUsing Mobility Traces”. In: 2010 7th IEEE Consumer Communications and Networking Con-ference. IEEE, Jan. 2010, pp. 1–5. ISBN: 978-1-4244-5175-3. DOI: 10.1109/CCNC.2010.5421613.

[45] J. R. Quinlan. “Induction of decision trees”. In: Machine Learning 1.1 (Mar. 1986), pp. 81–106. DOI: 10.1007/BF00116251.

[46] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. “Searching for Activation Func-tions”. In: (Oct. 2017). arXiv: 1710.05941.

[47] Carl Edward. Rasmussen and Christopher K. I. Williams. Gaussian processes for machinelearning. MIT Press, 2006, p. 248. ISBN: 026218253X.

[48] Schapire and Robert E. A brief introduction to boosting. 1999.

[49] Armin Shmilovici. “Support Vector Machines”. In: Data Mining and Knowledge DiscoveryHandbook. Boston, MA: Springer US, 2009, pp. 231–247. DOI: 10.1007/978-0-387-09823-4_12.

[50] V.N. Vapnik. “An overview of statistical learning theory”. In: IEEE Transactions on Neu-ral Networks 10.5 (1999), pp. 988–999. ISSN: 10459227. DOI: 10.1109/72.788640.

[51] Weetit Wanalertlak, Ben Lee, Chansu Yu, Myungchul Kim, Seung-Min Park, and Won-Tae Kim. “Behavior-based mobility prediction for seamless handoffs in mobile wirelessnetworks”. In: Wireless Networks 17.3 (Apr. 2011), pp. 645–658. ISSN: 1022-0038. DOI:10.1007/s11276-010-0303-x.

[52] Hui Zou, Trevor Hastie, and Robert Tibshirani. “Sparse Principal Component Analy-sis”. In: Journal of Computational and Graphical Statistics 15.2 (June 2006), pp. 265–286.DOI: 10.1198/106186006X113430.

56

https://doi.org/10.1109/CCNC.2010.5421613

https://doi.org/10.1109/CCNC.2010.5421613

https://doi.org/10.1007/BF00116251


https://doi.org/10.1007/978-0-387-09823-4_12

https://doi.org/10.1007/978-0-387-09823-4_12

https://doi.org/10.1109/72.788640

https://doi.org/10.1007/s11276-010-0303-x

https://doi.org/10.1198/106186006X113430

A Appendix A

Figure A.1: ROC curve for the 79.5% skewness model from Section 4.4 with thresholds inintervals of 0.05.

57



58



59



60



61



62

predicting inter-frequency measurements in an lte net...

Documents