predicting inter-frequency measurements in an lte net- work …1217651/... · 2018. 6. 13. ·...

Linköpings universitetSE–581 83 Linköping

+46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information ScienceMaster thesis, 30 ECTS | Datateknik

202018 | LIU-IDA/LITH-EX-A–18/017–SE

Predicting inter-frequencymeasurements in an LTE net-work using supervised ma-chine learning– a comparative study of learning algorithms and dataprocessing techniques

Att prediktera inter-frekvensmätningar i ett LTE-nätverk medhjälp av övervakad maskininlärning

Adrian E. Sonnert

Supervisor : Mattias TigerExaminer : Fredrik Heintz

External supervisor : Daniel Nilsson & Erik Malmberg (Ericsson)

http://www.liu.se

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 årfrån publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstakakopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och förundervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva dettatillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. Föratt garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sättsamt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende elleregenart. För ytterligare information om Linköping University Electronic Press se förlagetshemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement– for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone toread, to download, or to print out single copies for his/hers own use and to use it unchangedfor non-commercial research and educational purpose. Subsequent transfers of copyrightcannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measuresto assure authenticity, security and accessibility. According to intellectual property law theauthor has the right to be mentioned when his/her work is accessed as described above andto be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of documentintegrity, please refer to its www home page: http://www.ep.liu.se/.

c© Adrian E. Sonnert

http://www.ep.liu.se/http://www.ep.liu.se/

Abstract

With increasing demands on network reliability and speed,network suppliers need to effectivize their communicationsalgorithms. Frequency measurements are a core part of mobilenetwork communications, increasing their effectiveness wouldincrease the effectiveness of many network processes such ashandovers, load balancing, and carrier aggregation. This studyexamines the possibility of using supervised learning to predictthe signal of inter-frequency measurements by investigatingvarious learning algorithms and pre-processing techniques. Wefound that random forests have the highest predictive performanceon this data set, at 90.7% accuracy. In addition, we have shownthat undersampling and varying the discriminator are effectivetechniques for increasing the performance on the positive classon frequencies where the negative class is prevalent. Finally,we present hybrid algorithms in which the learning algorithm foreach model depends on attributes of the training data set. Thesealgorithms perform at a much higher efficiency in terms of memoryand run-time without heavily sacrificing predictive performance.

Acknowledgments

I would like to thank Ericsson for giving me the opportunity toperform my master thesis study on such an interesting area ofresearch. In addition, I would like to thank my supervisors atEricsson, Daniel Nilsson and Erik Malmberg. I would also like tothank my examinator and supervisor at Linköping University, FredrikHeintz and Mattias Tiger. Thank you for your time and all the helpyou have given me.

iv

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables ix

Acronyms x

1 Introduction 11.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 42.1 Introduction to LTE/LTE-A networks . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Introduction to machine learning and supervised learning . . . . . . . . . . . . 52.3 Cost functions and minimization techniques . . . . . . . . . . . . . . . . . . . . 72.4 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Hyperparameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.8 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.9 Comparative studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.10 Memory and run-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Method 193.1 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 From network data to features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Data statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Choice of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Initial test suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 Result parsing and visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.7 Tool specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.8 Hyperparameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.9 Final algorithm configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.10 Analytic experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.11 Hybrid algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.12 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

v

4 Results 294.1 Initial test suite results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Optimized learning algorithm results . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Pre-processing technique results . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 Number of samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.5 Performance on the positive class . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.6 Improvement from output class distribution . . . . . . . . . . . . . . . . . . . . 424.7 Final comparison of systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Discussion 445.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 Algorithm choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Hyperparameter optimization choices . . . . . . . . . . . . . . . . . . . . . . . . 475.4 Evaluation choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.5 Run-time and memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.6 Classification vs. regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.7 Data characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.8 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.9 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Conclusion 51

Bibliography 53

A Appendix A 57

vi

List of Figures

2.1 Simplified cutout of an LTE network, showing the eNB, UE, SC, and NC. . . . . . . 52.2 Variance and bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Examples of decision boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 PCA with one component on data generated from the Gaussian distribution . . . . 92.5 Examples of search space for grid search and random search when considering

two parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.7 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.8 Decision tree for binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.9 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.10 ROC curve with discriminator thresholds 0.0, 0.2, 0.4, 0.6, 0.8, 1.0 from the top

right to the bottom left. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Output class skewness per frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Metrics for logistic regression averaged over models based on the skewness of data 314.2 Metrics for multi-layer perceptron averaged over models based on the skewness

of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Metrics for random forest averaged over models based on the skewness of data . . 324.4 Metrics for gradient boosting decision trees averaged over models based on the

skewness of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5 Training and validation accuracy score for the chosen algorithms averaged over

models based on the skewness of data. The fully drawn line represents the vali-dation score and the dashed line represents the training score. . . . . . . . . . . . . 33

4.6 Training time for the chosen algorithms based on models with a certain numberof training samples available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.7 Classification time for the chosen algorithms based on models with a certain num-ber of training samples available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.8 Accskew for the chosen algorithms averaged over all models based on whether ornot the data was preprocessed with scaling . . . . . . . . . . . . . . . . . . . . . . . 35

4.9 AccSkew when training on different sets of features and models with varying lev-els of skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.10 Accuracy when training with the serving cell feature and various fractions of theneighbor cell features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.11 Accuracy when training with the serving cell feature and various fractions of theneighbor cell features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.12 Accskew for the chosen algorithms based on models with a certain number oftraining samples available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.13 AccSkew for random forest on 12 models trained on varying sizes of data . . . . . 384.14 Accuracy for random forest on 12 models trained on varying sizes of data . . . . . 394.15 F1 score for random forest on 12 models trained on varying sizes of data . . . . . . 40

vii

4.16 ROC curve for the gold (79.4%) model from Section 4.4 with thresholds in intervalsof 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.17 Accuracy and F1 score for RF when using various scoring functions . . . . . . . . . 424.18 Accuracy of various systems trained on the same data set . . . . . . . . . . . . . . . 43

A.1 ROC curve for the 79.5% skewness model from Section 4.4 with thresholds in in-tervals of 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57











viii

List of Tables

3.1 A sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Algorithm suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Algorithms and Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 Search space grid to distr. transformation . . . . . . . . . . . . . . . . . . . . . . . . 233.7 Logistc Regression hyperparameter search space . . . . . . . . . . . . . . . . . . . . 243.8 Multi-layer perceptron hyperparameter search space . . . . . . . . . . . . . . . . . 243.9 RF hyperparameter search space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.10 Gradient boosting decision tree hyperparameter search space . . . . . . . . . . . . 253.11 Hybrid 1 configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.12 Hybrid 2 configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.13 Hybrid 3 configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Results, Initial run, Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Results, PCA run, Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Algorithm training stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

ix

Acronyms

TelecomANR Automatic Neighbor RelationsAP Access pointc-ID Physical identityCGI Global Cell IDECGI E-UTRAN Global Cell IdentityECI E-UTRAN Cell IdentityeNB Evolved Node BMT Mobile terminalNC Neighboring cellPCI Physical Cell IDRSRP Reference Signal Received PowerSC Serving cellTC Target cellUE User equipment

Machine learningANN Artificial neural networkGBDT Gradient boosting decision treesGP Gaussian processesKNN k-Nearest neighborLogReg Logistic regressionMLP Multi-layer perceptronReLU Recitifed linear unitRF Random forestPCA Principal component analysisSGD Stochastic gradient descentSPCA Sparse principal component analysisSVD Singular value decompositionSVM Support vector machineTSVD Truncated singular value decomposition

x

1 Introduction

With the emergence of the networked society, many of people’s daily tasks are performedover wireless connections, and increasing requirements are put on mobile networks in termsof latency, data rates, and reliability. In order to meet these requirements, network providersmust constantly adapt to advances in technology and improve their communications algo-rithms.

To facilitate non-disrupted communications in mobile networks, handovers are performed.These are processes in which a mobile terminal (MT) switches channel during an ongoingsession [41]. The most general case for when a handover is performed is when the signalstrength from an MT to its access point (AP) falls below a certain threshold and a new APmust be acquired. Another important technique in network communications is load balanc-ing, in which APs serving many MTs are disfavored over APs serving few MTs1. This is es-pecially important in hotspots where many users conglomerate. In these areas, aggregatingusers toward low range high frequency APs is healthy for easing the load on low frequencyhigh range APs. In addition to techniques for providing reliable and stable services, thereexist techniques for boosting the throughput for users in the network. Carrier aggregation isa technique for connecting single MTs to several APs on one or several frequency bands inorder to increase bandwidth2.

The above mentioned techniques are examples of techniques that rely on performing fre-quency measurements, which is a measurement of signal strength from an MT to an AP ona certain frequency. In general, frequency measurements are the base for every technique inwhich the signal strength between MTs and APs in a network needs to be known. It is simpleto realize that an improvement in frequency measurement efficiency would be an improve-ment to network efficiency in general. To give an idea of how frequency measurements areused we may briefly examine a simplified process of a handover. When a handover is tobe performed, the MT must first acquire information about the APs in its vicinity to knowwhich one is the best candidate for a handover. For this to happen, the MT performs fre-quency measurements to each nearby AP. When the measurements are completed, the MTperforms a handover to the AP with the highest measured signal strength. Alternatively, theMT may switch as soon as an AP with a high enough signal is found. A more thoroughdescription of mobile networks will follow in Section 2.1

1http://www.3gpp.org/technologies/keywords-acronyms/105-son2http://www.3gpp.org/technologies/keywords-acronyms/101-carrier-aggregation-explained

1

1.1. Aim

Because of the network cost of performing frequency measurements, it would be advanta-geous to be able to know whether the signal to the strongest signal AP on a certain frequencyis strong, without having to perform a measurement. Predicting whether or not this wouldbe true given a certain position clearly requires the position of MTs in the world to be known.Extracting the absolute position of MTs in a network is time consuming, unreliable, and mayentail both legal and ethical complications. It could be possible to trilaterate the position ofan MT given the distance to its host AP and other nearby APs. If the relative position of anMT is established through trilateration, and there exists information regarding which inter-frequencies previous measurements were successful to, it is theoretically possible to estimatea probability of the success of a new measurement.

For this task, machine learning is a suitable approach. Machine learning is a field in which acomputer program, rather than being programmed to follow a specified rule set, is learned tomake predictions of certain outcomes through applying a learning algorithm onto some dataset [43]. The learning algorithms are dependent upon the data being divided into features,observed properties or phenomena. In machine learning, the data by which the model islearned is called training data, and input vectors are assembled from the features of the data.In supervised learning, each input-output pair represents a training example, as opposed tounsupervised learning, where each example only consists of input.

The main issue of this study is to apply machine learning to the problem of inter-frequencysignal prediction. This amounts to the creation of a system that can, given information aboutsignal strength to host AP and neighboring APs, be able to predict the signal to the highestsignal AP on a certain inter-frequency. More about how the network data relates to the algo-rithm I/O in Section 3.2. We investigate which learning algorithm and which data processingtechniques are best suited for the task. In order to to accomplish this, a comparative study oflearning algorithms is performed, and algorithms of interest are analyzed more in-depth. Inaddition to finding the optimal setup for the problem in general, we investigate solutions forpredicting signals on high frequencies in particular, where the signal strengths generally arelower. In addition, we analyze the data with the objective of evaluating the effectiveness ofthe studied solutions on this problem.

Although there exists research on the use of machine learning in various parts of an LTEnetwork, such as handover management [4], predicting user movement [51][44][11], andother techniques which make use of frequency measurements, there exists no research onusing machine learning in order to improve the frequency measurement algorithms in them-selves, to the authors knowledge.

1.1 Aim

The goal of the project is to determine which supervised learning algorithm performs bestin the given problem context with the available data set. A comparative study of supervisedlearning algorithms is made, where a few algorithms are selected for deeper study. In addi-tion to learning algorithms, the data is studied and suitable data processing techniques areselected.

1.2 Research questions

The following research questions are defined to guide the research in trying to explore algo-rithms, alternatives to pre-process the data, and how well suited a machine learning solutionis for the given data set:

1. From the selected suite of learning algorithms and pre-processing techniques, whichcombination achieves the highest predictive performance on the given data set?

2

1.3. Delimitations

This can be considered the main research question of the study. At the end of the studywe compare our implementation with various models including an Ericsson prototypesystem.

2. How does the predictive performance of the implemented system vary when the train-ing size is varied?

In order to easily test our resulting system in the field, a proposition is to make thesystem non-reliant on an external system. This means that data collection, data storage,training, model storage, and classification are all performed on the node. Because ofmemory and run-time requirements as well as the desire to keep the data collectionduration minimal, it is desirable to investigate how the number of training samplesaffect the predictive performance.

3. Which techniques are suitable for the objective of achieving high predictive perfor-mance on the positive class on high frequencies?

This is important because in general, the low frequency APs have large ranges and yieldhigh signal strength connections. This makes it easy for users to aggregate to these APs.From a load balancing standpoint, it is therefore desirable to move users to high fre-quency APs. Because these APs generally give lower signal strength connections, thereexist less data for cases when the signal is considered "good enough", and therefore lesscases to train the models on. From a machine leaning perspective, this means that thenegative output class is dominant.

4. How much is gained from using the implemented system over using naive selection?

It is important to produce a system which shows the capability of improving efficiencyin the real world application. By naive selection we mean always choosing the mostcommon output class. This could also help answer whether or not supervised learningis an ideal technology for this application.

1.3 Delimitations

The study is centered around implementations that should in entirety be run on LTE eNBs.This entails certain limitations in run-time and memory. In the proposed solution, the datais intended to be acquired and stored in the eNB, and the model is trained and stored in theeNB as well. This makes training speed, classification speed, and memory requirements allfactors of what is an efficient solution.

3

2 Theory

This chapter briefly describes relevant information about an LTE/LTE-A network. It also de-scribes machine learning, and supervised learning in particular, before detailing certain as-pects of learning techniques that are of importance for the studied algorithms. Cost functionsand minimization, regularization, and hyperparameter optimization techniques are some ofthe topics described. We detail the workings of the algorithms that were chosen for furtherstudy after the initial broad sweep. Finally, techniques for evaluation of machine learningmodels are described.

2.1 Introduction to LTE/LTE-A networks

In an LTE/LTE-A network, User Equipment (UE) is a conglomerate term for devices in thenetwork that communicate with the core network, such as mobile phones or laptops. Thesedevices connect to the core network via base stations called Evolved Node B (eNB) [39]. EacheNB consists of one or more cells, meaning that one eNB can consist of a single physicalantenna (omnidirectional), or several (sectorized multi-cell solution). However, a cell is notdefined as the actual antenna but rather the area the antenna services. The eNB or cell whichis currently supplying a particular UE’s connections is called its serving eNB or serving cell(SC). When a frequency measurement or handover is performed, the cell to which the proce-dure is being performed is called the target cell (TC) [6].

There are many different types of identifications in an LTE/LTE-A network pertaining tocells, UEs, and eNBs1. It is further complicated by different sources using different terms forthe same properties [5][37][19]. There are several ways of identifying cells with different mag-nitudes of uniqueness. The E-UTRAN Global Cell Identity (ECGI) identifies any cell in theworld uniquely, the E-UTRAN Cell Identity (ECI) identifies any cell in a network uniquely.Finally, the physical identity (c-ID) identifies neighboring cells uniquely in relation to one ECI[37], and range from 0 to 504. All of this can be simplified by using the terms Global Cell ID(CGI), which contains both the ECGI and ECI, and Physical Cell ID (PCI), a more commonterm for the c-ID [5][19]. In summary, the CGI is a cell identified uniquely, and the PCI is acell identified in relation to a CGI.

1http://www.rfwireless-world.com/Terminology/LTE-Identifiers.html

4

2.2. Introduction to machine learning and supervised learning

Figure 2.1: Simplified cutout of an LTE network, showing the eNB, UE, SC, and NC.

Each eNB maintains a list of neighboring cells (NC) called a neighbor cell relation list [5],or neighbor relation table [19], that includes certain connectivity information such as IP andPCI mappings [5]. UEs moving through the network may detect both PCIs and CGIs, butidentifying a cell absolutely is more time consuming. The UEs in the network continuouslymeasure their Reference Signal Received Power (RSRP), the signal strength, from the SC toall NCs in their vicinity that are possible candidates for a handover. When a UE encountersa new cell, it sets up a connection between its serving eNB and the eNB of the unknowncell. The eNBs then share information about the cells in their respective area, including PCIsand CGIs. This function is called Automatic Neighbor Relations (ANR) [19], and is a core ofthe work-flow of relational mapping in a network. The information generated by ANR usu-ally only consists of information regarding neighbors on the same frequency. Inter-frequencymeasurements can also be performed, but are generally ordered by an eNB in specific circum-stances, and are therefore considered more expensive.

2.2 Introduction to machine learning and supervised learning

In machine learning, algorithms are able to learn from data. This means that the performanceof an algorithm on a certain task is improved according to a certain performance measure[25]. The task generally refers to the procedure of predicting one or several output variables,given a series of input variables (or features) [27]. These tasks can further be divided intoclassification and regression, which is the prediction of categorical and continuous variables,respectively [25]. In the case of binary classification, the output is usually true or false andrepresented by a binary digit: 1 or 0. The measures used to gauge the performance of machinelearning algorithms are usually referred to as metrics; the most common one being accuracy,simply the ratio of correct predictions.

In supervised learning, an algorithm is trained by feeding it a training set consisting offeature vectors and corresponding outputs [25]. One feature vector and its correspondingoutput is often called an example or a sample. During evaluation (or validation), or when thealgorithm is queried to make a prediction in general, only the feature vector is given. In orderto acquire data for both training and validation, the data available for analysis is usually splitup into one training set and one test set, usually by ratios of 9 to 1, or 8 to 2.

If the vector of features is x P Rn, and the output variable is y, supervised learning al-gorithms often learn to predict y by estimating the probability distribution of y given x. Asimple example of a machine learning algorithm is Linear Regression, in which the output is a

5

2.2. Introduction to machine learning and supervised learning

linear function of the inputy = wJx + b (2.1)

where w P Rn is a vector of weights, and b is an intercept (or bias) term [25]. The weightsand the bias term are what is referred to as the parameters of the model, the variables thatare learned by training. The extra bias term is needed if the decision boundary is to not gothrough the origin (0, 0). The bias term is therefore added to account for what in machinelearning is called bias, the inaccuracy of the model. Another closely linked term is variance,which is instead the impreciseness of the model. Consider an example where a model istrying to, for a single observation, predict the correct outcome several times. If the multipletries yield predictions that are close to equal, but far from the true outcome, the bias is highand the variance is low. On the other hand, if the predictions are widely spread out aroundthe correct outcome, the bias is low but the variance is high. This is explained visually inFigure 2.2. For machine learning model performance, it is usually difficult to obtain both alow bias and a low variance, which is why the bias-variance tradeoff is a central problem. Biasand variance are closely linked to overfitting and underfitting, more about this in Section 2.6.

High variance, high bias

Low variance, low biasLow variance, high bias

High variance, low bias

Figure 2.2: Variance and bias

As can be concluded from Equation 2.1, when we are using linear regression for classifica-tion we rely on being able to linearly separate different outcome classes. This can visualizedby what is called a decision boundary, data will be assigned to one class or another depend-ing on which side of the boundary the data point falls. Figure 2.3 shows example decisionboundaries for two data sets. This demonstrates that the data cannot always be separatedso that the model always predicts the correct class. This is a simple example, but holds forhigher dimensions as well. The higher the number of dimensions of the data to analyze, themore complex the model needs to be in order to achieve high predictive performance.

6

2.3. Cost functions and minimization techniques

(a) Linearly separable data (b) Non-linearly separable data

Figure 2.3: Examples of decision boundaries

2.3 Cost functions and minimization techniques

The problem of optimizing a learning algorithm can be described by finding the function,from the set of all possible functions, that best predicts the true responses to a set of examples[50]. In order to find a function as close as possible to this function, the loss is measured. Theloss is the discrepancy between the true response and the prediction of the model. We wantto find the function that minimizes the expected loss (or risk), defined as

R(a) =ż

L(y, f (x, a))dP(x, y) (2.2)

where f (x, a) is the sought function, L(y, f (x, a)) is the loss function, and P(x, y) is the un-known probability distribution of the input x and output y. Because the probability distribu-tion is unknown, we cannot minimize the expected loss function directly but must computean approximation, the average loss over the training set [12]. Minimizing a loss function isin machine learning often done by using Stochastic Gradient Descent (SGD), in which oneweight update can be described by

wt+1 = wt ´ γt∇wL(zt, wt) (2.3)

where wt is the weight at time t, and γ is the learning rate. The learning rate is a constantby which the gradient of the loss is multiplied to hasten or slow down the descent. If thelearning rate is too large, the algorithm may not converge, while a too small learning ratemight make the convergence too slow for a practical application. There are many modernvariations of the SGD algorithm, such as Adam [33] and SAGA [20]. When a maximization ofthe objective function is used instead of a minimization, the loss function is often called thescoring function instead.

2.4 Data pre-processing

Before being used as input for a learning algorithm, there are a variety of ways data can bemanipulated. There are many reasons for why data pre-processing is useful; these include: re-ducing training time, boosting predictive performance of the resulting model, the data format

7

2.4. Data pre-processing

may be incompatible with the learning algorithm implementation, among others. Dimension-ality reduction can be used to reduce the number of features or samples, often while tryingto retain some information about what has been discarded. Feature selection can be used tosimply remove certain features that possess little or redundant information. Sampling can beused to decide in which manner samples are divided into sub sets, or which types of samplesto include. Feature scaling can be used to transform the values of the features.

Sampling

A common way of splitting data into train and test subsets is by random sampling, thatis to say picking samples at random from the full data set [42]. However, there are somecomplications with this technique. If the distribution of output classes is heavily skewed,the possibility of having all or most of samples with one of the classes in only one of thesubsets increases. This is problematic because if the algorithm is never trained on sampleswith a certain output class, there is no way for it to ever predict that class. This can be solvedby using stratified sampling, which retains the distribution of a chosen feature or outputvariable in the subsets.

In addition to the technique of sampling the full data set, it is possible to use samplingtechniques that only use part of the full data set, or that expands upon it. Removing sam-ples from the dominant class is called undersampling while generating data of the infrequentclass is called oversampling [17]. Both methods have strengths and weaknesses; undersam-pling may remove important examples while oversampling generates artificial data from theexisting data set, making it prone to overfitting.

Feature scaling

What in machine learning is referred to as feature scaling, or simply scaling, is a normal-ization of feature data by transforming different ranges of data into a more common scale.Many machine learning algorithms base feature similarity on the Euclidean distance of fea-ture vectors in the feature space instead of studying the features’ characteristics directly [3].One effect of this is that features with larger value ranges are given larger significance inweight calculations. This is not generally desired, as there is usually no inherent correlationbetween a feature’s total value range and the value of the feature in one example. A concreteway of explaining this could be to consider weight and height of a person when determiningthe person’s sex. If the heights of the observed people range between 160cm and 190cm, andthe weights range between 55kg and 100kg, a larger relative weight would be assigned to theweight feature. However, there is no support for the fact that a person’s weight is a largerdeterminant than height when predicting a person’s sex. In addition to this, simply reducingthe values in the data to smaller numerical values have shown to increase training speed forneural networks [28].

Because of these phenomena, it is considered best practice to scale the features. Thereexist multiple ways of scaling features; some rely on transforming the values of every featureto fit the same range, for example [0, 1], while others subtract or divide each feature value bya mean of the feature [3]. If x̂ is a feature, x a feature component, and l and u represent alower and upper bound for feature components, respectively, [0,1]-scaling can be formulatedas

x̂ =x´ lu´ l (2.4)

Standard scaling (or standardization), which is the removal of feature variance from eachfeature component, can be formulated as

x̂ =x´ µ

σ(2.5)

where µ and σ are the mean and deviation of the feature, respectively.

8

2.5. Hyperparameter optimization

Dimensionality reduction

In statistics, multivariate analysis is the practice of analyzing the joint behavior of multiplevariables. This is used because it is often inadequate to consider the importance of variablesindependently. For example, it is possible that a variable that gives a lot of information isredundant if another variable is introduced. This is important in machine learning becausewe want to minimize the dimensionality of input while maximizing the information from thedata that is fed to the algorithm.

Principal component analysis (PCA) is a technique for reducing the number of variableswhile retaining a large part of the variance of the data [2]. A principal component is thelinear combination that explains the largest possible variance of a set of data. When PCA isperformed, principal components are created sequentially, generally until a certain thresholdof variance is obtained. Principal components are always orthogonal to one another. This isimportant because if the components were to be correlated, it would not be possible to deter-mine the amount of variance that is explained by each component, akin to the variables thatwe had prior to performing PCA. Numerically, PCA is based on singular value decomposi-tion (SVD).

(a) Large amount of variance lost (b) Small amount of variance lost

Figure 2.4: PCA with one component on data generated from the Gaussian distribution

PCA is most easily interpreted visually. Figure 2.4 shows principal components fit on dif-ferent sets of data. One may say that the data is projected orthogonally upon the nearest prin-cipal component and the shorter the distance of the projection, the less variance is destroyed.Apart from classic PCA, there exists similar methods based on SVD such as Truncated SVD(TSVD) [26] and Sparse PCA (SPCA) [52].

2.5 Hyperparameter optimization

In machine learning, a model’s parameters are what is automatically updated as the algo-rithm performs training, such as the weights of a neural network. Meanwhile, a model’shyperparameters are what is specified by the user prior to training, such as the number oftrees to use in a random forest [8]. For many learning algorithms, there exists a large numberof hyperparameters. In addition, there are no real default configurations that "always work",one rather has to test a number of configurations to see what fits the application and data athand. In Bayesian machine learning, the term hyperparameter refers to a parameter in theprior distribution of another parameter of the model; for consistency across the various mod-els in this study, we will not use this definition. Note that this is only relevant for Gaussianprocesses among the learning algorithm we study in this work.

9

2.5. Hyperparameter optimization

The most common technique for hyperparameter optimization is the combination of man-ual work and a brute force search technique [10]. Around 10 years ago, grid search was themost common strategy. Grid search relies on manually configuring a hyperparameter searchspace which is then iterated through until every possible combination has been explored.For large grids, the process is very costly. If one wishes to optimize an algorithm with 5hyperparamters and simply explore 5 values for each, 3125 iterations need to be performed.Because hardware keeps improving throughout the years, grid search has not completely lostits function.

(a) Search space for grid search (b) Search space for random search

Figure 2.5: Examples of search space for grid search and random search when consideringtwo parameters

Another method, called random search, is similar to grid search in the way that it searchesthrough a pre-defined search space iteratively [10]. However, instead of performing exhaus-tive search, random search samples a configuration at random in each iteration. Because ofits non-exhaustive behavior, it is possible to use continuous distributions to sample valuesfrom rather than using a static grid. Also in regard to the non-exhaustive behavior, it is pos-sible to manually set a stopping condition, for example a set number of iterations to perform.Random search has been shown to outperform grid search [9]. In addition it performs welleven in comparison with modern sequential methods, but can be outperformed by GaussianProcesses and Tree-structured Parzen Estimator Approach when optimizing complex modelssuch as deep belief networks and convolutional networks [10].

Random search will in most cases explore more choices for each parameter, as illustratedby Figure 2.5. This allows random search to find local optima that grid search would other-wise not find. However, because of this it makes it difficult to analyze which parameters aremore significant. Grid search may therefore be better suited for analyzing the importance ofparameters while random search is better suited for simply finding the best configuration.

Hyperopt-sklearn is a package for automatic hyperparameter configuration [34]. It letsusers perform hyperparameter optimization using built-in parameter distributions, whichare engineered to account for both dense and sparse data representations. Hyperopt-sklearnconsiders quite a low number of hyperparameter choices but have still shown considerableperformance in comparison to other techniques [34].

10

2.6. Regularization

2.6 Regularization

In machine learning, our goal is to minimize our test error, which is the same as maximizingour predictive performance on the test data set. We generally obtain a low test error by ob-taining a low train error, a good performance on training data. However, because the trainand test data sets contain different data, it is possible to obtain a low train error but a hightest error; this phenomenon is called overfitting [25]. Intuitively, overfitting can be explainedby the algorithm focusing "too much" on the specific values in the training data, and over-looking general correlations. Overfitting, therefore, often occurs in models that have highvariance. In contrast, it is possible that the training error is large, which will also generallyinduce a high test error. This is called underfitting, and can be explained by the algorithmfailing to find correlation between the input and output of the data. In contrast to overfitting,underfitting instead occurs in models with high bias. Our ultimate goal is to minimize thetest error, and this can often be accomplished by performing a trade-off between overfittingand underfitting. It is possible that a configuration performs better on the test data than thetraining data.

In order to reduce overfitting, one uses what is called regularization. This can be explainedas reducing the amount the algorithm is learned from each training sample, hopefully lower-ing the test error at the expense of the training error [25]. This is done differently dependingon the type of algorithm. For example, for algorithms that rely on minimizing a cost functionby gradient descent, regularization is induced by adding a penalizing term to the cost func-tion. For algorithms that rely on decision trees, highly complex trees may induce overfitting,which is why breadth and depth are controlled. This is generally done by growing a full tree,and then cutting away select branches.

As previously mentioned, prior to training the data is split into training and test data sets.In order to calculate the training error, one must further split the training data into train- andtest subsets for use during training. In order to obtain a more accurate training error, thisis usually done multiple times using the method k-fold cross-validation. This method splitsthe data into k subsets, training iteratively k times, each iteration using one of the folds astest data and the remaining sets as training data [25]. After training has been completed, theaverage error is selected as the training error.

2.7 Learning algorithms

The learning algorithms considered in this study are logistic regression, random forest, gra-dient boosting decision trees, multi-layer perceptron, k-nearest neighbor, Gaussian processes,and support vector machines. The latter three were discarded after the evaluation of the ini-tial test run, which is why we deemed detailed descriptions of these algorithms superfluous.We will first briefly describe these three algorithms and follow with a more detailed descrip-tion of the others.

Gaussian processes are non-parametric kernel-based learning algorithms that are basedon finding a distribution over possible functions from the input data to the output data. Thefunctions considered are indirectly defined by a covariance matrix, which in itself defineshow the data points in the input relate. For a more detailed explanation, see Rasmussen etal.’s Gaussian Processes in Machine Learning [47].

We previously described how linear regression separates data points in a two-dimensionalspace by a line. A support vector machine attempts to separate data points in any dimensionwith a hyperplane, while at the same time maximizing the distance between each data pointand the hyperplane. This is partly done by transforming the input data to a (normally) higherdimensional feature space by a kernel function where the data is hopefully easier to separate.For details, see Shmilovici et al.’s Support Vector Machines [49].

k-Nearest neighbor is an instance-based method that is based on considering groupingsof each data point in the input by proximity in the feature space and where the outcome is

11

2.7. Learning algorithms

decided by majority voting of that grouping; the size of each grouping is decided by k. The k-NN holds the entire training data in memory and performs all computations at classification.This algorithm is considered one of the simplest machine learning methods. If any moredetails are desired, consider Cover et al.’s Nearest neighbor pattern classification [18].

Logistic regression

Logistic regression is a probability model for binary response data [23]. When used for classi-fication it will output the odds of one class over the other. The name derives from the logisticfunction; the standard logistic function (or sigmoid function), can be expressed as

f (x) =1

1 + e´x(2.6)

An example of a logistic regression function is

p(x) =ew1x+w2x+b

1 + ew1x+w2x+b(2.7)

where w1 and w2 are weights, and b is a bias. This may also be expressed as

q(x) = ln(

p(x)1´ p(x)

)= w1x + w2x + b (2.8)

which, if p(x) is a probability function for P(Y = 1|X), gives us the probability odds of oneoutput class over the other as a function of the input. Classification with logistic regression isdone by interpreting the output, for example by

f (x) =

#

1, q(x) > 0.50, q(x)


Figure 2.6: Artificial neuron

Figure 2.7: Artificial Neural Network

Figure 2.7 shows an ANN, each node representing one artificial neuron. In this figure,the green nodes represent the input layer, the purple nodes represent a hidden layer, and theorange node represents the output layer. It should be noted that while a network may consistof only one input layer and one output layer, there exists no boundary for how many hiddenlayers may be incorporated.

ANNs may be divided into two different subcategories, feed-forward and recurrent net-works [38]. The difference is that in feed-forward networks, the output of a neuron maynever be used in the input of a neuron in the same or previous layer, while in a recurrent net-work it may. The ANN in Figure 2.7 therefore represents a feed-forward network, describedby an acyclic graph.

In order to train (run SGD) the network, one must first acquire the gradient of the loss.This is done by what is called back-propagation, which is a technique for cheaply computingthe gradient of the network from the output node and back using the chain rule [25]. TheMulti-layer perceptron (MLP) is a feed-forward neural network that consists of one or morehidden layers, often called a "standard" ANN.

Activation function

As stated earlier, the activation function is what calculates the output of an artificial neuron.Because neural networks commonly use gradient minimization techniques, the weights ofthe hidden units are updated in proportion to the gradient of the error function. One effect ofthis is that the weights receive small updates if the gradient of the error is small. This is calledthe vanishing gradient problem and is present in many commonly used activation functionssuch as the logistic function (or sigmoid) and the hyperbolic tangent [7].

13


The rectified linear unit (ReLU), defined as

f (x) =

#

x, x > 00, x

2.8. Evaluation metrics

famous tree growing algorithms include CART, ID3, and C4.5 [40]. One difference betweenthese methods is whether to use early stopping, or to build a full tree and then prune it, cuttingaway select branches. Another is whether to always perform a binary split, or to allow morethan 2 daughter nodes per parent node.

Random forests

Multiple classifier systems, or ensembles, are a group of supervised learning algorithmswhich use multiple (weak) classifiers in order to create one strong classifier [21]. The clas-sifiers are called weak because they are low complexity versions of the aggregated classifiertype.

Bagging is a type of ensemble method in which many weak estimators are trained onvarious sub sets of the learning set and are aggregated in the final classifier by voting [13].This means that at time of prediction, each estimator is queried for an outcome; the predictionof the classifier is then the most common choice among the estimators. In bagging, eachestimator is trained independently and it is therefore possible to, for example, parallelizeestimator training.

Random forests are a version of bagging in which decision trees are used as estimators[14]. A random forest is trained by training weak decision trees, called stumps. Duringtraining, various attributes of the stumps are controlled by what we described in Section2.7, for example the tree depth. In addition to hyperparameters specific to each decisiontree, an important hyperparameter of a random forest is the number of estimators to train.Random forests are resilient to the overfitting complications of decision trees in that eachweak estimator is shallow.

Although bagging methods classically use voting for determining the classification, it ispossible to use other alternatives. For example, the scikit-learn implementation uses the trees’predictive average instead of having them vote2.

Gradient boosting decision trees

Similarly to bagging, boosting is an ensemble method in which multiple weak estimators areused to create one strong classifier. However, instead of training each estimator indepen-dently, boosting is a sequential process in which every new estimator in dependant on thepreviously trained estimators [48]. Boosting algorithms maintain a set of weights which areupdated during training in a way which makes each estimator focus on the examples that theprevious estimator did not perform well on.

Gradient boosting is a combination of the terms gradient descent and boosting, and isbased on that fact that boosting can be seen as an optimization algorithm over a certain lossfunction [24]. In this case, the weights described in the above paragraph equals the nega-tive gradient of the loss function. There exist variations of gradient boosting techniques. Forexample, one variation may use the entire data set for training in each iteration, while an-other uses some sub sample of the full data set (Stochastic gradient boosting). Because ofthe sequential nature of gradient boosting, it is not possible to parallelize the training of eachlearner. Gradient boosting decision trees is simply gradient boosting in which decision treesare used as estimators.

2.8 Evaluation metrics

Before we define the metrics themselves, it is important to define the concept confusion matrix.It is a table of four items that are defined on a combination of a true class and a class predictedby a model. If the true class and the predicted class are both positive, it is defined as a true

2http://scikit-learn.org/stable/modules/ensemble.html

15


positive (TP). If the true class is negative and the predicted class is positive, it is defined as afalse positive (FP). If the true class and the predicted class are both negative, it is defined as atrue negative (TN). Finally, if the true class is positive and the predicted class is negative, it isdefined as a false negative (FN) [22]. One way of concretizing this is is to view the true classas whether or not a patient has a certain affliction, while the predicted class is the diagnosisof a doctor. Then, if the patient has cancer, but the doctor incorrectly diagnoses the patient ashealthy it would correspond to a FN. Figure 2.9 shows a confusion matrix.

Figure 2.9: Confusion matrix

Many of the well-known metrics for model evaluation in machine learning can be derivedfrom the terms in this matrix. Accuracy, Precision, Recall, and F1score are defined as [22]:

Accuracy =TP + TN

TP + TN + FP + FN(2.12)

Precision =TP

TP + FP(2.13)

Recall =TP

TP + FN(2.14)

F1score = 2 ¨precision ¨ recall

precision + recall(2.15)

Accuracy therefore describes how many out of all predictions are correct. Precision de-scribes how many of the predicted positives that are true, while recall shows the proportionof actual positives identified. This means that a model that predicts the positive class veryseldomly may have a very good precision, while a model that frivolously predicts the posi-tive class may have a very high recall. Because of this phenomenon, recall and precision maybe good for analyzing individual factors, but quite weak at evaluating how "good" the modelis at predicting the positive class in general. We may therefore instead use the F1 score, whichis the harmonic mean of precision and recall. This is to say, it measures how good the modelis at finding all positive cases without predicting the positive class too often.

In addition to these metrics, we define what we call AccSkew, which is a comparison ofthe accuracy and the distribution of output classes, more specifically

AccSkew = accuracy´ skewness (2.16)

where skewness is the percentage most common outcome of a data set. Concretely, if 10samples contain 7 outputs ’1’ and 3 outputs ’0’, or vice versa, the skewness is 70%. TheAccSkew is not an absolute value and may be negative, suggesting that simply guessing themost common class is more accurate than using the classifier in question.

16


ROC analysis

In binary classification, most classifiers predict a class over the other by being over 50% con-fident in the outcome. For example, in a random forest, if 53% of the trees vote for the pos-itive class, the positive class will be predicted. This is the equal to having a 0.5 discriminatorthreshold, or simply discriminator, and can be likened with relying on the predictions of theclassifier. However, instead of relying on the classifier as is, we may sometimes achieve ahigher predictive performance by moving the discriminator. For example, we may decide toonly predict the positive class if the classifier is over 70% certain, which would be the sameas using a 0.7 discriminator. This can be likened to not fully relying on the prediction ofthe classifier, as we may then predict a class for which the model does not have the highestprobability.

Figure 2.10: ROC curve with discriminator thresholds 0.0, 0.2, 0.4, 0.6, 0.8, 1.0 from the topright to the bottom left.

Receiver Operating Characteristic (ROC) analysis examines a classifier’s ability to predictthe positive class when the discriminator is varied. ROC analysis is traditionally used inmedicinal diagnostics, and has in modern times become popular for use in machine learningclassifier evaluation [22]. Figure 2.10 shows an example of a ROC curve, where the y-axisrepresents the rate of TPs, and the x-axis represents the rate of FPs. Each data point signifiesa certain discriminator. A perfect score would be a value in the top left, signifying findingevery positive case while not raising any false alarms. The diagonal line represents the resultsof random guessing, so a value below the diagonal line represents a classifier that performsworse than guessing for that particular threshold. For example, the point (0.7, 0.7) representspredicting positives in 70% of cases, getting them wrong half the time and getting them righthalf the time. Because of its characteristics, ROC analysis is efficient for determining thepredictive performance of the positive class in data sets skewed towards the negative class.

In diagnostics, it may be important to have a good performance at various threshold lev-els, in which case the Area under the curve (AUC) may be calculated and compared betweenvarious systems. It should be used with caution as a general measure of single-thresholdmodel performance, however, as a classifier with great performance at a certain thresholdand a bad performance at all other thresholds would yield a low AUC. In addition, researchhas shown that although AUC is insensitive to class skewness, it can mask poor performance[31].

17

2.9. Comparative studies

2.9 Comparative studies

There have been several occasions where a suite of machine learning algorithms has beencompared on a suite of metrics and problems. StatLog [32] from 1995, as well as the reviewsby Kotsiantis et. al. [35] and Caruana et. al. [16] from 2006, and by Caruana et. al. [15] from2008, are some of the most well-known. They all conclude that while some algorithms maybe stronger on average on a range of problems, a learning algorithm should be chosen basedon the problem and data at hand.

2.10 Memory and run-time

Kotthaus et. al. [36] have performed a study in which a range of supervised machine learningalgorithms for binary classification were evaluated in terms of run-time and memory usage.They do not study training and classification times separately but find that the total run-timeof random forests, support vector machines and gradient boosting decision trees are compar-atively slow while k-nearest neighbor and logistic regression are comparatively fast. Theyalso find that gradient boosting decision trees and logistic regression have a comparativelylow total memory allocation while random forest and support vector machines have a com-paratively high total memory allocation. They conclude, however, that both run-time andmemory usage is affected by the algorithm choice, the implementation, and the hardware.It is therefore difficult to draw conclusions about a specific algorithm by studying certainimplementations.

18

3 Method

This chapter describes the data extracted from the LTE/LTE-A network and the pre-processing of that data for use with learning algorithms. Following this, we motivate ourchoice of which algorithms were selected for the initial study. We then describe the exper-iments that were conducted in order to gauge the initial suite of learning algorithms, in-cluding choices for pre-processing and hyperparameter optimization. The chapter continuesby describing the algorithms chosen for comprehensive comparison, as well as their con-figurations. We then describe various analytic experiments performed with the optimizedalgorithms. Finally, we introduce novel implementations by which the choice of learningalgorithm varies depending on data attributes.

3.1 Platform

The algorithms were implemented in python using scikit-learn1, with pandas2 being usedfor data handling. Algorithm test runs were performed on 2 Ericsson Red Hat VMs with8 CPU cores and 32Gb of RAM on each. Training- and classification speeds were measuredsimply by using the time function of the python time package. Hyperparameter optimizationwas implemented using hyperopt-sklearn3, as well as the implementations of grid search andrandom search in scikit-learn. Visualization was performed using pandas and matplotlib4.

3.2 From network data to features

Our raw data consists of a number of time stamped frequency measurements. These includemeasurements to SC and NCs on the UEs frequency, and to inter-frequency cell neighbors.The measurements to the inter-frequency cells are performed in a way that lets us save onlythe cell on a particular frequency which has the highest signal strength. Using the CGI and thetime stamps, we can link together measurements into a sample which provides informationregarding whether or not a handover to a particular inter-frequency cell would be successful.

1https://scikit-learn.org/stable/2https://pandas.pydata.org/3http://hyperopt.github.io/hyperopt-sklearn/4https://matplotlib.org/

19

3.2. From network data to features

Table 3.1: A sample

CGI 1639551231SF 2688 MHzIF 720 MHz

SCM 15 RSRPNCM1 PCI 302, 24 RSRP

...NCM8 PCI 12, 17 RSRPICM 23 RSRP

An example of a sample can be seen in Table 3.1, where SF is serving frequency, IF is interfrequency, SCM is serving cell measurement, NCM is neighboring cell measurement, andICM is inter cell measurement. It is important to note that for this study we are exclusivelystudying inter-frequency measurements, which is why every sample contains an ICM. Weshould also point out that the number of NCs available in each sample vary between 1 and 8.

What we want to do is to predict the signal on the inter-frequency, which is why the ICMwill be our output. However, the RSRP value of the ICM is first converted into binary usinga threshold. This threshold is a network configuration constant and is there to determinewhat constitutes a good enough signal for a handover. In our case, this threshold is set to 24.This means that every ICM RSRP ě 24 is transformed into 1, while every ICM RSRP ď 24 istransformed into 0. This also means that we transform the problem from regression to binaryclassification. The motivation for this is partly because internal preliminary testing showedbetter results for classification than for regression, and partly because another ongoing masterthesis study at Ericsson on the same data set is focusing on regression.

When transforming the sample data to features, we group the data by CGI and IF, whichmeans that we will have one model per cell and inter-frequency. This also means that thosevariables are constant in each model’s data set and would therefore be redundant. The sam-ples contain multiple value data for the NCMs, which is problematic for many algorithms tohandle. For this reason, we transform the data from a dense representation to a sparse rep-resentation, in which every PCI has its own data column, and every row is a feature vectorthat represents a particular sample. Table 3.2 and 3.3 are examples of input and output ofone row in our representation. Note that in this instance, PCI0 and PCI144 have the value 0.This may either mean that the RSRP value is 0, or that the data value is missing. This meansthat any information regarding the difference between missing data and RSRP 0 will be lostto the algorithm. Because each sample contains the SCM and a maximum of 8 NCMs, therecan be a maximum of 9 non-zero values in each feature vector. Finally, it should be expressedthat because the samples available of each CGI contain various numbers of unique PCIs, thelength of the feature vector may vary between classification models.

Table 3.2: Input

SCM 15PCI0 0

...PCI99 24

...PCI144 0

Table 3.3: Output

ICM 1

20

3.3. Data statistics

3.3 Data statistics

The total number of samples, when limiting the available data for each model to 5000 sam-ples, is 1415049. The number of classification models, where each model is constituted byone cell and one target cell frequency, equal 428. This gives each model an average of 3283samples, or around 2955 training samples, if split 9 to 1. The average number of neighbormeasurements per sample is 3. As the average number of neighbors per feature vector is80, the data representation is sparse, meaning that a high number of data matrix elements areempty. The average number of elements in the feature matrices that are non-zero is 4.96%,we will call this value the allotment. We also define a single feature’s allotment as the ratio ofnon-zero values. It is important to note that with our data representation, the maximum pos-sible allotment is quite low. For example, if our feature vector contains 90 features, we mayhave a possible maximum of 10% allotment, because each sample may contain at maximumthe serving cell and an additional 8 neighbors, meaning the rest of the neighbors will be 0 forthis particular row in the matrix.

Figure 3.1: Output class skewness per frequency

Figure 3.1 shows the distribution of the output classes depending on the target frequencyof the model. For the lowest frequency, which contains 19.7% of the data, the skewness to-wards the ’1’ class is extreme, at 93.2%. The highest frequency, which contains 44.0% of thedata, is instead very skewed towards the ’0’ class, at 87.0%. Finally, the middle frequency,which contains 36.3% of the data, is 69.1% skewed towards the ’0’ class. The varying levelsof skewness on the different frequencies mean that the average RSRP value of the outcomevariable before conversion varies between frequencies. The lowest frequency, for example,has an average RSRP far higher than 24, while the middle frequency has an average RSRPslightly below 24. In total, there are only 33 models in the range of 50%´ 70% skewness,whereas there are 330 models in the range of 80%´ 100% skewness. In addition to the skew-ness, we do note that the allotment of the different frequencies is close to equal, at 5.34%,4.87%, and 4.83%, for low, middle, and high, respectively.

It is important to note that these frequency bands are not representative of everyLTE/LTE-A network. There are many more frequencies in use for LTE/LTE-A networks, andthe number of bands in use also vary between networks. The configuration of the network isof course decided by the owner of the network, and we do therefore not possess informationas to why these exact frequencies were chosen for this particular network.

21

3.4. Choice of algorithms

3.4 Choice of algorithms

There are many supervised learning algorithms, but many of these are variations within acategory of algorithms. For instance, k-nearest neighbor and radius nearest neighbor areboth implementations of the nearest neighbor category of algorithms. In addition to this,many supervised learning algorithms only work for, or are mostly suited for, either regres-sion or classification. We selected 7 distinctly different learning algorithm spanning the mostwidely used approaches to binary classification. Following this, Table 3.4 shows the chosenalgorithms and the abbreviations we will be using from this point on.

Table 3.4: Algorithm suite

Logistic regression LogRegSupport vector machine SVM

k-Nearest neighbor KNNGaussian processes GP

Random forest RFGradient boosting decision trees GBDT

Multi-layer perceptron MLP

3.5 Initial test suite

The suite of algorithms was implemented in python using scikit-learn, with hyperopt-sklearnfor hyperparameter optimization. Where hyperopt’s default parameter distributions werenot applicable, a short test run on a custom parameter distribution was performed, beforeusing the best average configuration for all models. In addition, the full suite was also testedwith 3 different dimensionality reduction techniques: PCA, TSVD, and SPCA. For all of theseruns, [0,1]-scaling was used. In addition, accuracy was used as a scoring function duringtraining. The algorithms included as well as the choice of optimization can be seen in Table3.5, where Cs is a factor for regularization strength, res is the number of restarts (complete re-training from new initial parameters), rb f is the kernel choice (Radial basis function), x, y, zis the number of hidden units in each layer, where x is the input layer, y are the hiddenlayers, and z is the output layer; # f is the number of features, relu is the choice of activationfunction, and lb f gs (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is the choice ofminimization algorithm.

Table 3.5: Algorithms and Configurations

Algorithm Implementation HyperparametersLogReg LogisticRegressionCV Cs=10

SVM SVC hyperopt defaultKNN KNeighborsClassifier hyperopt default

GP GaussianProcessClassifier res=3, rbf(0.25, (1e-5, 5e4))RF RandomForestClassifier hyperopt default

GBDT GradientBoostingClassifier hyperopt defaultMLP MLPClassifier #f,#f/2(x3),1,relu,lbfgs

After the initial runs, GBDT, RF, LogReg, and MLP were selected for further investigation.Meanwhile, dimensionality reduction by either of the three techniques were abandoned.

22

3.6. Result parsing and visualization

3.6 Result parsing and visualization

During the training runs, the frequency, number of samples, skewness, training time, classi-fication time, Accuracy, Precision, Recall, F1 score, and AccSkew were collected for each typeof classifier. These values were stored in .csv files where one row represents the results of onemodel. In order to be able to analyze the large amounts of saved results, a parsing programwas written in python that extracts the average performances for models with certain modelor data attributes. For example, the program can calculate the average Accuracy of everymodel for which the number of samples are in a certain range. In addition to the parsingprogram, a program for visualization was written in python primarily using the packagespandas and matplotlib.

3.7 Tool specifics

Scikit-learn uses a version of the CART algorithm for growing its trees which is used by theRF and GBDT algorithms. In scikit-learn, some algorithms give the option of choosing thenumber of cores to use for training in parallel. For example, this option exists for RF butnot for GBDT. However, this is not something specific to scikit-learn; gradient boosting isinherently impossible to parallelize due to its sequential nature. This is worth noting whenconsidering the end results.

3.8 Hyperparameter optimization

When optimizing the algorithms selected for further investigation, a combination of gridsearch and random search was used. Firstly, a grid was specified and ran on 20% of the mod-els, selected so that many different skewness levels were represented. The hyperparametervalues which were never selected, or selected very seldomly, were removed. If the lowest orhighest value was selected often then a lower or higher value, respectively, was introducedinto the grid. This process was repeated 3 times.

The resulting grids’ numeric parameters were transformed into uniform continuous dis-tributions from the lowest value to the highest value of each grid. Table 3.6 shows the trans-formation from a grid to a search space.

Table 3.6: Search space grid to distr. transformation

Attr. & Val. Grid Attr. & Val. Distr.penalty: ’l1’, ’l2’ penalty: ’l1’, ’l2’

solver: ’saga’, ’liblinear’ solver: ’saga’, ’liblinear’C: 0.1, 1.0, 10.0 C: uniform{0.1, 10.0}

max_iter: 10, 100, 300 max_iter: uniform{10, 300}class_weight: None, ’balanced’ class_weight: None, ’balanced’

The resulting random search distributions are used as the final hyperparameter optimiza-tion configuration. Each of the final random search distributions outperformed the defaulthyperopt-sklearn distributions, as well as the custom hyperparameter configurations.

3.9 Final algorithm configurations

The final implementations of the algorithms use the MaxAbsScaler and Randomized-SearchCV classes of scikit-learn. This represents a [0,1]-scaling and a random search hyper-parameter optimization with 3-fold cross-validation for each configuration.

23

3.9. Final algorithm configurations

Logistic regression

The hyperparameter search space for the LogReg classifier can be seen in Table 3.7. solver de-termines the which minimization algorithm to use, C is the inverse of regularization strength,the lower the value, the stronger the regularization. For hyperparameters that are not listed,the default value is chosen.

Table 3.7: Logistc Regression hyperparameter search space

Attribute Valuessolver ’saga’

C uniform{0.1, 10.0}

Multi-layer perceptron

The hyperparameter search space for the MLP can be seen in Table 3.8, where n f is the num-ber of features and each item within each parenthesis of hidden_layer_sizes represents onehidden layer. activation specifies the activation function used, and alpha specifies the reg-ularization strength, the higher the value the stronger the regularization. learning_rate_initspecifies the constant learning rate to use throughout training. For hyperparameters that arenot listed, the default value is chosen.

Table 3.8: Multi-layer perceptron hyperparameter search space

Attribute Valueshidden_layer_sizes (nf*2,nf*3,nf*4), (nf*4,nf*6), (nf*8)

activation ’relu’solver ’adam’alpha uniform{0, 0.01}

learning_rate_init uniform{0, 0.1}

It can be noted that the configurations of hidden layers and units only contain layers withan increasing number of units per layer. An equal number and descending numbers were alsotested but was discarded during multi-stage grid search due to being selected very seldom.

Random forest

The hyperparameter search space for the RF can be seen in Table 3.9. n_estimators determinesthe number of trees to grow, min_samples_split determines the number of samples requiredin the learning set of a node in order to split it, and min_samples_lea f determines the numberof samples that must be present in the learning set of a leaf node.

Table 3.9: RF hyperparameter search space

Attribute Valuesn_estimators uniform{8, 1256}

min_samples_split uniform{2, 20}min_samples_leaf uniform{1, 3}

Gradient boosting decision trees

The hyperparameter search space for the GBDT can be seen in Table 3.10. The first fourparameters have the same meaning as for the MLP and RF, and max_depth specifies the max-imum allowed tree depth for each estimator.

24

3.10. Analytic experiments

Table 3.10: Gradient boosting decision tree hyperparameter search space

Attribute Valuesn_estimators uniform{8, 1256}learning_rate uniform{0.001, 1}

min_samples_split uniform{2, 20}min_samples_leaf uniform{1, 3}

max_depth uniform{1, 10}

It is interesting to note that the configurations that proved effective for RF and GBDTare very similar. Surprisingly, setting a value for max depth proved effective for GBDT butnot for RF. Specifically, removing the parameter from GBDT increased training time but didnot largely affect predictive performance. Adding the parameter to RF did not largely affecttraining time but reduced predictive performance.

3.10 Analytic experiments

In order to determine the predictive performance of the optimized algorithms, the Accuracy,F1 score, and AccSkew were analyzed. In addition to comparing these metrics when aver-aging over all models, we also studied how the results varied when averaging over modelsbased on different levels of skewness in data. The motivation for this is that preliminary test-ing showed that the AccSkew was generally quite low, meaning that the skewness seemedvery important for the Accuracy of the models. We wanted to determine whether or not thiswas an effect of the data set generally being very skewed, or that the data contained informa-tion which was difficult for the algorithms to interpret.

In order to have time to complete the various subsequent experiments, the number ofalgorithms to use had to be limited. Because the RF showed the best predictive performancewhile having a reasonable training time, it was solely used for some of the experiments.

Training- and test score

In order to determine whether or not our models suffer from over- or underfitting, we an-alyzed the difference in training and test scores over the four optimized algorithms. Sincethe models are trained with accuracy score as a objective function, we compared it to theaccuracy of the trained models on the test set. The performance was analyzed when study-ing an average over all models and when averaging over models based on different levels ofskewness.

Scaling technique

During the initial test run and all subsequent experiments up to the point of our four finalizedalgorithms, [0,1]-scaling was used. However, it is not certain that [0,1]-scaling, or scaling atall, is the optimal choice for our configurations.

In order to determine whether or not [0,1]-scaling was the strongest technique when thehyperparameter optimization had been finalized, a study was conducted in which the scal-ing technique was varied but all other factors remained constant. Apart from [0,1]-scaling,standardization and performing no scaling were evaluated. Only the RF was used in thisstudy.

Training- and classification speed

In order to easily test the application in the field, everything should be included in a packageready to run in a node. This means that data collection, storing samples, training, storing

25

3.10. Analytic experiments

a model, and classification all should be run in the same package without the need of anexternal system. Because of this, both speed of training and classification is of interest. Itis important to note that should a more intricate system where the training is performedin a cloud environment, for example, be used, training speed and memory requirements oftraining data storage would no longer be important factors.

In order to determine how the algorithms perform in regards to speed of training and clas-sification, these results were studied based on different training sizes. It is important to notethat we do not analyze the speed of the algorithms from a purely theoretical standpoint, butrather comparing these specific implementations of the algorithms. Hopefully we should beable to draw some conclusions that are not implementation-specific, as it is not certain Erics-son would use these specific implementations, should any of the algorithms be implementedfor node usage.

Sample size

Studying the number of samples to use when training is interesting for various reasons. Forthis application in particular, it is important because the system, once installed in the field,must gather samples for a time before it can be used. The quicker the system can be up andrunning in a node, the better. In general, it is also possible that a lower number of samplesdoes not reduce predictive performance, in which using a larger number of samples is useless,increasing memory requirements and training time. For all the experiments presented in thissubsection, only the RF was used.

In order to determine how the training size affects performance, we first analyzed howthe AccSkew varied when studying RF models where varying training sizes were available.Since the result of this was deemed inconclusive, and could be dependant on other factorsof the models rather than simply the training size, a more intricate study was performed.In this study, the cap of 5000 samples was removed and 12 models were found for whichthe total number of samples surpassed 8000. This meant that we could train on up to 7000samples. The training samples in each interval were a random subset of the full data setof each model, drawn stratified to maintain the same skewness between training instances.The models were trained with samples drawn in intervals of 1000, and model training wasrepeated 3 times for each training size. This was done because of the potential variation inresults. The lower the training size, the lower the test size, and the lower the test size, themore arbitrary the evaluation of model performance is. In addition, if we had only trainedon, for example, 1000 samples when the full set has 7000 samples, we risk choosing a sampleset which is not representative of the full set. This only pertains to the variations in the input,as output class distribution variations are counteracted by stratification. After these testswere ran, we analyzed the average AccSkew for the 12 models over the 7 intervals, and eachmodel’s performance individually.

With the information gathered from this experiment, yet another experiment was con-ducted in which the training sizes were drawn in intervals of 200, from 200 to 1000. Becauseof the low training sizes for this study, the test size was increased from 10% to 25%. In addi-tion, model training was repeated 10 times instead of 3.

Feature size

Feature selection is of interest when exploring the information available in the data, and howmodel performance varies when it is used. It is also always helpful to be able to decrease theamount of data needed to train the model, especially when memory requirements are to beconsidered.

In order to determine the quality of our feature composition, we trained the RF on varioussubsets of our feature space; training with all features was compared to training with only the

26

3.11. Hybrid algorithms

neighbor features, and only the serving cell feature. The results were studied when averagingover all models, and when averaging over models with even skewness levels.

After this preliminary study, an additional study was conducted in which neighbor fea-tures were removed based on the allotment. The RF was trained when features were set to1/2, 1/4, 1/8, and 1/16 of the original amount.

Undersampling

Undersampling can help us understand how the information in the data varies between mod-els when skewness is not a factor. It could also be helpful when trying to increase perfor-mance of the positive class in a data set skewed towards the negative class. Finally, reducingthe amount of data needed to train the model is always helpful.

Because the data is heavily skewed, an experiment with uniform random undersamplingwas performed. After undersampling, each training and test set are exactly 50% skewed.Because the skewness levels are equal, there was no need to evaluate the AccSkew, instead wesimply compared the Accuracy with and without undersampling. This study was conductedwith the RF.

ROC analysis

ROC analysis is best used when analyzing systems for use in multi-discriminator diagnostics,but can also be used to analyze whether or not predictive performance can be increased forsingle-discriminator systems. Since most classifiers use a 0.5 discriminator by default, we cananalyze the ROC curve and try to distinguish whether other discriminator have a better ROC.

In order to determine whether or not varying the discriminator could increase our predic-tive performance of the positive class, we performed ROC analysis on the same 12 modelsthat were used in the training size study. Because we were only interested in finding a singleoptimal threshold, and not the overall diagnostic capability of the models, we chose not tostudy the AUC but rather the individual curves analytically.

Objective function

Accuracy may not be the best metric for evaluating the predictive power on the positive classin particular. Therefore, a high F1 score is perhaps more highly sought than a high accuracyfor this objective, for example. If this is the case, using accuracy as a scoring function duringtraining may not be optimal.

In this experiment, we train the RF with F1 score and ROC-AUC as scoring functions inaddition to accuracy, and evaluate the resulting accuracy and F1 score.

3.11 Hybrid algorithms

The various experiments conducted show that the AccSkew is smaller the larger the skew-ness, that the training size had little to moderate impact, and that reducing the feature sizeby allotment do not impact performance at some levels. We have also established that whileRF outperforms LogReg at low skewness levels, RF and LogReg have more similar predictiveperformance the higher the skewness. At a high enough skewness level, no algorithm has asignificant AccSkew.

Therefore, hybrid implementations were created in which the algorithm used depends onthe level of skewness of data. For 50´ 70% skewness, the optimized RF algorithm was used.For 70´ 90% skewness, the optimized LogReg algorithm was used. Finally, for 90´ 100%skewness, the most common class is chosen. In addition to these criteria, variations wereintroduced based on the feature size and training size. For Hybrid 1, the full training sizeand feature set were used. For hybrid 2, the number of features was cut in half based on

27

3.12. Comparison

allotment. F

predicting inter-frequency measurements in an lte net- work …1217651/... · 2018. 6. 13. ·...

Documents