machine learning for traffic classification in industrial...

IN DEGREE PROJECT ELECTRICAL ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Machine Learning for Traffic Classification in Industrial EnvironmentsDegree Project in Electrical Engineering,Second Cycle.

FILIP BYRÉN

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

AbstractThe consumption has increased drastically over the years,where consumers have high demands on the quality of theproducts, the time it takes to receive the products and thepersonalization options. Factories try to scale with the con-sumers demands by removing human labour and deployingautomation devices that can produce products more rapidlyand with higher precision.Wireless communication in the factories would help to achievethis goal, by enabling mobility as well as reducing cablereconfiguration/troubleshooting and increasing the utiliza-tion of the factories resources.This report is investigating if it is possible to achieve benefi-cial wireless communication in a production line, where theevolved Node B scheduler can prioritize important cyclicReal-Time and alarm packets by using machine learningbased classification models. This new prioritization tech-nique would allow important factory applications to havehigh priority and it would make sure that important packetsgets served. We found several useful application classifica-tion models for factory environments, but demonstratedthat the best model may depend on the factory setup.Therefore, the report introduces as well the idea of auto-mated deep learning model construction, which allows formodel improvements by time.

ReferatMaskininlärning för att Klassificera i

Industriella Miljöer

Konsumtionen och konsumenternas krav för produkter harökat drastiskt över de senaste åren. Konsumenterna krä-ver att produkter kan skickas under kort tid efter beställ-ning med möjlighet att modifiera för personliga preferen-ser. Ständigt förbättras fabriker för att kunna tillfredsstäl-la kundernas krav. Till följd av detta har fabrikerna byttut mänsklig arbetskraft i produktionen till automatiseraderobotar som kan producera mer effektivt och med högrenoggrannhet.Vad industrin försöker möjliggöra i framtiden är att des-sa robotar ska kunna kommunicera trådlöst. Om de skullelyckas med detta skulle det resultera till att produktionenkan vara mobil, där robotar kan flyttas runt för att öka pro-duktiviteten. Trådlöst skulle också bidra till mindre kabelomdragning/felsökning samt att kostnader skulle minska.Vad denna rapport studerar är att se om det är möjligt attprioritera viktiga Real-Tids applikationer samt alarm somskickas i fabriker och se till att dessa alltid blir hanteradeförst i trådlösa nätverk. Hur detta ska gå till är att utvecklaen maskininlärningsmodell som kan klassificera applikatio-ner, där sedan denna modell kommer användas i kompo-nenten ” evolved Node B”. ”evolved Node B” ansvarar föratt ge klienter frekvenser som används för att skicka datai trådlösa nätverk, där målet är att se till att de viktigaapplikationerna prioriteras.Resultatet blev att flertal maskininlärningsmodeller kundeklassificera applikationerna, men det visade sig att den bäs-ta modellen berodde på vilka typer av applikationer sombehövdes klassificeras. Därför diskuterar sedan rapportenom en framtida automatiserad hypotetisk modell som an-passar sig för produktionslinjens applikationer.

Acknowledgment

I am grateful to Filip Mestanov for providing materials and guidelines during un-certainties in the research and thank the company HMS for offer a data capturefrom their industrial environment. I acknowledge professor Viktoria Fodor from thedepartment of network and system engineering for accepting the offer to be the ex-aminer and supervisor for the thesis. Finally, I appreciate Ericsson AB and NiklasJohansson for constructing the thesis. The research would not have existed withoutthese people and companies. Thank you,

Filip Byrén

Contents

Acknowledgment

List of Figures

List of Tables

1 Introduction 11.1 Production line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Wireless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Goals and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 62.1 Profinet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 Profinet protocol . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 Profinet communication . . . . . . . . . . . . . . . . . . . . . 8

2.2 Cellular network system . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 eNodeB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Factory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Converged Plant-wide Ethernet . . . . . . . . . . . . . . . . . 122.3.2 Manufacturing Zone . . . . . . . . . . . . . . . . . . . . . . . 122.3.3 Demilitarized Zone . . . . . . . . . . . . . . . . . . . . . . . . 132.3.4 Enterprise Zone . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.5 Wireless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.1 Traffic capture and DataFrame . . . . . . . . . . . . . . . . . 152.4.2 Evaluate a classification model . . . . . . . . . . . . . . . . . 15

2.5 Classification using machine learning . . . . . . . . . . . . . . . . . . 172.5.1 Introduction to machine learning . . . . . . . . . . . . . . . . 17

CONTENTS

2.5.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5.3 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5.4 Optimizer function . . . . . . . . . . . . . . . . . . . . . . . . 232.5.5 Common types of activation functions . . . . . . . . . . . . . 252.5.6 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 272.5.7 Recurrent neural network . . . . . . . . . . . . . . . . . . . . 272.5.8 Improving the model . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6.1 Internet Traffic Classification Using Feed-forward Neural Net-

work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.6.2 Network Traffic Classifier With Convolutional and Recurrent

Neural Networks for Internet of Things . . . . . . . . . . . . 302.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Implementation 313.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Captures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4 Empirical observations . . . . . . . . . . . . . . . . . . . . . . . . . . 343.5 Data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.6 Machine learning models . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Results and conclusions 464.1 Computer specification . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2 Model evaluation for HMS . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Model evaluation for DEFCON . . . . . . . . . . . . . . . . . . . . . 504.4 Benefits of using machine learning for the eNodeB scheduler . . . . . 54

4.4.1 eNodeB scheduler gain . . . . . . . . . . . . . . . . . . . . . . 574.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Discussion 655.1 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.2 Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.3 Compressed neural networks . . . . . . . . . . . . . . . . . . . . . . . 675.4 Creating new auto AI models . . . . . . . . . . . . . . . . . . . . . . 675.5 Problems of using machine learning . . . . . . . . . . . . . . . . . . . 695.6 Concern about AI and the engineers role in the future society . . . . 69

References 71

Abbreviations

ADAM Adaptive Moment EstimationAI Artificial IntelligenceARP Address Resolution ProtocolART Acyclic Real-TimeCPwE Converged Plantwide EthernetDNN Deep Neural NetworkeNodeB Evolved Node BHTTP Hypertext Transfer ProtocolIACS Industrial Automation and Control SystemIO Input OutputIRT Isochronous Real-TimeIT Information technologyLLDP Link Layer Discovery ProtocolLSTM Long Short Term MemoryML Machine LearningMLP Multi Layer PerceptronsMTCD Machine-Type Communication DeviceNRT Non Real-TimeNTC Network Traffic ClassificationPN-DCP Profinet Discovery and Configuration ProtocolPN-PTCP Profinet Precision Transparent Clock ProtocolPNIO Profinet Input OutputPNIO-AL Profinet Input Output AlarmPNIO-CM Profinet Input Output Context ManagerPNIO-PS Profinet Input Output Provider StatusQoS Quality of Service

RAM Random Access MemoryReLU Rectified Linear UnitRT Real-TimeSVM Support Vector MachineTCP Transmission Control ProtocolUDP User Datagram ProtocolUE User Equipment

List of Figures

2.1 Profinet-IO protocol stack structure, drawn with https://draw.io/. . . 82.2 The Ethernet header for Profinet RT and IRT with the additional 802.11Q

frame added, the figure is from [1]. . . . . . . . . . . . . . . . . . . . . . 92.3 The scheduling of resource units in the eNodeB for granting UE/MTCDs

access to transmit data, the figure is from [2]. . . . . . . . . . . . . . . . 112.4 The framework structure for CPwE, notice that no direct communication

is occurring between Enterprise Zone and Manufacturing Zone. Thefigure is from [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 The figure illustrates the concept of Decision Tree as a classificationmodel, in this case the outputs are C and D and based on the Booleanfeatures A and B, the model will follow the correct path and determinewhich output it is. The figure is done using https://www.draw.io/. . . 19

2.6 The figure is illustrating the classification concept using SVM classifi-cation, here if a new data point is on the left side of the line it will beclassified as the same label as the other points on the left side and viseversa, the figure is from [4]. . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 The two core ideas of a deep neuron network, is artificial neuron anda neural network. These ideas can be seen in figure (a) of a artificialneuron and (b) a deep neural network. . . . . . . . . . . . . . . . . . . . 23

2.8 This figure illustrates the typical activation functions used in deep learn-ing, the figure is from [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 The time series matrix used for LSTM and Convolutional models inputdata, the figure is generated using https://www.draw.io/. . . . . . . . 33

3.2 The features mixture model for each application in the HMS data-set. . 35

List of Figures

3.3 The features mixture model for each application in the DEFCON data-set. 353.4 The distribution for different source interval time for four application

types in the HMS data-set. The applications selected was PNIO-AL,PNIO-CM, PN-DCP and PINO-PS. . . . . . . . . . . . . . . . . . . . . 37

3.5 The packet behaviour for each application. The colour represents onesource and how it is sending its packets, this is to study if there is anycyclic or acyclic behaviour. The result is from the HMS data-set. . . . . 37

3.6 The distribution for different source interval time for four applicationtypes in the DEFCON data-set.The applications selected was HTTP,PN-DCP, PN-PTCP and PNIO . . . . . . . . . . . . . . . . . . . . . . . 38

3.7 The packet behaviour for each application. The color represents onesource and how it is sending its packets, this is to study if there is anycyclic or acyclic behaviour. The result is from the DEFCON data-set. . 38

3.8 The frequency for each application in the HMS data-set, containing10140 packets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.9 The frequency for each application in the DEFCON data-set, containing1046036 packets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.10 The correlation for each feature in the HMS data-set, this is to determinehow each feature and application correlate. . . . . . . . . . . . . . . . . 41

3.11 The correlation for each feature in the DEFCON data-set, this is todetermine how each feature and application correlate. . . . . . . . . . . 41

3.12 The MLP model design for both data-sets. . . . . . . . . . . . . . . . . 443.13 The Convolutional model design for both data-sets. . . . . . . . . . . . . 443.14 The LSTM model design for both data-sets. . . . . . . . . . . . . . . . . 453.15 The Convolutional LSTM model design for both data-sets. . . . . . . . 45

4.1 Models evaluation scores for HMS test data-set. . . . . . . . . . . . . . . 474.2 The classification model delay for the selected models in the HMS data-set. 484.3 The classification delay for the fastest classification models in the HMS

data-set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4 Confusion matrix result using HMS test data-set for the Decision Tree

(C4.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.5 Confusion matrix result using HMS test data-set for the Convolution

LSTM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.6 Models evaluation scores for DEFCON test data-set. . . . . . . . . . . . 514.7 The classification model delay for 1000 packets, this is for the DEFCON

data-set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.8 The classification delay for the fastest models in the DEFCON data-set. 524.9 Confusion matrix result using DEFCON test data-set for the Decision

Tree (C4.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.10 Confusion matrix result using DEFCON test data-set for the Convolu-

tion LSTM model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.11 HMS traffic streamed to the eNodeB during one second. eNodeB under-

stands the packets applications using ideal ML. . . . . . . . . . . . . . . 55

4.12 This example for HMS, the eNodeB views all packets as equal. . . . . . 554.13 DEFCON traffic streamed to the eNodeB during one second. eNodeB

understands the packets applications using ideal ML. . . . . . . . . . . . 564.14 This example for DEFCON, eNodeB views all packets as equal. . . . . . 564.15 The accumulated score for different models, to show the potential gain

of using machine learning classifier to prioritize important industrial ap-plication in the HMS data traffic. . . . . . . . . . . . . . . . . . . . . . . 58

4.16 The potential buffer benefit of using a machine learning model in theeNodeB vs without, for the HMS data traffic. . . . . . . . . . . . . . . . 59

4.17 Each application drop rate, using machine learning for the HMS datatraffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.18 Each application drop rate, without machine learning for the HMS datatraffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.19 The serve rate increased by 33.3% for the eNodeB with no ML installed,to make sure as many important applications gets served, as using ma-chine learning for HMS traffic. . . . . . . . . . . . . . . . . . . . . . . . 60

4.20 The accumulated score for different models, to show the potential gainof using machine learning classifier to prioritize important industrial ap-plication in the DEFCON data traffic . . . . . . . . . . . . . . . . . . . 61

4.21 The buffer load for the DEFCON traffic, using machine learning and nomachine learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.22 Packet drop rate for the applications, using machine learning, in theDEFCON traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.23 Packet drop rate for the applications, without machine learning, in theDEFCON traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.24 The serve rate increased by 50% for the eNodeB with no ML installed, tomake sure as many important applications gets served, as using machinelearning for DEFCON traffic. . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Illustrative image showing how each factory eNodeB could be updatedand improved over time, using some type of automated machine learningthat tries to improve model classification performance and latency. Thefigure is generated using https://www.draw.io/. . . . . . . . . . . . . . 66

List of Tables

5.2 Idea how the future factory eNodeB function in the cellular network canimprove over time, the figure is generated using https://www.draw.io/. 68

List of Tables

2.1 The priority level for different traffic class services in the 802.1 Q header,table is from [6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 This table illustrates common services used in a Profinet communication,and the priority based on 1(low)-9(high) scale. This scale is based onmy conclusion motivated in the background, (slide 14 [7]) and (section2-4 [8]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 The table illustrates the concept of a confusion matrix; here we have fourapplications where each contains 15 packets in total. The first elementin each row in the matrix is the true application for the packet. Eachcolumn after the first column element presents the classification. . . . . 16

3.1 The HMS/DEFCON village data-set split into to sets for training andtesting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Chapter 1

Introduction

AI is not here to take part, it ishere to take over.

Conor McGregor, Twist oforiginal quote

1.1 Production lineThe consumption has intensified radically over the years. ”Världskoll” a non-profitcompany funded by United Nations, assesses that the consumption has increasedby 46% from 1990 to 2015 in Sweden [9]. Besides the increase, the consumers’standards on the products have also enlarged. They require higher consistency,faster delivery, better quality and more personalization options then ever before [10].Factories try to scale with the consumers’ demands by removing human labour anddeploying automation devices that can produce products more rapidly and with highprecision. The issues factories are facing today are the fluctuated needs for differentproducts, and to mass-produce is not always an option as the storage capabilitiesare limited. Companies try to solve this by having close feedback on the market,and produce products based on the needs today. This means they need productionline devices that can adapt their tasks quickly. Today the industry is talking about"industry 4.0" where they look into using artificial intelligence among other thingsto create adaptive manufacturing, where robots can learn for instance by humandemonstration and by that create a fast relearning process[10]. However the needdoes not end there, as the storage capacity and material is limited, the productionline setup needs also to be adaptive, resulting in that devices need to change placesin the production line to be able to produce more efficiently. A way to enable this isto make the production line communication wireless; this would enable mobility aswell as reducing cable reconfiguration/troubleshooting and allow higher utilizationof factory resources.

1

CHAPTER 1. INTRODUCTION 2

1.2 WirelessThere are several technologies that enable wireless communication today, WiFi,Bluetooth and cellular networks are a few examples of the large pool of options.WiFi can locally connect devices and through the WiFi router access the Internet.WiFi focus on giving high speed to the local devices using a shared frequencyspectrum for sending data. The drawbacks of WiFi are that it does not use alicensed spectrum, not using a licensed spectrum for transmitting data results thatother devices might use the same spectrum for sending data resulting in interference.Second problem with WiFi is that the signal strength is very dependent on theenvironment, where objects around can change the signal strength. Finally, WiFiwhere not designed for scaling so having it in a factory production line would be apoor solution. Bluetooth is a device-to-device communication and likewise WiFi ituses a shared spectrum. There are several disadvantages of using Bluetooth for afactory device communication, for instance the range limitations, the interferencebetween other devices, and the latency. The technology the industry is looking intois cellular networks. This is because the cellular networks use licensed spectrum,and designed to be scalable, robust and mobile. Moreover, the future standardsof cellular networks can offer the low latencies that most production line devicesrequire.

1.3 ProblemThe technical hitches with wireless communication in the factories is that in generalthe communication in the production line is required to be sent in a low latencyperiodic manner. Those latency requirements are not achievable with today’s cel-lular networks, but will be in the future. Additional problem the industry is facingis that the automation devices communicate using industrial Ethernet protocols,not designed for wireless communication. This is an issue during traffic peak loadswhere the production line communication needs to be prioritized. If the cellular net-work cannot distinguish important packets and decides to not serve a productionline cyclic packet directly, the way most industrial protocols view this is that thecommunication is finished. The outcome of this is that the production halts. Thiscannot happen, as any production delay will translate to major production loss.This problem will the report assess, and the research question is to find out if it ispossible to achieve an effective application classifier model using machine learning,making sure important industrial packets get prioritized in a cellular network.

1.4 PurposeIf the cellular network knew how important different packets are in the traffic flow, itcould prioritize the one that are most important for the production. The solution forthis is to make a network traffic classifier (NTC) that can find out the characteristics


of data sent from factories and classify them, so less packet loss will occur for crucialdata and by that improving the quality of service (QoS) for the factories.

1.5 Goals and limitationsThe goal of the thesis is to answer the research question; if it is possible to achieve aneffective application classifier model using machine learning, making sure importantpackets get prioritized in a cellular network. How this will be achieved is thatthe thesis will process data from factory enterprises and mimic how the cellularnetwork receives the data. Different machine learning models will be tested, toobserve if applications can be correctly classified using the input the cellular networkcan extract from the packets during scheduling. The performance of the modelsclassification will then indicate if application prioritization is achievable. What thethesis needs to explore are:

1. Industrial Ethernet Protocol: understand a common industrial protocolused in production lines. This knowledge will be useful to find characteristicsin the packets that normally relate to a certain application. The limitationof this is that there are plenty of industrial protocols, where perhaps differentprotocols have different characteristics that relate to different applications.

2. The cellular network transmission scheduler: understand what corefunction in the cellular network gives clients access to frequencies for trans-mitting data.

3. General factory setup: understand how the setup between devices arein a factory, understand there roles and how they communicate to the officelandscape. The outcome of this knowledge will explain how we select data-setsthat mirror the reality as good as possible.

4. Data extraction: how the packets sent in a factory can be extracted and howthey could be used for understanding the behaviour of different applications.

5. Data processing: how can we extract the relevant features in the data. Theresearch also needs to figure out which features do the cellular network receivewhen users transmit data.

6. Data analytics: understand the underlying behaviour of the different packetsand see if this confirms the behaviour we expect.

7. Classification models: generate several classification models and studythem to show which one is the most suitable for the data-set.

8. Gain: find out a way to show the profit of using a classification model forindustrial environments.


The benefits of the work are a report showing how industrial production devicescommunicate, how a machine learning model can classify the application of eachpacket. How the machine learning model would be deployed in the cellular networkand the gain of using application classification for the production line.

There are many related topics to improve the packet priority in a production lineusing wireless communications. Topics that are not deliberated in this researchare for instance, a redesigned industrial Ethernet protocol for wireless communica-tion, other prioritization techniques without machine learning and study multipleindustrial Ethernet protocols.

1.6 MethodologyThe research will use quantitative measurement, as large set of data exists in pack-ets captures. Data exploration will be the ground reasoning for the applicationsbehaviour, resulting in empirical observations to observe if the data match the hy-pothesis or the background. Nevertheless, if the behaviour of the data does notmatch the expectation from industrial standards or previous study, the study willattempt to explain why that might be the case or will question the provided data.This will be an inductive approach where the thesis will be able to change the di-rection during the research to be able to answer the research question.The goal of the report is to answer the research question: ”if it is possible to achievean effective application classifier model using machine learning, making sure impor-tant packets get prioritized in a cellular network”. Which means a conclusion thatindicates a sufficient classification model will be difficult to achieve, is as good of aresult as finding a sufficient model. This will encourage unbias reasoning where nogain exists for instance in manipulation of the results.The methods used for building the classifier models will be in the machine learningarea. The reason for machine learning is the adaptability machine learning toolshave for different data characteristics. How a machine learning model works is thatthe model learns the input mapping to the output, without anyone declaring it.This allows machine learning to be able to efficiently classify the output of the datawithout the need of prior knowledge of the data-set. The hypothesis why machinelearning is needed is that different industrial applications in the production line willmost likely have similar packet behaviour. This means that the applications are dif-ficult to divide manually, where then a machine learning model can instead find outthe mapping between the input and the output. Due to the classification strengthdifferent machine learning models ensure, the thesis is going to try-out several ofthem to see if a handful can classify the applications. This means that severaldifferent classification models needs to be tested, from simple machine learningmodels to more complex deep learning methods to reflect the data-sets complexityas good as possible. The hypothesis why deploying many different types of machinelearning models is that most likely they will indicate together, if a network traffic


classification model will perform well or not in the factories.

1.7 OutlineFirst chapter is the background chapter; it will describe a well-used industrial Eth-ernet protocol, the role of different devices in a production line and conduct areasonable priority table for different industrial applications. Later the chapter willgo through a factory setup, and where the network traffic classifier would be usefulin the cellular network. Afterwards it will enlighten different data processing tech-niques as well as data analytics. Finally, different machine learning models will bemotivated based on previously used machine learning models for similar classifica-tion problems.The implementation chapter will explain how the industrial factory data can beconverted to useful data types, and this chapter will display analytics on the dif-ferent data-sets. The implementation chapter will use the background knowledgeto select relevant features as input for the machine learning models, where dataanalytics empirical evidence will later confirm the relevance and the behaviour ofthe features selected.The result chapter will compare the machine learning models performance andpresent the gains of using machine learning for the production line. The discussionof the thesis is in the final chapter where the thesis will explain possible futureworks and improvements. This chapter will also discuss the general ethical concernabout deploying machine learning and AI.

Chapter 2

Background

2.1 Profinet2.1.1 IntroductionHistorically the factory production lines have been separated from the rest of theenterprise. The only method to get information about the production progression isto be present at the production line, the same goes for updating machines and han-dle oversights. The industry aspired to have the factory activity connected with theenterprise, using some type of standard solution which all automated devices cancommunicate within, allowing someone to have oversight of the process in the fac-tory without being present. The solution became to introduce industrial Ethernetprotocols. The goal of an industrial Ethernet protocol is to enable communicationbetween automation devices, and also be able to communicate to the office land-scape using Ethernet cables. One of the most commonly used industrial Ethernetprotocols today is Profinet [11]. Profinet allows different input types with differentrequirements to communicate using the same type of protocol. The Profinet proto-col uses a standard communication for the production lines Input-Output Devices(IO-Devices) over Ethernet. Profinet enables connections between the productionsIO-Devices and the office landscapes, with several different qualities of service capa-bilities. This includes for instance different latency performance guarantees basedon the IO-Devices needs. The protocols services are Non-Real-Time communication(NRT), Real-Time communication (RT) and Isochronous-Real-Time (IRT) [12].

2.1.2 HistoryThe first Profinet protocol arrived in 2000, named Profinet component-based-automation(CBA). The goal of this protocol was to use the TCP/IP stack for machine-to-machine communication, where the protocol offers a periodicity of approximately100ms[1]. The downside with this protocol version is that it does not offer Real-Time cyclic communication [12]. Because this version of Profinet cannot handleReal-Time communication, most production IO-Devices cannot use this.Cyclic Real-Time data means a periodic communication with low jitters, which

6

CHAPTER 2. BACKGROUND 7

should always send and receive in a specific periodic time interval. The factoryautomation line requires Real-Time cyclic communication as the IO-Devices do fastoperations were each IO-Device has its own task for the production. If this com-munication misses one period or is sent or received to late the communication isterminated[13] and the production stops. An updated version of Profinet calledProfinet-IO, offers this guaranteed low latency cyclic communication by addingReal-Time and Isochronous-Real-Time service capabilities. How it achieves this isto bypass the UDP/IP layers in the protocol stack, which reduce the latency for thepackets. The result of this is a Real-Time packet offer of 5-10 ms periodic time inter-val message rate[1] where Isochronous-Real-Time offers even lower periodic latencytime between 0.25-1 ms, this is useful for instance in motion control IO-Devices.Now all the IO-Devices can communicate using only one type of protocol, and ableto send data to the office landscape.

2.1.3 Profinet protocolThe Non-Real-Time Profinet applications use the first version of the Profinet-IO,seen in (A) in the figure 2.1. It is used for communications that do not require lowlatencies, for instance connection establishments, diagnostics and status reportsfrom the production line to the office landscape. The Real-Time production lineapplications use the second version of the Profinet-IO, the figure (B) in figure 2.1and the Isochronous-Real-Time applications use the third version of Profinet-IOseen in figure (C) in figure 2.1.Now based on the applications need it can select the suitable Profinet-IO versionfor transmitting the packet. As mentioned before the Real-Time and Isochronous-Real-Time protocol stack is redesigned without the Internet layers, which is seenin figure 2.1. What the redesign adds is the 802.1Q block in the Ethernet frame

Priority Traffic Class0 Background1 Best effort2 Excellent effort3 Critical application4 Video5 Voice6 Internetwork control7 Control data traffic

Table 2.1: The priority level for different traffic class services in the 802.1 Q header,table is from [6].

for (B), (C) in figure 2.1. The Ethernet frame in figure 2.2 has the Type 8892,signifying the packet is a Profinet Real-Time or Isochronous-Real-Time application.The additional 802.1Q [6], also shown in figure 2.2, adds the priority to the packet.The priority is based on a scale from 0 to 7 where 7 is the highest priority. Table


(A) Profinet Version 1. (B) Profinet Version 2. (C) Profinet Version 3.Figure 2.1: Profinet-IO protocol stack structure, drawn with https://draw.io/.

2.1 shows all the priority levels for 802.1Q services, where all Profinet Real-Timeand Isochronous-Real-Time applications have a priority of 6, same as internetworkcontrol services. This design results in all Profinet services using version 2 andversion 3 of Profinet-IO will have the same priority even though various serviceshave different importance.

2.1.4 Profinet communicationThere are three types of roles in a Profinet communication, IO-Devices, IO-Controllersand IO-Supervisors. The IO-Devices are machines in the automation line. Thesemachines are for instance robots, sensors drivers, actuators and fans. They do notoperate by themselves, as they need some type of programmable logic controller(PLC) that will inform them what to do. Those programmable logic controllerscan for instance inform the IO-Devices to change the angle of a robot arm, or do aparticular task on the product. In Profinet the programmable logic controllers arecalled IO-Controllers. The last role is the IO-Supervisor role. The IO-Supervisorsassignment is to gather status reports and diagnostics from the IO-Devices. Forinstance how the production is going and if there is any indication that a machineneeds repairing or needs to be changed [8].


Figure 2.2: The Ethernet header for Profinet RT and IRT with the additional802.11Q frame added, the figure is from [1].

The most commonly used services in a Profinet communication:

1. Link Layer Discovery Protocol (LLDP): it informs the network aboutits existence and its ability, as well understands how the network is setup [14].

2. Address Resolution Protocol (ARP): the goal of this protocol is to broad-cast a request to assign this connection a particular IP Address, where it firstchecks if this IP Address is available. Now the network will map this IP Ad-dress to this connections unique MAC Address [14]. For a Profinet device theMAC Address is a Profinet id which is unique for every device and is basedon [Vendor_ID,Device_ID] [15].

3. Profinet Discovery and Configuration Protocol (PN-DCP): it is forthe IO-Supervisor to allocate a reference to the connection and to give specificIP Address based on hardware configurations for the IO-Controller, this stageis done together with ARP [14].

4. Profinet Input Output Context Manager (PNIO-CM): this stage setupa connection between the IO-Device and the IO-Controller, and notifies theconnection establishment of the type of traffic and latency requirements needfor the communication [14].

5. Profinet Precision Transparent Clock Protocol (PN-PTCP): this ap-plication maintains synchronization of the production line [14].


Service Priority Type ConceptLLDP 1 NRT,LLDP Device existenceARP 1 NRT, ARP Address look up

PN-DCP 5 RT, 802.1Q Address assignment stepPNIO-CM 3 NRT, UDP Connection establishmentPN-PTCP 5 RT, 802.1Q Synchronization

PNIO 7 RT, 802.1Q Cyclic data exchangePNIO-PS 7 RT, 802.1Q Cyclic Service data unitPNIO-AL 5,9 ART 802.1Q Acyclic process alarm (low,high)

Table 2.2: This table illustrates common services used in a Profinet communication,and the priority based on 1(low)-9(high) scale. This scale is based on my conclusionmotivated in the background, (slide 14 [7]) and (section 2-4 [8]).

6. Profinet Input Output (PNIO): is the application where cyclic Real-Timeand Isochronous-Real-Time packets are sent between the IO-Controller andIO-Device, using Profinet-IO version two or three.

7. Profinet Input Output Provider Status (PNIO-PS): is similar to PNIOwhere service status is added to the exchange, this is optionally done for cyclicReal-Time and Isochronouse-Real-Time packets between the IO-Controllerand IO-Device [8].

8. Profinet Input Output Alarm (PNIO-AL): sends acyclic Real-Time andIsochronous-Real-Time alarms [14]. Alarms can for instance inform an IO-Controller that an IO-Device is getting warm and the IO-Controller needs tosend a data exchange message to an IO-Device fan to increase the fan speed.The alarm can have the value high or low, signifying the importance.

Based on our knowledge about Profinet-IO communication we can assign differentimportance level of the application types during the data exchange. This is done intable 2.2.

2.2 Cellular network systemThe latest commonly used standard for wireless cellular network systems is LTE,LTE stands for Long Term Evolution[16]. The function in the LTE architecture thatis responsible for the wireless transmission of data to the user devices is the evolvedNode B (eNodeB). The goal of eNodeB is to oversee all radio functions, includingscheduling and provide the User Equipment (UE) or Machine-Type CommunicationDevice (MTCD)[2], communication to the rest of the system[17].


2.2.1 eNodeBThe eNodeB transports IP user data via a Packet Data Convergence Protocol(PDCP) that encapsulates the data. Encapsulation means the data will be opaque,and cannot be processed nor open for other clients. When the UE/MTCDs wantto send data they send a scheduling request to the eNodeB[2], the process can beseen in figure 2.3. The eNodeB schedule the clients based on three criterias shownin figure 2.3, (1) QoS Requirements, (2) Channel Quality Dynamics and (3) TrafficDynamics. The goal of these criterias is first to understand the QoS requirementthe data have, for instance if the packet is for uploading, streaming, web surfing,etc [2]. Different QoS cluster levels exist to map this[2]. However, this turns out tobe a bad solution as different packets in the same QoS cluster can have for exampledifferent acceptance for delay of the packet, for instance vehicles sensor data, mon-itoring and factory Real-Time applications[2]. Second scheduling criteria studieshow much resources each UE/MTCD needs and how fast it needs to send them,and the last one is traffic dynamics, taking the load of the system into account forthe scheduling. Based now on the three requirements a scheduling request can begranted during a transmission time interval (TTI), where the UE/MTCD will getassigned Resource Units (RU) for this current TTI, seen in gray in figure 2.3. Nowthe data will be sent under some defined frequencies where no other client can usethose frequencies during this time, called Resource Block (RB) [2], seen in figure2.3. The goal for the thesis is to investigate if this scheduler can do internal prioriti-zation for important factory applications UEs/MTCDs, using a classification modelthat classifies the different applications. The network classification model creates anew type of QoS requirement system that can prioritize more exactly based on theapplications, overcoming the internal prioritization issue with industrial Ethernetprotocols. 5G improvements on the LTE network will for instance allow more de-

Figure 2.3: The scheduling of resource units in the eNodeB for granting UE/MTCDsaccess to transmit data, the figure is from [2].


vices connected and make it possible to get lower latencies then what was possiblebefore, now allowing communication with a latency of 1-10ms[18]. It will also makeit possible for the eNodeB to process the UE/MTCDs Ethernet type. This will beintroduced in future standards to handle QoS mechanism issues, reduce complexityand reduce extra overhead [19].

2.3 FactoryCisco and Rockwell automation have provided a white paper [20] about their vi-sion how a factory topology can be designed when mixing the office landscape andthe production lines, naming it Converged Plantwide Ethernet (CPwE) [3]. It isdesigned to separate and protect different parts of the network, by defining levels,where each level in the design has a particular role in the network.

2.3.1 Converged Plant-wide EthernetCisco’s and Rockwell automation’s goal of the CPwE solution is a scalable solutionthat handles both small (50 devices or less) networks and large (10 000 + devices)networks for factories. How it achieves the scalability and security is to have specificlevels in the design where one has predefined communication structures betweenthe levels, allowing it to scale horizontally. The CPwE communication can be seenin figure 2.4. The idea of defining levels is that the enterprise will never havedirect access to the manufacturing, and by that reduce the risk of getting malicioussoftware in the production.

2.3.2 Manufacturing ZoneLevel 0 to 1

Level 0 is the process level in the CPwE hierarchy, where the underlying productionunits exist. This level is the lowest level in figure 2.4. The functions done here arefor instance welding, painting, 3-d printing, measurements and so on. The devicesconnected at this level receives their instruction from the controller devices in level1 [3]. One that have read the section Profinet-IO can quickly draw the conclusion,that this would be the IO-Devices and IO-Controllers communication in a Profinetsetup.Level 1 have the basic programmable logic controllers for the manufacturing. Theycan communicate with both the higher level 2 and also the devices in level 0. Asmentioned before, these basic controllers can relate to the IO-Controllers in theProfinet-IO.

Level 2 to 3

Based on the size of network, level 2 can be merged with the level 3. The differencesof level 2 and level 3 in CPwE, are that generally level 3 have additional services like


Figure 2.4: The framework structure for CPwE, notice that no direct communica-tion is occurring between Enterprise Zone and Manufacturing Zone. The figure isfrom [3].

file servers, and other domain services. The role for the devices in level 2 is to havesupervision over the manufacturing, where the devices in level 0-1 send feedback tothe supervisor devices. Here humans can get this feedback and have oversight overthe production in dedicated control rooms [3]. These devices can be linked to theIO-Supervisors roles in Profinet-IO.Level 3 serves as the final production layer. The task these devices have is to managefile systems, follow production progress and check material assets [3].

2.3.3 Demilitarized ZoneThe goal of the demilitarized zone (DMZ) is to isolate the enterprise and manufac-turing, which reduces harmful interference risk. The way this is done is to only allowcertain types of traffic to pass in either direction. Here several firewalls is setup tomake sure only traffic with permission can access the lower levels in the enterprise.A firewall is a system that are designed to protect its resources by having rules onthe traffic incoming and outgoing of the firewall [3].


2.3.4 Enterprise ZoneLevel 4 to 5

Level 4 is an office landscape level. However not all employees have access to theinformation one can extract from level 4. What can be extracted in the level 4 aresummaries from the production line and with permission, the organization database[3].Level 5 is the final level in CPwE and it is the general office network. External usersare connected at this level and the only way to access lower levels in the CPwE designfrom level 5 is to get permission through the enterprise secure applications [3].

2.3.5 WirelessThe CPwE architecture explains a general common business setup for a productioncompany, where figure 2.4 shows each level in the architecture. Here every line couldtheoretically be replaced with a wireless transportation medium, and still keep theinternal architecture, allowing the factory setup to be dynamic and secure usingfor instance a cellular network. The application classification model would makethe eNodeB in the cellular network understand the applications going through andmake sure important applications will be served first.

2.4 AnalyticsA part of building a good classification model is to understand the data. Analyticsis essential when one for instance wants to understand the complexity of the dataor wants to confirm a hypothesis with quantitative data analytics. Based on thisknowledge one can for example get an idea of how complex the methods for clas-sifying the data need to be and to confirm ones feature selections. To be able todo data analytics the data needs to be processed. This is to remove some of theirrelevant parts of a data capturing, as well as restoring damaged parts and alsobe able to convert the data to useful objects for data analytics and machine learning.

One common technique for understanding the relationship of different character-istics is to normalize the data. The goal of normalization is to shows more clearlyhow large or small a value is in contrast to the rest in the capture. The samekind of analytics can be done for standardization, where one instead is interestedin the spread. Both techniques can be related to statistics and the formulas are thefollowing:

Normalization : Xnormalized = X − xmin

xmax − xmin(2.1)

Standardization : Xstandardization = X − E[X]σ

(2.2)


2.4.1 Traffic capture and DataFrameThe communication between a source address and a destination address is possibleto record by using a packet capture program. Python[21] is the selected program-ming language for the thesis due to its simplicity and that a lot of Machine Learningtools exist for this language. Wireshark[22] is the selected capturing program, as itcan extract the captures to Python using a module called pyshark[23].In Wireshark one can then fetch a summary of each packet information, this infor-mation can for instance be the packet size, the Ethernet type and the recoded indexfor the packet.This captures need to be extracted, using for instance pyshark and stored in a wayPython can extract the information. One structure type that can handle this data isa DataFrame[24]. DataFrame is a two-dimensional data structure, with the optionof having columns with different data types. This makes it useful for analytics andmachine learning. A DataFrame is related to a structured query language (SQL)table. It is stored in the Random Access Memory to perform fast search and oper-ations, with the option of storing the data on disc storage if it exceeds the RandomAccess Memory allocation available[24].

2.4.2 Evaluate a classification modelTo evaluate the performance of each classification model generated in the research,a type of score will be necessary. We will introduce four different types of scoresthat are typically used for evaluating classification performances. The four scoresare accuracy, precision, recall and f1-score.

Accuracy measures the total amount correctly classified packets divided by thetotal amount of packets. Example, if five packets classify as the correct application,and the total packets tested are ten, the accuracy score would be 50%. Drawback ofusing accuracy as the measurement of the performance of the classification model,is that it does not express how precise the classification was between each label. Forinstance if the majority of the data belongs to one application type and the modeldecides to classify all packets as that particular application type, this will result inhigh accuracy score but the model is not precise.

Precision measures for each application, how many of the packets that was clas-sified for that application are correct. Example, if ten ARP packets are classified asARP of the total one hundred ARP packets, and no other application are classifiedas an ARP application. The precision will be 100% for the ARP label. This makessure that the ones that are classified as an ARP application are guaranteed to bean ARP application. The drawback is that it ignores that only 10% of the totalARP packets where classified correctly.

Recall handles this issue by instead checking for every type of application, howmany of them where classified correctly. For instance, the ARP example would give


a recall score of 10% for the data that belongs to an ARP application.

The final classification model evaluation score included in this study is the F1 score.The goal of F1 score is to combine the recall score and precision score to find a har-monic combination of both scores. Using the following equation:

F1 = 2 ∗ Precision ∗Recall

Precision + Recall(2.3)

Instead of a numeric score of the model classifications performance, one can getstrong understanding of how the model classifies by studying how the model clas-sified each data point. This is also very useful in a priority perspective. Because iftwo applications where misclassified between each other where both applications aremapped to the same priority level this would not be a big issue, but if the prioritydifferences are large it would be a problem.One way of doing this type of evaluation is to study a confusion matrix. The goalof a confusion matrix is to show row wise the true application and column wisehow the model classified the data. Table 2.3 illustrates this, where four classes A-Dwhere selected. Here each row number represents each class, and how that classdata was classified. The first row in the table 2.3 is for class label ”Application A”,which contains 15 packets in total, where 10 of 15 where classified as ”ApplicationA”. One of 15 was misclassified as ”Application B”. One of 15 was misclassified as”Application C” and three of 15 was misclassified as ”Application D”. The secondrow is for the results of ”Application B”, third row for ”Application C” and the lastrow for ”Application D”.

Class \ Classified Application A Application B Application C Application DApplication A 10 1 1 3Application B 0 10 2 3Application C 1 1 10 3Application D 0 0 0 15

Table 2.3: The table illustrates the concept of a confusion matrix; here we havefour applications where each contains 15 packets in total. The first element in eachrow in the matrix is the true application for the packet. Each column after the firstcolumn element presents the classification.


2.5 Classification using machine learning2.5.1 Introduction to machine learningThe traditional structure of a program is to build a model that creates the outcomeone is looking for. Machine learning inverts that concept by defining the outcomeand then learning the model mapping between the input and the outcome. Thismapping is achieved by using large amount of data, on which the machine learningalgorithms tries different hypothesis to match the desired outcome.There are three different learning types when one discusses machine learning; su-pervised learning, unsupervised learning and reinforcement learning. Supervisedlearning is the case when the model knows what is correct or not, for example onehas a set of pictures with and without cars where all pictures are label if they havea car or not in it. Now the model learns the mapping of the pixels to the outputcar/not_car. Unsupervised learning does not have the label provided for the dataset. Now it cannot get the feedback on what is correct or not, and needs to figureout the distribution by itself. The final one is reinforcement learning where themodel gets a reward for the action it has taken, where the goal of the model isto maximize the overall total reward. Reinforcement learning is generally used ingames where the model creates a bot that can solve the game. The report will usethe supervised learning approach, where each packet service type will be the outputthe model will classify.

2.5.2 DataWhen a machine learning model is learning the mapping between the input andthe desired output, one wants to be sure the mapping is a general solution thatwill work well when new data is classified. How one can test this, is to split thedata-set into two or three parts. The largest part of the split will be the trainingdata, which will be the data the model will train the mapping on. Then one cantest the performance of the model on the second split of the data called testingdata. This will then test how the machine learning model preforms with unseendata. The standard ratio between training and testing data is around [80%, 20%]to make sure one have sufficient amount of data the model is testing. The secondoption is to split the data-set into three parts, the goal here is that one introducesa validation set, where this split becomes now [60%, 20% , 20%] of the total data.Motivation for a validation set is to allow the model to change after training withoutintroducing bias. This means one validates the model performance using the valida-tion set and then change the setup of the model to perform better on the validationset. After one has changed the model setup, the training is done for this new modelsetup. The new model is then tested once again using the unseen validation data.This continues until one is satisfied with the classification performance. What hashappened now is that the setup of the model is optimized on performing the beston the validation data. This means even if the model has never been trained onthis data, the complexity selection of the model has for instance been based on the


validation data performance. To then test the actual performance of the model,one tests the model using testing data, which is data the model has never beenintroduced to before. This result will imitate how the model would perform in thefuture. A limitation of using validation sets, are that they requires large amountof data, as 20% less of the data will be for training. If one believes that will notbe enough data, one has come to a dilemma between bias and too little trainingdata. A solution to handle this, can be random partition of the data as trainingand testing but still keep the [80%,20%] ratio. This allows the model to be trainedand then tested several times with low bias. The model setup can now be changedand tested, where now the new model setup takes another partition of the data astraining and testing. This allows the model to not be optimized for any partic-ular testing data. This will reduce the bias, but requires considerable amount ofiterations for improving the model.

Naïve Bayes

The Naïve Bayes classification is a supervised learning technique. It is foundedon Bayes theorems, to create a probabilistic classifier for predicting the outcome.The name ”naïve” comes from its assumption regarding that the data are alwaysindependent, which is a naive approach. The concept of Naïve Bayes is that onefirst has a dedicated training set. The training set is a data-set used to build theprobability space for the different classification labels based on the input features.For example, lets say one have two input features A and B, where both are Booleanfeatures. The outcome can be two labels, for instance C and D. With the trainingset, Naïve Bayes can find several useful probabilities:

P (C) = 13

P (A = True, B = True|C) = 19

P (A = False, B = True|C) = 12

P (A = True, B = False|C) = 1P (A = False, B = False|C) = 0

Now when new data will be classified the model will use the previous probabilityspace to predict the new data labels, where the label with the highest probabilitywill be the predicted output. Lets say a data with feature A = True and B = True


should be classified, one can use Bayes theorems to solve this.

P (C|A = True, B = True) = P (C)P (A = True, B = True|C)P (A = True, B = True)

P (D|A = True, B = True) = P (D)P (A = True, B = True|D)P (A = True, B = True)

Classification = Max(P (C|A = True, B = True), P (D|A = True, B = True))

Classification = Max(P (C)P (A = True, B = True|C)P (A = True, B = True) ,

P (D)P (A = True, B = True|D)P (A = True, B = True) )

Classification = Max(P (C)P (A = True, B = True|C), P (D)P (A = True, B = True|D))

Classification = Max( 127 , (1− 1

3)(1− 19))

Classification = D

The classification label becomes D. This is how the Naive Bayes handle classifica-tions.

Decision Tree

The goal of a Decision Tree is to take decision based on the input where the datashould travel in the tree. The tree is a structure with several directed paths. Thepath selection is based on the input data values. Based on the decisions taken, thedata will travel in the tree and end at a leaf. The leaf will represent a classification.This will be the output for the data. This is better explained with an example. Letscontinue with the pervious example from Naïve Bayes section, where we start byhaving a tree already generated. Meaning we have already trained the algorithm.The trained Decision Tree can be seen in figure 2.5. Now, like the example before

Figure 2.5: The figure illustrates the concept of Decision Tree as a classificationmodel, in this case the outputs are C and D and based on the Boolean features Aand B, the model will follow the correct path and determine which output it is. Thefigure is done using https://www.draw.io/.

we have a new data which needs to be classified, containing the feature valuesA = True and B = False. Following the figure 2.5, we are only allowed to go apath where our feature value is in the path feature limit. The first path the data


will select is A = True, meaning in the figure 2.5 is to go the left path from theroot. Then it will select the path where B = True, ending up with a classificationof D. Classifying the data as label D.How the Decision Tree classification model is built is by an algorithm called C4.5.In the report, the C4.5 algorithm is an acronym for Decision Tree. The DecisionTree models start from the top-down root. Each feature in the training set will havea discretized value, meaning continues features will be discretized [25]. The C4.5algorithm creates paths from the root based on the features values. If no featuresare left or all data in one specific path belongs to the same class label the path isfinished [25]. How a path is separated in to multiple paths is based on a heuristic.The heuristic used in C4.5 tries to maximize the information gain it can get per split.This is done using two entropies, expected information E(Data) and informationEfeature(Data). The idea is to split a path based on the feature that gives the mostinformation about each class, resulting in the amount of separations needed in thetree should be the as little as possible. How this is achieved is based on findingout the expected information, meaning the information of the different labels in thetraining set Data[25]. This becomes equation 2.4 for the example above:

E(Data) = −count(classC)count(Data) log2(count(classC)

count(Data) )−count(classD)count(Data) log2(count(classD)

count(Data) )(2.4)

Information for the two features becomes:

EA(Data) = count(A = True)count(Data) E(Data|A = True) + count(A = False)

count(Data) E(Data|A = False)

EB(Data) = count(B = True)count(Data) E(Data|B = True) + count(B = False)

count(Data) E(Data|B = False)

Based on the feature that have the highest difference between expected informationand information would be how the C4.5 splits[25].

Max(E(Data)− EA(Data), E(Data)− EB(Data)) (2.5)

This continues until it reaches the termination requirements. Now the testing datacan use the generated Decision Tree to test its performance.

Support vector machine

Support vector machine (SVM) uses separation in the hyper plane for classification.How it achieves this is to make the input into higher dimension, resulting in simplerseparation [4], this concept is illustrated in figure 2.6. How it does the separationis to first find the maximum margin between the classes, based on what kernelfunction one has selected. The kernel function K for transforming the data intohigher dimensions, can for example be linear, polynomial or radial based [4], this


Figure 2.6: The figure is illustrating the classification concept using SVM classifi-cation, here if a new data point is on the left side of the line it will be classified asthe same label as the other points on the left side and vise versa, the figure is from[4].

can be seen for two vectors x and y:

Linear : K(x, y) = xT y + 1 (2.6)Polynomial : K(x, y) = (xT y + 1)p (2.7)

Radial : K(x, y) = e− 1

2p2 |x−y|2 (2.8)

After one has selected a suitable kernel for the data-set, one wants to make surethe separation line between the classes is placed in a way that the distance for bothclasses is maximized. Making sure the line is not biased towards a particular class.This is seen in figure 2.6. How this is done is to find the data points closest betweenthe two classes, called support vectors and make a separation line between. Thesupport vectors are the points in figure 2.6 that are on the dotted lines. To makesure the separation line is the optimal one, α is introduced to guide the direction ofthe separation line, making the margin between the line and the two classes datapoints x:s as large as possible. t is the output for the data point x. In supportvector machine, one can only have a binary output, where one class label outputis represented as -1 and the other as value 1. A data-set using several classes usesSupport vector classifier instead, which will be discussed later. Maximizing theequation below will make the margin as large as possible between the classes and


the separation line:

Maximize :∑

i

αi −12∑

i

∑

j

αiαjtitjK(xi, xj) (2.9)

Constraint : 0 ≤ αi∀i (2.10)

When the maximum α:s are found, the support vectors xsup will be the data pointswhere:

xiαi = 0⇒ xi ≡ xsupi (2.11)

When the support vector machine is generated with the training data, testing datacan be used to study the classification performance. This is done with equation2.12:

Classification :∑

i

αitiK(xi, xsupi) (2.12)

If the result value becomes negative in equation 2.12, it will be classified as thenegative class number -1. If the classification gets positive, it will be classified asthe class having the number 1. For instance label C can be -1 and D will then be 1.To be able to handle more then two classes for classification using SVM, one canuse Support Vector Classifier (SVC). SVC uses SVM and one vs rest approach. Theidea is to use several SVMs. For instance one have three classes to classify, classesC, D and E. First, one creates a SVM for C in one group; lets say C will have tas negative one, where D and E will be grouped together in the other group with tas positive one. If a test data now classifies as negative value, it is classified as C.Nevertheless, if it got a positive value it can be D or E. The solution is to make oneadditional SVM where for instance D is in one group, and C, E is in the other. Ifit now, using new SVM gets a negative value, one knows it is classified as label Dand if it gets a positive value it is classified as label E.

2.5.3 Deep learningInstead of doing one complex mapping function as the methods describe above, adeep learning neural network includes several easier approximations that combinedresult in a complex mapping function f(x) ≈ fn(..(f2(f1(x))..) between input x andoutput y, y = f(x). This allows it to approximate difficult functions, but still havea simple structure. These easier approximation functions fn(x) are called neurallayers. Each neural layer is containing several artificial neurons. The idea is inspiredfrom neural science, and that is why the deep learning network is called deep neuralnetworks [26] (DNN). Why it is called ”deep” is because the layers after the inputlayer, are connected to previous layer creating a ”deep” mapping function. How adeep learning neural network works, is that the network is feed the feature inputdata x, into the first layer. The first four dots on the left side on the network in figure2.7 (b), would correspond to the artificial neurons in the first layer. Each artificialneuron has a weight assigned to it. This weight value changes during the training


(a) Concept of a artificial neuron. Noticethat the activation function on the righthere represent what the sum becomes, soif it is for example negative sum the valuewill be zero but if positive the value willbe one. Figure is from [4].

(b) The image shows the conceptof a deep learning neural network.Notice that the two middle layers areonly dependent on previous layer, this iscalled hidden layers. Figure is from [4].

Figure 2.7: The two core ideas of a deep neuron network, is artificial neuron and aneural network. These ideas can be seen in figure (a) of a artificial neuron and (b)a deep neural network.

data phase, to approximate the mapping function better. This will be explainedin section 2.5.4. When the input data is feed to the first layers neurons, some ofthe neurons will be activated, based on their weight values and the input values.Activated means those neurons will influence the next layers neurons, while notactivated will not influence for this particular data input. Generally every artificialneuron from previous layer is connected to the neurons in the next layer, this can beseen in figure 2.7 (a), where the neuron is receiving signals from the previous layersconnected neurons. How a neuron is activated is based on a activation function,for instance a unit step as shown in figure 2.7 (a). The idea is that the activationfunction should not be linear as it introduce non linearity to the model, allowing itto approximate non linear functions. As we have several layers, this non linearitycan now approximate complex functions [26], which other machine learning modelsmight not be able to do as efficiently. When the final layer in the network receivesthe signals from previous neurons, the neuron with the highest signal in the lastlayer will now map the data into an output classification label. For instance infigure 2.7 (b), we have four neurons in the last layer, which is the right-side layerin the model. Here each neuron corresponds to a classification label, for instanceC,D,E,F. The neuron with the highest signal will decide the classification label forthis particular input feature data x. The deep learning model described above, iscalled a feed forward network also known as multilayer perceprons (MLP).

2.5.4 Optimizer functionWhen one trains a deep learning model, one sends a batch m of data input x andoutput y through the model f . The model performance is then evaluated using aloss function L. This loss function is the classification error during this batch. Thisloss error value function is sent back in the network, to change the artificial neuronsweights θ. The name of this operation is backward propagation. This is done by


taking the partial deviation of the loss function for all weights, which becomes thegradient g [26]:

g ← 1m∇θ

m∑

i=0L(f(xi|θ), yi)

This gradient g is later used to improve the weights of the model, by minimizingthe loss function by using a optimizer function [26].

Stochastic gradient descent

Imagine a function space where we have all possible weight values for all the neuronsθ in one axis, the neurons in one axis, and the loss function value in one axis.Stochastic gradient descent is minimizing the loss function, by moving the gradientin the negative direction in this function space expressed above, with a step oflearning rate distance ϵk. The model weights are improved for K epochs, where anepoch means one batch iteration. To make sure the learning rate distance will notovershoot and travel past the minima, the learning rate is declining with a decayα during the epoch iterations. Stochastic gradient descent for improving the deeplearning model can be explained with algorithm 1 [26]:

Algorithm 1: Stochastic gradient descent (SGD)Data: x, yResult: Updates the model weights θfor k in K do

g ← 1m∇θ

∑m+ki=k L(f(xi|θ), yi)

θ ← θ − ϵkgϵk ← ϵk − α

end

Momentum

A problem with Stochastic gradient descent, is that it can be wedged in a localminimum. For instance, if the loss function has large spikes the Stochastic gradientdescent might just move between two spikes and never travel past them to findthe true global minimum. A model optimizer function that handles this issue ismomentum.Momentum is inspired by momentum in classical physics [26]. How it decreasesthe oscillation is to store previous gradient direction. If the new updated gradientdirection for minimizing the loss function is the same direction as the previous onethe learning rate "speed" v, will accumulate. It will also decrease the "speed" if theyare in the opposite direction. Using momentum for updating the weights for each


batch would now be [26]:

v ← v − 1m∇θ

m+k∑

i=k

L(f(xi|θ), yi)

θ ← θ + v

Adaptive moment estimation

A concern with momentum is that the speed accumulated can be too high; at thatpoint, it does not slow down enough at the global minimum and moves away fromit. A solution to tackle this is to have an adaptive learning rate, that can changethe learning rate during the training in such a way that it sometimes increases andsometimes decreases, enabling it to slow down rapidly if needed. This handles boththe issues with stochastic gradient decent and momentum. The idea is to have nowtwo accumulated moments, called first moment r and second moment s [26]. Bothof them will have a learning decay rate of ρ1 and ρ2, a learning rate of ϵ and finallya stabilisation constant δ. Based on this the Adaptive moment estimation (ADAM)will be defined as Algorithm 2 [26]. This is in general the best optimization function

Algorithm 2: Adaptive moment estimation (ADAM)Data: x, yResult: Updates the model weights θs = 0, r = 0 .for k in K do

g ← 1m∇θ

∑m+ki=k L(f(xi|θ), yi)

s← (ρ1s)+(1−ρ1)g1−ρk

1

r ← (ρ2r)+(1−ρ2)g⊙g1−ρk

2θ ← θ − ϵ s√

r+δ

end

for deep learning today [26].

2.5.5 Common types of activation functionsAs mentioned before, the goal of the activation function is to introduce non linearityto the deep learning model. The activation function decides if the output signalfrom the artificial neuron will be zero or not, for instance the unit step for figure 2.7(A). There are several typically used activation functions in deep learning. Mostcommon types of activation functions can be found in figure 2.8. Sigmoid is one ofthe first introduced activation functions, seen in the figure 2.8. Sigmoid is inspiredfrom how a neuron in the brain is designed. However, it turned out that this designhad problem for artificial neurons. In a Sigmoid activation function, the maximum


Figure 2.8: This figure illustrates the typical activation functions used in deeplearning, the figure is from [5].

activation value one can receive is a value of one. Meaning if one would have manysigmoids after another, the signal value will decrease rapidly. For instance, lets saythe first signal is five. That will give the activation value of around 0.9. This goesto the next neurons in the next layer, where 0.9 in signal value corresponds to thenew activation value of ≈0.6. This continues until the signals from the neurons areso small that they will close to vanish. This will then correspond that the gradientwill also vanish, making it hard to improve the model. This is called a vanishinggradient problem. The second issue with Sigmoid activation function is that it isnot zero centre. This is an issue because it allows a signal of zero to correspond toa positive activation value (0.5 in figure 2.8), allowing activations that should nottake place influence the classification, making the model less accurate. The solutionfor this is using the ReLU function. The ReLU function is designed so it will be zerofor all negative signals, and linear from zero to positive signals. This results that itwill handle both of the issues Sigmoid has. However, in this case it is very possiblemany neurons never will be activated, as they might always receive negative signals.How one can tackle this is to introduce an other linear function between − inf and0 that will compensate this possible issue. This function is called Leaky ReLU.In the beginning of the section deep learning, we discussed that in the final layer theneurons correspond to a classification label. The neuron with the highest outputvalue for an input data will then decide the classification label. A way to normalizethe neurons classification value in the last layer is to use a function called Softmax.Softmax takes the input times the weight for each neuron and normalizes the resultsin contrast with the other artificial neurons values in the last layer. Softmax showshow much the model believes the classification, as it gives a result between [0-1]. Forinstance, lets continue with the previous example, where we now got 0.3 for label Cand 0.7 for label D. The model classifies as label D, which was correct, but the lossfunction can now use this result to improve the model still, as the classification wasnot 100% sure it was D. Softmax is typically used in the last layer for classificationmodels in deep learning.


2.5.6 Convolutional Neural NetworksThe goal of the convolutional neural network is that instead of taking a data inputin form of a vector, it takes in data in form of a matrix. This is useful for two cases,time series data or image data [26]. The name convolution is from mathematicsconvolution between two matrices. The idea of this is that features around a featureelement will have an influence on it, for instance when studying time behaviour.How this works in deep learning is that one first selects the size of the matrix theconvolution will be, meaning how many features around one each element shouldbe included in the convolution. From this, one selects a partition of the input datamatrix, to create a new matrix called filter. Lets take a random input matrix xmatrix

as an example.

xmatrix =

⎛

⎜⎝1 0 00 1 00 0 1

⎞

⎟⎠

Lets now define that the filter matrix selected will be a size of 2x2 and two filterneurons will be used in this convolution layer. Each filter will randomly take a 2x2matrix data from the input data.

filter1 , filter2 =(

1 00 1

)

,

(0 10 0

)

The convolution will result in:

output1 = xmatrix ∗ filter1 =

⎛

⎜⎝1 0 00 2 00 0 2

⎞

⎟⎠

output2 = xmatrix ∗ filter2 =

⎛

⎜⎝0 1 00 0 10 0 0

⎞

⎟⎠

This output matrix will then use an activation function for each element. Whenfinished with the amount of convolutional layers, one can then reshape the lastlayer to be in a vector form instead. This vector form layer is then as the previousartificial neuron layer concepts described above. Now a final Softmax layer can bedeployed to do the classification.

2.5.7 Recurrent neural networkRecurrent neural network is a network where each neuron is connected to itself, ina loop. This creates a memory cell for each neuron, with the idea that previousdata might give a good input together with the new input for a classification model.This is for instance used in classifying objects in video clips, using previous frames


to improve the classification performance on the new frame. The design issue withrecurrent neural networks combined with the concepts of DNN, is that this loopingwill make the signal larger or smaller and never stay the same as it gets looped inthe activation function. Making the signal larger or smaller after every time stepwill make the gradient to increase or decrease drastically, making large or very smallweights in the update for each artificial neuron. Both are problems, as this resultsthat the weights eventually becomes infinite or zero, where the network cannotlonger improve. A solution came that is called Long short term memory (LSTM).The goal with LSTM is to have gates [26] that will determine if the previous datashould influence the classification or not on the new input. This allows it to only loopdata if needed and remove data loops that turn out to never help the classification.What the gates do is also to make sure the looped input with the new input doesnot change the signal, allowing it to behave just as a typical DNN, but now with amemory influenced model.

2.5.8 Improving the modelIn this section, we will explain how to change the deep learning model setup toimprove the classification model performance. This is not a trivial task as there aremany factors why the model does misclassifications. Two techniques for improvingthe model are generalization and increased complexity.Generalization is used when the model training performance is outperforming thetesting result. This indicates an over fitted model, and needs to be more generalized.In DNN one common technique is to use dropout. Dropout is to randomly removeneurons in a layer. For instance, a dropout of 0.25 means to randomly remove 1

4of all neurons. The result of this is that the complexity of that layer will decrease,as fewer neurons influence the next layer. The lower complexity for the model canresult in that the model will generalize better for unseen data and handle over fittedtraining data.Other typical event is that the model does not even receive good results for thetraining data, this indicates that the complexity of the model is too small. Thismeans the mapping function needs to be more complex for this data set. Solutioncan be to improve the model by changing the activation function or increase theamount of neurons and layers. There are many techniques for improving a DNN,and for achieving good results is generally about good intuition.

2.6 Related worksThe idea of doing quality of service improvement in the scheduler function in theeNodeB using machine learning classification is not a very explored subject. To beable to find relevant previous work, the thesis started to look into network trafficclassifier problems (NTC), using machine learning. NTC classifies different servicesusing packet behaviours, which is what an eNodeB machine learning classifiers alsoneed to do. Previous NTC studies will provide us good information regarding model


selection and best practises on packet extraction. One of the reports used machinelearning for classifying services, for instance ”google”, ”email” and ”youtube”. Theother report was classifying applications when the packet is encrypted. Classi-fying encrypted packets applications has similar feature limitations as a eNodeB.The encryption type that the previous machine learning NTC used was transportlayer security encryption. This means that the layers above the transport layerare encrypted. The report managed successfully to classify the application of eachencrypted packet, using deep learning methods with source time series. Transportlayer security, means however, that they could extract more information out of thepacket than a future eNodeB can. They can get information regarding the networkand the transport layer, which eNodeB will not be able to fetch. This means thatfeatures regarding for instance IP-address and TCP window will be discarded asfeatures for our problem.

2.6.1 Internet Traffic Classification Using Feed-forward NeuralNetwork

Paper [27] was one of the first that we could find that uses deep learning as a clas-sification solution for NTC. The report is from 2011, where they first discussed thedownside of previous solutions of using classical machine learning techniques. Thegoal for this classifier is to classify what type of service the packets does, for instancea Google search. The paper has three large contributions to the NTC problem. Thefirst one is to explain the previous findings of using classical machine learning forNTC, and justifies why they are not good estimators for this type of complex trafficproblems. The second contribution is realizing that port-based classification is not agood solution anymore as several applications nowadays uses random ports to com-municate, and payload based classification might not be possible due to encryption.The third contribution is the proposed solution to use MLP for creating a betterclassification model. The model feature extraction became source-IP, destination-IP, source-port, destination-port and protocol type [27] to classify the service. Letsstart by breaking down the different classification issues using classical machinelearning. Classical machine learning is in this case, for instance Naïve Bayer classi-fication, k-nearest neighbour and C4.5.The core issue they found with Naïve Bayes classification is the assumption that allfeatures are independent of each other. They found that was not the case, resultingin poor classification performance. K-nearest neighbour, is a clustering techniquethat needs to study the relationship with the clusters of non trained data and thetrained clusters, this algorithm is computation intensive and can result in severaldata points getting classified as noise during testing[27]. C4.5 also called DecisionTree had the best result among the classical machine learning models. The bestclassification performance they achieved was by using MLP. Their design of MLPwas 4 connected layers, and 10-30 neurons for the layers between the input layer andoutput layer. They compared the MLP with Naive Bayer and showed that the ac-curacy massively improved using MLP. Drawback of the paper is that they assume


one will use IP addresses as input parameters for the network. This is somethingthat can lead to over fitting, as the classification model can base the applicationtype of the IP address.

2.6.2 Network Traffic Classifier With Convolutional and RecurrentNeural Networks for Internet of Things

The work presented in [28] compared several deep learning models for NTC. Theyconcluded that one should not only use accuracy as the score of evaluation fordifferent machine learning models. They used several evaluation scores, accuracy,F1, precision and recall to evaluate each model. The features they used for theirclassification models was source port, destination port, packet direction, bytes inthe payload, inter arrival time and windowsize [28] for classify the application typeof the packet. In addition, they used the idea of storing previous packets sentfrom each source as input. This allows them to format the input to a time seriesmatrix where each previous packet from the same source will be a row in the matrix.This makes it possible to use recurrent neural networks and convolutional neuralnetworks, resulting in that the classification performances improved. The paperfound that the best model for their traffic data-set was a combination of LSTM andConvolution neural networks [28]. Nevertheless, the drawbacks of this paper arethat they never compared simpler machine learning models that are not based ondeep learning, which might be better solution for their data-set. Additional problemwith their paper is that the data-set is not public. Lastly, they assumed some typeof port-based classification. This might be biased, as a new data-set might havetotally different ports assigned for some packet applications.

2.6.3 SummaryThe methods used in the articles can be applied for the problem in this thesis. Thesecond report managed to achieve accuracy above 95% for classifying the applicationa packet belongs to. This indicates it is possible to achieve good results for ourproblem.The reports are however not designed for our problem but we can use their modelsto see if they translate well in our NTC problem. The first report used Naïve Bayesas a sanity check for indicating if models that are more complex perform well ornot. The scores they used in the second article to measure each models classificationperformance was carefully selected and their time series matrix feature concept willalso be used in our case, allowing us to test convolutional and LSTM classificationmodels. Lastly, one thing that we want to discard from the reports is the use ofunique address as an input to the model as this can make the model biased towardsa identifier when doing classification. However, we will extract the MAC addressfrom each packet to ensure that the time series matrix can be built. Nevertheless,the actual MAC address will not be a feature for the models.

Chapter 3

Implementation

3.1 GoalThe first step for the thesis is to find relevant data captures. The captures mightneed to both have production line communications using Profinet protocols andalso capture communications in the office landscape, as the cellular network can bedeployed for several levels in the CPwE design. When data-sets are found, relevantfeatures from the data need to be extracted. The feature extraction selection willbe based on the background knowledge from Profinet, eNodeB and related works.Later to confirm that the data-set input features behave as the background haveexplained, quantitative measurements on the data-set will be done to study thebehavior. Empirical observation on the data will be done using analytics, whichwill show the data-sets relevance. If the data-sets behave as expected, machinelearning models will be trained and modified to perform as good as possible toclassify the applications. This will be done using the knowledge from section 2.5and 2.6. The evaluation of the models done in the result chapter will answer ourresearch question, ”if it is possible to achieve an effective application classifier modelusing machine learning, making sure important packets get prioritized in a cellularnetwork”.

3.2 CapturesThere is two data-sets used in the thesis. The first capture is from HMS, HMSstands for "Hardware meets software" which is a company specializing in makingmanufacturing devices communicate over Ethernet, for example industrial robotsand control systems. The capture they provided was from their test bench whichuses Profinet communication between a IO-Controller and several IO- Devices. Thecapture has 10140 packets captured and illustrates a possible industrial commu-nication between level 0 and level 1 in the CPwE architecture. The capture usesthe majority of the commonly used applications in a industrial Ethernet commu-nication discussed in 2.1.4. The goal of this data-set is to illustrate a productionline communication and study if classification can be achieved internally in Profinet

31

CHAPTER 3. IMPLEMENTATION 32

communication. The data-set is not public. The second data-set used was providedby DEFCON. The original goal of this data-set was to find passwords and importantinformation about the PLCs in a factory plant. This data-set uses a combinationof applications, where several is not used in the production. The thesis goal withthis data-set is to study if a machine learning classification model can still do strongclassification when the packets is not only from the production, emulating a sce-nario when a eNodeB need to handle all the traffic in a company (level 0 - 5). Thiscapture has a size of 1046036 packets, approximately 100 times larger then the firstdata-set, and is using some of the commonly used Profinet applications.When the thesis was scavenging captures for the data-sets, we found that thereare limited amount of industrial Ethernet communication captures available online,most likely because companies does not want to share their production line data.

3.3 Feature extractionThe captures provided needs to be converted to useful data types for data analyticsand machine learning. Mentioned in previous section 2.4.1, DataFrame has a struc-tured table type that is useful for this case. The conversion between the Wiresharkcapture to a DataFrame was done using multi-threading approach and pysharkmodule. Algorithm 3 is doing the conversion between a capture to a DataFrame.The idea with the algorithm 3 is to filter the capture. Meaning the pyshark modulewill read the capture, and separate the packets based on each application. Eachapplication capture becomes its own thread seen in algorithm 3, allowing it to moreefficiently go though the data-set. Feature extraction is done for each packet inevery application, where the result is stored in a DataFrame. The index of everypacket is later used to sort all of the generated applications DataFrames, so theycan later be merged together in the correct order.The goal of the classification model is to be able to classify Profinet applicationsefficiently; so different production line packets can have different priorities. Thefeature extraction used is based on Profinet behaviour, and what the eNodeB canextract. The selection became:

1. Ethernet Type

2. Source Interval Time

3. Packet Size

The motivation for this, is that Profinet Real time applications have a distinctEthernet Type of 8892. The Source Interval Time is selected to separate cyclic,non-cyclic and acyclic applications. Finally, the Packet Size, because some datafor instance PNIO will have a variety in size as different IO-Devices send differentamount of data or measurement, while other Profinet application most likely hasstatic packet sizes. All of these features selected are possible for the future eNodeBto extract, making them relevant for our problem. From the section 2.6.2, the idea


Algorithm 3: Presudo code for extracting the capture to a DataFrame.Data: CaptureResult: Return a DataFrame with the selected features and labelsmap = {filters,application_labels} ;thdMng = new ThreadManager();for (fltr,application) in map do

cap_summery_filter = pyshark.read(Capture,filter=_filter(fltr));thr = new Thread();thdMng.append(thr.start{df = new DataFrame()for pkt in cap_summery_filter do

df.append(feature_extraction(pkt),label:application,index:pkt.idx)end})

endthdMng.completion_handler{()->completion:df = thdMng.mergeDataFrames();df.to_csv("full_data_set.csv",sortIndex=True);}

of a matrix based time series input of previous data seems very promising. Thisallows the study to explore more complex deep learning models such as convolutionaland LSTM. For that reason an additional feature extraction will be deployed forthose models. Figure 3.1 shows the defined feature extraction input matrix. We

Figure 3.1: The time series matrix used for LSTM and Convolutional models inputdata, the figure is generated using https://www.draw.io/.

decided to use a time series of three, meaning the newest packet called latest sourcepacket, the next most reason packet called previous source packet and the packetbefore that, the penultimate source packet will be used as feature input for theconvolutional and LSTM models. In the case a particular source sends its first orsecond packet, the other positions in the matrix will be zero.


3.4 Empirical observationsBefore we can start defining the machine learning models that will be deployed,the data-set needs to be quantitatively analysed, confirming their relevance to theproblem. Based on section 2.2, table 2.1 and the hypothesis in the section 1.6 weneed to confirm a few behaviours for both data-sets. The behaviours that needs tobe analysed are the following:

1. All Profinet Real-Time and Isochronous-Real-Time applications has the Eth-ernet Type 8892.

2. The cyclic and acyclic behaviour for PNIO,PNIO-PS and PNIO-AL.

3. The idea that PNIO have a variety in packet size, while the others are mostlystatic.

4. Show that the Real-Time and Isochronous-Real-Time Profinet applicationsuse the latency requirements of Profinet version 2 or 3.

5. Observe if the hypothesis "that different industrial applications in the produc-tion line will have similar packet behaviour, so a more complex classificationmodel is needed" is true.

If we can confirm the first 1-4 points are true, this will indicate that the data-setsselected are relevant data for this classification problem. If we can confirm point 5,this will indicate that machine learning is relevant for the problem.

For the first point, a feature space plot has been done for both data-sets. Eachapplication is represented with a number in both plots. The feature space for thedata-sets are seen in figures 3.2 and 3.3. For the figure 3.2 we can see the produc-tion line communication for the HMS data-set. It confirms two things, first thatEthernet Type behaviour supports our background knowledge and secondly, our hy-pothesis regarding the use of machine learning. The applications used are explainedin section 2.1.4, where every application seems to behave as we expected. For in-stance, the non-real time PNIO-CM is communicating using Profinet version 1 thatis using the IP/TCP stack, and that is why its Ethernet Type is 0800 (IPv4). TheReal-Time Profinet applications PNIO_DCP, PNIO_AL and PNIO_PS, all havethe Profinet Real-Time Ethernet Type of 8892. As expected the ARP and LLD alsoconduct as they should. The second thing the feature space analytic shows is thatthe Real-Time Profinet PNIO_DCP, PNIO_AL and PNIO_PS are mixed together.This is seen when the packet size is around 100 Bytes and the source time intervalis very frequent in figure 3.2. This confirms our original hypothesis, why a machinelearning classification model might be relevant. The same things are confirmed inthe feature space plot for our second data-set in figure 3.3. Here user applicationsare added, and communicating using Ethernet Type 0800 (IPv4) and 86dd (IPv6).Those applications will not be explained, as they will be viewed as standard office


Figure 3.2: The features mixture model for each application in the HMS data-set.

Figure 3.3: The features mixture model for each application in the DEFCON data-set.


landscape data. For instance HTTP application is used for REST APIs on the In-ternet, DNS is for finding the IP address for a website and FTP is to send files, forinstance from a company server. Here all Ethernet Types in the feature space con-firmed the expected behaviour. The Real-Time Profinet communication are mixedwhen the packet size is below 200 and the source interval time is very frequent.We can also confirm that PNIO seems to be the most dynamic packet size appli-cation. Nevertheless, PN-DCP seems to have two different packet sizes in figure 3.2.

For point 2 and 4, two analyzes on the data will be done. First, to analyze thesource interval time distribution for different applications, where the idea is thatProfinet Real-Time applications should follow the Real-Time requirements of 5-10ms or the Isochronous-Real-Time of 0.25-1ms. The second observation is to an-alyze each source interval time and the packet index to confirm cyclic and acyclicbehaviours for the data-sets. The data analytic for the HMS data set can be seenin figure 3.4 and the cyclic behaviours can be seen figure 3.5. In figure 3.4 we cansee that PNIO-AL is sending in acyclic manner, as it has both interval distributionfor 0.1 s and also for around 1ms. It seems to follow both the acyclic claims andthe Profinet version 3. PNIO-PS distribution follows Profinet version 3, where themajority of data is around 1ms. The PN-DCP have a variety of ranges it is sendingwith, this makes sense as the PN-DCP is a setup phase for the communication, thesame goes for PNIO-CM. We can also confirm that the PNIO-AL alarms are acyclicin the figure 3.5, as each source seems to send in random manner and also thatPNIO-PS have a cyclic behaviour. One can conclude also that the setup parameterphase is also cyclic, based on the PNIO-CM, however that was nothing we required.For the DEFCON data-set, the source interval time distribution is shown in fig-ure 3.6. Here we selected also an office landscape application, HTTP to show thatit should send in very random interval times, explaining the random distribution.The same conclusion can be draw for the results for this data-set as HMS. Theadded PN-PTCP sends in random source interval time, this makes sense, as thegoal of PN-PTCP is to take care of synchronizations. The PNIO seems to followProfinet version 2, with latency profile of around 8ms. The cyclic behaviours ofeach application can be seen in figure 3.7. The PNIO seems cyclic but some casesthe communication changes for a few sources. This can be a cyclic behaviour overa large period that was not captured. However, the communication that does notappear cyclic is the one that is sending with very high latency. Those devices mightnot need the low latencies, and perhaps are for IO-Devices that are not used oftenin the production, like a fan, turning on and off. For instance, the green and purplesource are sending in a very cyclic manner and highly frequent, confirming the cyclicbehaviour. The PN-PTCP has a cyclic behaviour, most likely because it needs tohandle different synchronization phases.

The empirical observations on the data-sets appear to support the backgroundknowledge. Because of this, the data-sets are viewed as relevant data for the ma-chine learning classification models.


Figure 3.4: The distribution for different source interval time for four applicationtypes in the HMS data-set. The applications selected was PNIO-AL, PNIO-CM,PN-DCP and PINO-PS.

Figure 3.5: The packet behaviour for each application. The colour represents onesource and how it is sending its packets, this is to study if there is any cyclic oracyclic behaviour. The result is from the HMS data-set.


Figure 3.6: The distribution for different source interval time for four applicationtypes in the DEFCON data-set.The applications selected was HTTP, PN-DCP,PN-PTCP and PNIO

Figure 3.7: The packet behaviour for each application. The color represents onesource and how it is sending its packets, this is to study if there is any cyclic oracyclic behaviour. The result is from the DEFCON data-set.


3.5 Data analyticsWe have confirmed using empirical observations on the quantitate data availablethat the data-sets appear to behave as Profinet communication. We have also re-vealed that machine learning might be necessarily. Next step is to analyse thebehaviour of the features and applications. For instance, if one application type isdominating the data-set, or how features correlate.

The histogram for the data-seta can be seen in figure 3.8 and 3.9. In both his-tograms, most of the packets belong to one particular application. This means thatfor the HMS data-set, if we would only use accuracy as the classification evalua-tion score and the model decides to classify everything as PNIO-PS the accuracywould be around 65%, and for the DEFCON data-set around 85% using PNIO.This means the machine learning models should be evaluated using several typesof scores. Based on the knowledge from section 2.6.2 we will train the models toperform well on accuracy, precision, recall and f1 score to compensate for this. Thecorrelations between the features and the application label are seen for both data-sets in figure 3.10 for HMS and figure 3.11 for DEFCON. In the HMS data-set manyfeatures are correlated among each other. Section 2.6.1 concluded that when fea-tures are correlated between each other the Naïve Bayes will not perform well, dueto its assumption of feature independence. For that reason we will use Naïve Bayesas sanity check for the rest of the models, claiming the rest should perform better.In the DEFCON data-set the correlation between different features are lower andthe correlation between Ethernet type and application is very high, resulting inNaive Bayes should perform better for this data-set.

The final data analytics done is that we will not deploy a validation set. Thereason for this is that some applications are non frequent in both histograms. Ifwe would deploy a validation set, the training data might not have enough data todo the mapping between every application output and the input data successfully.This means we will deploy a random separated training set and testing set [80%,20%] seen in table 3.1, this approach is discussed in section 2.5.2. This allows thedeep learning model parameter setup to be changed and re-trained without affect-ing the bias that much. The deep learning models generally need to be changed afew times before they achieve good results, however most classical machine learningmodels do not require this. To make sure some results do not introduce any pos-sible bias, the classical machine learning models will only be trained and tested once.

The last note for the data analytics is that the classification delay should be moni-tored during testing; this allows us to study the classification delay for online clas-sification. The reason for this is that Profinet cyclic communication requires thatthe data will be sent close to instantly and has low time tolerance, so the model cannot have a large classification delay.


Figure 3.8: The frequency for each application in the HMS data-set, containing10140 packets.

Figure 3.9: The frequency for each application in the DEFCON data-set, containing1046036 packets.


Figure 3.10: The correlation for each feature in the HMS data-set, this is to deter-mine how each feature and application correlate.

Figure 3.11: The correlation for each feature in the DEFCON data-set, this is todetermine how each feature and application correlate.


HMSTraining data 80% Testing data 20%

Label Size Normalized [0,1] Size Normalized [0,1]PNIO-PS 5063 0.624137 1255 0.618531PNIO-CM 1532 0.188856 393 0.193691PNIO-AL 511 0.062993 130 0.064071

ARP 498 0.061391 127 0.062592PN-DCP 484 0.059665 115 0.062592

LLDP 24 0.002959 9 0.004436DEFCON

Training data 80% Testing data 20%Label Size Normalized [0,1] Size Normalized [0,1]PNIO 729637 0.871908 182337 0.871563ARP 55372 0.066169 13850 0.066202ICMP 27544 0.032915 6917 0.033063

PN-DCP 6133 0.007329 1511 0.007223LLMNR 4116 0.004919 1024 0.004895NBNS 2977 0.003557 768 0.003671MDNS 1695 0.002026 390 0.001864HTTP 1372 0.001640 344 0.001644

PN-PTCP 1200 0.001434 316 0.001510DHCPv6 898 0.001073 245 0.001171ICMPv6 861 0.001029 235 0.001123LLDP 870 0.001040 223 0.001066DHCP 886 0.001059 213 0.001018TLSv1 737 0.000881 206 0.000985DNS 734 0.000877 198 0.000946

IGMPv3 730 0.000872 188 0.000899MODUS 517 0.000618 121 0.000578

FTP 193 0.000231 38 0.000182ECHO 60 0.000072 11 0.000053NTP 54 0.000065 10 0.000048

XDMCP 44 0.000053 9 0.000043NAT_PMP 22 0.000026 9 0.000043

NFS 25 0.000030 9 0.000043BJNP 25 0.000030 9 0.000043

TELNET 38 0.000045 8 0.000038ISAKMP 36 0.000043 7 0.000033

FTP-DATA 35 0.000042 6 0.000029BOOTP 17 0.000020 5 0.000024

Table 3.1: The HMS/DEFCON village data-set split into to sets for training andtesting.


3.6 Machine learning modelsThe models selected for the thesis are inspired from the most successful machinelearning models from section 2.6.1 and 2.6.2, where Naive Bayes will be used asa sanity check for the other models performance. The classical machine learningmodels that will be tested are:

1. Decision Tree (C4.5).

2. Naïve Bayes.

3. Support vector classifier (SVC).

4. Random Forest.

None of them will change their initial parameter setup after the first training/tesingresult. This means they will be fully unbiased. The Support vector classifier willlater be discarded, as we notice it did not perform well with the initial kernel func-tion selected. It could also not be used in the DEFCON data-set as it contains morethan a million packets, where transforming all of those packets into a higher dimen-sion for testing and training is not feasible. However, the result in HMS can mostlikely be improved with better kernel function, but that will we not explore for thisthesis and that is why it will be defined as SVCparam* (parameter improvementneeded). Random Forest are several decision trees where all of them together voteon the classification. The idea is that this will allow it to handle more complex datathen what a singular decision tree would be able to map.

The deep learning models are also inspired from the results in section 2.6.1 and2.6.2, the models tested are:

1. Multilayer perceptron (MLP).

2. Convolutional neural network (CONV).

3. Long Short Term Memory neural network (LSTM).

4. Convolutional Long Short Term Memory neural network (CONV_LSTM).

For the deep learning models, we allow them to be changed to perform better,as they require lots of tuning before achieving good results. The changes can forinstance be the amount of neurons in each layer, the total amount of layers and theactivation function selected for each layer. To reduce the bias, random partition ofthe training and testing set will be done for every tuning iteration, this concept isdiscussed above. How the models are improved is based on following the section2.5.8. The final design for each DNN is shown in figures 3.12,3.13,3.14 and 3.15.The deep learning models are built using Keras[29] and Tensorflow[30] back-endand the classical machine learning models are built using scikit-learn[31].


Figure 3.12: The MLP model design for both data-sets.

Figure 3.13: The Convolutional model design for both data-sets.


Figure 3.14: The LSTM model design for both data-sets.

Figure 3.15: The Convolutional LSTM model design for both data-sets.

Chapter 4

Results and conclusions

4.1 Computer specificationThe computer used for training every model has four graphics cards (Titan P andTitan XP), 128 GB of RAM memory and 3 TB of SSD storage. As a references whentraining the CONV_LSTM for the DEFCON data-set using a virtual linux on localstandard ERICSSON laptop takes approximately 60 times longer to train versusthe "super computer". This is the case because the powerful GPUs are used for thematrix operations in DNN, allowing large batch sizes on each epoch train in parallelon the GPUS cores. This resulted in the training taking around 1 minute and 20seconds for the heaviest DNN vs 1 hour and 20 minutes on the local computer.

4.2 Model evaluation for HMSIn the previous sections we have discussed point 1 to 6 section 1.5. Goal 7 is to findthe most suitable model for the data-set. For the HMS data-set we have selected 8different models, to find the most suitable. The model evaluation scores are shownin figure 4.1. Here one can see the accuracy, precision, recall and f1 percentagefor all models. As explained in chapter 3 the SVCparam seems to have a badlyselected kernel function to start with, resulting in a insufficient separation in thehigher dimensions. The Naive Bayes was a sanity check for the rest of the models,which achieved a classification percentage of around 70% for all scores. The rest ofthe models classify the data-set applications with evaluation scores above 97%. Themodels that performed the best were the ones based on a time series matrix input.This makes sense as in Profinet communication, different applications are used indifferent stages of the communication, following the application order from section2.1.4. A time series classification model notice this and understands for instancethat if a source have not seen a PNIO-CM packet all ready then it can not send anPNIO-AL or PNIO application. The best model for this data-set is the convolutionLSTM model, most likely because of its complexity and time series understanding.However, the majority of models successfully classify the applications, and one canargue that all of MLP, CONV, LSTM, CONV_LSTM, C4.5 and Random Forest can

46

CHAPTER 4. RESULTS AND CONCLUSIONS 47

be used as the machine learning model to classify the packet application. Duringthe chapter 3 we also discussed the relevance of the classification delay to answerthe research question, ” if it is possible to achieve an effective application classifiermodel using machine learning, making sure important packets get prioritized in acellular network”. Because most of the models have a good classification evaluationscore they indicate that a machine learning model can be used for making sure im-portant industrial communications are prioritized, and by that one can reason thatthe research question might be answered.

Figure 4.1: Models evaluation scores for HMS test data-set.

One also need to answer is if the classification delay is significant for the differentmodels. In figure 4.2 and 4.3 one can study the classification delay for each model.Here one can see that the classical machine learning models have a lower classifi-cation delay. For instance, the C4.5 model manages to classify 1000 packets under0.0001 second, averaging a classification delay of 0.1 microseconds per packet, whichis definitely in the jitters range for RT and IRT Profinet communication. Howeverstudying the classification delay of the best classification model CONV_LSTM, onecan see the delay is fluctuating with up to averaging 0.3ms per packet. That wouldcorrespond to the same interval time as IRT Porfinet communication. So it mightnot be a valid solution for IRT applications, based on the explanation in section1.3. The reason classical machine learning models have lower classification delay isbecause of their simplicity, using less parameters.


Figure 4.2: The classification model delay for the selected models in the HMS data-set.

Figure 4.3: The classification delay for the fastest classification models in the HMSdata-set.


Lastly, we need to understand how different models misclassify. This can be stud-ied using a confusion matrix, explained in section 2.4.2, where we can evaluate ifa model can or cannot separate the RT, IRT Profinet applications. There is nopoint in doing a confusion matrix for every model deployed in the test, as that willoccupy too much space and will most likely not give that much more insight versusdoing it for a selected bunch of models. For this reason we will perform confusionmatrix analysis for only C4.5 and CONV_LSTM. The reason for this is that theCONV_LSTM had the best classification performance scores in figure 4.1 and C4.5had a good HM between classification performance and classification delay, seen infigure 4.1 and 4.3. It might be naïve to discard the CONV_LSTM because it had ahigh classification delay, because it might be possible with improvements to reducethe classification delay for this model in the future.The confusion matrix results on the test-set can be seen in figure 4.5 for CONV_LSTMand 4.4 for C4.5. The CONV_LSTM manages to classify close to perfect for thetest data, where 5 PNIO-AL was misclassified to PN-DCP. The C4.5 did well in theclassification but managed to misclassify 16 PNIO-AL to PN-DCP and 10 PN-DCPwas classified as PNIO-AL. That is a worse result, but it still classifies the majorityof important data applications correct.

Based on the C4.5 low classification delay and good classification performance itseems to indicate that it is possible to have beneficial scheduling for importantapplications for this data-set, using machine learning.

Figure 4.4: Confusion matrix result using HMS test data-set for the Decision Tree(C4.5).


Figure 4.5: Confusion matrix result using HMS test data-set for the ConvolutionLSTM.

4.3 Model evaluation for DEFCONIn the DEFCON data-set, office applications are also used in the network. Theadded applications might confuse the time based classification models, as now allcommunications does not behave the same over time. Also the correlations betweenthe features are low for this data-set, where we expect that Naive Bayes will performbetter. The evaluation score for the selected 7 models for DEFCON test-set can beseen in figure 4.6. As expected the Naive Bayes performed better at this data-set,this is most likely the case because the features are less correlated in this data-set.All of the other models classify exceptionally well, where the Random Forest andC4.5 performed the best. The data in this case is mixed between many differenttype of services, it does not behave the same during time, this might explain theresult why the time series models do not perform as good for this data-set.The model parameter setup is not changed for this data-set, except the amount ofoutputs, resulting in the classification delay to be similar to the previous data-set.The classification delay can be seen in figure 4.7 and 4.8.From the evaluation score one cannot derive which labels got misclassified, this isessential to know as if the RT production applications got misclassified or not. Todo the same comparison as before, the confusion matrix for the C4.5 is seen in fig-ure 4.9 and the CONV_LSTM is seen in figure 4.10. The C4.5 classification modelclassifies all production line applications correctly; where also no misclassification


between the production applications and the office landscape applications occur.This is very impressive result, showing a perfect machine learning model. Why onecan argue that this is a perfect machine learning model for the thesis is becausemisclassification between different office applications is not relevant as all of themwill be treated with the same priority. This is an exceptional score as no miss pri-oritization would occur for this data-set if the eNodeB used the C4.5 classificationmodel. The CONV_LSTM have a few misclassifications, where office landscapeapplications got prioritized as different production line applications seen in figure4.10. Also one PNIO packet got misclassified to an office application. However thetotal important misclassifications were 14 packets of 209207 packets. The evalua-tions results seems to clearly show that a beneficial machine learning model can bedeployed for production line application prioritization for this data-set.

Figure 4.6: Models evaluation scores for DEFCON test data-set.


Figure 4.7: The classification model delay for 1000 packets, this is for the DEFCONdata-set.

Figure 4.8: The classification delay for the fastest models in the DEFCON data-set.


Figure 4.9: Confusion matrix result using DEFCON test data-set for the DecisionTree (C4.5).

Figure 4.10: Confusion matrix result using DEFCON test data-set for the Convo-lution LSTM model.


4.4 Benefits of using machine learning for the eNodeBscheduler

From the evaluation scores and confusion matrices for both data-sets machine learn-ing models, we could find several models that seems to be able to classify the pack-ets correctly. This means that the issue about industrial Ethernet protocols, notdesigned for wireless communication might be resolved. We could show that themachine learning models could classify the different applications and could for in-stance prioritize internal applications in the industrial protocol Profinet. We couldalso show using the DEFCON data-set that this type of classification model mightstill work when other type of traffic is also introduced in the eNodeB. The finalgoal in section 1.5 is to study the gain for industrial important applications duringpeak load, when using a model that can prioritize internally the different packetapplications.

How this research will show the gain of using a machine learning classifier for bothdata-sets is to sample a stream of packets with different applications going in to theeNodeB, where the eNodeB will use the priority table 2.1 and the machine learningapplication classification to prioritize different traffic. Using the empirical observa-tions from the data-sets we can sample this traffic stream. The traffic stream forthe HMS data-set is seen in figure 4.11 and 4.12. The figure 4.11 shows when theeNodeB knows the application type of every packet using machine learning, and infigure 4.12 it does not know the application type of the packet. The figures showshow different packets arrive to the eNodeB during 1 second, where every millisecondthe eNodeB needs to schedule some packets and not serve others. The same conceptis done for the DEFCON data-set where the machine learning solution is shown infigure 4.13 and without machine learning in figure 4.14. The idea to show the gain,is that the different application priority level in table 2.1 will signify a score, wherethe office landscape applications will have a priority of 1. Each packet served willget the packet size times the priority level as a positive accumulated score and anegative score if not served. We also define a buffer where None-Real-Time datacan be stored and be sent in later time. The eNodeB will be designed so it cannotserve all packets that are arriving every millisecond, where the goal of this is tosimulate peak load. This will show how the eNodeB using machine learning willserve important production line applications so the production does not terminate,while using no machine learning it will in a naive way select random packets toserve.


Figure 4.11: HMS traffic streamed to the eNodeB during one second. eNodeBunderstands the packets applications using ideal ML.

Figure 4.12: This example for HMS, the eNodeB views all packets as equal.


Figure 4.13: DEFCON traffic streamed to the eNodeB during one second. eNodeBunderstands the packets applications using ideal ML.

Figure 4.14: This example for DEFCON, eNodeB views all packets as equal.


4.4.1 eNodeB scheduler gainTo demonstrate the benefits for eNodeB using machine learning in the HMS net-work, the study looked at four cases: first case when the model in eNodeB hasno prior knowledge and choose by random the packet to serve (naive approach).Second case using the trained C4.5 classification model to prioritize packets, thethird case using the trained CONV_LSTM model, and the last case is for perfectclassification to show the potential gain by using machine learning for the eNodeB.From the confusion matrices results, the misclassification is known for both C4.5and CONV_LSTM. The ML models misclassifications will be considered as notserved packets where they receive negative accumulated scores for every misclassi-fication. The eNodeB will iterate though the 1 second of traffic and the score ofeach model is shown in 4.15. The score shows that the CONV_LSTM is close tothe perfect machine learning model, showing its classification strength, where theimportant packets are served directly. The C4.5 gets a lower score as it misclassifiesmore, but still have a positive final result. However for the case using no machinelearning, the eNodeB setup fails to prioritize important Profinet applications, whichexplains the score it got. The buffer is also more optimized using machine learning,seeing in figure 4.16. Using machine learning gives the eNodeB the informationthat the Real-Time Profinet applications need to be sent instantly or be dropped.This results that eNodeB drops RT Profinet packets if it cannot serve them directly,lowering the buffer load.The last empirical observation done was to analyze the drop rate for different appli-cation with or without machine learning. The drop rate for the different applicationsusing machine learning can be seen in figure 4.17. Here most of Profinet alarms donot get drop as they have the highest priority. The PNIO_CM, has also a lowdrop rate result. This is because it has the highest priority of the None-Real-Timeapplications, where the eNodeB will buffer those packets and serve them later whenit have resources available. The PNIO_PS have also a lower drop rate then forinstance PN_DCP proving the eNodeB now prioritizes important industrial appli-cations.

The application packet drop rate can be compared using no machine learning, thiscan be seen in figure 4.18. Here all RT profinet applications have the same droprate, showing they have no internal priority. The None-Real-Time applications havea lower drop rate as they can be used after buffered.

The final gain to assess is how much does the eNodeB serving rate need to increaseusing no machine learning model to make sure it can serve as many important in-dustrial production line applications as using machine learning. The result can beseen in figure 4.19, where the starting serving rate was 60 Bytes per millisecond.The none-ML eNodeB with no prior knowledge about the packets need to increasethe rate to 80 Bytes/ms to achieve the same score as with ML. This results to 33.3%higher rate.


Figure 4.15: The accumulated score for different models, to show the potential gainof using machine learning classifier to prioritize important industrial application inthe HMS data traffic.

The same idea is done for the DEFCON data traffic. Due to the increased trafficthe eNodeB can now handle 200 Bytes/ms and the buffer is increased to 80 Bytes.As the C4.5 achieves perfect classification for the industrial applications and noneof them are misclassified as an office landscape application, the perfect machinelearning model case from HMS is removed. The result score can be seen in figure4.20, showing a similar case as the HMS traffic stream. The buffer gain using ma-chine learning can be seen in figure 4.21, showing the same conclusion as before.The eNodeB using machine learning to reduce important applications drop rate isseen in figure 4.22. When no machine learning is installed in the eNodeB, the droprate for each application is seen in figure 4.23. Based on the results, one can clearlysee in the the DEFCON case that the machine learning models manage to do pri-oritization of important application effectively as well.

Doing the same experiment regarding the serving rate as before, the eNodeB neededto increase the serving rate by 50% to make sure as many important applicationspackets are served with no machine learning. This result is seen in figure 4.24.


Figure 4.16: The potential buffer benefit of using a machine learning model in theeNodeB vs without, for the HMS data traffic.

Figure 4.17: Each application drop rate, using machine learning for the HMS datatraffic.


Figure 4.18: Each application drop rate, without machine learning for the HMSdata traffic.

Figure 4.19: The serve rate increased by 33.3% for the eNodeB with no ML installed,to make sure as many important applications gets served, as using machine learningfor HMS traffic.


Figure 4.20: The accumulated score for different models, to show the potential gainof using machine learning classifier to prioritize important industrial application inthe DEFCON data traffic

Figure 4.21: The buffer load for the DEFCON traffic, using machine learning andno machine learning.


Figure 4.22: Packet drop rate for the applications, using machine learning, in theDEFCON traffic.

Figure 4.23: Packet drop rate for the applications, without machine learning, in theDEFCON traffic.


Figure 4.24: The serve rate increased by 50% for the eNodeB with no ML installed,to make sure as many important applications gets served, as using machine learningfor DEFCON traffic.

4.5 ConclusionThe goal of the report is to see if it is possible to create a good network trafficclassifier model for an industrial environment, using machine learning, where thefound model can be used, as a prioritization tool to make sure crucial traffics willnot be dropped. We used two traffic captures to test this, where in both cases themajority of ML models could achieve good classification of different applications.This motivates that it is possible to develop a sufficient machine learning modelto classify important industrial Ethernet applications, making sure crucial trafficgets served instantly and the production do not terminate. We also tested the gainof using machine learning to prove that important applications gets served first,this was done by sampling traffic for both data-sets where the eNodeB could makesure important packets got RUs first. The gain observations showed that a eNodeBwithout a ML classification model needs 33%-50% more RUs to be able to makesure as many important industrial packets gets served as one that is using a MLmodel during peak load.

However, the models we found do not indicate that we have found a deployablesolution. There are several reasons why those research models could not be used infactories. The first reason is that there was no model that was best for both cases,indicating that one needs to do analytics at every factory before one will find the


best model for that particular factory, which is not reasonable. Especially if thegoal of the thesis is to allow the factory setup to be changed every day. Secondreason is that some models had very high classification delay. This can result inthat the eNodeB takes to much time for deciding if a packet should be prioritizedor not. This can result in that the cyclic jitter threshold has passed, resulting intermination for crucial communications. The next reason is that the models whereonly modified slightly from previous studies and if better tuning had been doneanother model might have been better.The conclusion one can draw is that the research indicates that it is possible to doan effective application classifier model using machine learning, making sure impor-tant industrial packets get prioritized in a cellular network. Nevertheless, the paperdid not show that one model was the best for both data-sets, that can handle anychange in the network setup.

To find a deployable solution, we suggest an auto machine learning solution thatmight be able to adapt to any factory requirement and will use different neural com-pressions techniques to reduce the classification delay. This proposed automatedmachine learning solution might answer if machine learning will definitely work asa prioritization classification model for packets in an industrial environment.

Chapter 5

Discussion

5.1 BenefitsThere seems to be several benefits of using machine learning for classifying theapplication for the eNodeB. It leads to lower drop rate of important applicationsand more optimized buffer. The problem with this solution at this moment is that itwould require several data scientists for finding the optimal model for every factorydata-set, which is not feasible. How this can be solved is to use a genetic algorithm,where a data scientist only defines the search space the algorithm should search in,allowing it to find optimal solution for every factory. Then the eNodeB can downloadthe new improved model, resulting in factories that are always improving. NicolòVendramin inspired this idea, when he used a similar approach on his on goingmaster thesis, using genetic algorithm for optimal hyper parameters and networkarchitecture in DNN. The second issue the thesis had was that some models mighthave to large of a classification delay. A solution we are going to explore is to deploydecompressed neural network, which will reduce the size of the classification modelbut still keep the classification performance. This will result in faster classification.How this auto ML concept could be applied in factories is shown in figure 5.1. Theidea is that the box represents storage, where a capturing program like Wiresharkstore all previous traffic in the factory network. This storage is then later processed,where the processed data is used to train the new generated model, this can be doneusing cloud computing or locally to reduce the risk of sharing data. The eNodeBwill then download and use the newly generated model. This would allow the MLclassification model to improve over time.

5.2 Genetic algorithmGenetic algorithm [32] is an algorithm to find best parameters based on a optimiza-tion criteria. The algorithm is inspired from natural selection in biology. How thegenetic algorithm finds the best parameters for the optimization problem is done in5 stages:

65

CHAPTER 5. DISCUSSION 66

Figure 5.1: Illustrative image showing how each factory eNodeB could be updatedand improved over time, using some type of automated machine learning that triesto improve model classification performance and latency. The figure is generatedusing https://www.draw.io/.

1. First, one generates a population where each individual is assigned with ran-dom parameter setup from a defined feature space.

2. The algorithm selects the best individuals from the population, for instancethe ones with highest accuracy. Called selection.

3. From these individuals the algorithm does a crossover, where two of the se-lected individuals parameters gets merged and used to create a new individual,called a child.

4. The child can mutate a few of the parameters, the idea of mutation is that itallows the algorithm to explore possible better solutions.

5. This continues, where each iteration will get better score. The algorithmcontinues until it has reached a stopping criteria. The stopping criteria canfor instance be the threshold of iterations, or all individuals having very closescore for the optimization function.

Paper [32] explored this for DNN, where they tested genetic algorithm to build theneural setup for their DNN model. They found that the genetic algorithm managedwith 30 iterations to achieve a better DNN model then their expert data scientist.This was tested for three different data-sets. This idea can be deployed in a factoryto find in an automated manner the best machine learning model to deploy for thefactory based on the setup the factory has now.


5.3 Compressed neural networksDeep neural networks are great general function approximations, but the bottleneckof a DNN is the large computational and memory requirements. The result is thatthe most deep neural networks can not fit for instance in the static random accessmemory (RAM), forcing them to be used in dynamic RAM memory where thereference take longer to fetch [33]. This concept of compressed neural networkscomes from Song Hans research together with Nvidia and Google brains [33]. Themotivation S Han[33] used was the idea that humans prune synapse connectionsfrom when they are new born to adults. He wanted to see if the same conceptcan be applied in DNN [33]. The result he found was that DNN models couldbe compressed between 10x to 49x in size and still keep the same performance.For instance AlexNet[34] with a size of 240MB and accuracy of 80.27% could becompressed to 6.9MB (possible to have in SRAM) with 80.30% accuracy [33]. Toachieve this he did several modifications to the model network. The first step was toremove low influence synapses, meaning the connection between neurons. How thiswas achieved is to remove all connections that have very small weights. He foundthat a ratio of 50% - 70% sparsity generally keeps the classification performance ofthe model [33]. The next step is to do the same concept for neurons that are rarelyactivated. Based on only these two operations he managed to reduce the size ofAlexnet with a factor of 13. He also later found, that the weight parameters forevery neuron had very good performance even when the parameters were not precise.For example allowing neuron weights of 1.01, 1.02, 0.98, 0.99 all group to a weightof 1, and still keep the same classification performance. This will then reduce theamount of data referencing needed and by that also reduce the model parameter size.They later looked in to more difficult compression techniques and model structureto reduce the model size even more but that would not be necessarily for our idea.This is because the genetic algorithm will build a model structure optimized toimprove the latency.

5.4 Creating new auto AI modelsBased on genetic algorithms and decompressed neural network, one can merge thetwo ideas to automatically build a deep learning model that outperform expertdata scientist, and does not require any human interaction. This automated MLmodel would allow factory important applications to be prioritized in a wirelessproduction environment. This would allow hardware and robotics engineers tobuild new factory devices that perhaps can move around and at the same timeexecute very fast operations on products. This would allow for a new type of futurewireless factory. This new wireless factory can change its setup every data, wherethe automated ML model will adapt to classify the best for this days productionline setup. The idea how both concepts would be merged in a factory plant is thatthe capture storage from figure 5.1 would capture previous packets and their trueapplication. This is done offline. This capture will then be sent to a computer that


can for instance be in the company cloud. This computer will execute the followingtask from figure 5.2. The first it will do is to process the storage and create a relevantfeature and label data-set. Similar to how the thesis did it. Afterwards it will splitthe data-set into three parts, a training set, testing set and a validation set. [60%,20%, 20%]. From here it will deploy a genetic algorithm. The genetic algorithmcan select different parameter setups. The parameter setup will represent a DNNdesign, where one parameter can for instance be the gradient optimization function;others can be the amount of layer depth in the model, the activation functions perlayer, the amount of neurons at each layer and finally the type of neuron at eachlayer. It will have an optimization criterion, which will be based on classificationperformance and a maximum tolerance for classification delay. Now the geneticalgorithm generates a population, with random parameter setup. Each individualsparameter setup will represent a model setup. Each model will then be trained usingtraining set and then be compressed using the compressed neural network strategy.From here, it selects the individuals that follow the classification delay thresholdrequired and have the best classification performance, based on the validation set.This iterates until for instance, all off the population have vary similar classificationperformance. The final ML model will now be designed for fast classification andclassification performance. The best model generated will then be compared withthe test set with previous model deployed in the eNodeB. If it performed betterthan the previous model, it will deploy the new solution for the eNodeB, shown infigure 5.1.

Figure 5.2: Idea how the future factory eNodeB function in the cellular networkcan improve over time, the figure is generated using https://www.draw.io/.


5.5 Problems of using machine learningProblem of using a machine learning model to classify the application type to apriority for the eNodeB is that, if one would find out the characteristics of whatmakes the model classify a packet as important, one can trick the model to prioritizepackets that should not be important, but is altered to look like they are importantpackets. This would then terminate the wireless communication in the factory.The good thing with this problem is that the new proposed design of using geneticalgorithm and decompressed neural network, would during packet processing stagefind out that those packets are not really important and be retrained to reduce theirpriority. Other problem with this proposed general learning model for building DNNis that the same concept could be applied in a vicious manner. The same techniquecan be applied for instance in classifying encrypted applications, discussed in section2.6. Meaning this algorithm can be used to find out the application for differentencrypted packets, which together with other analytics might risk peoples privacyover Internet.

5.6 Concern about AI and the engineers role in the futuresociety

When innovative studies of different topics are done, one need to look at the ethicalproblems that can occur with the results of the study. When we continuouslydeploy new Artificial Intelligence (AI) models like in this paper, we need to discussthe ethical problems. Ethical problem that can be introduced with this papersproposed future AI model, is that it would remove every human worker that has theobjective to tune or modify similar AI models, as this AI would do it better [32]. Thisis perhaps not a issue in this case, but the same can not be draw in other industrieslike warfare, data mining and social media, where the AI might find to be too goodif the developer have a cruel goal. For instance, using machine learning model tofind possible recruiters for terrorism organization using social media mining[35] orautomating autonomous robot fighters, which could now use extreme low wirelesslatency communication using 5G and a modification of my model. Creating an AIagent that trades extremely well which most likely only the top 1% of the societywill have access to, making them even richer, increasing the gap between the poorand the wealthy. Elon Musk the CEO of SpaceX, Tesla and chairmen of open AIhas a great deal of concern about AI, because of its rapid improvements. He hasasked the United Nations to regulate AI in areas like weapon systems, as he believesAI is a larger danger then nuclear weapons [35]. Other people with high influencethat have raised similar concern about AI are for example Spethen Hawkins andBill Gates [35]. There is so much AI already can solve, where there should be alot of ethical discussions before deploying AI models. AI is a double edge sword,because now when it is performing good no one gets rewarded for it and when itperforms poorly no one gets disrewarded for it. The role of the engineers in the


future society, cannot only be to develop new AI products, but also question theircompanies intention with it. An AI engineer might need to ask herself or himself,how their products could be altered for vicious intention, and based on that drawthe conclusion if the development should continue. AI is not here to take part, it ishere to take over.

References

[1] PROFIBUS and P. International. (2018, Feb) Real-time communica-tion power point. From the one who standardized PROFINET Proto-col. [Online]. Available: http://www.profibus.org.pl/index.php?option=com_docman&task=doc_view&gid=28

[2] A. G. Gotsis, A. S. Lioumpas, and A. Alexiou, “M2m scheduling over lte: Chal-lenges and new perspectives,” IEEE Vehicular Technology Magazine, vol. 7,no. 3, pp. 34–39, Sept 2012.

[3] I. Cisco Systems and I. A. r. r. Rockwell Automation, Converged PlantwideEthernet (CPwE) Design and Implementation Guide. Customer Order Num-ber: Text Part Number: OL-21226-01 Document Reference Number: ENET-TD001E-EN-P, 2011 January, chapter: Converged Plantwide Ethernet Solu-tion.

[4] G. Salvi. (HT 2017) Power point: Classification with separating hyperplanes,school: Royal institute of technology (kth), course name: machine learning,dd2421. [Online]. Available: https://kth.instructure.com/courses/3180/pages/lectures

[5] A. Nikishaev. (2018-03-2) How to debug neural networks. Figure credit takenfrom website. [Online]. Available: https://cdn-images-1.medium.com/max/1600/1*DRKBmIlr7JowhSbqL6wngg.png

[6] A. Nasrallah, A. Thyagaturu, Z. Alharbi, C. Wang, X. Shao, M. Reisslein, andH. ElBakoury, “Ultra-Low Latency (ULL) Networks: A Comprehensive SurveyCovering the IEEE TSN Standard and Related ULL Research,” ArXiv e-prints,Mar. 2018.

[7] PROFIBUS and P. T. C. S. L. PROFINET International. An Introductionto PROFINET Frame Analysis using. From the one who standard-ized PROFINET Protocol. [Online]. Available: https://profibusgroup.files.wordpress.com/2013/01/w4-profinet-frame-analysis-peter-thomas.pdf

[8] PROFIBUS and P. International, PROFINET System Description.PROFIBUS Nutzerorganisation e.V. Haid-und-Neu-Stra$e 7 76131 KarlsruheGermany, 2009, vol. System Description Version April 2009 Order number4.132.

71

REFERENCES 72

[9] . Svenska FN-förbundet. (2018, May) KONSUMTION. [Online]. Available:http://varldskoll.se/fokus/konsumtion

[10] C. T. o. Mariya Yao and head of production at Metamaven. (2018, May)Future factories: How AI enables smart manufacturing. [Online]. Available:https://www.topbots.com/future-factories-ai-enables-smart-manufacturing-industrial-automation/?utm\_medium=article&utm\_source=Medium&utm\_campaign=futurefactories

[11] . b. C. H. Posted July 2nd. WHY PROFINET? From Profinet US departmentsreasons. [Online]. Available: https://us.profinet.com/why-profinet/

[12] J. Sivén, “Securing profinet networks,” Helsinki Metropolia University of Ap-plied Sciences , Metropolian, Bachelor of Engineering Information TechnologyThesis 11 May 2015, pp. 1–49, 2017.

[13] . S. A. 2010, “Automate with the leading industrial ethernet standard and profitnow.” 90026 NÜRNBERG GERMANY, vol. HOF/25228 GM.016XX.52.0.06WS 04108.0, p. 9, 2010.

[14] Profinet and Profibus. (2013, Jan) Profinet Frame Analysis Workshop.[Online]. Available: https://profibusgroup.files.wordpress.com/2013/01/w4-profinet-frame-analysis-handout-peter-thomas.pdf

[15] PROFIBUS and P. International. Ident Numbers. From the protocol website.[Online]. Available: https://www.profibus.com/products/ident-numbers/

[16] A. global initiative. About the 3rd Generation Partnership Project . [Online].Available: http://www.3gpp.org/about-3gpp

[17] R. Nossenson, “Long-term evolution network architecture,” in 2009 IEEE In-ternational Conference on Microwaves, Communications, Antennas and Elec-tronics Systems, Nov 2009, pp. 1–4.

[18] G. P. Fettweis, “5g and the future of iot,” in ESSCIRC Conference 2016: 42ndEuropean Solid-State Circuits Conference, Sept 2016, pp. 21–24.

[19] F. Mestanov, Researcher at System Technology (Ericsson AB), Private dis-cussion and Skype conversation, May. 21 2018.

[20] Cisco and R. automation, “Oem networking within a converged plantwide eth-ernet architecture,” White Paper, vol. Document Reference Number: ENET-WP018A-EN-P, 2017.

[21] C. .-. P. S. F. L. S. P. P. P. by Rackspace. python. [Online]. Available:https://pypi.org/project/pyshark/

[22] G. Combs. Go Deep. [Online]. Available: https://www.wireshark.org/

REFERENCES 73

[23] . . P. S. Foundation. pyshark 0.3.7.11. [Online]. Available: https://www.python.org/

[24] PANDAS. (0.22.0 Documentation) Intro to Data Structures . [Online].Available: https://pandas.pydata.org/pandas-docs/stable/dsintro.html

[25] J. Li, “Decision trees modified from jiawei han, university of illinois,” Universityof South Australia, course Data mining COMP 4008, 2017 Week 3.

[26] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016,vol. 1, ISBN: 9780262035613,[Oninde.] http://www.deeplearningbook.org/.

[27] W. Zhou, L. Dong, L. Bic, M. Zhou, and L. Chen, “Internet traffic classifica-tion using feed-forward neural network,” in 2011 International Conference onComputational Problem-Solving (ICCP), Oct 2011, pp. 641–646.

[28] A. S.-E. Manuel Lopez-Martin, Belen Carro and J. Lloret, “Network traffic clas-sifier with convolutional and recurrent neural networks for internet of things,”IEEE Access, vol. 5, pp. 18 042–18 050, 2017.

[29] F. Chollet. Keras: The python deep learning library. KERAS. [Online].Available: https://keras.io/

[30] GOOGLE. An open source machine learning framework for everyone.GOOGLE. [Online]. Available: https://www.tensorflow.org/

[31] Scikit-learn. Machine learning in python. Funding provided by INRIA andothers. [Online]. Available: http://scikit-learn.org/stable/

[32] T. Shinozaki and S. Watanabe, “Structure discovery of deep neural networkbased on evolutionary algorithms,” in 2015 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), April 2015, pp. 4979–4983.

[33] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally,“EIE: Efficient Inference Engine on Compressed Deep Neural Network,” ArXive-prints, Feb. 2016.

[34] G. E. H. Alex Krizhevsky, Ilya Sutskever. ImageNet Classification with DeepConvolutional Neural Networks. [Online]. Available: http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf

[35] A.-R. Sadeghi, “Ai industrial complex: The challenge of ai ethics,” IEEE Se-curity Privacy, vol. 15, no. 5, pp. 3–5, September/October 2017 2017.

REFERENCES 74

TRITA TRITA TRITA-EECS-EX-2018:268

www.kth.se

machine learning for traffic classification in industrial...

Documents