encrypted traffic classification using statistical features...encrypted traffic, graph theory....

ISeCureThe ISC Int'l Journal ofInformation Security

January 2018, Volume 10, Number 1 (pp. 29–43)

http://www.isecure-journal.org

Encrypted Traffic Classification Using Statistical FeaturesEhsan Mahdavi 1,∗, Ali Fanian 1, and Homa Hassannejad 1

1Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran

A R T I C L E I N F O.

Article history:Received: 13 August 2017

First Revised: 31 December 2017

Last Revised: 4 January 2018

Accepted: 28 January 2018

Published Online: 31 January 2018

Keywords:Internet Traffic Classification,Encrypted Traffic, Graph Theory.

Abstract

Traffic classification plays an important role in many aspects of networkmanagement such as identifying type of the transferred data, detection ofmalware applications, applying policies to restrict network accesses and soon. Basic methods in this field were using some obvious traffic features likeport number and protocol type to classify the traffic type. However, recentchanges in applications make these features imperfect for such tasks. As aremedy, network traffic classification using machine learning techniques is nowevolving. In this article, a new semi-supervised learning is proposed whichutilizes clustering algorithms and label propagation techniques. The clusteringpart is based on graph theory and minimum spanning tree algorithm. In thenext level, some pivot data instances are selected for the expert to vote for theirclasses, and the identified class labels will be used for similar data instanceswith no labels. In the last part, the decision tree algorithm is used to constructthe classification model. The results show that the proposed method has aprecise and accurate performance in classification of encrypted traffic for thenetwork applications. It also provides desirable results for plain un-encryptedtraffic classification, especially for unbalanced streams of data.

c© 2018 ISC. All rights reserved.

1 Introduction

N etwork traffic classification provides many fea-tures for network administrators. Many network

management tasks like bandwidth management andguarantee for quality of services, rely on accurate net-work traffic classification. An accurate network traf-fic classification provides Internet Service Providers(ISPs) the ability to provide reliable methods for im-proving quality of their services with respect to ap-plication specific requirements. Therefore, the firststep to achieve these goals is to identify and classify

∗ Corresponding author.Email addresses: [email protected] (E. Mahdavi),[email protected] (A. Fanian),[email protected] (H. Hassannejad)ISSN: 2008-2045 c© 2018 ISC. All rights reserved.

the traffic of applications and services in the network.On the other hand, growth of malwares and theirtechniques to hide themselves from being detectedby Intrusion Detection Systems (IDSs) and to bypassnetwork firewalls makes traffic classification an im-portant role in network security and cyber threats.Current network traffic classification methods can becategorized into the following four major classes:

• Port based classification: traditionally, trafficclassification was done by port numbers definedby IANA for network transport layer. Althoughthis is a quick and simple method to classifytraffic of basic applications, with the growth ofnew applications such as Peer-To-Peer (P2P)applications it is becoming more and more inef-ficient. The nature of P2P applications tends tobypass firewalls and IDSs by using non-standard

ISeCure

30 Encrypted Traffic Classification Using Statistical Features — Mahdavi et al.

port numbers or masquerading other standardapplication port numbers [1].• Payload based classification: to identify traffictype, the whole payload of network packet isbeing analyzed in this category. The first ma-jor challenge ahead of this method is to ana-lyze the payload of every single network pack-ets, which requires heavy processing efforts. Inaddition, contents of encrypted packets are am-biguous for these methods. These restrictionsmake them unusable for such cases. Respectingthe privacy of the users is another importantchallenge which these methods face.• Host based classification: in this category, the

host behavioral patterns are being modeled. Thejob is done by utilizing heuristic methods with-out using network packet contents. For example,these patterns could contain information aboutconnected hosts, number of different ports ina connection and the transport protocols usedin a connection. The accuracy of these meth-ods is strictly tied to the topological positionof assets being watched. In recent researches,these methods are not used by themselves butin conjunction with other methods.• Feature based classification: considering the

stated limitations for the above categories, fea-ture based classification methods got more at-tention. In this category a set of statistical fea-tures are used as classification criterion. Thisapproach assumes that network layer trafficof each application has unique statistical fea-tures which can be used to distinguish them.Packet length distribution, the idle time be-tween streams, inter arrival time of the pack-ets and the length of the packets are amongthese features. These methods analyze the datavalues for these features instead of analyzingpacket payloads. This brings high efficiency forfast network transactions and also privacy forthe users [2–5]. Moreover, as there is no need ofthe packet payloads to obtain the features, it isalso possible to utilize these methods to classifyencrypted network traffic [3, 4, 6–8].

Many feature based classification methods, usemachine learning algorithms to determine the

traffic labels. In this regard, machine learning al-gorithms fall in three categories of supervised, un-supervised and semi-supervised algorithms [2–4, 9].Supervised algorithms need pre-classified data in-stances to build a classification model. The modelcan be used later to classify new instances. Althoughsupervised algorithms are highly accurate, generat-ing fully labeled data is cumbersome and in somecases impossible. Un-supervised algorithms put data

instances with similar features in same groups. Thesemethods do not need pre-labeled data, so they areconsidered cheap and simple methods but their effi-ciency is not suitable. Semi-supervised algorithms aresomewhere in the middle. Semi-supervised methodstend to utilize benefits of both other categories whileavoiding their deficiencies.

A new classification method proposed in this arti-cle to classify application layer protocols. The methodwhich is a semi-supervised algorithm can efficientlybe used for encrypted traffic like SSH [10]. The pro-posed method, will construct its classification modelwith datasets that only 10% of their instances arelabeled. At the first step, the method clusters thedataset, after labeling clusters and streams in them,a classification algorithm uses labeled streams andthe constructed classification model can be used toclassify new network streams.

Network streams for each Internet protocol, makeclusters with different forms. Utilizing a graph basedclustering algorithm and applying MST algorithmson the clusters, the proposed method is able to distin-guish clusters with different shapes and density andalso clusters with imbalance sizes. Technical resultsshow the quality of the clusters and classificationmodel built using them.

The remaining sections of this article are organizedas follows. In Section 2 we review the related work inthe literature. The proposed methods are explainedin Section 3. Section 4 assesses the efficiency of theproposed method and finally in the last section weconclude the paper.

2 Related Work

As mentioned before, port based and payload basedmethods are facing important limitations. As a result,recent methods mostly rely on statistical features toidentify different applications. The rationale behindthese methods is based on the fact that traffic in dif-ferent network applications have some unique featureswhich make them distinguishable from each other.

More and Zoe [5] utilized Bayesian analysis andshowed the naÃŕve bayes classifier can be efficientwith slight modifications. These modifications con-tained two plugins. The first add-on was to use ker-nel density estimation. While naÃŕve bayes classifierbuilds the model using normal distribution, kernelestimation methods can utilize different kernel func-tions to build it. The second trick was to use fastcorrelation technique to select optimal features. Bythese modifications, bayes classifier achieved classifi-cation accuracy of 96% for different network traffic.

Utilizing naÃŕve bayes classifiers is an ongoing

ISeCure

January 2018, Volume 10, Number 1 (pp. 29–43) 31

method. Zhang et al. in [11] showed that it can be ef-ficiently used for encrypted traffic classification withfew labeled data in hand. They correlate flows bynaÃŕve bayes classifier in Bag of Flows (BoF). Thenthey can process these BoFs and aggregate the naÃŕvebayes predictions of those bags. Extensive experi-ments on real world data sets proved performance oftheir method.

In [12] the efficiency of C4.5 decision tree algorithm,neural networks, naÃŕve bayes and naÃŕve bayes treeclassifiers are compared. Results showed that C4.5algorithm is faster than others. Kim et al. [13] com-pared results of 9 different learning algorithms. Theresults showed that Support Vector Machine (SVM)classifiers are more accurate. The total accuracy ofSVM classifiers was reported to be 94.2%.

Speaking of supervised learning, deep learning isin a rise. Lotfollahi and others claim to be the firstwho proposed a method based on deep learning forencrypted traffic classification in [14]. Deep packet,proposes a framework that combines stacked autoen-coder and convolution neural network (CNN) for net-work traffic classification. They also claim that deeplearning outperforms other methods in encrypted net-work traffic classification.

A recent study in [15] compared 6 classificationmodels namely logistic regression, support vector ma-chine, NaÃŕve Bayes, k-nearest neighbor, RandomForest (RF) and Gradient Boosting Tree (GBT). Theyshowed ensemble methods RF and GBT with some op-timization can prove efficiency over other mentionedclassifiers. Those optimizations worked for datasetstime-related features.

One of the antecedent methods in unsupervisedlearning for network traffic classification was proposedby McGregor in [16]. This research concentrated oncommon applications like HTTP, FTP, SMTP, IMAP,NTP and DNS.

In [17] some clustering algorithms like k-means,Gaussian Mixture Model (GMM), hidden markovmodel and spectral clustering was used to clusterTCP connections. The learning phase outputs twodifferent sets. The first set contains a description ofthe clusters and the second contain applications dis-tinguished be each cluster. In the classification part,a stream of packet headers is used as input and someinformation like source and destination port number,source and destination IP headers, connection proto-col and packet length of first P packets of each TCPconnection are extracted.

In [18] a new unsupervised learning method is in-troduced. In the first step of the method the statis-tical features are used to build the clusters and the

bag of words model is used to describe the contentsof each cluster. Bag of words is a common model todescribe the documents in information retrieval andnatural language processing.

In [19] a semi-supervised algorithm is introducedwhich consists of the two step of clustering and clas-sification. The latter one assigns cluster labels. Forthe clustering, the K-means algorithm was used. Thedata instances are divided in K clusters with respectto their feature values. Data instances in the samecluster have similar feature values. K is a native inte-ger which specifies the number of different clustersand should be defined somehow prior for the algo-rithm. For each cluster a central instance is definedand other data instances will be divided into clusterwith respect to their distance with these central in-stances. Each new data added to a cluster and thecentral instance might update and change until thealgorithm becomes stable and no changes are madeduring different iterations.

In [20] a semi-supervised method is proposed us-ing constrained clustering algorithm. Constrainedclustering works according to Must-link and Cannot-link. Must-link relations specify which data instancesshould be in same cluster. On the other hand, Cannot-link constraints will forbid some data instance to bein same clusters. In network flow context, the correla-tion between data streams might be as used Must-linkrelations.

3 The Proposed Method

In this section, a novel semi-supervised based methodis proposed to classify network traffic. The purposeof the method is to classify encrypted traffic for ap-plication layer protocols. In this regard, although it isshown in the performance evaluation section that theproposed method can classify different applicationlayer protocols, considering the availability of SSHencrypted datasets, it is taken under the focus. Theproposed classification method will utilize graph the-ory clustering, MST algorithm and C4.5 decision tree.The method will construct a classification model intwo major steps of clustering and classification conse-quently. In the proposed method the class of networktraffic is automatically determined. Figure 1 shows ageneralized structure of the proposed method.

The classification model is built in two steps of clus-tering and classification. In the first step, a clusteringis done with respect to statistical feature values. Us-ing un-labeled network flows, we try to build highquality clusters. The clustering is done using graphtheory and MST algorithm. Afterwards, the limitednumber of labeled data instances are used for labelingclusters. Some clusters might not contain any labeled

ISeCure


Figure 1. General structure of the proposed method

data instances so their label is determined via askingexpert. Then regarding the intra-cluster distances,clusters with the same label might get merged. At theend, labeling the clusters results in all data instancesto have labels and get used as input to the classifica-tion step. The classification model is obtained usingC4.5. Finally, this semi-supervised model is able toclassify new network streams using their statisticalfeatures. In next subsections each part of the pro-posed model will be explained.

3.1 Learning Phase

The learning phase in the proposed method is a semi-supervised classification. The input to this phase isa collection of training data instances which at most10% of them are labeled and the learning is done withrespect the values of their features. IfD is the trainingset, then D = L ∪ U in which L is the collection oflabeled and U is the collection of un-labeled networkstreams. The learning phase of the proposed schemecan be divided into this 3 blocks:

• Clustering,• Labeling clusters and merging them,• Building the classifier model.

The first part of the learning phase, i.e., clustering,utilizes a graph based algorithm. Nodes in this graphrepresent the labeled network flows and the edgesshow the distance of these flows. The graph of flowsis clustered using graph algorithms such as MSTalgorithm. Using the labeled network flows the label ofother un-labeled flows are assigned. Finally utilizingthese labeled data, a supervised learning method isused the build a classifier. The details of each partsare exposed in the following.

3.1.1 Clustering

The final goal of the clustering part is to divide datainstances into relative clusters with respect to thestructure of the dataset. This is done by utilizing datasimilarity measures. For the clustering we assume thatthere is no initial knowledge about the dataset suchas number of clusters and the distribution of data in-stances amongst them. Clustering is a powerful meanin the fields of machine learning and data mining.

Difference in distributions of datasets causes theconstructed clusters to be of arbitrary shapes, differ-ent densities and even un-balanced sizes. However,in the most clustering methods, some initial infor-mation such as number of clusters is assumed to beavailable, they donâĂŹt consider some issues relatedto the unbalanced data. So, the proposed methodsomehow covers these issues. In this section clusteringalgorithm based on graph theory and MST is used.

As shown in Figure 2, the final clusters are yield af-ter two level of clustering. In both levels the networkflows are considered as nodes in the graph and dis-tance between them construct the edges of the trees.Dissociation of the flows and formation of the clustersis done by eliminating the edges of the resulting trees.Elimination of the edges is inspired by the methodproposed in [21]. In the following part, if needed theimpure clusters of the first part, are decomposed intonew smaller clusters using a new method.

As mentioned before, D = L∪U is a set consistingof labeled network flows L = {x(l)i , i = 1, . . . , n} andun-labeled network flows U = {xi, i = 1, . . . ,m}. Foreach network flow, xi = (xi1, xi2, . . . , xij , . . . , xid) ∈? ?d represents the set statistical features and xij is asingle feature of the flow xi. The labels of the flowsare from the set B(l ∈ B).

First a weighted graph is built using the set oftraining samples D. As noted before the set of the

ISeCure


Figure 2. clustering scheme in the proposed method.

network flows (consisting of labeled and unlabeled)are the nodes of the graph and the distance betweenthem construct the edges of the graph.

Definition 1. Weighted Undirected Graph:Graph G(X) = (V,E) is a weighted undirectedgraph in which V is the set of vertices and E =(xi, xj)xi, xj ∈ X, i 6= j is the set of graph edges.Each edge e = (xi, xj) in graph has the length(weight) of d(xi, xj).

In the proposed method, G(X) = (V,E) is com-plete weighted undirected graph. In G, V = D isthe set of graph vertices and E = {(xi, xj) | xi, xj ∈D, i 6= j}e = (xi, xj) is the set edges of the graph.Each edge defined distance between the two network isthe dmix in which dmix = (xi, xj) graph has a lengthof the in streams. Common distance functions likeEuclidean, Mahalanobis or other arbitrary distancefunctions can be used. In this research, instances arenetwork flows and their features can be either numer-ical or nominal. So we use a mixed distance functionof Euclidian and stream distances. The distance func-tion is shown in relation Equation 1.

dmix =√dn(xi, xj) + dc(xi, xj) (1)

dn(xi, xj) =∑

dk∈NUM

(xi,k − xj,k)2 (2)

dc(xi, xj) =∑

dk∈CAT

δ(xi,k − xj,k) (3)

where δ(a1, a2) = 1 if a1 6= a2 and δ(a1, a2) = 0 ifa1 = a2. Also, dmix is the distance function that weuse in this article, and dn is the Euclidian distanceof numerical features and dc is the distance for nomi-nal feature values. A clustering algorithm will dividedataset X into K clusters of C1, C2, . . . , CK . In thisdivision these conditions exist:

Ci 6= ∅

Ci ∩ Cj = ∅

X = C1 ∪ C2 ∪ . . . ∪ Ck

i = 1 : k; j = 1 : k

(4)

As we represented the collection of network streamby means of graphs, so clustering them results in di-

Figure 3. first and second MSTs [22].

viding the graph into K subgraphs, each representinga cluster. To construct the clusters in clustering oneneeds to define a separation criterion. In graph ofnetwork streams, definition of separation criteria hasthe certain meaning of finding incompatible edges.Incompatible edges are the edges which connect irrel-evant network flows and clustering process should bethe best candidates for elimination. Topologically, in-compatible edges are the inter-cluster edges that elim-inating them to form subgraphs (clusters) correctly.Proper separation criterion distinguishes inter-clusteredges from intra-cluster ones and makes correct net-work traffic classes by eliminating them.

The proposed method uses Minimum SpanningTrees in graph clustering phase. A minimum span-ning tree (MST) or minimum weight spanning tree isa subset of the edges of a connected, edge-weighted(un)directed graph that connects all the vertices to-gether, without any cycles and with the minimumpossible total edge weight. However, using only thefirst MST may not produce reliable clusters. Giventhe nature of a tree, removing a single edge results intwo separate subgraphs. On the other hand, the 1stMST might not contain enough edges to represent theoriginal graph. It means that it cannot represent thenetwork flows correctly. So we use 1st and 2nd MSTstogether and apply the clustering algorithm on it.

In Figure 3 first and second MSTs of an arbitrarygraph are shown respectively in section 1 and 2. 3rdand 4th section of that figure show the graph ofthose MSTs combined together. Specifically, the 4thsection of the figure, shows the subsets that can beseparated with respect to their distances. Edge ab is acorrect candidate for elimination which results a validseparation. This edge can be considered incompatible.

Regarding the edge ab of Figure 3, a criterion fordetermining incompatible edges can be defined asfollows: “Each edge with length significantly greaterthan the average length of other edges connectedto the both ends of that edge, are considered asincompatible and should be eliminated” [21, 22].

This criterion only covers separation of clusterswith respect to their distances. Another measurement

ISeCure


that commonly is used as separation criterion in clus-tering algorithms is cluster densities. According toFigure 3, in [22], it was stated that the major differ-ence between the cluster separation using two mea-sures of cluster distances and cluster densities, is thatthe average length of connected edges to the endingvertices of incompatible edges behave differently withrespect to these measures. In clusters that are distin-guishable using cluster distances, it is common thatthe stated average lengths are homogenous while withrespect to the density they differ significantly.

To take these differences into account, a new crite-rion can be defined which covers both measures. Incombined graph of first and second MSTs together,eliminating at least two edges yields a graph cut.

Definition 2. In graph G(X) = (V,E) and e ∈ E,if eliminating edge e yields two subgraphs G1(X) =(V1, E1) and G2(X) = (V2, E2) such that V = V1+V2,then a graph cut is happened. To identify the inter-cluster, for each edge itâĂŹs length is compared to theminimum value of the average length of the connectededges to the vertices of both sides of that edge. Thiscomparison could be done by their relation ratio. Theweight of each edge in the combined graph of 1st and2nd MSTs is defined as follows:

Graph Gmst(X) = (V,Emst) is the graph fromcombining the 1st and 2nd MSTs in which eab ∈ Emst

and a, b ∈ V . Weight of edge ab is defined as follows:

w(eab) =d(eab)−min(avg(Ea − {eab}), avg(Eb − {eab}))

d(eab)(5)

where Ea = {eij | (eij ∈ Emst) ∧ (i = a ∨ j = a)},Eb = {eij | (eij ∈ Emst) ∧ (i = b ∨ j = b)} andavg(E) = 1

|E|∑

e∈E d(e). As mentioned before d(e)is the defined distance between network flows. Todefine weight of an edge, consider the connected edgesto both ends of that edge. Then consider averagelength of those edges for both sides and find whichis minimum. The weight of an edge is defined to bethe ratio of its length the mentioned minimum value.The smaller this ratio is, the mentioned weight willbe greater and vice versa. So, the larger the weightof an edge, it is most probable to be an inter-clusteredge. Then it is a good candidate to be eliminated.Figure 3 shows a problem of cluster separation bydistances. According to Figure 1, edge ab is inter-cluster and should be removed. Edge cd is an intra-cluster edge and should remain. The weight of ab andcd are 10 and 5 respectively. Therefore, edge ab is abetter candidate to be eliminated.The weight defined in (3-5) considers only local

edges connecting to both ends of the edge beingchecked. This is where some exceptions arise. If theedge being checked is an intra-cluster edge far fromany inter-cluster edge, but with weight greater thanaverage of that local edges, it will be removed. To

remedy this exception and some other likewise prob-lems, the weight of an edge in 1st and 2nd MSTsbased graph is defined as:

w(eab) = δ ×d(eab)−min(avg(E′

a − {e′a}), avg(E

′b− {e′

b}))

d(eab)(6)

+ (1− δ)× d(eab)

where E′a = Ea − {eab}, e′a = argmaxe∈E′a(d(e)),

E′b = Eb − {eab}, e′b = argmaxe∈E′b(d(e)) and δ is

a penalty factor for the length of the edge whichresides in 0 ≤ δ ≤ 1. Terms E′a − {e′a} and E′b − {e′b}cause omitting longest edge length in computingweight eab. The penalty factor δ holds a trade-off between length of the edge and the ratio ofd(eab)−min(avg(E′

a−{e′a}),avg(E

′b−{e

′b}))

d(eab).

The defined weight in 1st and 2nd MSTs basedgraph exposes useful concepts:

• In a graph, weight of inter-cluster edges is ex-tensively larger than others. This helps to findgraph cuts and to constitute subgraphs(cluster)by removing the longer edges sorted by theirweights in a descending order.• Inter-cluster edges are distributed equally be-

tween T1(1st MST) and T2(2nd MST). This fea-ture facilitates determining incompatible edges.• Except the inter-cluster edges, most weighted

edges reside in T2.

According to above benefits, a suitable criterionfor determining incompatible edges in 1st and 2ndMSTs based graph is finding an edge with most weightin that graph which removing it results in a validgraph cut. The edges are candidate to be eliminatedwith respect to their weight in a descending order.Rank(Emst) is list of the edges in 1st and 2nd MSTsbased graph, which is sorted in a descending order.

After removing each edge, obtaining a graph cut isexamined. In 1st and 2nd MSTs based graph, eachcut is yield be removing at least 2 edges. So on con-stitution of each graph cut, we have a set of removededges. Assume that is the mentioned set. If the fol-lowing equation is established the graph cut is validand there will be 2 subgraphs in which each one isconsidered as cluster:

Ratio(Egcut) =min(|Egcut ∩ T1|, |Egcut ∩ T2|)

|Egcut|≥ λ

(7)

Otherwise the cut is invalid and the whole graphmust remain as a cluster. λ is an empirical threshold.According to above, incompatible edges are the oneswith most weight in Rank(Emst) which hold 3-7. Thismeans removing them makes a valid graph cut. Ingraph separation with respect to the distances, num-ber of the edges that are removed from T1 is approxi-mately equal to the number of eliminated edges from

ISeCure


T2. It is expected that λ = 0.5 is a suitable choice.The criterion is not held about graph separation withrespect to cluster densities. Our experiments showedthat λ = 0.33 is suitable choice for the thresholdwhich best covers both measures. By eliminating in-compatible edges, 1st and 2nd MSTs based graph willbe divided to some clusters. For each cluster, the in-compatible edge removal is continued until no moreincompatible edges could be found. This means thatthe clusters could not be divided to smaller ones. Thealgorithm of the process is as below:

Algorithm 1 Clustering algorithm

Input: G(X) = (V,E) the graph of the trainingnetwork flow

Output: out a set of clusters(subgraphs)1: Find 1st and 2nd MSTs T1 and T22: Build 1st and 2nd MSTs based graph Gmst(X)3: Create tables open = GmstX) and closed = ∅4: If open = ∅ and out = closed then end the algo-

rithm5: If open = ∅ then grab G′mst = (V ′, E′) from open6: Calculate weight of the edges in G′mst = (V ′, E′)7: Create list Rank(E′)8: Find the best edge candidate for removal from

Rank(E′)9: If a graph cut happened go to 11.

10: If a graph cut not happened go to 8.11: If the graph cut is valid subgraphs G′1 and G′2 is

added to the table open and go to 4.12: If the graph cut is not valid move G′mst = (V ′, E′)

from open to closed and go to 4.

Finding imperfect clusters:At the end of the clustering, network flows mightfall within clusters of different size and shapes. Theproposed method tends to build an accurate classi-fication model and finding perfect clusters has animportant effect on this matter. Perfect clusters arethe ones that all or most of their instances belong tothe same class of traffic. In this regard, the quality ofproduced clusters with respect to standard deviationof the length of the intra-cluster edges is measured.

Definition 3. If G′ = (V ′, E′) is a produced clus-ter or subgraph (subtree) of the clustering phase andN is the number of the edges in graph G′(X) andd(e′) is the length of and edge, then the standarddeviation of the length of the intra-cluster edges aremeasured like:

σ =

√√√√ 1

N

N∑i=1

(d(ei)− d(e))2 (8)

Here d(e) in the average length of edges and can becalculated as below:

Figure 4. primary clusters in second phase of clustering.

d(e) =1

N

N∑i=1

d(ei) (9)

Considering a threshold β, if for a cluster σ ≥ βthen the cluster will be checked to see if it can bedivided into smaller clusters.

Procedure of removing incompatible edgesAssume C = {c1, c2, c3, . . . , cn} is the set of the pro-duced clusters of the last steps. σ ≥ β is checked onall produced clusters. If it holds for a cluster, then thecluster will be checked to see if it can be divided intosmaller clusters. Assume ci is a cluster with σ ≥ β .Each cluster in C is an MST, so ci can be showed asT = (V,E) in which V is the set of it’s vertices andE the set of edges. In T , deg(vi) is the degree of eachvertex where vi ∈ V .

In this step the proposed method is about to reduceimpurity of the clusters by dividing them into smallerclusters. For this purpose, degree of each vertex in Tis calculated and they get sorted in a descending orderwith respect to the calculated degree. Consideringthe vertex with most degree, if all adjacent verticesare of degree 1, they will be in same cluster with thatvertex. If one of them has degree more than 1 then itwill be a cluster (Figure 4).

As shown in Figure 4, each vertex with its adjacentvertices of degree 1 is in the same cluster. For exam-ple, consider c1. Vertex a with degree of 8 and its 7adjacent vertices of degree 1 (cdefgh) are forming asingle cluster. Vertex b is adjacent to a but its degreeis greater than 1. So it creates its own cluster namedc12 and as it hasn’t any adjacent vertices of degree 1,cluster c12 has only one member b.

Consider vertex vi ∈ V in tree and assume Ni ={ni1, ni2, . . . , nij , . . . , nim}, j = 1 . . . ,m is the set ofadjacent vertices of vi. While creating clusters ofphase 2, if deg(nij) = 1, nij ∈ Ni then it will be inthe same cluster with vi.

To obtain an optimal number of clusters, after cre-ating these clusters, if necessary, some of the clustersget merged together. For this purpose, the followingparameters are calculated and if a merge happenedthey should be updated.

ISeCure


Figure 5. Final clusters of second phase of clustering.

• Cluster center center i• Adjacent vertices Ni

• Inter-cluster edges outi• Intra-cluster vertices vini• Cluster standard deviation sdi (respect to the

length of Intra-cluster edges)• Cluster Degree deg ci (maximum degree of ver-

tices in cluster)

To start merging, the clusters are sorted descend-ing with respect to their degrees and merge procedurestarts with cluster with smallest degree. As shownin Figure 4, clusters c7 to c15 have the minimum de-gree value, so each one can be the starting candidate.Assume c12 is the choice. The distance of center forc12 (center12) should be calculated with neighboringcluster centers (center11) and (center1). Two neigh-bors with nearest clusters now get merged. To ensurethat the merge was useful difference of standard devi-ation for merged clusters sdmerg and stand deviationof two initial clusters sd1, sd12 is calculated.

sdmerg − sd12 = a1 (10)sdmerg − sd1 = a2 (11)

The merge is successful if 3-11 is hold, else themerge should be reverted and the edge between themeliminated. a1 ∧ a2 ≥ θ

a1 ∧ a2 < θ(12)

The θ in 3-12 is a threshold which is studied insection âĂŐ4.

After merging two clusters, all mentioned parame-ters should be updated. Merge procedure is contin-ued until no inter-cluster edges exist. In other words,no more clusters could be merged. For example, Fig-ure 5 shows the results of the algorithm for the lastexample.

As can be seen in Figure 5, initial 15 clusters arereduced to 6 after applying this algorithm. The re-sulting clusters are expected to be more accurate incomparison with first step clusters.

The merge algorithm is as follows:

Algorithm 2 Merge step

Input: Clusters produced by algorithm 1 (MST T )Output: More accurate clusters1: 1. Build the primitive clusters with respect to

vertices degrees and store in List initial.2: Calculate the following parameters for eachClusteri in initial List. Center for each clusteri,Adjacent clusters Ni, Intra-cluster edges ini,Inter-cluster edges outi, Intra-cluster verticesvini, Standard deviation of the cluster sdi (Withrespect to length of the edges in cluster), Degreeof the cluster degci (Maximum degree of the ver-tices in the cluster)

3: Sort the clusters based on cluster rank in ascend-ing order and store them in clusters list.

4: Store all inter-cluster edges ininter_clusters_edges list.

5: If inter_clusters_edges 6= ∅, go to 7.6: If inter_clusters_edges = ∅, end the algorithm.

7: Calculate the distance of center of first cluster ciin clusters list to center of its adjacent clustersNi.

8: Select the cluster that satisfies cselect =argminj=1,...,m,nij∈N{d(centeri, centernij )] andstore it in cselect

9: Merge ci and cselect.10: If the merge is not appropriate, assume the edge

between ci and cselect as incompatible, remove itand go to 4.

Finally, clusters obtained from algorithm 1 and 2will go as input of labeling step. In next section thelabeling phase is explained.

3.1.2 Labeling

The resulting clusters of the clustering part will beused as input for classification phase. Classificationis a supervised learning method. So, it needs labelednetwork flows. In this step, the flows must be labeledproperly. As mentioned before, we assume that atleast 10The clustering phase has nothing to do withthe data labels, so labeled data instances are proba-bly distributed among the clusters. To label clusters,labeled instances in each are used.

Some clusters might not have any labeled instances,but if a cluster has some, the instances should beexamined if they are determinative or not. The deter-minative instance is the one with maximum degree.In other words, the vertices with maximum numberof neighbors are determinative ones. Determinativeinstances are also called dense instances. If there isa labeled dense instance in a cluster, itâĂŹs label ispropagated to other instances of that cluster. Other-

ISeCure


Figure 6. General structure of classification phase in proposedmethod.

wise, the cluster label could be obtained by askingexperts. Then the label will be propagated to all clus-ter members.

3.1.3 Classification

In last steps, the training data instances were clus-tered and by label propagation, a set of labeled net-work flows was produced. These labeled instancescan now be used for classification purposes. The goalof the classification phase is to build a classificationmodel by statistical features of the dataset. For clas-sification model, C4.5 supervised algorithm is used.As mentioned before, C4.5 algorithm showed properperformance in comparison with Bayesian network,naÃŕve bayes and SVMs. C4.5 algorithm generatesa tree structure based on statistical features of thenetwork flows. The structure can be used later in testphase to classify unseen instances. These instancesare called test flows. The structure of the classifica-tion phase is shown in Figure 6.

3.2 Test Phase

The classification model built by C4.5 decision tree innow examines. Test data instances, are a subset of ini-tial data set which doesn’t contain training instances.

One method for evaluating classification algorithmis holdout. In this method, initial data set is dividedinto two disjoint data sets training and test. Thedivision ratio depends on the view of analyst. It iscommon that 2/3 of the data set is used for trainingand the rest for evaluation. Holdout is simple andfast but has its deficiencies. The choice of trainingand test sets affects each other. Instances chosen fortest have no chance to take part in training and viceversa. Random Sub-sampling is about running hold-out repeatedly. The flaw here is that some instancesmight more frequently be chosen for training or testand others less. Cross Validation, which will be usedin this article for evaluation purposes, controls thisfrequency and tends to maintain a balance.

4 Evaluating the Proposed Method

The proposed method is implemented using JAVAprogramming language on a machine with win-dows operating system. To evaluate it, datasetsNLANR(AMP) [23], MAWI [24], DARPA99 [25] and

Moore [26] was used. Although the size of SSH andNOTSSH types are imbalanced, SSH network flowsare widely present in all these datasets. For the first3 datasets, the binary classification problem SSH andNOTSSH was evaluated, but in Moore dataset thereare various classes of traffic. So, classifying all typesof network flows was evaluated using the proposedmethod.

To evaluate the proposed method, measures Recallor Detection Rate (DR), False Positive Rate (FPR),Precision and F-measure are used. FPR reveals therate false prediction for the negative (NOTSSH) class.DR shows the percentage of correct predictions of allclasses and precision states how accurate a classifierperforms regarding a specific class. Finally, F-measurebalances a tradeoff between Recall and Precision andcan be calculated by the following equation:

F −Measure =(1 = β2)× Recall × Precision

β2 × (Recall + Precision)(13)

β in 4-1 is a tradeoff parameter between recall andprecision which is normally set to be equal to 1.

Experiments and resultsFirst 3 datasets are perused first. Each dataset con-tains statistical features of some network flows. SSHlabel corresponds to SSH network flows and NOTSSHis used for other classes of flows.

For cross validation purposes, 5 files each contain-ing 4000 network flows are randomly sampled and pre-pared. Each file consists of 3000 NOTSSH instancesand 1000 SSH flows. The following results are averagevalues obtained for these prepared datasets.

Figure 7 illustrates detection rate of the proposedmethod for SSH and NOTSSH classes for variousstandard deviations which was used in the secondphase of clustering. The range for standard deviationin these experiments was in [0.02, 0.1] . As can beseen, various detection rates obtained by varying stan-dard deviation. Smaller values of standard deviationresult in more accurate clusters, so it is reasonablefor the method to achieve better results for such val-ues. For MAWI dataset as can be seen in Figure 7,acceptable detection rates achieved for values than0.07 of standard deviation, so 0.07 can be consideredas threshold for this parameter.

Another effect of changing standard deviation isvariation of the cluster numbers. Lesser values of stan-dard deviation results in more clusters and greatervalues for it yields fewer clusters. Figure 8, showsthis effect for data set MAWI. It is shown that theminimum number of clusters was obtained for value0.1 of standard deviation and maximum number forvalue 0.02.

ISeCure


Figure 7. Detection rate of SSH and NOTSSH for variousstandard deviations in MAWI dataset

Figure 8. Different cluster numbers for different values ofstandard deviation in MAWI dataset

Figure 9. Detection rate of SSH and NOTSSH for variouscluster numbers in MAWI dataset

Figure 9 shows trend for detection rate of SSH andNOTSSH while varying number of clusters. Values forstandard deviation which result in fewer clusters andbetter separation for SSH and NOTSSH are suitableones. Considering Figure 9, for 39 clusters or more,acceptable detection rate of SSH and NOTSSH is ob-tained. As mentioned before, the standard deviationfor MAWI dataset varied in range of [0.02, 0.07], sorange of [0.05, 0.07] that produces less clusters withacceptable results was the best range.

Figure 10, similarly shows detection rate of theproposed method for SSH and NOTSSH classes forvarious standard deviations in DARPA dataset. Thefigure shows that for higher standard deviation val-ues, detection rate for NOTSSH increases and de-creases for SSH. This shows that choosing smallerstandard deviation values leads in misclassifying more

Figure 10. Detection rate of SSH and NOTSSH for variousstandard deviations in DARPA dataset

Figure 11. Different cluster numbers for different values ofstandard deviation in DARPA dataset

Figure 12. Detection rate of SSH and NOTSSH for variouscluster numbers in DARPA dataset

of NOTSSH flows as SSH flows.

Also Figure 11, 12 show cluster numbers for dif-ferent standard deviation values and correspondingdetection rates respectively in DARPA dataset.

Like the results of MAWI dataset, according toresults shown in Figure 10 to Figure 12, standarddeviation values in range [0.07, 0.09] produces accept-able results for lesser cluster numbers. The proposedmethod is evaluated using AMP dataset and the re-sults are shown in Figure 13 to Figure 15.

According to Figure 13, in AMP dataset smallervalues of standard deviation results in better detec-tion rates. Also smaller values of standard deviationproduce more cluster number. According to Figure 15,minimum number of clusters which produce betterresults are in range of 47 to 89.

According to Figure 14 and Figure 15, the best

ISeCure


Figure 13. Detection rate of SSH and NOTSSH for variousstandard deviations in AMP dataset

Figure 14. Different cluster numbers for different values ofstandard deviation in AMP dataset

Figure 15. Detection rate of SSH and NOTSSH for variouscluster numbers in AMP dataset

range of standard deviation for noted cluster numberis [0.05, 0.07]. Considering all mentioned results, thiscould be stated that the range of [0.05, 0.07] is suitablefor standard deviation in all 3 datasets.

To better analyze the performance of the classi-fiers, specifically the binary ones, Receiver OperatingCharacteristics (ROC) curves are used. The observa-tion mentioned before were all binary classificationproblems, so we analyze the ROC for them.

In this article, all datasets consist of two parts oflabeled and un-labeled instances with the assumptionthat at least 10% of them are labeled.

After clustering part, as mentioned before, eachcluster is checked to see if they contain any labeleddense instances. If there is one, its label is propagatedto other instances of the cluster. If not, the clusterlabel is determined by asking experts. Initial assump-

Figure 16. ROC for datasets DARPA and AMP

Figure 17. minimum and maximum amount of labeled datain datasets MAWI, AMP, DARPA

Figure 18. Comparing detection rate of SSH network flows

tion was that the dataset consists of at least 10%labeled data. After label propagation, this numberarises. Considering above statement, the minimumand maximum number of labeled instance for datasetsMAWI, DARPA and AMP are shown in Figure 17.

According to Figure 17, for the mentioned experi-ments, at most 14% data instances were labeled forclassification step.

ISeCure


Figure 19. Comparing false positive rate for SSH

Figure 20. Comparing detection rate of NOTSSH networkflows

To present a better view of the proposed methodperformance, in the following paragraphs, the pro-posed method is compared to some other well-knownmethods. Figure 18 and Figure 19 compare detectionrate and false positive rate of the proposed methodwith 6 methods namely GP [6], C4.5, AdaBoost, RIP-PER, SVM and NaÃŕve Bayesian for SSH networkflows. Similarly, Figure 20 and Figure 21 comparedetection rate and false positive rate of the proposedmethod with 6 methods namely GP [6], C4.5, Ad-aBoost, RIPPER, SVM and NaÃŕve Bayesian forNOTSSH network flows.

Figure 21. Comparing false positive rate for NOTSSH

Figure 22. Comparing detection rate of the proposed methodwith a K-means based semi-supervised method [9]

According to these comparisons, the proposedmethod reached better results considering both DRand FPR. Due to the fact that our method had only10% of its data instances labeled clears better per-formance of this method. Considering the fact thatthe proposed method is a semi-supervised method,we compare its performance with another semi-supervised method [9]. Figure 22 compares detectionrate of SSH network flows for datasets DARPA andMAWI.

As mentioned before, the proposed method con-sists of two clustering algorithms in the first part. Inboth levels the network flows are considered as nodesfor the graph and distance between them constructthe edges of the trees. Dissociation of the flows andformation of the clusters is done by eliminating theedges of the resulting trees. Elimination of the edgesis inspired by the method proposed in [22]. In thefollowing part, if needed the impure clusters of thefirst part, are decomposed into new smaller clustersusing a new method. The evaluation results for the

ISeCure


Figure 23. Comparing SSH detection rate of the proposedmethod with [22]

Figure 24. Comparing NOTSSH detection rate of the pro-posed method with [22]

proposed method and [22] are presented in Figure 23for SSH network flows and in Figure 24 for NOTSSHnetwork flows.

Another useful comparison is shown by Figure 25. Itcompares average F-measure for SSH class in MAWIdataset. It also shows the effect of majority of labeledsamples. In addition to comparing with C4.5 andErman’s [9], Figure 25 shows slight superiority of themethod compared to [11]. As can be seen, F-measurefor the proposed method in noticeably higher thanmethod of [11] for fewer labeled instances and likeother methods increases with more instances.

Figure 25. Average F-Measure of SSH class on MAWI dataset

Figure 26. Detection rate of the method for classes of theMoore dataset on two disjoint random subsets of it

As mentioned before, the proposed method is ap-plied to Moore dataset to classify different classes ofnetwork flows. In this regard, two distinct randomlygenerated subset of Moore dataset is prepared andthe results of the proposed method for these subsetsare shown in Figure 26.

The maximum number of generated clusters onfirst subset was 400. Like other experiments 10% ofdata instances were initially labeled and after labelpropagation, only 15.5% of instances had labels. Forthe second subset, the maximum number of clustersreached 295 and at most 15% of data instances werelabeled for classification purposes. As shown in Fig-ure 26, the proposed method reached suitable resultsby having less than 16% of its instances labeled.

Figure 27 compares detection rate of the proposedmethod with C4.5 and K-means algorithm. Accordingto Figure 27, K-means reached detection rate 100%for 4 of the classes, but poor results for some otherclasses. This shows high FPR for those classes whichshows poor performance of K-means.

In addition, more experiments have been carriedout using other common dataset from UCI Reposi-tory [26] like Seeds, Breast Cancer, Wisconsin, Ban-knote and others. Datasets with various dimensionsof features, classes and instance numbers form a fewto hundreds of features and thousands of instances

ISeCure


Figure 27. comparing performance of the proposed methodwith c4.5 and K-means algorithm

were used and the method obtained more or less thesame results. The above experiments prove appropri-ate level of scalability.

5 Conclusion

Internet network traffic classification methods werediscussed in this article. The main goal of the paperwas to devise a novel method for detecting and clas-sifying traffic of encrypted application such as SSH.The proposed method, which is a semi-supervisedmethod based on graph theory, was implemented. Thealgorithm classifies both encrypted and plain networktraffics by means of concepts of graph theory, min-imum spanning trees and C4.5 decision trees. Themethod first clusters network flows based on theirstatistical features. According to the results, gener-ated clusters of this method were perfect enough tocover most classes in related clusters. Having densedata instances, the label of each cluster could be ob-tained and propagated to other instances of the samecluster. The experimental results showed that thenew method has accurate and suitable performanceon classifying different network flows. Consideringthe main goal of the paper to classify encrypted net-work traffics, the experimental results confirm thatthe algorithm classifies the instances successfully.

References

[1] A. Madhukar and C. Williamson, “A longitudinalstudy of P2P traffic classification,” in 14th IEEEInternational Symposium on Modeling, Analysis,and Simulation of Computer and Telecommuni-cation Systems, 2006.

[2] A. Callado and et al., “A survey on internettraffic identification,” Communications Surveys& Tutorials IEEE, vol. 11, pp. 37-52, 2009.

[3] T. Nguyen and G. Armitage, “A survey of tech-niques for internet traffic classification using ma-chine learning,” Communications Surveys & Tu-torials, IEEE, vol. 10, pp. 56-76, 2008.

[4] A. Dainotti, A. Pescape and K. C. Claffy, “Issues

and future directions in traffic classification,”IEEE Network, vol. 26, no. 1, pp. 35-40, 2012.

[5] A. W. Moore and D. Zuev, “Internet Traffic Clas-sification Using Bayesian Analysis Techniques,”SIGMETRICS Perform. Eval. Rev., vol. 33, no.1, pp. 50-60, 2005.

[6] R. Alshammari and A. N. Zincir-Heywood, “Canencrypted traffic be identified without port num-bers, IP addresses and payload inspection?,”Computer networks, vol. 55, pp. 1326-1350, 2011.

[7] C. Zigang, C. Shoufeng, X. Gang and G. Li,“Progress in Study of Encrypted Traffic Classifi-cation,” in Trustworthy Computing and Services:International Conference, Beijing, 2013.

[8] Z. Meng, H. Zhang, B. Zhang and G. Lu, “En-crypted Traffic Classification Based on an Im-proved Clustering Algorithm,” in TrustworthyComputing and Services: International Confer-ence, Beijing, 2012.

[9] J. Erman, A. Mahanti, M. Arlitt and L. Cohen,“Offline/realtime traffic classification using semi-supervised learning,” Performance Evaluation,vol. 64, no. 9-12, p. 1194âĂŞ1213, 2007.

[10] “SSH,” [Online]. Available: http://www.rfc-archive.org/getrfc.php?rfc=4251.

[11] C. Chao, J. Zhang, Y. Xiang, W. Zhou and Y.Xiang, “Internet traffic classification by aggre-gating correlated naive bayes predictions,” IEEETransactions on Information Forensics and Secu-rity, vol. 8, no. 1, pp. 5-15, 2013.

[12] N. Williams, S. Zander and G. Armitage, “APreliminary Performance Comparison of FiveMachine Learning Algorithms for Practical IPTraffic Flow Classification,” SIGCOMM Comput.Commun. Rev., vol. 36, no. 5, pp. 5-16, 2006.

[13] H. Kim, K. Claffy, M. Fomenkov, D. Barman, M.Faloutsos and K. Lee, “Internet Traffic Classifica-tion Demystified: Myths, Caveats, and the BestPractices,” in CoNEXT ’08, New York, 2008.

[14] M. Lotfollahi, R. S. Hossein Zade, M. JafariSiavoshani and M. Saberian, “Deep Packet: ANovel Approach For Encrypted Traffic Classifi-cation Using Deep Learning,” eprint arXiv, vol.1709.02656, no. 2, 2017.

[15] S. Bagui, X. Fang, K. Ezhil, S. C. Bagui and J.Sheehan, “Comparison of machine-learning algo-rithms for classification of VPN network usingtime-related features,” Journal of Cyber SecurityTechnology, vol. 1, no. 2, pp. 108-126, 2017.

[16] A. McGregor, M. Hall, P. Lorier and J. Brun-skill, “Flow Clustering Using Machine LearningTechniques,” in Passive and Active Network Mea-surement: 5th International Workshop, Berlin,Heidelberg, Springer Berlin Heidelberg, 2004, pp.205-214.

[17] L. Bernaille, R. Teixeira and K. Salamatian,

ISeCure

http://www.rfc-archive.org/getrfc.php?rfc=4251.

http://www.rfc-archive.org/getrfc.php?rfc=4251.


“Early Application Identification,” in Proceed-ings of the 2006 ACM CoNEXT Conference, NewYork, NY, USA, 2006.

[18] Z. Jun, X. Yang, Z. Wanlei and W. Yu, “Unsu-pervised traffic classification using flow statisti-cal properties and IP packet payload,” Journalof Computer and System Sciences, vol. 79, no. 5,pp. 573-585, 2013.

[19] A. Shrivastav and A. Tiwari, “Network trafficclassification using semi-supervised approach,”in Machine Learning and Computing (ICMLC),Bangalore, 2010.

[20] Y. Wang, Y. Xiang, J. Zhang and S. Yu, “Anovel semi-supervised approach for network traf-fic clustering,” in 5th International Conferenceon Network and System Security (NSS), Milan,2011.

[21] C. T. Zahn, “Graph-Theoretical Methods for De-tecting and Describing Gestalt Clusters,” IEEETransactions on Computers, Vols. C-20, no. 1,pp. 68-86, 1971.

[22] C. Zhong and et al., “A graph-theoretical clus-tering method based on two rounds of minimumspanning trees,” Pattern Recognition, vol. 43, pp.752-766, 2010.

[23] “NLANR,” [Online]. Available: http://pma.nlanr.net.

[24] “MAWI,” [Online]. Available: http://mawi.wide.ad.jp/mawi/.

[25] “DARPA 1999 intrusion detection evaluationdata,” [Online]. Available: https://www.ll.mit.edu/ideval/data/. “BRASIL,” [Online]. Avail-able:

[26] https://www.cl.cam.ac.uk/research/srg/netos/projects/brasil/.

Ehsan Mahdavi is a Ph.D. candi-date with Isfahan University of tech-nology. He received his B.S. and M.S.degrees in computer science fromShahid Beheshti University,Tehran,Iran. He then returned back to hishometown Isfahan and pursuit Ph.D.

study in the field of network security with IsfahanUniversity of Technology which is ongoing now.

Ali Fanian received the B.S., M.S.and Ph.D. degrees in computer engi-neering in 1999, 2001 and 2011, re-spectively from Isfahan University ofTechnology, Isfahan, Iran. He startedhis work in the same department asan assistant professor from that time.

Different aspects of computer architecture and net-work security are his research interests; specially, ad-hoc networks, wireless network security and hardwaredesign.

Homa Hasannejad received herB.S. in computer engineering fromthe department of electrical & com-puter engineering at Shiraz Univer-sity, Iran. She received her M.S. de-grees in artificial intelligence at thedepartment of electrical & computer

engineering at Isfahan University of Technology, Iran.Her research interests include the study of increasingthe accuracy of Internet traffic classifications usingmachine learning methods.

ISeCure

http://pma.nlanr.net.

http://pma.nlanr.net.

http://mawi.wide.ad.jp/mawi/.

http://mawi.wide.ad.jp/mawi/.

https://www.ll.mit.edu/ideval/data/.

https://www.ll.mit.edu/ideval/data/.

https://www.cl.cam.ac.uk/research/srg/netos/projects/brasil/.

https://www.cl.cam.ac.uk/research/srg/netos/projects/brasil/.

encrypted traffic classification using statistical features...encrypted traffic, graph theory....

Documents