network traffic classification: from theory to practice

39
Network traffic classification: From theory to practice Pere Barlet-Ros Associate Professor at UPC BarcelonaTech Co-founder and Chairman at Polygraph.io Joint work with: Valentín Carela-Español, Tomasz Bujlow and Josep Solé-Pareta

Upload: others

Post on 28-Apr-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Network traffic classification: From theory to practice

Network traffic classification: From theory to practice

Pere Barlet-Ros Associate Professor at UPC BarcelonaTech Co-founder and Chairman at Polygraph.io

Joint work with: Valentín Carela-Español, Tomasz Bujlow and Josep Solé-Pareta

Page 2: Network traffic classification: From theory to practice

Background

• What do we refer to as traffic classification?

– Identifying the application that generated each flow

• What is traffic classification used for?

– Network planning and dimensioning

– Per-application performance evaluation

– Traffic steering / QoS / SLA validation

– Charging and billing

Page 3: Network traffic classification: From theory to practice

State of the Art: Ports • Port-based

– Computationally lightweight – Payloads not needed – Easy to understand and program – Low accuracy and completeness

Page 4: Network traffic classification: From theory to practice

State of the Art: DPI • Deep packet inspection (DPI)

– High accuracy and completeness – Computationally expensive – Needs payload access – Privacy concerns – Cannot work with encrypted traffic

Page 5: Network traffic classification: From theory to practice

State of the Art: ML

• Machine Learning – High accuracy and completeness – Computationally viable – Payloads not needed – Can work with encrypted traffic – Needs retraining

Page 6: Network traffic classification: From theory to practice

Main limitations of ML-TC

• Introduction in real products and operational environments is limited and slow – Current proposals suffer from practical problems

– Actual products rely on simpler methods or DPI

• We identified 3 main real-world problems 1) The deployment problem

2) The maintenance problem

3) The validation problem

Page 7: Network traffic classification: From theory to practice

1) Deployment problem

• Current solutions are difficult to deploy

– Need dedicated hardware appliances / probes

– Need packet-level access (e.g. compute features, …)

• How to address this problem?

– Work with flow level data (e.g. Netflow)

– Support packet sampling (e.g. Sampled Netflow)

Page 8: Network traffic classification: From theory to practice

NetFlow w/o sampling

• Challenge: NetFlow v5 features are very limited

– IPs, ports, protocol, TCP flags, duration, #pkts, …

• State-of-the-art ML technique: C4.5 decision tree

Page 9: Network traffic classification: From theory to practice

Results (NetFlow w/o sampling)

• UPC dataset (publicly available) – 7 x 15 min traces from UPC access link

– Collected at different days and hours

– Labelled with L7-filter (strict version with less FPR)

Page 10: Network traffic classification: From theory to practice

Results (Sampled NetFlow)

• Impact of packet sampling

Page 11: Network traffic classification: From theory to practice

Sources of inaccuracy

1) Error in the estimation of the traffic features

2) Changes in flow size distribution 3) Changes in flow splitting probability

Page 12: Network traffic classification: From theory to practice

Solution (Sampled NetFlow)

Page 13: Network traffic classification: From theory to practice

Deployment problem: Summary

• Current proposals are difficult to deploy

• Proposed a simple but effective technique

– Supports standard NetFlow data

– Supports packet sampling

• Main limitation: Needs to be frequently retrained

V. Carela-Español, P. Barlet-Ros, A. Cabellos-Aparicio, J. Solé-Pareta. Analysis of the impact of sampling on NetFlow traffic classification. Computer Networks, 55(5), 2011.

Page 14: Network traffic classification: From theory to practice

2) Maintenance problem

• Difficult to keep classification model updated

– Traffic changes, application updates, new applications

– Involve significant human intervention

– ML models need to be frequently retrained

• Possible solution to the problem

– Make retraining automatic

– Computationally viable

– Without human intervention

Page 15: Network traffic classification: From theory to practice

Autonomic Traffic Classification

• Lightweight DPI for retraining

– Small traffic sample (e.g. 1/10000 flow sampling)

Page 16: Network traffic classification: From theory to practice

Evaluation

• 14-days trace collected at CESCA

Page 17: Network traffic classification: From theory to practice

Temporal/Spatial obsolescence

• Comparison without autonomic retraining

Page 18: Network traffic classification: From theory to practice

Maintenance problem: Summary

• Exiting classifiers need periodic retrainings

– Temporal obsolescence: Changes in application traffic

– Spatial obsolescence: Different networks

• Autonomic traffic classification system

– Easy to deploy: Works with Sampled NetFlow

– Easy to maintain: Lightweight DPI for self-training

V. Carela-Español, P. Barlet-Ros, O. Mula-Valls, J. Solé-Pareta. An autonomic traffic classification system for network operation and management. Journal of Network and Systems Management, 23(3):401-419, 2015.

Page 19: Network traffic classification: From theory to practice

3) Validation problem

• Current proposals are difficult to validate, compare and reproduce

– Private datasets

– Different ground-truth generators

• Our contribution

– Publication of labeled datasets (with payloads)

– Common benchmark to validate/compare/reproduce

– Validation of common ground-truth generators

Page 20: Network traffic classification: From theory to practice

Proposal

• Reliable labeled dataset with full payloads

– Accurate: VBS (label from the application socket)

– Avoid privacy issues: Realistic artificial traffic

Page 21: Network traffic classification: From theory to practice

Methodology

• Manually generate representative traffic – Create fake accounts (e.g. Gmail, Facebook, Twitter) – Interact with the service simulating human behavior

(e.g. posting, chatting, gaming, watching videos, …)

Page 22: Network traffic classification: From theory to practice

Dataset

• > 750K flows, ~55 GB of data

Page 23: Network traffic classification: From theory to practice

DPI tools compared

Page 24: Network traffic classification: From theory to practice

Application protocols

Page 25: Network traffic classification: From theory to practice

Applications

Page 26: Network traffic classification: From theory to practice

Web services (summary)

• PACE: 16/34 (6 over 80%)

• nDPI: 10/34 (6 over 80%)

• OpenDPI: 2/34

• Libprotoident: 0/34

• L7-filter: 0/44 (high FPR)

• NBAR: 0/34

Page 27: Network traffic classification: From theory to practice

Validation problem: Summary

• Comparison of most popular ground-truth generators – PACE: Best results at all classification levels – Libprotoident: Very good results at application/protocol – nDPI: Good results, web services level, open source – NBAR and L7-filter: Very poor results

• Dataset including payloads is publicly available – http://www.cba.upc.edu/monitoring/traffic-classification (Including also all other datasets presented in these slides) – Common benchmark to validate, compare and reproduce

T. Bujlow, V. Carela-Español, P. Barlet-Ros. Independent comparison of popular DPI tools for traffic classification. Computer Networks, 76:75-89, 2015. V. Carela-Español, T. Bujlow, P. Barlet-Ros. Is our ground-truth for traffic classification reliable? In Proc. of Passive and Active Measurement Conf. (PAM), 2014.

Page 28: Network traffic classification: From theory to practice

Network Polygraph

• Addressed 3 practical problems – The deployment problem (Sampled Netflow)

– The maintenance problem (Autonomic retraining)

– The validation problem (Labeled payload traces)

• We identified interest in the market – We created a UPC spin-off: https://polygraph.io

– Several customers world-wide

P. Barlet-Ros, J. Sanjuàs, V. Carela-Español. Network Polygraph: A cloud-based network visibility service. In ACM SIGCOMM Conf., Industrial Demo, 2015.

Page 29: Network traffic classification: From theory to practice

Why Network Polygraph?

• Other products are expensive and difficult to deploy – Can only be afforded by large operators, ISPs, …

– Large portion of the market are SMEs (>90% in EU)

• Our technology based on Sampled NetFlow only needs a small volume of traffic data – <0.5% of extra bandwidth usage

– Can be provided as a service from the cloud (SaaS)

Page 30: Network traffic classification: From theory to practice

Visibility-to-cost ratio

cost

visibility

Page 31: Network traffic classification: From theory to practice

Website + On-Line Demo

https://polygraph.io

Page 32: Network traffic classification: From theory to practice

traffic volume, breakdown by application

Page 33: Network traffic classification: From theory to practice

HTTP services

Page 34: Network traffic classification: From theory to practice

top talkers (addresses, ports, autonomous systems)

Page 35: Network traffic classification: From theory to practice

subnetwork-level bandwidth hogs

Page 36: Network traffic classification: From theory to practice

traffic geolocation (origins & destinations)

Page 37: Network traffic classification: From theory to practice

anomaly and attack detection with automatic baselining

Page 38: Network traffic classification: From theory to practice

indexed traffic database for forensic analysis

Page 39: Network traffic classification: From theory to practice

Network Polygraph

Talaia Networks, S.L.

K2M – Parc UPC Campus Nord

Jordi Girona, 1-3

Barcelona (08034)

Spain

Telephone: +34 93 405 45 87

[email protected]

https://polygraph.io