4th tma phd school - london – apr 16 th , 2014

4th TMA PHD School - London – Apr 16th, 2014

Alessandro Finamore<[email protected]>

Passive inference: Troubleshooting the Cloud with Tstat

TMATraffic monitoringand Analysis

2

Active -vs- passive inference Active inference:

Study cause/effect relationships, i.e., inject some traffic in the network to observe a reaction

PRO: world-wide scale (e.g., Planetlab) CONS: synthetic benchmark suffer from lack of

generality

Passive inference: study traffic properties just by observing it and without

interfering with it PRO: study traffic generated from actual Internet users CONS: limited number of vantage points

3

The network monitoring playground

Post-processing

What are the performance of YouTube video streaming?

What are theperformance ofa cache?

Challenges? Automation Flexibility/Opennesspassive

probedata

Deploy some vantage points

Collect somemeasurementsExtract analytics

4

FP7 European project about the design and implementation

of a measurement plane for the Internet Large scale

Vantage points deployed on a worldwide scale Flexible

Offers APIs for integrating existing measurement frameworks

Not strictly bounded to specific “use cases” Intelligent

Automate/simplify the process of “cooking” raw data Identify anomalies and unexpected events Provide root-cause-analysis capabilities

Pushing the paradigm further with

mPlane consortiumMarco Mellia

POLITOSaverio Nicolini

NECDina Papagiannaki

Telefonica

Ernst BiersackEurecom

Brian TrammellETH

Tivadar Szemethy NetVisor

Dario RossiENST

Fabrizio InvernizziTelecom Italia

Guy LeducUniv. Liege

Pietro MichiardiEurecom

Pedro CasasFTW

Andrea FregosiFastweb

Coordinator

WP1 WP2

WP6 WP5

WP3 WP4

WP7

16 partners 3 operators 6 research centers 5 universities 2 small enterprises

FP7 IP 3 years long 11Meuro

6

Pushing the paradigm further with

activeprobe

control

passiveprobe

data

Post-processing

Active and passive analysisfor iterative root-cause-analysis

Integration with existingmonitoring frameworks

“From global measurements to local management” Specific Targeted Research Projects (STReP) 3 years 2 left, 10 partners, 3.8 Meuros Build a measure framework out of probes

IETF, Large-Scale Measurement of Broadband Performance (LMAP) Standardization effort on how to do broadband

measurements Defining the components, protocols, rules, etc.

It does not specifically target adding “a brain” to the system

… is a sort of “mPlane use case”

Strong similarities for the architecture core

What else beside ?

8

The network monitoring trinity

Repository

Post-processing

Raw measurementsFocus onHow to process network traffic?How to scale at 10Gbps?

Try not to focus on just one aspect but rather on “mastering the trinity”

9

Is the passive sniffer developed @Polito over the last

10 years

http://tstat.polito.it

PrivateNetwork

Rest ofthe world

Border routerIN

Traffic stats

Question: Which are the most popular accessed services?

Question: How CDNs/datacenters are composed?

10

Is the passive sniffer developed @Polito over the last 10

years Per-flow stats including

Several L3/L4 metrics (e.g., #pkts, #bytes, RTT, TTL, etc.)

Traffic classification Deep Packet Inspection (DPI) Statistical methods (Skype, obfuscated P2P)

Different output formats (logs, RRDs, histograms, pcap) Run on off-the-shelf HW

Up to 2Gb/s with standard NIC Currently adopted in real network scenarios (campus

and ISP)

http://tstat.polito.it

11

research/technology challenge

Challenge: Is it possible to build a “full-fledged” passive probe that cope with >10Gbps? Ad-hoc NICs are too expensive (>10keuro) Software solutions build on top of common Intel NICs

ntop DNA netmap PFQ

By offering direct access to the NIC (i.e., bypassing the kernel stack) the libraries can count packets at wire speed…but what about doing real processing?

[ACM Queue] Revisiting network I/O APIs: The netmaps Framework [PAM’12] PFQ: a Novel Engine for Multi-Gigabit Packet Capturing With Multi-Core Commodity Hardware [IMC’10] High Speed Network Traffic Analysis with Commodity Multi-core Systems

12

Possible system architecture

Read pkts

Dispatch / Scheduling

Under testing a solution based on libDNAOne or more process for reading? Depends…

If needed, design “mergeable” outputout1 out2 outN

merge

cons

umer

1

cons

umer

2

cons

umer

N

Per-flow packet scheduling is the simplest option, but What about correlating multiple flows

(e.g., DNS/TCP)? What about scheduing per traffic class?

How to organize the analysis modules workflow? N identical consumer instances? Within each consumer, single execution

flow?

1 2 3 4 5 6 7 8 9 1002468

101214161820

Wire speed [Gbps]

% p

kts d

rop 2 Tstat + libDNA

(synth. traffic) Margin to improve

13

Other traffic classification tools? WAND (Shane Alcock) - http://research.wand.net.nz

Libprotoident, traffic classification using 4 bytes of payload

Libtrace, rebuilds TCP/UDP and other tools for processing pcaps

ntop (Luca Deri) - http://www.ntop.org/products/ndpi nDPI, a super set of OpenDPI

l7filter, but is known to be inaccurate The literature is full of statistical/behavioral traffic

classification methodologies [1,2] but AFAIK no real deployment no open source tool released[1] “A survey of techniques for internet traffic classification using machine learning”

IEEE Communications Surveys & Tutorials, 2009[2] “Reviewing Traffic Classification”, LNCS Vol. 7754, 2013

It doesn’t matter having a fancy classifier if you do not have proper flow characterization

14

Measurement frameworks RIPE Atlas – http://ripe.atlas.net

World wide deployment of inexpensive active probes

User Defined Measurement (UDM) credit based Ping, traceroute/traceroute6, DNS, HTTP

Google mLAB Network Diagnostic Test (NDT)http://mlab-live.appspot.com/tools/ndt Connectivity and bandwidth speed Public available data … but IMO not straightforward

to use

15

Focus onHow to process network traffic?How to scale at 10Gbps?

Recent research activities

Post-processing

Raw measurements

Repository

Focus onHow to export/consolidate data continuously?What about BigData?

16

(Big)Data export frameworks Overcrowded scenario

https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Logging_Solutions_Recommendation

17

(Big)Data export frameworks Overcrowded scenario All general purpose frameworks

Data center scale Emphasis on throughput and/or real-time and/or

consistency, etc. Typically designed/optimized for HDFS

log_sync, “ad-hoc” solution @ POLITO Designed to manage a few passive probes Emphasis on throughput and data consistency

18

Data management @ POLITO

11 nodes = 9 data nodes + 2 namenode

416GB RAM = 32GBx9 + 64GBx2

~32TB HDFS Single 6-core = 66 cores (x2

with HT) Debian 6 + CDH 4.5.0

ISP/Campus

log_sync (server)

log_sync (server)

probe1

probeN

~40TB (3TB x 12) = 1year data

log_sync (client) pre-processing

(dual 4-core, 3TB disk, 16GB ram)

gateway

NAS cluster

NAS

Gateway

Cluster

19

BigData = Hadoop? Almost true but there are other NoSQL solutions

MongoDB, REDIS, Cassandra, Spark, Neo4J, etc. http://nosql-database.org

How to choose? Not so easy to say, but Avoid BigData frameworks if you have just few GB of

data Sooner or later you are going to do some coding so pick

something that seems “confortable” Fun fact: MapReduce is a NoSQL paradigm but people are

used to SQL queries Rise of Pig, Hive, Impala, Shark, etc. which allow to do

SQL-like queries on top of MapReduce

20

Repository

Focus onHow to export/consolidate data continuously?What about BigData?

Focus onHow to process network traffic?How to scale at 10Gbps?

Recent research activities

Raw measurements

Focus onCase study of an Akamai“cache” performance

“DBStream: an Online Aggregation,Filtering and Processing System forNetwork Traffic Monitoring” TRAC’14

Post-processing

21

Monitoring an cache Focusing on vantage point of ~20k ADSL customers 1 week of HTTP logs (May 2012)

Content served by Akamai CDN The ISP hosts an Akamai “preferred cache” (a specific /25

subnet)

? ? ?

22

Reasoning about the problem Q1: Is this affecting specific FQDN accessed? Q2: Are the variations due to “faulty” servers? Q3: Was this triggered by CDN performance issues? Etc…

How to automate/simplify this reasoning?

DBStream (FTW) Continuous big data analytics Flexible processing language Full SQL processing capabilities Processing in small batches Storage for post-mortem analysis

23

Q1: Is this affecting a specific FQDN? Select the top 500 Fully Qualified Domain Names (FQDN) served

by Akamai Check if they are served by the preferred /25 subnet Repeat every 5 min

FQDN not served by the preferred cacheFQDN hostedby the preferredcache, exceptduring the anomaly

The two sets have “services” in common Same results extending to more than 500 FQDN

Preferred /25 subnetOther subnets

NO!!

24

Q2: Are the variations due to “faulty” servers? Compute the traffic volume per IP address Check the behavior during the disruption Repeat each 5 min

NO!!

25

Q3: Was this triggered by performance issues?

Compute the distribution of server query elaboration time It is the time between the TCP ACK of the HTTP GET

and the reception of the first byte of the reply Focus on the traffic of the /25 preferred subnet Compare the quartiles of the server elaboration time

every 5 min YES!!

client serverpassiveprobeSYN

SYN+ACKACK

GETACK

DATA

query processing time

Performance decreasesright before the anomaly@6pm

NO!!

YES!!

NO!!

26

Reasoning about the problem

Q1: Is this affecting only specific services? Q2: Are the variations due to “faulty” servers? Q3: Was this triggered by CDN performance issues? What else?

Other vantage points report the same problem? YES! What about extending the time period?

The anomaly is present along the whole period we considered On going extension of the analysis on more recent data sets

(possibly exposing also other effects/anomalies) Routing? TODO route views DNS mapping? TODO RipeAtlas + ISP active probing

infrastructure Other suggestions are welcomed

NONO

NO

27

Reasoning about the problem

Q1: Is this affecting only specific services? Q2: Are the variations due to “faulty” servers? Q3: Was this triggered by CDN performance issues? What else?

Other vantage points report the same problem? YES! What about extending the time period?

The anomaly is present along the whole period we considered On going extension of the analysis on more recent data sets

(possibly exposing also other effects/anomalies) Routing? TODO route views DNS mapping? TODO RipeAtlas + ISP active probing

infrastructure Other suggestions are welcomed

NONO

NO…ok, but what are the final takeaways? Try to automate your analysis Think about what you measure and be

creative especially for visualization Enlarge your prospective

multiple vantage points multiple data sources analysis on large time windows

Don’t be afraid to ask opinions

?? || ##<[email protected]>

TMATraffic monitoringand Analysis

4th tma phd school - london – apr 16 th , 2014

Documents