4th tma phd school - london – apr 16 th , 2014
DESCRIPTION
Passive inference: Troubleshooting the Cloud with Tstat. Alessandro Finamore < [email protected] >. TMA Traffic monitoring a nd A nalysis. 4th TMA PHD School - London – Apr 16 th , 2014. Active - vs - passive inference. Active inference: - PowerPoint PPT PresentationTRANSCRIPT
4th TMA PHD School - London – Apr 16th, 2014
Alessandro Finamore<[email protected]>
Passive inference: Troubleshooting the Cloud with Tstat
TMATraffic monitoringand Analysis
2
Active -vs- passive inference Active inference:
Study cause/effect relationships, i.e., inject some traffic in the network to observe a reaction
PRO: world-wide scale (e.g., Planetlab) CONS: synthetic benchmark suffer from lack of
generality
Passive inference: study traffic properties just by observing it and without
interfering with it PRO: study traffic generated from actual Internet users CONS: limited number of vantage points
3
The network monitoring playground
Post-processing
What are the performance of YouTube video streaming?
What are theperformance ofa cache?
Challenges? Automation Flexibility/Opennesspassive
probedata
Deploy some vantage points
Collect somemeasurementsExtract analytics
4
FP7 European project about the design and implementation
of a measurement plane for the Internet Large scale
Vantage points deployed on a worldwide scale Flexible
Offers APIs for integrating existing measurement frameworks
Not strictly bounded to specific “use cases” Intelligent
Automate/simplify the process of “cooking” raw data Identify anomalies and unexpected events Provide root-cause-analysis capabilities
Pushing the paradigm further with
mPlane consortiumMarco Mellia
POLITOSaverio Nicolini
NECDina Papagiannaki
Telefonica
Ernst BiersackEurecom
Brian TrammellETH
Tivadar Szemethy NetVisor
Dario RossiENST
Fabrizio InvernizziTelecom Italia
Guy LeducUniv. Liege
Pietro MichiardiEurecom
Pedro CasasFTW
Andrea FregosiFastweb
Coordinator
WP1 WP2
WP6 WP5
WP3 WP4
WP7
16 partners 3 operators 6 research centers 5 universities 2 small enterprises
FP7 IP 3 years long 11Meuro
6
Pushing the paradigm further with
activeprobe
control
passiveprobe
data
Post-processing
Active and passive analysisfor iterative root-cause-analysis
Integration with existingmonitoring frameworks
“From global measurements to local management” Specific Targeted Research Projects (STReP) 3 years 2 left, 10 partners, 3.8 Meuros Build a measure framework out of probes
IETF, Large-Scale Measurement of Broadband Performance (LMAP) Standardization effort on how to do broadband
measurements Defining the components, protocols, rules, etc.
It does not specifically target adding “a brain” to the system
… is a sort of “mPlane use case”
Strong similarities for the architecture core
What else beside ?
8
The network monitoring trinity
Repository
Post-processing
Raw measurementsFocus onHow to process network traffic?How to scale at 10Gbps?
Try not to focus on just one aspect but rather on “mastering the trinity”
9
Is the passive sniffer developed @Polito over the last
10 years
http://tstat.polito.it
PrivateNetwork
Rest ofthe world
Border routerIN
Traffic stats
Question: Which are the most popular accessed services?
Question: How CDNs/datacenters are composed?
10
Is the passive sniffer developed @Polito over the last 10
years Per-flow stats including
Several L3/L4 metrics (e.g., #pkts, #bytes, RTT, TTL, etc.)
Traffic classification Deep Packet Inspection (DPI) Statistical methods (Skype, obfuscated P2P)
Different output formats (logs, RRDs, histograms, pcap) Run on off-the-shelf HW
Up to 2Gb/s with standard NIC Currently adopted in real network scenarios (campus
and ISP)
http://tstat.polito.it
11
research/technology challenge
Challenge: Is it possible to build a “full-fledged” passive probe that cope with >10Gbps? Ad-hoc NICs are too expensive (>10keuro) Software solutions build on top of common Intel NICs
ntop DNA netmap PFQ
By offering direct access to the NIC (i.e., bypassing the kernel stack) the libraries can count packets at wire speed…but what about doing real processing?
[ACM Queue] Revisiting network I/O APIs: The netmaps Framework [PAM’12] PFQ: a Novel Engine for Multi-Gigabit Packet Capturing With Multi-Core Commodity Hardware [IMC’10] High Speed Network Traffic Analysis with Commodity Multi-core Systems
12
Possible system architecture
Read pkts
Dispatch / Scheduling
Under testing a solution based on libDNAOne or more process for reading? Depends…
If needed, design “mergeable” outputout1 out2 outN
merge
cons
umer
1
cons
umer
2
cons
umer
N
Per-flow packet scheduling is the simplest option, but What about correlating multiple flows
(e.g., DNS/TCP)? What about scheduing per traffic class?
How to organize the analysis modules workflow? N identical consumer instances? Within each consumer, single execution
flow?
1 2 3 4 5 6 7 8 9 1002468
101214161820
Wire speed [Gbps]
% p
kts d
rop 2 Tstat + libDNA
(synth. traffic) Margin to improve
13
Other traffic classification tools? WAND (Shane Alcock) - http://research.wand.net.nz
Libprotoident, traffic classification using 4 bytes of payload
Libtrace, rebuilds TCP/UDP and other tools for processing pcaps
ntop (Luca Deri) - http://www.ntop.org/products/ndpi nDPI, a super set of OpenDPI
l7filter, but is known to be inaccurate The literature is full of statistical/behavioral traffic
classification methodologies [1,2] but AFAIK no real deployment no open source tool released[1] “A survey of techniques for internet traffic classification using machine learning”
IEEE Communications Surveys & Tutorials, 2009[2] “Reviewing Traffic Classification”, LNCS Vol. 7754, 2013
It doesn’t matter having a fancy classifier if you do not have proper flow characterization
14
Measurement frameworks RIPE Atlas – http://ripe.atlas.net
World wide deployment of inexpensive active probes
User Defined Measurement (UDM) credit based Ping, traceroute/traceroute6, DNS, HTTP
Google mLAB Network Diagnostic Test (NDT)http://mlab-live.appspot.com/tools/ndt Connectivity and bandwidth speed Public available data … but IMO not straightforward
to use
15
Focus onHow to process network traffic?How to scale at 10Gbps?
Recent research activities
Post-processing
Raw measurements
Repository
Focus onHow to export/consolidate data continuously?What about BigData?
16
(Big)Data export frameworks Overcrowded scenario
https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Logging_Solutions_Recommendation
17
(Big)Data export frameworks Overcrowded scenario All general purpose frameworks
Data center scale Emphasis on throughput and/or real-time and/or
consistency, etc. Typically designed/optimized for HDFS
log_sync, “ad-hoc” solution @ POLITO Designed to manage a few passive probes Emphasis on throughput and data consistency
18
Data management @ POLITO
11 nodes = 9 data nodes + 2 namenode
416GB RAM = 32GBx9 + 64GBx2
~32TB HDFS Single 6-core = 66 cores (x2
with HT) Debian 6 + CDH 4.5.0
ISP/Campus
log_sync (server)
log_sync (server)
probe1
probeN
~40TB (3TB x 12) = 1year data
log_sync (client) pre-processing
(dual 4-core, 3TB disk, 16GB ram)
gateway
NAS cluster
NAS
Gateway
Cluster
19
BigData = Hadoop? Almost true but there are other NoSQL solutions
MongoDB, REDIS, Cassandra, Spark, Neo4J, etc. http://nosql-database.org
How to choose? Not so easy to say, but Avoid BigData frameworks if you have just few GB of
data Sooner or later you are going to do some coding so pick
something that seems “confortable” Fun fact: MapReduce is a NoSQL paradigm but people are
used to SQL queries Rise of Pig, Hive, Impala, Shark, etc. which allow to do
SQL-like queries on top of MapReduce
20
Repository
Focus onHow to export/consolidate data continuously?What about BigData?
Focus onHow to process network traffic?How to scale at 10Gbps?
Recent research activities
Raw measurements
Focus onCase study of an Akamai“cache” performance
“DBStream: an Online Aggregation,Filtering and Processing System forNetwork Traffic Monitoring” TRAC’14
Post-processing
21
Monitoring an cache Focusing on vantage point of ~20k ADSL customers 1 week of HTTP logs (May 2012)
Content served by Akamai CDN The ISP hosts an Akamai “preferred cache” (a specific /25
subnet)
? ? ?
22
Reasoning about the problem Q1: Is this affecting specific FQDN accessed? Q2: Are the variations due to “faulty” servers? Q3: Was this triggered by CDN performance issues? Etc…
How to automate/simplify this reasoning?
DBStream (FTW) Continuous big data analytics Flexible processing language Full SQL processing capabilities Processing in small batches Storage for post-mortem analysis
23
Q1: Is this affecting a specific FQDN? Select the top 500 Fully Qualified Domain Names (FQDN) served
by Akamai Check if they are served by the preferred /25 subnet Repeat every 5 min
FQDN not served by the preferred cacheFQDN hostedby the preferredcache, exceptduring the anomaly
The two sets have “services” in common Same results extending to more than 500 FQDN
Preferred /25 subnetOther subnets
NO!!
24
Q2: Are the variations due to “faulty” servers? Compute the traffic volume per IP address Check the behavior during the disruption Repeat each 5 min
NO!!
25
Q3: Was this triggered by performance issues?
Compute the distribution of server query elaboration time It is the time between the TCP ACK of the HTTP GET
and the reception of the first byte of the reply Focus on the traffic of the /25 preferred subnet Compare the quartiles of the server elaboration time
every 5 min YES!!
client serverpassiveprobeSYN
SYN+ACKACK
GETACK
DATA
query processing time
Performance decreasesright before the anomaly@6pm
NO!!
YES!!
NO!!
26
Reasoning about the problem
Q1: Is this affecting only specific services? Q2: Are the variations due to “faulty” servers? Q3: Was this triggered by CDN performance issues? What else?
Other vantage points report the same problem? YES! What about extending the time period?
The anomaly is present along the whole period we considered On going extension of the analysis on more recent data sets
(possibly exposing also other effects/anomalies) Routing? TODO route views DNS mapping? TODO RipeAtlas + ISP active probing
infrastructure Other suggestions are welcomed
NONO
NO
27
Reasoning about the problem
Q1: Is this affecting only specific services? Q2: Are the variations due to “faulty” servers? Q3: Was this triggered by CDN performance issues? What else?
Other vantage points report the same problem? YES! What about extending the time period?
The anomaly is present along the whole period we considered On going extension of the analysis on more recent data sets
(possibly exposing also other effects/anomalies) Routing? TODO route views DNS mapping? TODO RipeAtlas + ISP active probing
infrastructure Other suggestions are welcomed
NONO
NO…ok, but what are the final takeaways? Try to automate your analysis Think about what you measure and be
creative especially for visualization Enlarge your prospective
multiple vantage points multiple data sources analysis on large time windows
Don’t be afraid to ask opinions
?? || ##<[email protected]>
TMATraffic monitoringand Analysis