detecting resolvers at - dns-oarc (indico)€¦ · dns traffic is noisy despite general belief, not...

29
DETECTING RESOLVERS AT .NZ Jing Qiao, Sebastian Castro DNS-OARC 29 Amsterdam, October 2018

Upload: others

Post on 12-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

DETECTING RESOLVERS AT .NZ

Jing Qiao, Sebastian CastroDNS-OARC 29 Amsterdam, October 2018

Page 2: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

BACKGROUND

DNS-OARC 29 2

Page 3: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

DNS TRAFFIC IS NOISY

Despite general belief, not all the sources at auth nameserver are resolversNon-resolvers’ traffic could skew our Domain Popularity ranking resultIf we can tell whether a source is a resolver?

DNS-OARC 29 3

Page 4: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

A classifier to predict the probability of a source being a resolverIt’s not a simple task:

1. Trail blazing2. Uncertainty about resolvers’ pattern

“In the search of resolvers”, OARC 25

“Understanding traffic sources seen at .nz”, OARC 27

OBJECTIVE

DNS-OARC 29 4

Page 5: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

USING DNS KNOWLEDGE TO REDUCE THE SCOPE

DNS-OARC 29 5

Page 6: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

REMOVE NOISE BY CRITERIA

4 weeks (28/08/2017 – 24/09/2017)

27.8% of sources only queried for 1 domain45.5% of sources only queried for 1 query type25% of sources only queried 1 of 7 .nz servers65.8% of sources sent no more than 10 queries per day

The address list is reduced from 2M to 550k

DNS-OARC 29 6

Page 7: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

FOCUS ON ACTIVE SOURCES

Active at least 75% of total hoursActive at least 5/7 days per week

Address list is further reduced to 82k

DNS-OARC 29 7

Page 8: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

ASSUMPTION: THE REST IS EITHER RESOLVER OR MONITOR.

THE KEY IS TO FIND DISCRIMINATING FEATURES.

DNS-OARC 29 8

Page 9: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

DATASET

.nz DNS traffic across 4 weeks82k unique sources

DNS-OARC 29 9

Page 10: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

2515 Resolvers• ISP, Google DNS, OpenDNS, Education &

Research, addresses collected from RIPE Atlas probes

106 Monitors• ICANN, Pingdom, ThousandEyes, RIPE Atlas

probes, RIPE Atlas anchors

KNOWN SAMPLES

DNS-OARC 29 10

Page 11: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

NO CLEAR SPLIT OVER SINGLE FEATURE

Resolvers and monitors are distributed across overlapped ranges

DNS-OARC 29 11

Page 12: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

Query features: query type, query name, query rate, DNS flag, response code, ...

• Each feature can be described as a time series• mean, standard deviation, percentiles (10, 90)

TEMPORAL FEATURES

DNS-OARC 29 12

Page 13: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

Queries from a resolver should be more random than those from a monitorTiming entropy:• Time lag between successive queries• Time lag between successive queries of

same query name and query typeQuery name entropy:• Similarity of the query names (Jaro-

Winkler string distance) between successive queries

ENTROPY FEATURES

DNS-OARC 29 13

Page 14: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

A monitor’s query flow is likely to be less variable compared to a resolver• Finer aggregation across hour tiles• Variance metrics

• Interquartile Range• Quartile Coefficient of Dispersion• Mean Absolute Difference• Median Absolute Deviation• Coefficient of Variation

14DNS-OARC 29

VARIABILITY

Page 15: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

Remove redundant features with above 0.95 correlationSelect most relevant features by statistical tests, such as F-test and Mutual Information (MI)

FEATURE SELECTION

DNS-OARC 29 15

Page 16: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

50 features of different scales Normalize to comparable scales• Standardization: zero mean and unit

variance• Quantile Transformation: robust to outliers

FINAL FEATURE SET

DNS-OARC 29 16

Page 17: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

Algorithms:• K-Means• Gaussian Mixture Model• MeanShift• DBSCAN• Agglomerative Clustering

Evaluation metrics:• Adjust Rand Index• Homogeneity Score• Completeness Score

VERIFY THE FEATURE SET BY CLUSTERING

DNS-OARC 29 17

Page 18: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

The model with best performance on the known samples:

Gaussian Mixture with 5 clusters• Adjust Rand Index: 0.849789 (good)• Homogeneity score: 0.960086 (good)• Completeness score: 0.671609 (average but

acceptable as resolvers can be separated into a set of clusters different from monitors)

CLUSTERING RESULT

DNS-OARC 29 18

Page 19: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

Most resolvers are separated from monitors OpenDNS is a special case

• Only 7 samples • Specific behavior

VERIFY WITH GROUND THRUTH

DNS-OARC 29 19

Page 20: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

Exclude samples from OpenDNS90% resolvers (Cluster 0, 2, 4) are separated from 91% monitors (Cluster 1, 3) using our feature set

CONCLUSION

DNS-OARC 29 20

Page 21: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

Training data• 2515 positive (resolver) • 106 negative (monitor)

50 features 4 week period

SUPERVISED CLASSIFIER

DNS-OARC 29 21

Page 22: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

Challenges of training a classifier:• Searching algorithms and hyperparameters• Performance and efficiency

We use auto-sklearn to solve the challenges• Off-the-shelf toolkit that automates the process of

model selection and hyperparameter tuning• Improve performance and efficiency using Bayesian

optimization, meta-learning, ensemble construction• Efficient and Robust Automated Machine Learning,

Feurer et al., Advances in Neural Information Processing Systems 28 (NIPS 2015).

AUTOML

DNS-OARC 29 22

Page 23: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

An ensemble of 28 models:• Running time: 10 minutes• Accuracy score: 0.991• Precision score: 0.991• Recall score: 1.000• F1 score: 0.995

PRELIMINARY RESULT

DNS-OARC 29 23

Page 24: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

CURRENT STATUS

• Train the classifier on a sliding window regularly• Takes a few hours, mostly on feature

generation• Predict type of source on a given

threshold (probability of being like a resolver)• We picked probabilities higher than 0.7

• Integrate with popularity ranking algorithm• Improved ranking for known domains

24DNS-OARC 29

Page 25: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

CURRENT STATUS

• Out of 100k source address classified, 73k were detected as resolvers

• The model identified 96% of Illimunati probes as resolvers, from a request from SIDN

25DNS-OARC 29

Page 26: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

POTENTIAL USES & FUTURE WORK

With adjustment of feature set and training data, source classification can be used to measure:• validating resolvers• adoption of QNAME minimization in the wild• etc

from authoritative data, e.g. the root servers

DNS-OARC 29 26

Page 27: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

ACKNOWLEDGEMENTS

Daniel Griggs for his input to reduce noise.Sebastian Castro for the variability metrics.

DNS-OARC 29 27

Page 28: DETECTING RESOLVERS AT - DNS-OARC (Indico)€¦ · DNS TRAFFIC IS NOISY Despite general belief, not all the sources ... Trail blazing 2. Uncertainty about resolvers’ pattern “In

REFERENCE

https://blog.nzrs.net.nz/source-address-clustering-feature-engineering/

https://blog.nzrs.net.nz/source-address-classification-clustering/

DNS-OARC 29 28