anomaly detection with apache spark

1

Anomaly Detection withApache Spark: WorkshopSean Owen / Director of Data Science / Cloudera

Anomaly Detection

2

• What is “Unusual”?• Server metrics• Access patterns• Transactions

• Labeled, or not• Sometimes know

examples of “unusual”• Sometimes not

• Applications• Network security• IT monitoring• Fraud detection• Error detection

Clustering

3

• Identify dense clusters of data points

• Unusual = far from any cluster

• What is “far”?• Unsupervised learning• Can “supervise” with

some labels to improve or interpret

en.wikipedia.org/wiki/Cluster_analysis

k-means++ clustering

4

• Simple, well-known, parallel algorithm

• Iteratively assign points, update centers (“means”)

• Goal: points close to nearest cluster center

• Must choose k, number of clusters

mahout.apache.org/users/clustering/fuzzy-k-means.html

5

Anomaly Detection in KDD Cup ‘99

KDD Cup 1999

6

• Annual ML competitionwww.sigkdd.org/kddcup/index.php

• ’99: Computer network intrusion detection

• 4.9M connections• Most normal, many

known to be attacks• Not a realistic sample!

7

0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.

Label

Service Bytes Received

% SYN errors

Apache Spark: Something For Everyone

8

• From MS Dryad, UC Berkeley, DataBricks

• Scala-based• Expressive, efficient• JVM-based

• Scala-like API• Distributed works like

local, works like streaming• Like Apache Crunch is

Collection-like

• Interactive REPL• Distributed• Hadoop-friendly

• Integrate with where data, cluster already is

• ETL no longer separate• MLlib

9

Clustering, Take #0

10

val rawData = sc.textFile("/user/srowen/kddcup.data", 120)rawData: org.apache.spark.rdd.RDD[String] = MappedRDD[13] at textFile at <console>:15

rawData.count...res1: Long = 4898431

11


12

import scala.collection.mutable.ArrayBuffer

val dataAndLabel = rawData.map { line => val buffer = ArrayBuffer[String]() buffer.appendAll(line.split(",")) buffer.remove(1, 3) val label = buffer.remove(buffer.length-1) val vector = buffer.map(_.toDouble).toArray (vector,label)}

val data = dataAndLabel.map(_._1).cache()

13

import org.apache.spark.mllib.clustering._

val kmeans = new KMeans()val model = kmeans.run(data)

model.clusterCenters.foreach(centroid => println(java.util.Arrays.toString(centroid)))

val clusterAndLabel = dataAndLabel.map { case (data,label) => (model.predict(data),label) }val clusterLabelCount = clusterAndLabel.countByValue

clusterLabelCount.toList.sorted.foreach { case ((cluster,label),count) => println(f"$cluster%1s$label%18s$count%8s") }

14

0 back. 22030 buffer_overflow. 300 ftp_write. 80 guess_passwd. 530 imap. 120 ipsweep. 124810 land. 210 loadmodule. 90 multihop. 70 neptune. 10720170 nmap. 23160 normal. 972781

0 perl. 30 phf. 40 pod. 2640 portsweep. 104120 rootkit. 100 satan. 158920 smurf. 28078860 spy. 20 teardrop. 9790 warezclient. 10200 warezmaster. 201 portsweep. 1

Terrible.

15

Clustering, Take #1: Choose k

16

import scala.math._import org.apache.spark.rdd._

def distance(a: Array[Double], b: Array[Double]) = sqrt(a.zip(b).map(p => p._1 - p._2).map(d => d * d).sum)

def clusteringScore(data: RDD[Array[Double]], k: Int) = { val kmeans = new KMeans() kmeans.setK(k) val model = kmeans.run(data) val centroids = model.clusterCenters data.map(datum => distance(centroids(model.predict(datum)), datum)).mean}

val kScores = (5 to 40 by 5).par.map(k => (k, clusteringScore(data, k)))

18

(5, 1938.8583418059309)(10,1614.7511288131)(15,1406.5960973638971)(20,1111.5970245349558)(25, 905.536686115762)(30, 931.7399112938756)(35, 550.3231624120361)(40, 443.10108628017787)

19

kmeans.setRuns(10)kmeans.setEpsilon(1.0e-6)(30 to 100 by 10)

(30, 886.974050712821)(40, 747.4268153420192)(50, 370.2801596900413)(60, 325.883722754848)(70, 276.05785104442657)(80, 193.53996444359856)(90, 162.72596475533814)(100,133.19275833671574)

20

Clustering, Take #2: Normalize

21

data.unpersist(true)

val numCols = data.take(1)(0).lengthval n = data.countval sums = data.reduce((a,b) => a.zip(b).map(t => t._1 + t._2))val sumSquares = data.fold(new Array[Double](numCols)) ((a,b) => a.zip(b).map(t => t._1 + t._2*t._2))val stdevs = sumSquares.zip(sums).map { case(sumSq,sum) => sqrt(n*sumSq - sum*sum)/n }val means = sums.map(_ / n)

val normalizedData = data.map( (_,means,stdevs).zipped.map((value,mean,stdev) => if (stdev <= 0) (value-mean) else (value-mean)/stdev)).cache()

val kScores = (50 to 120 by 10).par.map(k => (k, clusteringScore(normalizedData, k)))

22

(50, 0.008184436460307516)(60, 0.005003794119180148)(70, 0.0036252446694127255)(80, 0.003448993315406253)(90, 0.0028508261816040984)(100,0.0024371619202127343)(110,0.002273862516438719)(120,0.0022075535103855447)

23

Clustering, Take #3: Categoricals

24

val protocols = rawData.map( _.split(",")(1)).distinct.collect.zipWithIndex.toMap...

val dataAndLabel = rawData.map { line => val buffer = ArrayBuffer[String]() buffer.appendAll(line.split(",")) val protocol = buffer.remove(1) val vector = buffer.map(_.toDouble)

val newProtocolFeatures = new Array[Double](protocols.size) newProtocolFeatures(protocols(protocol)) = 1.0 ... vector.insertAll(1, newProtocolFeatures) ... (vector.toArray,label)}

25

(50, 0.09807063330707691)(60, 0.07344136010921463)(70, 0.05098421746285664)(80, 0.04059365147197857)(90, 0.03647143491690264)(100,0.02384443440377552)(110,0.016909326439972006)(120,0.01610738339266529)(130,0.014301399891441647)(140,0.008563067306283041)

26

Clustering, Take #4: Labels, Entropy

27


Label

Using Labels with Entropy

28

• Measures mixed-ness• Bad clusters have

very mixed labels• Function of cluster’s

label frequencies, p(x)• Good clustering =

low entropy clusters

- p log pΣ

29

def entropy(counts: Iterable[Int]) = { val values = counts.filter(_ > 0) val sum: Double = values.sum values.map { v => val p = v / sum -p * log(p) }.sum}

def clusteringScore(data: RDD[Array[Double]], labels: RDD[String], k: Int) = { ... val labelsInCluster = data.map(model.predict(_)).zip(labels). groupByKey.values val labelCounts = labelsInCluster.map( _.groupBy(l => l).map(t => t._2.length)) val n = data.count labelCounts.map(m => m.sum * entropy(m)).sum / n}

30

(30, 1.0266922080881913)(40, 1.0226914826265483)(50, 1.019971839275925)(60, 1.0162839563855304)(70, 1.0108882243857347)(80, 1.0076114958062241)(95, 0.4731290640152461)(100,0.5756131018520718)(105,0.9090079450132587)(110,0.8480807836884104)(120,0.3923520444828631)

31

Detecting an Anomaly

32

val kmeans = new KMeans()kmeans.setK(95)kmeans.setRuns(10)kmeans.setEpsilon(1.0e-6)val model = kmeans.run(normalizedData)

def distance(a: Array[Double], b: Array[Double]) = sqrt(a.zip(b).map(p => p._1 - p._2).map(d => d * d).sum)

val centroids = model.clusterCentersval distances = normalizedData.map(datum => (distance(centroids(model.predict(datum)), datum), datum))

distances.top(5) (Ordering.by[(Double,Array[Double]),Double](_._1))

From Here to Production?

33

• Real data set!• Algorithmic

• Other distance metrics• k-means|| init

• Use data point IDs

• Real-Time• Spark Streaming?• Storm?

• Continuous Pipeline• Visualization

[email protected]

?

anomaly detection with apache spark

Technology

val kmeans

val buffer

val label

val rawdata

val clusterlabelcount

val stdevs

val sumsquares

val kscores