clickstream data with spark

MAKING BIG DATA COME ALIVE

Clustering click-stream data using Spark Marissa Saunders

Slides available at: http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark

http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark



22

• Why?– Why clustering?– Why Spark?– Why click-stream?

• What?– What is the raw data?

• How?– Parsing user agent data on

Spark– Distributed K-modes on Spark

• So what?– Details of applying the method

to this use case– Resulting clusters– Time access patterns– Preferred websites

• Questions

Agenda

3

ObjectivesUnderstand:• k-means and k-modes clustering• why Spark is a good choice• different data structures in Spark

– RDD, dataframe and dataset• clickstream data and how user-agent parsing works

Demonstrate:• mapping a function over a RDD• defining a custom UDF and mapping it over a

dataframe• mapping a python function over a partition• how identifying different user types can drive insight

into user behavior

4

Why Clustering

5

We have a plot like this …• 2 groups of data• Clustering can find them• This can lead to insight …– There are two different groups

of unladen swallows– The heavy species flies more

slowly– When asking for airspeed, we

should specify if we mean African or European swallows

Why clustering?

… with apologies to Monty Python

Bird Type

Flight velocities vs. bird mass

6

For 2 clusters:1. Pick 2 points at random

as centroids

How does it work?

7

For 2 clusters:1. Pick 2 points at random

as centroids2. Cluster data based on

closest point

How does it work?

8

For 2 clusters:1. Pick 2 points at random as centroids2. Cluster data based on closest point3. Calculate the mean of each cluster as centroids

How does it work?

9

For 2 clusters:1. Pick 2 points at random as centroids2. Cluster data based on closest point3. Calculate the mean of each cluster as centroids4. Repeat 2 and 3 to convergence

How does it work?

10


How does it work?

11


How does it work?

Converged

This is called K-means clustering… and there is a Spark function for this

12

What about categorical data?

• Use modes instead of means– Most frequently occurring value

• Use binary distance metric for each dimension– 0 = the same– 1 = not the same

• Use the same iterative cluster assignment algorithm

This is called K-modes clustering

Color Mass Speed Type

Green/Grey Heavy Slow AfricanGreen/Grey Heavy Fast AfricanGreen/Black Heavy Slow AfricanGreen/Grey Light Slow AfricanBlue/White Heavy Fast EuropeanBlue/White Light Fast EuropeanBlue/Grey Light Slow EuropeanBlue/White Light Fast European

… and we’ve open-sourced a Spark function for this

13

Why Spark?

14

What is Spark?Apache Spark™ is a fast and general engine for large-scale data

processing. - spark.apache.org

• Distributed computing

• Relies on HDFS (or other DFS)

• In-memory• Optimized

execution• High level

functionality

15

Block1Block2Block3Block4


Why Spark?

• Take the computation to the data

• Spark works faster on partitioned data than map-reduce– In-memory operation avoids I/O costs– DAG optimization reduces computational costs

• Fast to develop– Data transformation and machine learning libraries are part of Spark

http://spark.apache.org/docs/latest/cluster-overview.html

It is FAST

16

Basic data structures in Spark

• Resiliently distributed dataset (RDD)

• Dataframe = RDD with a schema– SQL-style syntax– Refer to column by name– Optimized queries

• Dataset = best of both worlds?!?



Block1Block2Block3Block4Block5Block6Block7Block8

Full data set



What makes it resilient?Multiple copiesStores lineage

17

A little terminology …





Full data set

nodepartition

record

18

Why Clickstream?

19

What is clickstream data?• Information trail left behind by each user• Semi-structured website log files• Includes:– User agent information- Device- OS- Browser

– Geo information- Timezone- Lat/Longitude- City- Country

– Time of access– Referring website– Website accessed

Photo credit: Tim Franklin Photography via Foter.com

20

What is this good for?

• Web analytics can answer questions like:– How long do users take from first visit to purchase?– When do users visit the website?– What marketing channels are effective in attracting users?– Where are users located?– What are the paths that users take through the website?– How long do users stay on a specific page?– Which pages draw the most users?– etc…

21

The sample use caseClickstream data from 1usagov– Created whenever anyone shortens a .gov or .mil site with bitly– Feed at http://developer.usa.gov/1usagov– Archive for 2011-2013:

http://bitly.measuredvoice.com/bitly_archive/?C=M;O=D

Why this is a great dataset:– Large volume– Realistic format- Streaming - Not cleaned

– Interesting questions- What subtypes of users are there?- How do the activity patterns of these subtypes differ?

– Publically available archive

http://developer.usa.gov/1usagov

http://developer.usa.gov/1usagov



22

What is the raw data?

23


{‘h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL”}

• json format

24



• json format• Fields include:

• Website clicked: long url

25




• Website clicked/long url• Referring url

26




• Website clicked/long url• Referring url• User agent – what machine is this?

27




• Website clicked/long url• Referring url• User agent – what machine is this?• Time accessed• etc…

28

Parsing click stream data on Spark

29

High level picture

• Need to extract:– Time in date, hours– Information about the user:- Device type- OS- Timezone

– Main domain of the url– Referring url

• Do this for one record in python• Map this function over all records

using Spark

{"h":"1rzB4JL","g":"1laU0gx","l":"anonymous","hh":"1.usa.gov","u":"http://www.cdc.gov/cdcgrandrounds/index.htm","r":"direct","a":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","i":"","t":1460753233,"k":"","nk":0,"hc":1413468615,"_id":"52d237f4-8c0e-0ac2-a0ed-a32acabe05bb","al":"en-US","c":"US","ll":[38,-97],"sl":"1rzB4JL” ,”tz":”America/New_York "}

Day: FridayLocal_hour: 16Device_type:pcBrowser: IEOS: Windows 7Is_bot: false

30

Actual transformation

• Define parsing function

• Map parsing function over RDD

Leverage user-agents library

for every record s

Apply user_agent library

RDD containing parsed json data

31





Keep every entry as item in list

32





Apply custom function to user agent string

33

Distributed K-modes

34

How does clustering have to change to be distributed?

K-means example:Clustering is a collective operation.How can we distribute it?

35

How does clustering have to change to be distributed?

Do k-means on each partition

Cluster the collected centroids

K-means example:

36

Mapping over data in Spark• Map over a record:

def f(record): return transform(record)rdd2 = rdd1.map(f)

37

Mapping over data in Spark





Full data set

map

Block1

What is the equivalent here?

Spark has two possibilities:1. mapPartition:

• get each record in turn and do something; return after all records are done

• mapPartitionWithIndex:• Keep track of which partition returned

which result

38

Mapping over data in Spark• Map over a record:

def f(record): return transform(record)rdd2 = rdd1.map(f)

• Map over a partition:def f(iterator): yield cluster(iterator)rdd2 = rdd1.mapPartitions(f)

• Map over a partition with a partition keydef f(splitIndex, iterator): yield (partitionIndex, cluster(iterator))rdd2 = rdd1.mapPartitionsWithIndex(f)

For K-modes, we have open-sourced an implementation of distributed clustering: https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes

Iterator = cycle once through each record

https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes

39

Applying to 1USAGOV data

40

Getting 1usagov clickstream data• Scrape data from archive site:– http://1usagov.measuredvoice.com/– json format

• Concatenate into files by month• Store in HDFS• Load into Spark

http://1usagov.measuredvoice.com/

http://1usagov.measuredvoice.com/

41

Loading json data

42

Parse to extract user agent information

• Python package user_agents– Input string -> output information

• Add some custom parsing to extract features– os family, os_version, device

• Use spark to map this over each clickstream entry

43

Prepare for K-modes clustering

To reduce dimensionality:• Decide which variables to

use for clustering• Keep only the top few

categories for each variable

Prasad Patil, as referenced on http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/

The CURSE of dimensionality ….

44

Prepare for K-modes clustering• Decide which variables to use for clustering– Country– Timezone– Device Type– OS– Browser

• Keep only the top few categories for each variable

Custom UDF for Spark dataframes

Apply a series of UDFs

45

• Uses open-source packagehttps://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes

Perform distributed k-modes clustering

# of modes Max. iterations

Full log

Partition

Partition

Partition

Centroids

Centroids

Centroids

Centroids

Local clustering

Distributed clustering

Create RDD




46

Clustering results: 10 clusters

47

What do the clusters look like?# Size Country Timezone Device

TypeOS Browser

1 617820 US: 93% US/NY: 53% Pc: 97% Win 7: 75% Firefox: 57%

2 226035 NotUS: 68% Other: 57% Mobile: 75% iOS:84% MobileSafari: 78%

3 152053 NoGeoInfo:86%

NoGeoInfo: 86% Pc: 99% Windows:

81%Chrome/IE:

72%

4 161947 US:96% US/NY: 60% PC: 99% Windows not 7: 99%

IE:81%

5 105090 NoGeoInfo:76%

NoGeoInfo:76% Mobile: 70% Other: 70% Other: 99%

6 235719 NotUS:99% Other:89% PC: 99% Win7: 68% Chrome: 51%

7 121464 US:100% US/LA: 59% PC:95% MacOSX: 72%

Chrome: 54%

8 121115 US:48% NoGeoInfo: 40% Mobile:93% Android:

100%Android:

99%

9 101052 NotUS:98% Other: 90% PC: 100% Win other than 7: 84% Firefox: 57%

10 173424 US:100% US/NY: 48% Mobile: 68% iOS:100% MobileSafari: 74%

48

Access patterns

49

Access patterns

50

Top sites visited: January 2012Description

Top 3 domains

US, pc, Win7

www.nysdot.gov 212K

www.nasa.gov 59K

www.fda.gov 18K

US, pc, Win_not7, IE

www.nasa.gov 15K

www.shrewsbury-ma.gov 9K

www.fda.gov 5K

US, pc, Mac OS X

www.nysdot.gov 29K

www.nasa.gov 16K

www.whitehouse.gov6KnotUS, pc,

Win7www.nasa.gov 87K

earthobservatory.nasa.gov 15K

www.nysdot.gov 14K

notUS, pc,Win_not7

www.nasa.gov 30K

www.navy.mil 8K

globalhealth.gov7K

noGeo, pc, Win, Chrome

www.nasa.gov 34K

www.nysdot.gov 17K


US, mobile,iOS

www.nasa.gov 33K


forecast.weather.gov 9K

notUS, mobile, iOS

www.nasa.gov 82K


www.navy.mil 13K

Mobile,Android

www.nasa.gov 29K


www.navy.mil 6K

noGeo, mobile, OtherOS

www.nasa.gov 24K

www.nysdot.gov 8K

www.army.mil 5K

http://www.nysdot.gov/

http://www.nasa.gov/

http://www.fda.gov/


http://www.shrewsbury-ma.gov/

http://www.shrewsbury-ma.gov/

http://www.fda.gov/



http://www.whitehouse.gov/

http://www.whitehouse.gov/




http://www.navy.mil/










http://www.army.mil/

51

Where do users come from: January 2012Description

Top 3 domains

US, pc, Win7

direct 342K

t.co 135K

www.facebook.com 67K

US, pc, Win_not7, IE

direct 69K

t.co33K

www.facebook.com19K

US, pc, Mac OS X

t.co 49K

direct41K

www.facebook.com15K

notUS, pc, Win7

t.co 125K

www.facebook.com45K

direct38K

notUS, pc,Win_not7

t.co 41K

direct29K

www.facebook.com14K

noGeo, pc, Win, Chrome

t.co56K

direct 47K

www.facebook.com24K

US, mobile,iOS

twitter.com 83K

direct59K

m.facebook.com17K

notUS, mobile, iOS

twitter.com 119K

direct 69K

t.co 21K

Mobile,Android

t.co 62K

direct34K

m.facebook.com17K

noGeo, mobile, OtherOS

direct 63K

t.co20K

m.facebook.com13K

http://www.facebook.com/









52

What happened in space that had the twitter-sphere abuzz in January 2012?

Solar Flares!

Especially non-US users

to:Nasa.govEarthobservatory.com

from:Twitter

http://earthobservatory.nasa.gov/NaturalHazards/view.php?id=76998

53

Summary

• Data processing operations, like parsing user-agent string, can be distributed using spark• Clustering of large data sets can be distributed using Spark• Clustering finds groups of related users/records• These user types show distinct behaviors • Segmenting users can drive insight and facilitate appropriate

messaging– When are they visiting?– Where are they looking?– Where are they coming from?

User information

Usergroups

Targeted message

Web log data

54

Questions?

Slides available at: http://www.slideshare.net/MarissaSaunders/clickstream-data-with-spark

Distributed K-modes clustering for pyspark:https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes






clickstream data with spark

Data & Analytics